In the Margins

The Scrub Tool

The Scrubbing tool allows you to make document-wide edits on collections of selected documents, including:
 
  1. Detect a file uploaded or scraped from gutenberg.org. Remove most of front and end boiler plate material.
  2. Make text all lowercase (including tags and HTML entities)
  3. Remove digits (
  4. the removal of Unicode punctuation and/or digits,  considering case and whitespace, as well as how to handle hyphens and possessive apostrophes. Additionally, Scrub allows you to input a list of stop words (or keep words), lemmas, consolidations (character replacements), and special characters (e.g., older HTML entities, MUFU3/4), and basic edits to make on the tags of marked-up documents (e.g., XML). Each of these options is explained in more detail below.

The directions include prompts to further discussions on the implications of choices and when possible, effective practices for making preprocessing decisions.

Scrubbing is a challenge, Hoover (2016) most recently reminding us of that. Our attempt here is to share how we are scrubbing documents, in itself scrubbing is a significant step in the Methods of any computational experiment on texts. By being as explicit as possible with the many functionalities of scrubbing texts, including those that we know Lexos does not handle richly such as hyphens (cf., Hoover 2016), we are encouraging an effective practice of recording the Methods used.  

Using Lexos
The first part of the scrubbing process involves several simple options which effect the entire document. In almost all cases, their names provide ample evidence of their functionality, so the important thing to remember is that they will take effect in each of your selected (active) documents or document segments in this order. 

Note:  scrubbing is hard: order of operations, special markup entitties (e.g., &ae; )
"""
Scrubbing order:
0. Gutenberg
1. lower
2. special characters
3. tags
4. punctuation (hyphens, apostrophes, ampersands)
5. digits
6. white space
7. consolidations
8. lemmatize
9. stop words/keep words
"""
0. Gutenberg files
Documents from gutenberg.org containing "bioler plate" materials about the text at the beginning  and end of 


1. Make lowercase:

Remove punctuations
All unicode characters have an associated set of metadata for classifying its "type" of character. If this option is selected, any unicode character in each of the active texts that has a Punctuation Character Property (begins with a 'P') or a Symbol Character Property (begins with 'S') is removed. The specific Punctuation and Symbols that are removed are listed below:
 
Punctuation 
PcConnector punctuation 
PdDash punctuation 
PeClose punctuation 
PfFinal punctuation 
PiInitial punctuation 
PoOther punctuation 
PsOpen punctuation 
SSymbol 
ScCurrency symbol 
SkModifier symbol 
SmMathematical symbol 
SoOther symbol
If Remove Punctuation is selected, two additional options are presented:

 

Contents of this path:

  1. Lemmas