The Scrub Tool
- Detect a file uploaded or scraped from gutenberg.org. Remove most of front and end boiler plate material.
- Make text all lowercase (including tags and HTML entities)
- Remove digits (
- the removal of Unicode punctuation and/or digits, considering case and whitespace, as well as how to handle hyphens and possessive apostrophes. Additionally, Scrub allows you to input a list of stop words (or keep words), lemmas, consolidations (character replacements), and special characters (e.g., older HTML entities, MUFU3/4), and basic edits to make on the tags of marked-up documents (e.g., XML). Each of these options is explained in more detail below.
The directions include prompts to further discussions on the implications of choices and when possible, effective practices for making preprocessing decisions.
Scrubbing is a challenge, Hoover (2016) most recently reminding us of that. Our attempt here is to share how we are scrubbing documents, in itself scrubbing is a significant step in the Methods of any computational experiment on texts. By being as explicit as possible with the many functionalities of scrubbing texts, including those that we know Lexos does not handle richly such as hyphens (cf., Hoover 2016), we are encouraging an effective practice of recording the Methods used.
Using Lexos
The first part of the scrubbing process involves several simple options which effect the entire document. In almost all cases, their names provide ample evidence of their functionality, so the important thing to remember is that they will take effect in each of your selected (active) documents or document segments in this order.
Note: scrubbing is hard: order of operations, special markup entitties (e.g., &ae; )
"""0. Gutenberg files
Scrubbing order:
0. Gutenberg
1. lower
2. special characters
3. tags
4. punctuation (hyphens, apostrophes, ampersands)
5. digits
6. white space
7. consolidations
8. lemmatize
9. stop words/keep words
"""
Documents from gutenberg.org containing "bioler plate" materials about the text at the beginning and end of
1. Make lowercase:
Remove punctuations
All unicode characters have an associated set of metadata for classifying its "type" of character. If this option is selected, any unicode character in each of the active texts that has a Punctuation Character Property (begins with a 'P') or a Symbol Character Property (begins with 'S') is removed. The specific Punctuation and Symbols that are removed are listed below:
| Punctuation | ||
| Pc | Connector punctuation | |
| Pd | Dash punctuation | |
| Pe | Close punctuation | |
| Pf | Final punctuation | |
| Pi | Initial punctuation | |
| Po | Other punctuation | |
| Ps | Open punctuation | |
| S | Symbol | |
| Sc | Currency symbol | |
| Sk | Modifier symbol | |
| Sm | Mathematical symbol | |
| So | Other symbol |