Scrubbing
0. Gutenberg files
Documents from gutenberg.org containing "bioler plate" materials about the text at the beginning and end of
1. Make lowercase:
Remove punctuations
All unicode characters have an associated set of metadata for classifying its "type" of character. If this option is selected, any unicode character in each of the active texts that has a Punctuation Character Property (begins with a 'P') or a Symbol Character Property (begins with 'S') is removed. The specific Punctuation and Symbols that are removed are listed below:
| Punctuation | ||
| Pc | Connector punctuation | |
| Pd | Dash punctuation | |
| Pe | Close punctuation | |
| Pf | Final punctuation | |
| Pi | Initial punctuation | |
| Po | Other punctuation | |
| Ps | Open punctuation | |
| S | Symbol | |
| Sc | Currency symbol | |
| Sk | Modifier symbol | |
| Sm | Mathematical symbol | |
| So | Other symbol |
This page has paths:
- Pre-Processing Scott Kleinman