In the Margins

Scrubbing

Scrubbing topics go here.  (this is not organized)

0. Gutenberg files
Documents from gutenberg.org containing "bioler plate" materials about the text at the beginning  and end of 


1. Make lowercase:

Remove punctuations
All unicode characters have an associated set of metadata for classifying its "type" of character. If this option is selected, any unicode character in each of the active texts that has a Punctuation Character Property (begins with a 'P') or a Symbol Character Property (begins with 'S') is removed. The specific Punctuation and Symbols that are removed are listed below:
 
Punctuation 
PcConnector punctuation 
PdDash punctuation 
PeClose punctuation 
PfFinal punctuation 
PiInitial punctuation 
PoOther punctuation 
PsOpen punctuation 
SSymbol 
ScCurrency symbol 
SkModifier symbol 
SmMathematical symbol 
SoOther symbol
If Remove Punctuation is selected, two additional options are presented:

 

This page has paths:

  1. Pre-Processing Scott Kleinman

Contents of this path:

  1. Special Characters and Non-Roman Writing Systems