This site requires Javascript to be turned on. Please enable Javascript and reload the page.

In the Margins Main Menu Welcome The In the Margins home page Lexomics The starting point for the Lexomics path Manual Start page for the Lexos Manual Topics Explore this path to learn about the Lexomic methods Glossary Glossary of terms used in Lexos and In the Margins Bibliography Beginning of bibliography path Lexos Install Guide Install Guide

The Scrub Tool

5371

The Scrubbing tool allows you to make document-wide edits on collections of selected documents, including:

Detect a file uploaded or scraped from gutenberg.org. Remove most of front and end boiler plate material.
Make text all lowercase (including tags and HTML entities)
Remove digits (
the removal of Unicode punctuation and/or digits, considering case and whitespace, as well as how to handle hyphens and possessive apostrophes. Additionally, Scrub allows you to input a list of stop words (or keep words), lemmas, consolidations (character replacements), and special characters (e.g., older HTML entities, MUFU3/4), and basic edits to make on the tags of marked-up documents (e.g., XML). Each of these options is explained in more detail below.

The directions include prompts to further discussions on the implications of choices and when possible, effective practices for making preprocessing decisions.

Scrubbing is a challenge, Hoover (2016) most recently reminding us of that. Our attempt here is to share how we are scrubbing documents, in itself scrubbing is a significant step in the Methods of any computational experiment on texts. By being as explicit as possible with the many functionalities of scrubbing texts, including those that we know Lexos does not handle richly such as hyphens (cf., Hoover 2016), we are encouraging an effective practice of recording the Methods used.

Using Lexos
The first part of the scrubbing process involves several simple options which effect the entire document. In almost all cases, their names provide ample evidence of their functionality, so the important thing to remember is that they will take effect in each of your selected (active) documents or document segments in this order.

Note: scrubbing is hard: order of operations, special markup entitties (e.g., &ae; )

"""
Scrubbing order:
0. Gutenberg
1. lower
2. special characters
3. tags
4. punctuation (hyphens, apostrophes, ampersands)
5. digits
6. white space
7. consolidations
8. lemmatize
9. stop words/keep words
"""

0. Gutenberg files
Documents from gutenberg.org containing "bioler plate" materials about the text at the beginning and end of

1. Make lowercase:

Remove punctuations
All unicode characters have an associated set of metadata for classifying its "type" of character. If this option is selected, any unicode character in each of the active texts that has a Punctuation Character Property (begins with a 'P') or a Symbol Character Property (begins with 'S') is removed. The specific Punctuation and Symbols that are removed are listed below:

Punctuation
Pc	Connector punctuation
Pd	Dash punctuation
Pe	Close punctuation
Pf	Final punctuation
Pi	Initial punctuation
Po	Other punctuation
Ps	Open punctuation
S	Symbol
Sc	Currency symbol
Sk	Modifier symbol
Sm	Mathematical symbol
So	Other symbol

If Remove Punctuation is selected, two additional options are presented:
Keep Hyphens - If this option is selected, any character in a set of hyphen characters are replaced with the (ASCII) hyphen-minus character and then the hyphen-minus characters remain in the text. A hyphen character is any of the following unicode hyphen characters: [u'\u058A', u'\u05BE', u'\u2010', u'\u2011', u'\u2012', u'\u2013', u'\u2014', u'\u2015', u'\uFE58', u'\uFE63', u'\uFF0D ].

Contents of this path:

Lemmas