In the Margins

The Scrubber Tool

Preprocessing your texts, what we refer to as "scrubbing", is a critical step when probing digitized texts and is a primary step in the Lexos workflow. In order to facilitate a conscious consideration of the many small decisions required, scrubbing options are isolated into individual choices. If for no other reason, your careful deliberation and choice of the many options during scrubbing facilitates a replication of your analyses in the future, both by you and others who wish to replicate your experimental results.

Scrubbing affects all active documents and cannot be undone. So make sure to de-activate any documents you do not wish to scrub using the Manage tool. If you apply scrubbing and later wish to revert to the unscrubbed version, you will have to upload another copy to Lexos.

The Preview Scrubbing button allows you to view your options without permanently saving the changes. At this point, only the beginning and ending of each document is displayed, separated by ellipses (…). When you are satisfied that you have achieved the desired effect, click the Apply Scrubbing button. Your documents will be scrubbed, and the scrubbed versions will be used by all the other Lexos tools.

The Download Scrubbed Files button allows you to download the scrubbed versions of your files. A .zip folder of your scrubbed versions will be downloaded. Downloading scrubbed files can facilitate future analyses by uploading the previous scrubbed versions, allowing you to skip the scrubbing step and ensuring that your analyses are based on the same scrubbing options.

Scrubbing is an algorithm: a series of steps applied in a specific order. If you wish to change that order, you will need to de-select some options, scrub, re-select them, and then scrub again. The default order of operations is provided in The Lexos Scrubber Algorithm section below.

Scrubbing Options

  1. Remove Project Gutenberg boilerplate material: Upon entering the Scrubber page, if you have uploaded a file from the Project Gutenberg website without removing the boilerplate material (i.e., text added by the Project Gutenberg site at the top and license material at the end of the text), you will receive the following warning:

    One or more files you uploaded contain Project Gutenberg licensure material. You should remove the beginning and ending material, save, and re-upload the edited version. If you Apply Scrubbing with a text with Gutenberg boilerplate, Lexos will attempt to remove the majority of the Project Gutenberg Licensure, however there may still be some unwanted material left over.

    Note that if you select the ‘Apply Scrubbing’ button without removing this extra text, Lexos will attempt to remove the Project Gutenberg boilerplate material at the top and end of the file. However, since Project Gutenberg texts do not have a consistent boilerplate format, we suggest you remove the boilerplate material using a text editor before uploading it to Lexos in order to prevent unwanted text from being included in subsequent analyses, e.g., including Project Gutenberg licensure material in your word counts. If you choose to let Lexos do the work for you, we recommend that you preview the beginning and ending of the document after you have scrubbed in the Manage tool in order to ensure that Lexos has not left any boilerplate or deleted any of your text. Lexos’ attempt to remove start and ending boilerplate material only applies to files from the Project Gutenberg website. When choosing a file from this website, we recommend the “Plain Text UTF-8” version. It is smaller, so it will upload faster, and you will not have to remove any HTML markup.
     

  2. Remove All Punctuation: Lexos assumes that uploaded files may be in any language and automatically converts them to Unicode using UTF-8 character encoding. This enables Lexos to recognize punctuation marks from a wide variety of languages. All Unicode characters have an associated set of metadata for classifying its “type”, e.g. as a letter, punctuation, or symbol. If the Remove All Punctuation option is selected, any Unicode character in each of the active texts with a “Punctuation Character Property” (that character’s property begins with a ‘P’) or a Symbol Character Property (begins with ‘S’) is removed. A guide to Unicode Character Categories can be found on fileformat.info.

    If Remove All Punctuation is selected, three additional sub-options are available:

    • Keep Hyphens: Selecting this option will change all variations of Unicode hyphens to a single type of hyphen ("-") and this will be left in the text. Hyphenated words (e.g., “computer-aided”) will be subsequently treated as a single token. Note: Dealing with hyphens "correctly" is a challenging task (c.f. Hoover, 2012); we hope to increase the functionality of this option in subsequent work.
    • Keep Word-Internal Apostrophes: If this option is selected, apostrophes will be retained in contractions (e.g., can’t) and possessives (Scott’s), but not those in plural possessives (students’ becomes the term students) nor those that appear at the start of a token ('bout becomes the term bout). As with hyphens, there is still more work needed to sharpen Lexos' handling of apostrophes.
    • Keep Ampersands: This option will not treat ampersands as punctuation marks and will retain them in the text. Note that HTML, XML, and SGML entities such as æ  (æ) are handled separately and prior to the Keep Ampersands option. You can choose how to convert these entities to standard Unicode characters using the Special Characters option below. 
  3. Make Lowercase: Converts all uppercase characters to lowercase characters so that the tokens The and the will be considered as the same term. In addition, all contents (whether in uploaded files or entered manually) for the Stop Words/Keep Words, Lemmas, Consolidations, or Special Characters options will also have all uppercase characters changed to lowercase. Lowercase is not applied inside any HTML, XML, or SGML markup tags remaining in the text.
  4. Remove Digits: Removes all number characters from the text. Similar to the handling of punctuation marks, any Unicode character in each of the active texts with a “Number Character Property” is removed. For example, this option will remove a Chinese three (㈢) and Eastern Arabic six (۶) from the text. Note: at present, Lexos does not match Real numbers as a unit. For example, for 3.14, Lexos will remove (only) the 3, 1, and 4 and the decimal point will be removed only if the Remove All Punctuation option is selected. Remove Digits is not applied inside any HTML, XML, or SGML markup tags remaining in the text.
  5. Remove Whitespace: Removes all whitespace characters (blank spaces, tabs, and line breaks), except in HTML, XML, and SGML markup tags. Removing whitespace characters may be useful when you are working with languages such as Chinese that do not use whitespace for word boundaries. In addition, this option may be desired when tokenizing by character n-grams if you do not want spaces to be part of your n-grams. See the section on Tokenization for further discussion on tokenizing by character n-grams. If Remove Whitespace is selected the following sub-options are available to allow you to fine-tune the handling of whitespace:
    • Remove Spaces: each blank-space will be removed.
    • Remove Tabs: each tab character ( \t ) will be removed.
    • Remove Line Break: each newline character ( \n ) and carriage return character ( \r ) will be removed.
  6. Scrub Tags: Handles markup tags in angular brackets, such as those used in XML, HTML, and SGML. In markup languages like these, start and end tags like <p>...</p> are used to designate an “element”. Elements may be modified by “attributes” specified inside the start tag. For instance, a text using the the Text Encoding Initiative (TEI) specification for XML might contain the markup <p rend="italic">...</p> for a paragraph in italics. When this option is selected, a gear icon will appear. Click the icon to open the tag scrubbing dialog. This will allow you to choose one of four options to handle each type of tag or to handle all the tags at once:
    • Remove Tag Only (default): Removes the start and end tags but keeps the content in between. For instance, <p>Some text</p> will be replaced by Some text .
    • Remove Element and All Its Contents: Removes the start and end tags and all the content in between. For instance, <p>Some text</p> will be removed entirely.
    • Replace Element’s Contents with Attribute Value: Replaces the element with the value of one of its attributes. Since elements may have multiple attributes, Lexos allows you to enter the name of the attribute you wish to use. For instance, if you have some markup like <stage type="setting"> Scene <view>Morning-room in Algernon's flat in Half-Moon Street.</view></stage> , you could use this option to replace the entire scene description with setting if you entered type as the attribute name.
    • Leave Tag Alone: This option will leave the specified element untouched in the text. This is especially useful if you want to scrub other markup tags.

    Troubleshooting scrub tags: Lexos compiles a list of the tags in your documents by first attempting to parse the documents as XML. If the markup is not well-formed XML, it next tries to parse the documents as HTML using Python’s BeautifulSoup library. This will generally work with the proviso that BeautifulSoup automatically converts all tags to lowercase. As a result, the Lexos scrubbing function will miss HTML (and SGML) tags that contain uppercase letters. In this case, you may have to check each of the tags Lexos finds to make sure it does not have uppercase letters in your original document. If you find that Lexos is not scrubbing tags containing capital letters, you will have to change these in an editor before uploading the files. This issue does not affect valid XML files since XML parsers are case sensitive. If Lexos is unable to compile an accurate list of the tags in your XML file, we recommend testing the file with an XML Validator.

Additional Options

  1. Stop Words/Keep Words: “Stop Words” represents a list of words or terms to remove from your documents, and “Keep Words” represents a list of words or terms that should remain in your documents with all other words removed. In both cases, words must be entered as comma-separated or line-separated lists like the following:
    a, some, that, the, which

    or

    a
    some
    that
    the
    which


    You may enter these lists manually in the provided form area or upload a file (e.g. stopWords.txt ). Note that the Make Lowercase option will be applied to your list of stop/keep words if that option is also selected.
  2. Lemmas: Replaces all instances of terms in a list with a common replacement term called a “lemma”. Lemmas might be conceived of as dictionary headwords. Using the lemmas option will allow you to count a lemma and all of its variants (such grammatically inflected forms) as a single term. For instance, in Old English, the word for “king”, cyning may occur as cyninges (possessive) or cyningas (plural), amongst other variants. If each of these forms occurs one time in a text, the Lemmas function will instruct Lexos to treat this as three occurrences of the type cyning. Lemmas are specified by providing a comma-separated list of variants followed by a colon and then the lemma. Multiple lemmas can be specified in separate lines as shown below:
    cyninges, cyningas: cyning
    Beowulfes, Beowulfe: Beowulf

    The list may be entered manually in the form provided or uploaded from a file. Note that the Make Lowercase option will be applied to your list of tokens and lemmas if that option is also selected. To replace individual characters with other characters, you should use the Consolidation option.
  3. Consolidations: Replaces a list of characters with a different character. This is typically to consolidate symbols considered equivalent. For instance, in Old English the character common character “eth” ð is interchangeable with the character “thorn” þ. The Consolidations option allows you to choose to merge the two using a single character. Consolidations should be entered in the format ð: þ , where you wish to change all occurrences of ð to þ . Multiple consolidations can be separated by commas or line breaks. Consolidations can be entered manually in the provided form field or uploaded from a file. Note that the Make Lowercase option will be applied to your list of characters if that option is also selected. To replace entire words (terms) with other words, you should use the Lemma option.
  4. Special Characters: Replaces character entities with their glyph equivalents. A character entity is a symbolic representation for an actual character symbol (glyph). Entities are used by markup languages like HTML, XML, and SGML when the symbol itself cannot be entered in the editor used to produce the text or when the method of rendering the character is left to independent software like a web browser. For instance, in HTML, the Old English character “ash” (æ) is represented with the entity &aelig; . Since Lexos works entirely with Unicode characters you will most likely want to replace character entities with their Unicode equivalents prior to further analysis. The Special Characters option can be used to replace entities like &aelig; with its corresponding Unicode glyph æ. Lexos provides four rule sets of pre-defined entities and their corresponding glyphs:
    • Early English HTML: Transforms a variety of HTML entities used to encode Old English, Middle English, and Early Modern English into their corresponding glyphs.
    • Dictionary of Old English SGML: Transforms SGML entities used by the Dictionary of Old English into their corresponding glyphs.
    • MUFI 3: Transforms entities specified in version 3.0 of the Medieval Unicode Font Initiative (MUFI 3) to their corresponding glyphs.
    • MUFI 4: Transforms entities specified in version 4.0 of the Medieval Unicode Font Initiative (MUFI 4) to their corresponding glyphs.

    Note: Selecting MUFI 3 or MUFI 4 will convert entities specified by the Medieval Unicode Font Initiative (MUFI) to their Unicode equivalents. In this case, the Preview window will be changed to use the Junicode font, which correctly displays most MUFI characters. However, if you downloaded your files after scrubbing, these characters may not display correctly on your computer if you do not have a MUFI-compatible font installed. Information about MUFI and other MUFI-compatible fonts can be found on the MUFI website.

    Note: Any special characters that appear inside tags will be modified.

    You may also design your own rule set if you are not using a language covered by one of the pre-defined rule sets. To do this, enter your transformation rules in the provided form field. The entity should be separated from its replacement glyph by a comma (e.g. &aelig;, æ ). Multiple transformation rules should be listed on separate lines. The Lexomics Project welcomes submission of the new rule sets. Please use the Feedback and Support button in Lexos or click here to contact us about a adding pre-defined rule set to Lexos.

Replacing Patterns

Sometimes it is necessary to replace a pattern rather than a precise string. For instance, if a document contains multiple URLs like http://lexos.wheatoncollege.edu and http://scalar.usc.edu/works/lexos/ , and you need to strip these URLs, a method is required for matching all URLs without knowing what they are in advance. One technique is to apply regular expression (regex) pattern matching. Lexos uses regular expressions internally to perform some of its scrubbing options, but, as of version 3.0, it does not provide a way for users to supply their own regular expression patterns when scrubbing. If users need to strip or replace patterns by regular expressions, it will be necessary to perform that action using a separate script or tool prior to using Lexos. A useful regular expressions tutorial can be found at RegexOne. Most modern text editors like Sublime Text and TextWrangler accept regular expressions in their search and replace functions, and users may find them to be a convenient means of performing actions with regular expressions. We hope to add a regular expression pattern matching to Lexos in the future.  

The Lexos Scrubber Algorithm

Lexos scrubs documents by applying rules in the following order:

When the Preview Scrubbing button is clicked

Markup tags in angular brackets are not affected by the rules below except rule 4. When Previewing, the actual text is not permanently modified at this point, but of course the Preview window shows a sample of what will be changed if you select Apply Scrubbing.

1.  Remove Project Gutenberg boilerplate, if present
2.  Convert stopwords, keepwords, lemmas, consolidations, and special characters to lowercase (the actual text is converted to lowercase later, see step #5 below).
3.  Apply special character transformations.
4.  Apply markup tag scrubbing rules.
5.  Convert text to lowercase.
6.  Apply consolidation rules.
7.  Apply lemmatization rules.
8.  Apply stopword/keepword lists.
9.  Remove punctuation (hyphens, apostrophes, ampersands).
10.  Remove digits.
11.  Remove whitespace.

 

When the Apply Scrubbing button is clicked

Markup tags in angular brackets are not affected by the rules below except rule 4.

1.  Remove Project Gutenberg boilerplate, if present
2.  Convert stopwords, keepwords, lemmas, consolidations, and special characters to lowercase (the actual text is converted to lowercase later, see step #5 below).
3.  Apply special character transformations.
4.  Apply markup tag scrubbing rules.
5.  Convert text to lowercase.
6.  Apply consolidation rules.
7.  Apply lemmatization rules.
8.  Apply stopword/keepword lists.
9.  Remove punctuation (hyphens, apostrophes, ampersands).
10.  Remove digits.
11.  Remove whitespace.

This page has paths: