In the Margins

Rolling Window Analysis

Rolling window analysis is a method of tracing the frequency of features within a designated window of tokens over the course of a document. It can be used to identify small- and large-scale patterns of usage of individual features or to compare these patterns for multiple features.
 
A typical issue in text analysis, especially cluster analysis, is the question of how many documents should be used and how large they should be. Ultimately, one must adopt a system of trial and error to find a “sweet spot” which produces the most meaningful results. This challenge can arise even if whole texts are used, but it is felt most acutely when cutting texts into smaller segments. The result may yield meaningful patterns between segments, but it may also obscure smaller meaningful patterns both within and across them. For methods of manipulating segment size to detect such patterns, see Identifying Robust Clusters.

Rolling window analysis is another method of identifying smaller patterns that can be used in conjunction with cluster analysis or on its own. Unlike cluster analysis, rolling window analysis does not require that we pre-specify boundaries—in fact, the technique allows us to identify possible boundaries or segments within texts. Rolling window analysis tabulates token frequency not in discrete segments but as part of a continuously moving metric. Beginning with the selection of a window, say 100 tokens, rolling window analysis traces the frequency of a token within tokens 1-100, then 2 to 101, then 3, 102, and so on until the end of the document is reached. The result can be plotted as a line graph so that it is possible to observe gradual changes in a token’s frequency as the text progresses. Plotting different tokens on the same graph allows us to compare their frequencies. For instance, the following graph compares the letters þ and ð in the Old English poem Beowulf.

 
The greater preponderance of ð in the second half of the poem is evidence of a change in scribe, but the many variations that occur in both halves may correlate to other phenomena such as different source material. Variations in the use of spellings with þ and ð in the Old English helped inspire the development of rolling windows analysis as a Lexomic method and a Lexos tool. To find out more with an extended case study, see Rolling Windows Analysis of Old English Orthography.

Rolling windows analysis can be performed on any type of token, including n-grams of characters or words, or even whole lines of text. Different resolutions can be achieved by changing the size of the token window. In addition, different metrics for tabulating the data may be used. Lexos currently provides two:

Rolling Window Average: The number of times a specific token appears in the window, divided by the overall size of the window.

Rolling Window Ratio: The number of times a specific token appears divided by the sum of the appearances of the token and a second token. This metric can be used for comparing two mutually exclusive features.

By default, Lexos performs a forward-looking window analysis, in which the metric for the final section of tokens is not calculable. For example, if we are using a 100-token window size, we will not be able to compare the average frequency of a specific token in the final 99 tokens because the window size will of necessity be less than the window size used for the rest of the text. Hence a number of tokens equivalent to the window size minus 1 will be omitted from the analysis. Theoretically, it is possible to reverse this using a backward-looking window that begins at the end of the text and moves towards the beginning. In this case, we would lose the beginning tokens equivalent to the window size minus 1. We can also use a centered window by starting the window in the middle at a point equivalent to the window size divided by two. This method leaves undefined sections at both the beginning and the end of the text, but these segments are each only half the size of those generated by the forward- or backward-looking windows. Currently, Lexos does not enable the use of backward-looking or centered windows. To achieve these effects, it is necessary to cut the documents and re-order the resulting segments. In practice, forward-looking windows are typically sufficient unless one is particularly interested in the end of the text.

Selecting a Window Size


The window size should be significantly smaller than the total number of tokens to be examined. You can quickly find out the total count for a token in the Statistics tool. The window size limits the resolution of the analysis, allowing us to localize anomalies only to within the given window in which they appear. Smaller windows can improve resolution, but, because the standard deviation of the data increases as the window size decreases, small windows have the potential to amplify the influence of random variations. Larger window sizes tend to smooth the data, eliminating random fluctuations but potentially obscuring smaller features.

We are not yet able to determine the optimal window size for a given text in the abstract, but a rule of thumb that has worked reasonably well when comparing experiments to control texts has been to use windows of between 100-500 words (approximately 20-100 lines in poetry), with a preference towards larger windows for longer texts.

The undefined (unexamined) sections at the end of the document in rolling window analyses should be taken into consideration when choosing window size. For instance, in a 3000 line poem, a rolling analysis using a 100-line window would cause 3% of the information at the end of the text to be excluded. In a 750-line poem, such a large window size (100-line window) would obscure 13% of the poem. It might also be true that interpolated or intertextual sections are less extensive in shorter texts, and to detect them we need to apply a smaller window.

Not yet added to the path:
  1. Instructions for using the Rolling Window Graph tool in Lexos.

Contents of this path:

  1. Rolling Window - Formal Description

This page references: