Extracting Linguistic Patterns from Texts with LIWC (“luke”) for Analysis
By Shalin Hai-Jew, Kansas State University
If researchers were looking for a common point-of-entry to understand social and human phenomena, natural language would be a very logical place to start.
Language is a critical aspect of how people interact and socialize, both verbally and in writing. Historically, research has been recorded and codified in language, mostly written. Text is a common denominator with Web data, with text forms of multimedia (such as transcripts of audio and video, alt-texts of images, and so on).
Quite a few software tools enable a range of complex linguistic analytic capabilities. In mainline data analytics research suites, many include some basic linguistic analysis.
Figure 1: A Sample Text Corpus Analysis in LIWC
Linguistic Features from Counting
Linguistic Inquiry and Word Count (LIWC2015) is a simple software tool that merely counts words through a lightweight processor, but its dictionaries and its use in over two decades of empirical research have made it one of the leading (if not the leading) software programs.
To get a sense of its capabilities, it helps to review the contents of its main built-in dictionary, which includes summary dimensions of the ingested text, punctuation marks, function words, affect terms (including sentiment analysis), social terms, cognitive process terms, perceptual processes, biological processes, human drives, time orientation, relativity, personal concerns, and informal language (including “netspeak”).
On the surface, the software offers simple counts in these various categories, but these dictionaries are based on psychometric and other types of research. The dictionaries have been used by researchers in various fields (more on this later). The respective dictionaries have been vetted by trained students and researchers. The validity of the various dimensions are addressed in published research, and many researchers use various approaches to improve internal and external validity of findings based on LIWC linguistic analyses.
While LIWC was originally conceptualized and created in English, there have been non-English versions since its second and third versions. There are downloadable dictionaries in German (2011), Spanish, French, Russian, Italian, and Dutch (2007).
Also, this tool has a built-in capability for users to create their own dictionaries using a simple structure of constructs and then words indicating those constructs. These customized dictionaries may be loaded to run against various text documents and text corpora. (There were four such dictionaries available when this author visited the protected site, and only two seemed to be structured correctly and in a possibly usable way. The others were not coherently set up. Suffice it to say that it helps for builders of dictionaries to know their subject areas well first and then to have a clear conceptualization about how the dictionaries have to be structured for use in LIWC. Custom linguistic analysis dictionaries also should be tested thoroughly before being shared broadly, and how they were created should be documented in the academic research literature, in the same way that any other research instrument is. That said, it’s very helpful that Pennebaker Conglomerates, Inc. does enable people to share their custom dictionaries quite easily as .dic / dictionary files.)
Once the numbers are extracted from LIWC, the data have to be saved out as a .xl, .xlsx, or .csv format file and analyzed in Excel. If there are more complex types of analyses desired, the extracted quantitative data may be restructured for analysis in SPSS or SAS.
A Brief History
Linguistic Inquiry and Word Count (LIWC) was created initially in 1993 at the University of Texas. Dr. James W. Pennebaker, a social psychologist, was interested in examining the therapeutic value of writing and wanted an easy-to-use software tool to capture various linguistic features of texts. Over the next several years, he and Martha E. Francis, then a graduate student and programmer, worked on the software. Much of the focus was on various dimensions of language as reflected in dictionaries.
In 2001, 2007, and 2015, newer versions of LIWC were released. The tool has been quite popular and has resulted in hundreds of academic publications in a variety of fields.
In 2013, Pennebaker gave a talk at TEDxAustin on “The Secret Life of Pronouns,” with the title echoing a 2011 book that he wrote by the same name.
A Skim of the Applied Research
Since the mid-1990s, a range of research has been done using LIWC.
Gender differences. There have been gender differences found in how males and females write and especially in how they use certain function words, emotion words, cognitive words, and social words. (It is not what people might guess.) Of course, there are examples of people of different genders who write cross-gender-wise.
Differences in writing styles by age. Researchers have found differences in how people write at different phases of the human life span.
Authorship. Researchers have studied historical works of literature that involved contested authorship and used the hidden linguistics features to show likelihood of authorship. The styles of famous authors have been mapped using stylometry. Even more complex would be the uses of psycholinguistics for author fingerprinting, which combines the psychological metrics in LIWC with the linguistic ones.
Power and language. Researchers have also been able to identify differences in speech patterns between those with power and status and those without. One of the insights comes from the use of the word “I.” Do those in power tend to use more instances of “I” or fewer? Why or why not? (The answer will be at the bottom.)*
Remote profiling. Linguistic analyses have also been used to profile various writers to understand their personalities (traits) and their mental states and moods (more transitory and ephemeral). These approaches have been applied to historical figures based on speeches and writings, and they have been applied to politicians and political (and other) leaders today.
Suicidal intention. Using various psychometrics built into the built-in dictionary, researchers have studied the differences between the poems by poets who’d committed suicide versus those who didn’t, and also lyrics of songwriters who’d committed suicide vs. those who didn’t—in order to know linguistic indicators or “markers” to look for in order to intervene with a person who may be deeply depressed and at risk of suicide.
Three truths and a lie, and a lie, and a lie. This software has been applied to understand intentionality in the realm of deception. For example, with known works of deception by journalists and authors caught engaging in fraudulent writing (making up journalistic stories, making up on-the-record sources, pretending to be public figures writing biographies), researchers have been able to find measureable differences between honest works by those individuals vs. dishonest ones. The thinking is that people are aware when they’re lying, and they engage in certain behaviors to try to be believed. The thinking is also that deception detection may be helpful in other contexts, such as in written confessions, in anonymous posted threats, and so on.
Baselines. Beyond research design, methodology, and analytics, the setting of baselines is important. There have been some critical central works done that set baseline counts for various types of function words in particular contexts. Some of these have been found to hold across cultures and languages. Others are generalizable in more limited ways. The power of a baseline is that
Based on common practice, the ideas usually is that the more data that is available, the more accurately a “tell” may be found to generalize from. However, I have also seen in various works that researchers will use a single sentence or a single microblogging entry (Tweet) to generalize about the author or the message. In such cases, very sparse data has been used to generalize, which may be quite risky. (The rationale may have been that the analytic standards came from big data even though the findings were applied to a mere sentence or Tweet.)
Some Data Runs with LIWC
The basic process is as follows:
- Theoretical underpinning(s)
- Research design
- Text collection
- Text cleaning (normalization, text searchability in the documents)
- LIWC runs and re-runs (word counts, percentages of words in content categories)
- Analytic conclusions
- Further research within and beyond LIWC
At every step of the process, the decisions made and the actions taken will affect what may be asserted. Data curation is an important step. For example, the types of texts that a researcher has access to and how he or she or they handle the texts will affect the findings from LIWC.
* Those in positions of high-status and social power tend to use “I” less than those who have lower status. The thinking is that those in higher leadership positions tend to think more abstractly and so do not deal with issues of explaining themselves.
Pennebaker, J.W., Booth, R.J., Boyd, R.L., & Francis, M.E. (2015). Linguistic Inquiry and Word Count: LIWC2015. Operator's Manual. Retrieved May 16, 2016, from https://s3-us-west-2.amazonaws.com/downloads.liwc.net/LIWC2015_OperatorManual.pdf. 1 - 22.
About the Author
Shalin Hai-Jew works as an instructional designer at Kansas State University. She may be reached at email@example.com.
|Previous page on path||Cover, page 17 of 26||Next page on path|