Introduction to Text Analysis
With Zoë Wilkinson Saldaña
Lesson Overview:
This session will provide learners with a framework for exploring text corpora using computational methods. Through a hands-on introduction to iterative analysis with AntConc and jsLDA, learners will learn how to pose and pursue research questions using techniques like collocation detection and topic modeling. The session will introduce the assumptions, benefits, and tradeoffs of various methods of text analysis, such as the impact of parts-of-speech tagging and the interpretability of topics. The session will also help learners identify how additional methods can support their inquiry, e.g. running text analysis in a programming language or visualizing outcomes, which they can pursue further in small group sessions.
Learner Outcomes:
- Learners will be able to describe texts as data and the features of a text corpora
- Learns will learn about the exploratory data analysis pipeline and its relevant to text analysis
- Learners will become aware of the types of techniques commonly used in text analysis, from counting methods to machine learning methods
- Learners will be able to use AntConc software to run and configure ngram detection, collocation detection, and related techniques
- Learners will be able to use the jsLDA platform to generate and interpret topic models at an introductory level
Tools and Datasets:
- AntConc
- jsLDA: https://mimno.infosci.cornell.edu/jsLDA/
- Documenting the American South’s First Person Narratives collection: https://docsouth.unc.edu/fpn/
- (Either jsLDA default State of the Union corpus or custom dataset - decision forthcoming)
- For small group discussion: Topic Modeling Vogue tool
Outline:
- Introductions, housekeeping notes
- Pair discussion: explore the First Person Narrative text collection. Discuss the following questions:
- What types of perspectives, experiences, positionalities, etc. does this collection of text capture?
- What research questions might you be interested in investigating with this data?
- What limitations or difficulties would this collection pose in pursuing those questions? What’s missing?
- Share out: What did you find? Share out, focusing on surfacing questions about (1) text as data, (2) text as relevant to specific people and communities, (3) issues of complexity/limitations of inference, and (4) unexpected things!
- Core Lecture 1:
- What is Text Analysis?
- How do you approach texts as data? Why generate secondary data?
- What is the exploratory data analysis pipeline?
- Text cleaning and prep concerns: stop words, stemming, etc.
- (To be fleshed out, PPT forthcoming)
- Introduce AntConc software & make sure it’s up and running on student computers
- Hands-on: Generate concordance plots with First Person Narratives + AntConc (download First Person Narratives)
- Hands-on: Generate ngrams and collocates with AntConc
- Pair discussion: Explore the Topic Modeling Vogue project. Discuss the following questions:
- What is the central ideas or questions this project is communicating to its users?
- What are the “texts” in this instance? And what kind of analysis/secondary data is being generated about those texts?
- What research questions might you be able to pose and pursue with this tool?
- Share out:
- Core Lecture 2:
- Why machine learning and natural language processing methods?
- What is topic modeling?
- What other ML methods might be relevant (classification, clustering, etc.)?
- What is parts-of-speech tagging? Why is it relevant?
- What is feature extraction and Named Entity Recognition? Why is it relevant?
- **focus on entity extraction importance**
- Sentiment analysis?
- (Many of these we will not have time to explain at depth, but hopefully provide a workable starting point)
- Hands-on: Try out jsLDA together
- If there is time, give the option to upload and use an additional dataset (unlikely but just in case!)