This site requires Javascript to be turned on. Please enable Javascript and reload the page.

Digital Humanities Research Institute: Binghamton 2019 Main Menu Welcome to the DHRI Schedule and Curriculum Instructors and Graduate Assistants Participants Logistics and Practicalities Co-Sponsors and Advisory Board Outcomes Supported by the National Endowment for the Humanities and GC Digital Initiatives at The Graduate Center, City University of New York

Introduction to Text Analysis

33339

With Zoë Wilkinson Saldaña

Lesson Overview:

This session will provide learners with a framework for exploring text corpora using computational methods. Through a hands-on introduction to iterative analysis with AntConc and jsLDA, learners will learn how to pose and pursue research questions using techniques like collocation detection and topic modeling. The session will introduce the assumptions, benefits, and tradeoffs of various methods of text analysis, such as the impact of parts-of-speech tagging and the interpretability of topics. The session will also help learners identify how additional methods can support their inquiry, e.g. running text analysis in a programming language or visualizing outcomes, which they can pursue further in small group sessions.

Learner Outcomes:

Learners will be able to describe texts as data and the features of a text corpora
Learns will learn about the exploratory data analysis pipeline and its relevant to text analysis
Learners will become aware of the types of techniques commonly used in text analysis, from counting methods to machine learning methods
Learners will be able to use AntConc software to run and configure ngram detection, collocation detection, and related techniques
Learners will be able to use the jsLDA platform to generate and interpret topic models at an introductory level

Tools and Datasets:

AntConc
jsLDA: https://mimno.infosci.cornell.edu/jsLDA/
Documenting the American South’s First Person Narratives collection: https://docsouth.unc.edu/fpn/
(Either jsLDA default State of the Union corpus or custom dataset - decision forthcoming)
For small group discussion: Topic Modeling Vogue tool

Outline:

Introductions, housekeeping notes
Pair discussion: explore the First Person Narrative text collection. Discuss the following questions:
- What types of perspectives, experiences, positionalities, etc. does this collection of text capture?
- What research questions might you be interested in investigating with this data?
- What limitations or difficulties would this collection pose in pursuing those questions? What’s missing?
Share out: What did you find? Share out, focusing on surfacing questions about (1) text as data, (2) text as relevant to specific people and communities, (3) issues of complexity/limitations of inference, and (4) unexpected things!
Core Lecture 1:
- What is Text Analysis?
- How do you approach texts as data? Why generate secondary data?
- What is the exploratory data analysis pipeline?
- Text cleaning and prep concerns: stop words, stemming, etc.
- (To be fleshed out, PPT forthcoming)
Introduce AntConc software & make sure it’s up and running on student computers
Hands-on: Generate concordance plots with First Person Narratives + AntConc (download First Person Narratives)
Hands-on: Generate ngrams and collocates with AntConc
Pair discussion: Explore the Topic Modeling Vogue project. Discuss the following questions:
- What is the central ideas or questions this project is communicating to its users?
- What are the “texts” in this instance? And what kind of analysis/secondary data is being generated about those texts?
- What research questions might you be able to pose and pursue with this tool?
Share out:
Core Lecture 2:
- Why machine learning and natural language processing methods?
- What is topic modeling?
- What other ML methods might be relevant (classification, clustering, etc.)?
- What is parts-of-speech tagging? Why is it relevant?
- What is feature extraction and Named Entity Recognition? Why is it relevant?
  - **focus on entity extraction importance**
- Sentiment analysis?
- (Many of these we will not have time to explain at depth, but hopefully provide a workable starting point)
Hands-on: Try out jsLDA together
- If there is time, give the option to upload and use an additional dataset (unlikely but just in case!)