Data for Humanists
Data
'Data' can be a difficult term for humanists. As Miriam Posner of the Department of Information Studies at UCLA explains in "Humanities Data: A Necessary Contradiction:”
When you call something data, you imply that it exists in discrete fungible units; that it is computationally tractable; that its meaningful qualities can be enumerated in a finite list; that someone else performing the same operations on the same data will come up with the same results. This is not how humanists think of the material they work with.
Despite discomfort with the term, humanists today engage with data on a regular basis. The data that shapes our professional lives can be defined as "a digital, selectively constructed, machine-actionable abstraction representing some aspects of a given object of humanistic inquiry" [1]. As this definition suggests, the state of our data - and its utility for research - depends on the construction process. For analogue objects, the process begins with digitization. From there, both digitized and born-digital objects need to be described, curated, structured and/or annotated to facilitate human and computational analysis.
Data and the Digital Humanities
In the digital humanities, there are two basic approaches to working with data. The first approach is rooted in the field of 'big' data research. Oriented towards the social sciences, big data research in the digital humanities focuses on "large or dense cultural datasets, which call for new processing and interpretation methods" [2]. The second approach focuses on constructing 'small' or 'smart' datasets that critically engage with - and frequently challenge - traditional classification systems, archives, and cannons. Whereas the first approach uses computational methods to perform macro-level analyses, the second uses web-based technologies to redress absences and biases in "how people process and document human cultures and ideas" [3]. Explore the project websites for selfiecity and The Caribbean Memory Project, which are pictured on the right. Explain how each project represents its respective approach to working with data in the digital humanities.
Activity
In "Big? Smart? Clean? Messy? Data in the Humanities," Christof Schöch argues that digital humanists need to combine the big and smart data approaches:
We need smart big data because it can not only adequately represent a sufficient number of relevant features of humanistic objects of inquiry to enable the level of precision and nuance scholars in the humanities need, but it can also provide us with a sufficient amount of data to enable quantitative methods of inquiry that help us transgress the limitations inherent in methods based on close reading strategies. To put it in a nutshell: only smart big data enables intelligent quantitative methods.
Explore Voyages: The Trans-Atlantic Slave Trade Database. In groups, discuss whether - and if so to what extent - the project successfully combines the big and smart data approaches. Do you agree with Schöch that digital humanists should combine these approaches? Why or why not?
Constructing Data
The rest of this workshop will focus on constructing data for digital humanities work, beginning with how to describe and organize digitized and/or born-digital objects.
Recommended Readings and Tutorials
Seth van Hooland, Ruben Verborgh, and Max De Wilde, "Cleaning Data with OpenRefine," The Programming Historian 2 (2013), https://programminghistorian.org/en/lessons/cleaning-data-with-openrefine.
Social Science Research Council, "Thinking about Data," SSRC Labs: Doing Digital Scholarship (2018), https://labs.ssrc.org/dds/articles/6-thinking-about-data/.