Capture
This module explores what we mean when we say data, moving beyond common assumptions that understand data as raw and objective to an understanding of data as a product of capture. The module includes theories of capture, relevant readings, an in-class activity on "capturing" data, and a lab session using Python and Google Colab.
What is data? How is data captured?
Philip Agre introduced the concept of capture in communication and information studies in the 1994 article "Surveillance and Capture: Two Models of Privacy". In the article, Agre notes that capture is a common term in the vocabulary of software engineers and computing professionals when discussing data, and he contrasts it with the concept of surveillance. When we talk about surveillance, we typically think of people or devices watching us (such as cameras and policing). Instead, we can also think about these technologies as doing a form of capture. Capturing involves generating data about an object or activity and reorganizing it so it fits a standard language that computers can understand. Agre argues that through this reorganization, we are being "captured" by the systems as data and as subjects that rely on the system to operate and generate data.
Thinking of data as captured instead of collected enables a more critical interpretation of what is happening when something becomes data. Data also needs first to be imagined as data to exist and function as data. As such, it requires our participation to exist. We use our perception, imagination and tools available to us to define what can be data. Capturing data entails interpretation: we create a data object by using our imagination and perception about the world in combination with several instruments (e.g., diagrams, documentation, code). Capturing things as data involves making decisions about what goes in and out, what matters, and what describes an object, often in a way that is constrained by the tools used.
Analogy to photography
The process of capturing data can be compared to taking pictures. A camera, as a tool, limits what can be framed in a picture, and both the photographer and the camera emphasize some aspects over others. Just as a photo might not precisely reflect reality (e.g., northern lights appearing brighter due to camera settings, or edited to appear better with filtering), data collection tools influence what can be captured and what is emphasized.
Data diagrams as a form of language
Technical diagrams used to describe data (such as the ERD - entity-relationship-diagram) are a communication instrument and a form of framing. They are a language that people in the tech industry are trained to understand and use to communicate with each other. They are also a form of abstraction and formalization. Diagrams have properties that allow certain things to be expressed while limiting others.
The impact of disciplinary norms and tools
Every discipline has its own norms and tools for capturing and communicating data. For example, in communication studies, we use surveys, infographics, or tools like NVivo to collect and organize data. These choices shape the data at various points of interaction and production.
Etymology of the word data
Scholars have questioned the appropriateness of the word "data" to describe what we today call data. For example, in the 1950s, sociologist Howard Jensen argued that the use of the word "datum" (meaning "that which has been given" in Latin) was an "unfortunate accident of history". He argued that, in reality, science deals with "captum" – "that which has been taken" or selected by the scientist according to their purpose, thereby acknowledging the scientist's active role in framing data (Kitchin, 2014). Similarly, Johanna Drucker (2011) proposed that we should think of data as "capta," which means "to capture."
History and use of the word data
Daniel Rosenberg's chapter, "Data before the fact," in the book Raw Data is an Oxymoron, complements these arguments by revealing the etymology and historical meaning of data:
Historical Evolution of "Data"
- 17th Century: The term "data" was used in mathematics (for given premises, like X=3) and theology (to refer to the unquestionable word of God in scriptures). In both cases, it represented a given foundation for an argument that was not susceptible to questioning and not necessarily factual.
- 18th Century: A significant shift occurred where "data" gained a scientific connotation, spreading to disciplines like medicine and finance. It began to refer to what is obtained as a result of investigation or experimentation, rather than an unquestionable premise. This meant "data started to come before the fact," implying that data was needed to determine if something was true.
Distinguishing Fact, Evidence, and Data:
- Fact: Something that occurs or exists, with ontological roots (e.g., perceiving everyone in the same room).
- Evidence: Something seen that indicates an event happened, with epistemological roots (e.g., disarranged tables serving as evidence that people were in a room). Evidence allows for knowledge production based on information.
- Data: While data can be used as evidence to prove a theory, Rosenberg argues that data's existence is independent of facts and truth. As he states, "False data is data nonetheless". An experiment might yield data due to a faulty instrument, but that data is not necessarily accurate or evidence. Therefore, equating data to evidence should be approached with caution.
- Rosenberg, therefore, concludes that "Data has no truth." This does not mean data cannot be used to prove truth, but rather that data itself is not inherently representative of the truth. This requires careful consideration of the processes used to derive conclusions from data.
Data and AI
Kate Crawford (2021) highlights the decontextualization of data in the age of AI, where the complex process of data construction is often overlooked. There is a rush to gather data for AI systems with little discussion about their quality, context, or what they truly represent (e.g., mugshots treated as "ground truth" for facial recognition without understanding the circumstances of the people being arrested).
Understanding the history of the word “data” is crucial because contemporary computational usage, particularly in AI, often treats “everything as data,” thereby losing sight of the historical tensions surrounding fact, truth, and evidence. Systems like facial recognition use datasets (e.g., mugshots) as “ground truth” without considering context or the truth embedded within them, sometimes perpetuating flawed theories.
Conclusion
The module concludes by challenging common assumptions about data and proposing a more critical understanding:
- Common assumptions: Data is often assumed to be transparent, self-evident, the truth, objective, and something that can be "collected" in nature.
- Data is:
- Contested: Not self-evident or transparent.
- Framed: Shaped by the various tools used to make sense of the world as data.
- Contextual: Possessing a context that can be lost as it circulates.
- Social: Dependent on humans, social processes, imagination, and interpretation to exist.
As Gitelman and Jackson (2014) put it, the saying "raw data" is an oxymoron because data is never truly raw; it is always "cooked".
This page has paths:
- Critical and Creative Data Studies Carina Albrecht