Future Look: Data Repositories for NVivo-based Data Sets?
Researchers who use quantitative methods in the social sciences often publish out their data sets into online repositories at the time they publish the work based on the data. These datasets are original data. They have a clear provenance. They are placed in a format that is easily ingestible into any number of quantitative data analysis software tools. (An assumption is that a researcher's work may be checked, for reproducibility and repeatability, an assumption that is not present in qualitative/mixed method/multi-method research in any precise sense. There is also the idea that the datasets may be used for other research based on different backgrounds of the respective researchers and their access to other datasets that may have some cross-referencing value.) Some universities host dataverses or data repositories based on the work of their faculty.
Datasets are often scrubbed of human identifiers ("personally identifiable information") and “noise” before their release into public spaces. They are shared generally for two reasons: (1) to enable other researchers to test the findings of the prior researcher (who originated the dataset)…and (2) to enable other researchers to surface new findings from the released data (possibly using new methods or new technologies, or new combined methods and technologies).
Open-source data repositories tend to mostly share quantitative data—such as those specializing in government data and in map data.
Currently, there is not an online space per se for the release and distribution of NVivo datasets, which are based on qualitative and mixed methods research. Part of the concern is that there are privacy issues with using raw qualitative data, which may be used to re-identify participants to the research. Another issue is the difficulty of trying to “replicate” findings from qualitative or mixed methods data given the differences in methodologies and their underlying theories. NVivo does not generally enable scrubbing of ingested data, so if identifiers were included initially, it is possible that those would remain throughout the use of the software. Also, the integrated secondary sources (already-published articles) would be potentially “re-published” and "distributed" with the release of NVivo projects, which may contravene intellectual property and copyright issues.
Proprietary (.nvp and .nvpx) Project Files
Also, NVivo datasets would require NVivo to open and access given that these are proprietary .nvp files.(By contrast, most format styles for quantitative data are in non-proprietary file formats--or may be easily converted to these non-proprietary formats, which makes the data much easier to query and manipulate using a variety of software tools.) Researchers may use NVivo to output some table data in .csv or .xl or .xlsx formats that are more easily manipulable in other software programs, but entire projects are in the proprietary .nvp format.
Some Thoughts about Preparing an NVivo Project for Purposeful Public Sharing
As a thought experiment, it may be helpful to consider how to prepare an NVivo project for eventual public sharing on publication. The basic idea is to know what explicit and implicit data are in the project and what may be learned directly and inferred indirectly from the project. Then, control for what may be seen by others without muffing your data or your codebook. Generally, it is important to exploit your data as fully as possible before going with public sharing (IMHO).
I am actually not confident that a person can fully set up a full-blown research project and have checked all the boxes for safe information sharing through the sharing of an NVivo project...but I am open to being proven wrong if anyone wants to have a go at it, in a friendly way.
Project setup:
- It seems to make optimal sense to clean data of any personal identifiers before anything is ingested into an NVivo project.
Project event logs:
- There may be data leakages from the event log.
Sensitive codebooks:
- Be careful about any sharing of codebooks with identifiers included. The coding you want has to do with themes, not identities per se.
Classification sheets (applied to case nodes):
- Be careful about re-identification of participants from classification sheets applied to case nodes.
- Be careful about the potential for cross-referencing contents to re-identify individuals.
Ingested content and metadata:
- There may be data leakages from metadata (like EXIF data in digital imagery), and others.
Physical maps (from social media platforms):
- Some excerpted information from social media platforms will include physical maps of locations of accounts, along with identifiers (or at least name handles).
- There are also social networks extracted, with name handles of the respective accounts.
- These can be cross-referenced with in-world data to possibly re-identify participants.
- Depending on the social media platform, there may be additional data that may ride with the downloaded files from NCapture. (I have not explored these sufficiently yet to see what rides.)
Imagery / photos:
- Photos of people can be re-identified to a person because of the prevalence of facial recognition software...and online reverse image searches...the broad mapping of the WWW and Internet... This applies to videos as well. Videos that have people's faces showing and photos with people's faces showing...may be as good as identifying them.
- Social media sleuths also enable re-identification of people. Further, there are doxing (documenting) practices that involve the spillage of private personal information (in various modalities: photos, video, voice recordings, and others).
There could well be other data and other paths to possible re-identification and data compromise.
A Literate Programming Approach
A current movement in quantitative-based research is to enable authors to present research online as a stream of human-readable text and machine-readable code which are woven to enable readers to access the analytical data (at minimum) as well as the modeling code--so that the data analysis may be verified. It is possible that such approaches may flow over to qualitative and mixed methods research. One popular engine for such dynamic report creation is Knitr (pronounced "knitter"). This approach mixes documentation language for human readability and programming language for machine readability, and as such, it bridges some of the ambitions of the Semantic Web (which suggests a Web that is both human- and machine- readable).
Additional Online Dataset Exploration
One of the first articles to address this was by Dr. Lisa Cliggett in "Qualitative Data Archiving in the Digital Age: Strategies for Data Preservation and Sharing." This was published in The Qualitative Report (TQR, Vol. 18, 1 - 11) in 2013.
Cornell University links to a number of Internet data sources for social scientists through the Cornell Institute for Social and Economic Research. The Inter-university Consortium for Political and Social Science is another source. The Pew Research Center offers datasets on social and demographic trends.
Various commercial companies will release scrubbed datasets of internally collected data for researcher exploration. Others have created web-based application programming interfaces to enable access to extracted "shadow datasets" of some limited data; others, similarly, enable access to datasets with suppressed data values. These datasets may often only be used under certain legal constraints.
An Idea
Perhaps someone will create an automated way to automatically strip identifiers from an NVivo project...and extract key data...and report out with accuracy. Ha ha!
As in machine learning...pulling out data patterns from multi-modal data. But at project level.
Discussion of "Future Look: Data Repositories for NVivo-based Data Sets?"
Add your voice to this discussion.
Checking your signed in status ...