Curation Workflows and Checklists
Curation Workflows
Curation is the process involved in preparing data for sharing and preservation/archiving. It tends to review operationality, understandability, and discoverability of data. For example, when curating, you may check that files are not corrupt and are in suitably accessible formats. This is often a workflow step left to the last minute—do your best to not let it be! By the way, curation is not the same as verification, which can be a different set of steps, related to scientific quality and file output accuracy, and could involve actually rerunning all the experiments done in the study. Here’s a lightning talk (watch 13:12 – 20:21; alternatively, see video below) about the difference between these curation and verification.Various workflows and checklists have been created to help streamline the curation process. If curation and documentation has been happening throughout the research data process, curation should go smoothly. If it has not, you will likely encounter long pauses between each check in the curation workflow list.
ICPSR is a repository that specifically has staff to do curation for researchers prior to depositing the data files in their repository. They describe in this webinar (Watch from 10:23-17:38) their curation process (alternatively, see video below), including risk disclosure (Jump to Risk Assessment and De-identification for more).
For those of us who are guiding others on curation rather than doing the curation and anonymization ourselves, Pisani et al (2019) is a case study of a research group at Crisis Text Line, a nonprofit company, that collected data from a texting hotline for people in crisis. The case study details the process this group went through to form a data ethics collaboration committee, and identify and launch a protocols model and appropriate technical solutions for ethical data sharing of this data.
Another study by O’Donnell and Brundy (2022) detail the risk assessment process as a collaborative workflow model.
Checklists
Data Curation Network created a C.U.R.A.T.E.D. checklist of steps to perform upon data to ensure it is ready for deposit. This is a significant workflow source for data curators. Included in this version are key ethical considerations for each step of the checklist.CURATED refers to:
Check files and read documentation (risk mitigation, file inventory, appraisal/selection)
Understand the data (or try to), if not… (run files/environment, QA/QC issues, readme)
Request missing information or changes (tracking provenance of any changes and why)
Augment metadata for findability (DOIs, metadata standards, discoverability)
Transform file formats for reuse (data preservation, conversion tools, data viz)
Evaluate for FAIRness (licenses, responsibility standards, metrics for tracking use)
Document your curation activities (Curator Log, correspondence)
The Poverty Action Lab also provides a curation checklist of steps.
Besides these operational checklist steps, here are some more holistic questions to ask yourself as you are depositing datasets:
- How long will the data exist in this repository?
- Did you get consent from your participants for subsequent data use?
- What sort of ethical responsibilities will future users have if they want to reuse your data?
- How will you be ensuring appropriate data provenance (i.e. the earliest known original of this final data: your repository data record) and ownership (i.e. you, and/or the human subjects’ community) is kept if future users want to reuse your data?
- Will deidentifying the data be re-identifiable? Will not deidentifying enable risks to participants?
Sources:
- Choate, R., Adeniyi, K., Akbarifard, A., Beaubien, A., Imbody, S., & Curation Unit. (2021, October 8). ICPSR Curation: The Who, What, Where, Why, and How of Curating Data at ICPSR [Presentation]. 2021 ICPSR Biennial Meeting. ICPSR Youtube Channel. https://youtu.be/AqRRccPpRcw?list=PLqC9lrhW1VvbtV7GtM4u4ZnI1RsDKHIBj&t=623
- Data Curation Network. (2022). CURATE(D): Checklist for Data Curation (version 2). https://datacurationnetwork.org/outputs/workflows/
- Kopper, S., Sautmann, A., & Turitto, J. (2020, January). J-Pal Guide to Publishing Research Data. Abdul Latif Jameel Poverty Action Lab. https://www.povertyactionlab.org/resource/data-publication#A-checklist-for-data-publication
- Lang, J., Deardorff, A., Bruno, I., & Lewis Christian, T-M. (2020, Feb 6). Here Come the Data [Presentation]. Charleston Conference Youtube Channel. https://www.youtube.com/watch?v=zG3dcoCVZP0&t=792s
- Markham, A. (2012). Charting Ethical Questions by Data and Type. In Ethical Decision-Making and Internet Research 2.0, Association of Internet Researchers, https://aoir.org/reports/ethics2.pdf
- O'Donnell, M. N. & Brundy, C. (2022). Bringing All the Stakeholders to the Table: A Collaborative Approach to Data Sharing. Journal of eScience Librarianship, 11(1), 2. https://doi.org/10.7191/jeslib.2022.1224
- Pisani, A.R., Kanuri, N., Filbin, B., Gallo, C., Gould, M., Lehmann, L.S., Levine, R., Marcotte, J.E., Pascal, B., Rousseau, D., Turner,S., Yen, S., Ranney, M.L. (2019). Protecting User Privacy and Rights in Academic Data-Sharing Partnerships: Principles From a Pilot Program at Crisis Text Line. Journal of Medical Internet Research, 21(1), 1-11. https://doi.org/10.2196/11507