This site requires Javascript to be turned on. Please enable Javascript and reload the page.

Learning Data Ethics for Open Data Sharing

See more about Restricted Access

Jump to the Restricted Access in FAIR Sharing page to see more about the language and methods for restricting access to your dataset in a data repository.

This page is referenced by:

FAIR Data Sharing
Activity: Make it FAIR
If you haven’t already, play Level 3 in the League of Data game: https://lod.sshopencloud.eu/LodGame/ It helps you break down what advantages depositing in a FAIR data repository there are compared to posting it with your journal article.
Source: SSHOC. (2020). Data Publication Challenge [video game]. Social Sciences and Humanities Open Cloud (SSHOC) League of Data (LOD). https://lod.sshopencloud.eu/
---------------------------------

Open Science has increasingly been promoting F.A.I.R.--Findable, Accessible, Interoperable, and Reusable. FAIR sharing includes principles that individually can increase the discovery and usability of your final research data for others. There are no hard and fast rules to completing each and every one of these features; in addition, there may be cases that you absolutely do NOT want to be 100% FAIR, as total openness and reproducibility could cause harm to your research subjects. Therefore, view the features of FAIR as individual opportunities you can select that will increase the value of your data.

You can make your data findable is to deposit it into a data repository, that has a DOI and that has machine-readable metadata fields. This will make it easier for people to get to your data, based on their search results. It will be even more findable to relevant audiences if the metadata you use is subject-specific.

You could make your data accessible by providing a link to be able to open or download your data. Can someone actually access your data? You could make this totally accessible by uploading the file into a repository and enabling access to it. (Making it slightly less accessible would be if you uploaded the file but it’s a proprietary file type that someone needs a specific software in order to open and see the data).

Interoperability can automatically occur if you deposit your data in approved repositories set up for interoperability, such as Dataverse. Doing so will make your data findable in other search indexes, such as Web of Science and Google, even though you didn’t deposit your data in those systems. Ask your librarian for help selecting an approved data repository.

Reusability can come into play with consideration to the cleanness/organization of your files and your documentation. Can someone understand what your data file is accomplishing? Do they know what your variables and columns are referring to? Are they able to know the methods you used to clean your data, such as a decision you made to remove certain rows because they were outliers or had missing values? The documentation you write can be helpful for someone to know what your data is so they can figure out what they could do themselves, let alone how to reproduce your study themselves.

The Australian Research Data Commons created a FAIR Data Self Assessment Tool to assess the FAIRness of a data and determine how to enhance its FAIRness. Check it out!
Data Use Agreements 4 plain 2022-10-31T17:54:10-07:00
We will not be going into great detail about Data Use Agreements (DUAs), because these are more often formally talked about when you’re using vendor proprietary data. In those cases, they often look like long contracts with legal clauses. However, Data Use Agreements can be short and easy to read too. Think of a Creative Commons CC-BY license, which is often the default open license placed on your data repository record. This is essentially stating, “User, you are allowed to USE this DATA, provided that you AGREE to acknowledge me as the data source.” Dataverse, a open data repository, has provided a sample Data Use Agreement for deidentified data that is a more contractual-looking version of what you’ll often see in the data repository record itself.

If we break down the clauses, we’ll essentially see (underlines added):
- “III. Use of the Materials: “Use of the Materials include but are not limited to viewing parts or the whole of the content included in the Materials; comparing data or content from the Materials with data or content in other Materials; verifying research results with the content included in the Materials; and extracting and/or appropriating any part of the content included in the Materials for use in other projects, publications, research, or other related work products.”
Meaning, you can look at, work with, and reuse this data.
- “III.1.B. Restrictions In his/her Use of the Materials, Downloader cannot: … produce connections or links among the information included in User’s datasets (including information in the Materials), or between the information included in User’s datasets (including information in the Materials) and other third-party information that could be used to identify any individuals or organizations, not limited to research subject.”
Meaning, don’t intentionally try to re-identify participant subjects.

University of North Carolina at Chapel Hill has provided 2 flowcharts about whether or not you should share data and if so whether or not you should have a Data Use Agreement. Check it out: https://research.unc.edu/wp-content/uploads/sites/61/2013/04/CCM3_039360.pdf

What would your Data Use Agreement say?
Source
- Office of Sponsored Research. (2013). Data Use Agreement Guidance. University of North Carolina at Chapel Hill. https://research.unc.edu/wp-content/uploads/sites/61/2013/04/CCM3_039360.pdf
Risk Assessment and De-identification 4 plain 2022-10-31T18:00:45-07:00
Activity: That's PII
If you haven’t already, play Level 6 in the League of Data game: https://lod.sshopencloud.eu/LodGame/ It helps you think about what sort of information you should anonymize.
Source: SSHOC. (2020). Data Publication Challenge [video game]. Social Sciences and Humanities Open Cloud (SSHOC) League of Data (LOD). https://lod.sshopencloud.eu/
--------------

Now that we’ve talked about ethics and sharing, one key component involved in preparing data for sharing is conducting a risk assessment. This can also be called assessing the risk of disclosure. Disclosure occurs when an individual participant in your study is directly or indirectly identified. You don’t want that to happen! It is very often the researcher’s responsibility to conduct a risk assessment and act accordingly to adequately protect their data.

Here are some resources you may find helpful:
- UNC-Wilmington facilitated a webinar in 2022 introducing disclosure risk, as well as how to assess and remediate it. The recording (1 hour) and PPT slides are available: https://library.uncw.edu/love_data_week#risk
- The Sensitive Data Expert Group of the Portage Network in Canada developed this Human Participant Research Data Risk Matrix (version 2, as of 10/1/2020), which breaks down data classification levels by how they should be handled at different stages of the research data process (including data sharing).
- The University of Michigan Open Data DEIA Toolkit (version 1.0, last updated July 2021) lists recommended resources related to DEIA at different stages of the research data process (including data sharing).
- The Poverty Action Lab (2020) explains and provides examples of relevant identifier variables for various processes of de-identification, including:
  - Encoding identifiers with anonymous ID numbers
  - Removing or partitioning identifiers by separating identified and de-identified data
  - Redacting values
  - Aggregating variables
  - Geographic masking or jittering or displacement by offsetting spatial datapoints in a systematically random nature
Once you have done a basic risk assessment to determine what sensitive information you have in your dataset, you can decide whether you want to restrict access to your dataset or anonymize your dataset (i.e. remove the sensitive information). Restricted access can certainly be a viable option; it preserves the original variables and is released only to approved users who agree to conditions respecting the study participants’ confidentiality. Restricted use agreements may require the prospective researcher to hash out a summary of their research question, explain why they need access to this dataset, and provide a data protection plan that explains how they will safeguard the data during a specific period of time and then destroy or return the data once the study is complete.

You will need to decide what information to keep for its usefulness in analysis and what can be removed, altered (or ideally never even collected) while still being helpful for data reuse by others.

Here are some basic methods that can be done to anonymize or protect the data:
- Remove, mask, generalize, or pseudonymize identifiers (ex: occupation instead of specific position title)
- Reduce precision/detail of identifiers (ex: birth year/decade instead of date of birth)
- Recode geographic areas to be higher levels (ex: show county /region instead of zip code)
- Remove or mask unique outliers
- Recategorize and collapse codes if some coded categories are too small (ex: 9 = “other”)
- Create ranges/bins for variables (ex: ages 0-9 and 10-19, or “child” and “teenager”)
- Combine variables (ex: school in North Carolina, an aggregated location, instead of East Chapel Hill High School, an individual place name)
- Report in the aggregate instead of at individual levels
- If a variable is sensitive and not essential, consider removing this variable
- If you use “find” and “replace” techniques on qualitative data, be aware this may not catch everything!
- If you make replacements like pseudonyms or removals, use brackets to show this interference made to this cleaned copy.
- Check for potential “hidden” metadata, such as metadata in image files or in NVIVO analysis files
Check out this simplified infographic “5 Things to check for data deidentification” (note that further on the page is some sample R code you can use for step 5).

Beyond what is listed above, know that there are more complex “Statistical Disclosure Control” methods for de-identification, which should be employed if you are especially concerned about risk from indirect identifiers; ask your IRB office to direct you towards these “Expert Determination” methods rather than just “Safe Harbor” standards for direct identifiers.

Remember FAIR? Part of Reusability is writing documentation. There is value in documenting what steps you made in anonymizing your data—not in revealing what specifically you changed from what into what, but in the thematic processes you made like removing names of municipalities and your decision that the municipalities will now appear as distinct numbers. Here is an example of documentation about a deidentification process done on a research project.

How possible is it to re-identify data that has been anonymized? Here’s an article (2019) examining how difficult it could be to re-identify a dataset: Estimating the success of re-identifications in incomplete datasets using generative models

You may want to double check on your risk assessment after you conduct de-identification. Here are some quick checks you can make:
1. Check the consent form
2. Check the codebook, data dictionary, or other documentation of variables and data elements to get a review of potential leftover identifiers. Is there documentation about which variables were transformed or de-identified?
3. Review data for remaining direct identifiers, especially dates and geography
4. Look for indirect identifiers that could possibly link to external datasets
5. Run descriptive statistics and crosstabs on quantitative data to see if there are any unique cases and combinations of variables
---------------------------------
Activity: De-identify a transcript
Download or make a copy of one of the sample interview transcripts in this Raw Interviews folder—you do not have edit access to these originals.

Read through the transcript and try to make changes to anonymize the text.

As you are making changes, document your processes. You can use this Open Data Release Form if you like it (you will have to download or make a copy of this Open Data Release Toolkit document as you don’t have edit access).

Once you’re done, compare your changes against the transcripts in the Cleaned Interviews folder.
--------------------------
Sources
- Darragh, J. (2022, October 11). Sharing Human Participants Data: Challenges and Potential Solutions [Presentation]. Duke University Libraries.
- Darragh, J., Hofelich Mohr, A., Hunt, S., Woodbrook, R., Fearon, D., Moore, J., & Hadley, H. (2020, March 2). Human Subjects Data Essentials Data Curation Primer. Data Curation Network GitHub Repository.
- Dunning, T., & Camp, E. (2015). 0_ Camp_Anonymization Protocol [data documentation]. Brokers, voters, and clientelism: The puzzle of distributive politics. Qualitative Data Repository. https://doi.org/10.5064/F6Z60KZB/0NR0VZ.
- Emmelhainz, C., & Sackmann, A. (2022, February 15). What do I do with all of this text? Cleaning and coding data for qualitative analysis [Presentation]. UC Berkeley Love Data Week 2022. https://docs.google.com/presentation/d/1Twsi1Zmt2kXscXsFGKz-gQNBUEk1CDmp6FpXrFmXpU0/edit?usp=sharing
- Finkle, E. (2016). Open Data Release Toolkit: Privacy Edition [Guide, version 1.2]. DataSF. https://datasf.org/resources/open-data-release-toolkit/
- ICPSR. (n.d.) Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle, 6th Edition [Report]. https://www.icpsr.umich.edu/web/pages/deposit/guide/index.html
- Kopper, S., Sautmann, A., & Turitto, J. (2020, January). J-Pal Guide to De-identifying Data. Abdul Latif Jameel Poverty Action Lab. https://www.povertyactionlab.org/resource/data-de-identification
- Office for Civil Rights (OCR). (2012, November 26). Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule [Guidance]. U.S. Department of Health & Human Services. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
- Qualitative Data Repository. (n.d.) De-Identification. https://qdr.syr.edu/guidance/human-participants/deidentification
- Tarrant, D., Thereaux, O., & Mezeklieva, V. (2020, June). Anonymising data in times of crisis [Report]. Open Data Institute (ODI). https://theodi.org/article/anonymising-data-in-times-of-crisis/
- UK Data Service. (n.d.) Research Data Management: Anonymisation. https://ukdataservice.ac.uk/learning-hub/research-data-management/#anonymisation