Book Review: Three Cases of Mass-Scale Data Sharing
By Shalin Hai-Jew, Kansas State University
Data Sharing: Recent Progress and Remaining Challenges
By Yousef Ibrahim Daradkeh and Petr Mikhailovich Korolev
New York: Nova Science Publishers
For a book about a topic as big as data sharing, including three disparate topical chapters—on educational data sharing, structural biology data sharing, and data sharing in Uganda—may seem a little disappointing. Does the dearth of materials suggest the state of the world (that there is not a lot of data sharing going on)? Does it suggest something about the limited reach of the co-editors or the publishers? Does it suggest that there are not researchers willing to engage the topic from their particular unique points-of-view? Or could it be a combination of factors, occurring in complex human systems?
Drs. Yousef Ibrahim Daradkeh and Petr Mikhailovich Korolev’s Data Sharing: Recent Progress and Remaining Challenges (2019) suggests the importance of data sharing to advance respective fields of study, to set high standards for research and data collection and data handling, and to enable reproducibility and some types of peer oversight. Their new book showcases some of the challenges of data sharing, including the following:
- How can researchers be encouraged to share primary data with their colleagues and peers when such data may confer competitive advantage?
- How can the data sharing portals be supported when the work of collecting and sharing such data may be technologically demanding and require complex skillsets?
- How can data sharing be encouraged when there is limited funding available for such continuing endeavors?
While the co-editors Daradkeh and Korolev do not contribute any chapters to this work, they do assert that databases are like “books collected in the library” that reader-researchers can review and learn from (2019, p. viii).
Sharing Educational Data
Yang Song, Edward F. Gehringer, and Zhewei Hu’s “Sharing Educational Data” (Ch. 1) suggests a number of ways in which educational data [“data generated in the educational process” (Song, Gehringer, & Hu, 2019, p. 1)] may enhance the teaching work, advance research, further data mining and exploration, and advance human learning. Yet, for all the potential gains, educational data sharing is challenging, with different types of educational data available, with different data schemas (p. 8); for example, different types of data may be extracted from e-learning and learning management systems, intelligent tutoring systems, game-based learning, educational peer assessment, and reading tutor environments (Song, Gehringer, & Hu, 2019, pp. 8- 9). There is no universal data schema for educational data.
Even the respective forms of educational data may be different, based on how close the data is to the original source and how it is acquired. Educational data may be raw and unprocessed; they may be “derived”; they may be “results” from analyzing educational data (Paton, 2008, as cited in Song, Gehringer, & Hu, 2019, p. 4), or some combination of forms.
Even from learning management systems (LMSes), the data may capture “dozens of entities such as students, student teams, assignments, submission timestamps, etc.” Song, Gehringer, & Hu, 2019, p. 1). The co-authors write:
The large number of related entities and the variety of relations between them make general-purpose data sharing—combining the data from two LMS’s (sic) from two schools, for example—extremely hard. However, if the scope of data is limited to a much smaller range (e.g., to share the code written for 100-level Computer Science courses), data sharing is more feasible. The tradeoff behind those two cases is that, with fewer entities involved, data can be shared relatively easily, but the shared data can only be used to answer specific research questions. If the shared data contains more entities, the shared data can be used to answer more research questions and thereby be used by more people, but the data-sharing processes (especially the data-sharing protocols) have to be carefully designed. (Song, Gehringer, & Hu, 2019, p. 2)
Beyond getting data aligned for analytics and comparisons, there are important practical challenges from the Family Education Rights and Privacy Act (FERPA). Student records, beyond general catalog information, is considered private. Any version of the data released must be anonymized, so dimensions of the data cannot be linked back to a person. While there are sophisticated ways to anonymize data and to fuzz identities (such as through “perturbations”), with sufficient access to data, inference attacks may be applied to the educational datasets, enabling re-identification of data records to individual people (in many cases).
The authors laud the potentials of using data to make public policy decision-making more transparent. They mention endeavors at various institutions of higher education to enable the sharing of educational data from massive open online courses (MOOCs) and other sources. However, the mentions seem somewhat distant from practice.
Chapter 1 offers a light and high-level review of some of the complexities of educational data sharing. As such, this may be read as an introductory work to the topic.
Song, the first author, is from the University of North Carolina in Wilmington, and the latter two, Gehringer and Hu, are from North Carolina State University.
Data from Structural Biology
“Data Sharing in Structural Biology: Advances and Challenges” (Ch. 2), written by M. Grabowski, I.G. Shabalin, P.J. Porebski, M.J. Domagalski, H. Zheng, D.R. Cooper, B.S. Venkataramany, P.E. Bourne, and W. Minor, describes another specific “use case” related to structural biology and the mapping of macromolecular structures using x-ray crystallography (with resulting large-size datasets in the tens of gigabytes and terabytes). These co-authors argue the importance of sharing data in the sciences, given the “reproducibility crisis in biomedical research” (Grabowski, et al., 2019, p. 29), in which there were low rates of reproducibility. While professionals in structural biology had “embraced sharing research data since the outset,” with setting up respective structural databases from the 1960s onwards, “the raw diffraction images collected for X-ray crystallography, the dominant method of macromolecular structure determination, usually have not been shared” until recently, with two all-purpose data repositories: the Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC) and the SBGrid Data Bank (Grabowski, et al., 2019, p. 30). Some advances have occurred based on the sharing of open data “by allowing identification of problems in data collection protocols as well as correction of problematic crystal structures” and potentially enabling the reproducibility of biomedical experiments (Grabowski, et al., 2019, p. 30). The structural biologists played a critical role in advancing the work of biological repositories like the Protein Data Bank (started in 1971) that have ensured that it has received over 10,000 deposits annually in the last five years and been widely used in the field (Grabowski, et al., 2019, pp. 31 - 32).
Figure 1. Structure of the FEN1 Protein (based on PyMOL Rendering of PDB 1ul1), by EMW, Dec. 2009, on Wikipedia
Professionals in the field served on expert committees to set guidelines on how to share the data. They established requirements for such model coordinate sharing as a pre-requisite to formal publication, from the 1970s onwards. This is not to say that researchers were not resistant to sharing their data due to the extra attention and perhaps the revelation of mistakes (Grabowski, et al., 2019, p. 40). Also, sharing such datasets with the proper labels and tags involves more effortful work. Also, in an environment of uncertain research funding, how well supported such dataset provisioning may be is also in question. (Grabowski, et al., 2019, pp. 47 - 48)
During each phase of structural biology work, described in a simplified pipeline [“sample preparation -> data collection -> data processing -> data analysis -> data dissemination” (Morris 2018, as cited in Grabowski, et al., 2019, p. 34)], there are various types of raw data generated or capture-able, which may be valuable to other researchers. Structural biologists created guidelines to data sharing and incentives to do so. The co-authors write: “The PDB’s success should be measured not only by the impressive number of deposited structural models but also by its impact on the development of the field, implementation of guidelines, and in particular, assurance of data quality” (Grabowski, et al., 2019, p. 32). The advent of social media meant more extensive data sharing. More assiduous capturing and sharing of data (not just in researcher notebooks) may prevent the loss of important data from history. The structural biology datasets range from tens of gigabytes or even terabytes in size even though these may be made smaller using specialized software (Grabowski, et al., 2019, p. 35). [The two samples provided in the chapter are visually evocative.]
This chapter was funded in part by NIH grants and the support of other federal agencies (NIAID, DHHS)
(Grabowski, et al., 2019, p. 59).
Data Sharing in the Developing World
Finally, the last chapter focuses on “Data Sharing Practice and Policy Challenges in Developing Countries: The Context of Uganda” (Ch. 3), by Isaac Tomusange, Ayoung Yoon, and Norman Mukasa. There are “inherent barriers to data sharing in developing countries such as scarcity of resources, limited technical knowledge, and data unreliability” (Tomusange, Yoon, & Mukasa, 2019, p. 69). Yet, to understand how well a country is developing and to make informed decisions, the leadership and peoples still require accurate data and statistics.
The co-authors provide a review of Uganda’s advances in this space. For example, their information and communications technology (ICT) policy in 1996 enabled “government deregulation of the country’s telecommunication sector, which led to the development of a platform for effective information sharing” (Tomusange, Yoon, & Mukasa, 2019, p. 70). The data were consumed among “scholarly researchers, private companies, public entities, telecommunication agencies, developers, and security agencies” (Tomusange, Yoon, & Mukasa, 2019, p. 70). The country still faces weaknesses in “existing policy and culture” (Tomusange, Yoon, & Mukasa, 2019, p. 70). The authors refer to the Open Data Barometer organization which offers an international measure of a nation’s standing in terms of “the readiness, implementation, and emerging impact of open data” and how governments use data for “accountability, innovation, and social impact” (Tomusange, Yoon, & Mukasa, 2019, p. 75). 2016 data from the Open Data Barometer site shows Uganda’s progress.
What are some ways forward for Uganda? The authoring team suggests building out “a supporting infrastructure,” by defining data sharing standards and policies, setting up links between government units for improved coordination, building capacity for e-governance, encouraging public and private partnerships, and establishing ICT standards and regulatory bodies to enforce these, among others (Tomusange, Yoon, & Mukasa, 2019, p. 92). From their writing, it is clear that the Uganda Bureau of Statistics (UBOS) will play an important role, given that it serves as “the national repository for demographic social survey data” (Tomusange, Yoon, & Mukasa, 2019, pp. 85 - 86). Their chapter shows the importance of considering the holistic context, with various stakeholders and entities competing over limited resources in their efforts to advance competing objectives. To advance an open data approach, they observe that Uganda faces technical barriers, skill and competency barriers, motivational and economic barriers, political barriers, legal and ethical barriers, and others.
The authors observe:
Several challenges also overwhelm the government plans for open data and open government in Uganda. Some are finance-related, such as the high cost of voice/data connections; expensive content and application hosting; multiplicity of data sources; establishment of a centrally managed national databank and data sharing standards; unreliable interactions within government agencies; expensive software for data handling and management; and repetitive and unlinked data collection processes. (Tomusange, Yoon, & Mukasa, 2019, p. 76)
Understanding some of the challenges of a government to enable broader sharing of data may stand to benefit other data sharing endeavors in the real world.
Daradkeh and Korolev’s Data Sharing: Recent Progress and Remaining Challenges (2019) is a sparse collection, but these three works are solid and insightful in their own ways and stand to contribute to data sharing in academic and political contexts. These respective works offer some insights on how to get a handle on known extant challenges.
About the Author
Shalin Hai-Jew works as an instructional designer at Kansas State University. Her email is email@example.com.
Earlier in 2019, she created an interactive online presentation as a digital leave-behind from an invited presentation on Data Science & Analytics at the TILTed Tech Event at FHSU.
|Previous page on path||Cover, page 21 of 23||Next page on path|