Best Practices of Metadata Creation in the USC Digital Library
Wayne Shoaf, November 2016
The Voltaire letters and poems in this project were digitized, cataloged and made available publicly in the USC Digital Library at http://digitallibrary.usc.edu/cdm/search/collection/p15799coll47.
We document below the principles behind the organization and cataloging of items in the Digital Library, in particular the Voltaire materials. The metadata for the Voltaire materials described below is available through several search engines which increase the chances of it being found – namely, the USC Digital Library (http://digitallibrary.usc.edu/) itself, WorldCat (https://www.worldcat.org/), the Digital Public Library of America (https://dp.la/), and Google (https://www.google.com/). Our system shares its metadata via the Open Archives Initiative for Protocol for Metadata Harvesting (OAI-PMH) and by submitting site maps to Google.
Metadata regarding the digital Voltaire letters and poems has several objectives:
1) Facilitate discoverability [descriptive metadata];
2) Support digital preservation [preservation metadata];
3) Ease navigability [structural metadata];
4) Manage intellectual property [administrative metadata].
These objectives are achieved through the use of the four specified types of metadata. All of these must be created and managed with limited resources. Before discussing discoverability – the metadata requiring the most intellectual effort – we will briefly touch on the other three.
Preservation metadata tells the user or system where the digital files are stored, what their names are, when they were created, what their formats are, etc.[1] Structural metadata tells the system how to organize the various files into a cohesive understandable whole so that the user can move from one page or one part to the next with ease. Administrative metadata provides the user and the system with information related to legal uses one can make of the files, for instance, if they are in the public domain, or if permission for use must be requested and of whom.
Whereas most preservation, structural and administrative metadata is the same (or similar) across all metadata records in a collection such as the Voltaire Collection and, as such, requires little human effort, Descriptive metadata is the most costly to produce because it relies heavily on intellectual and manual effort to create. If the item, such as a letter, contains handwriting, it can sometimes be indexed as searchable full text by running optical character recognition (OCR) against it. OCR is only as good as the text and performs most optimally on typeset or printed text. It performs very poorly on handwritten text. Since the Voltaire letters are all handwritten, OCR cannot expose the full text to search. Therefore we obtained a published version of each letter, including it in the record with the handwritten letter, so that OCR could render it accessible to full text search. Even then OCR is not perfect but provides much richer metadata to yield positive results when doing keyword and key phrase searching.
In addition to this automated extraction of full text as metadata, we manually transcribe the first line of each letter (following the salutation) in the metadata. This provides a more accurate rendering of at least that tiny portion of the letter for searching. It also helps to uniquely identify the letter if the other particulars – date, writer, recipient, location – are missing, unknown, or incorrect. The date, writer, recipient and location form the basis of the metadata title and description and, when formatted in a consistent way across an entire corpus of letters, have the added benefit of allowing the searcher to sort their results in a meaningful way. There are also individual metadata fields for date, writer, recipient and location to make fielded search possible. This also provides a means for searchers to limit their search results to any of those aspects.
Finally, it is important we provide metadata which refers the reader to additional sources for information. This includes links to other publicly available online versions of the letter, and, where possible, to more important published editions. Insofar as this last metadata is ascertained by the cataloger in the course of creating metadata for the individual records in the Digital Library, it usually is not an exhaustive list of publications or sources. Rather it lists one or two key references which include standard numbers or other rubrics by which the particular letter may be known. While we rarely revisit metadata to edit it or correct it, that possibility exists if a user notifies us of some inaccuracies or provides additional critical information which will benefit future discoverability.
WS: 9-15-16; rev 9-21-16
[1] As an aside: we store our files in the USC Digital Repository (https://repository.usc.edu/. This fulfills most of the requirements of a trusted digital repository.