The BitCurator NLP project is developing software for collecting institutions to extract, analyze, and produce reports on features of interest in text extracted from born-digital materials contained in collections. We are using open source natural language processing libraries to identify items likely to be relevant to preservation, information organization, and access activities. These may include entities (e.g. persons, places, and organizations), potential relationships among entities (e.g. those entities that appear together within documents or set of documents), and topic models to provide insight into how concepts are naturally clustered within the documents. We are developing software that will allow users to create customized reports from text discovered in disk images, providing both command-line executables and a public Python API to extend the capabilities of external tools.
Rationale and Technical Foundation
Born digital collections often include a wide range of complex file formats (for example, Office documents, PDF files, email, and audiovisual materials) from which text may be extracted directly or by automated transcription. Text extraction from arbitrary collections of files is itself a non-trivial task; solving this problem is not a focus of the project. BitCurator NLP projects use existing software platforms—including textract, textacy, spaCy, scikit-learn, and TextBlob—to perform text extraction from heterogeneous collections of file and execute NLP tasks such as entity and entity relationship identification, topic modeling (and topic model visualization), and document summarization. The software will allow users to select candidate files from a collection and create human- and machine-readable reports (PDF or text, and JSON, respectively) that meaningfully characterize the contents of those files based on raw (unannotated) text that they contain.
Generating topic models from disk image contents
Get Help or Contribute
Sources in the BitCurator NLP GitHub repositories are LGPL v3 licensed except where otherwise noted. This wiki, documentation, and other materials generated by the BitCurator team are licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). All other software included in the BitCurator environment is distributed in accordance with original licenses.
Development, Funding, and Partners
The BitCurator development team is hosted by the School of Information and Library Science at the University of North Carolina, Chapel Hill. Grants from the Andrew W. Mellon Foundation supported the BitCurator project (a partnership between UNC SILS and the Maryland Institute for Technology in the Humanities) through September 2014, and the BitCurator Access project through September 2016. A grant from the Andrew W. Mellon Foundation currently supports the BitCurator NLP project (2016-2018).