BitCurator NLP

From BitCurator
Jump to: navigation, search

The BitCurator NLP project will develop software for collecting institutions to extract, analyze, and produce reports on features of interest in text extracted from born-digital materials contained in collections. The software will use existing natural language processing software libraries to identify and report on those items likely to be relevant to ongoing preservation, information organization, and access activities. These may include entities (e.g. persons, places, and organizations), potential relationships among entities (for example, by describing those entities that appear together within documents or set of documents), and topic models to provide insight into how concepts are naturally clustered within the documents.

Downloads

Please check back in the coming months. The project linked below is a placeholder.

File-2.png
bitcurator-nlp (GitHub)

Current and past releases

Online Help

User-8.png
BitCurator User Group Get support and discuss issues with the community.
Monitor.png
Screencasts and Video Tutorials Useful screencasts on our YouTube channel.

Licenses

The source in the BitCurator NLP GitHub repositories is GPL v3 licensed. This wiki, documentation, and other materials generated by the BitCurator team are licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). All other software included in the BitCurator environment is distributed in accordance with original licenses.

Research and Development Areas

The BitCurator NLP project will produce software allowing institutions to extract, analyze, and report on features in text extracted from collection materials. The software will rely on existing NLP libraries to identify items likely to be relevant to ongoing preservation, information organization, and access activities. These may include entities (e.g. persons, places, and organizations), potential relationships among entities (e.g. those entities that appear together within documents or set of documents), and topic models to provide insight into how concepts are naturally clustered within the documents. The software will provide an API and RESTful interface for select operations provided by the toolset, allowing users to create customized reports from extracted text.

Born digital collections often include a wide range of complex file formats (for example, Office documents, PDF files, email, and audiovisual materials) from which text may be extracted directly or by automated transcription. Text extraction from arbitrary collections of files is itself a non-trivial task; solving this problem is not a focus of the project. BitCurator NLP will use existing software platforms—including textract, textacy, spaCy, scikit-learn, and TextBlob—to perform text extraction from heterogeneous collections of file and execute NLP tasks such as part-of-speech tagging, entity and entity relationship extraction, and topic segmentation and recognition. The software will allow users to select candidate files from a collection and create human- and machine-readable reports (PDF or text, and JSON, respectively) that meaningfully characterize the contents of those files based on raw (unannotated) text that they contain.