Documenting the American South: DocSouth Data

What is DocSouth Data?

Doc South Data provides access to some of the Documenting The American South collections in formats that work well with common text mining and data analysis tools.

Documenting the American South is one of the longest running digital publishing initiatives at the University of North Carolina. It was designed to give researchers digital access to some of the library’s unique collections in the form of high quality page scans as well as structured, corrected and machine readable text.

Doc South Data is an extension of this original goal and has been designed for researchers who want to use emerging technology to look for patterns across entire texts or compare patterns found in multiple texts. We have made it easy to use tools such as Voyant to conduct simple word counts and frequency visualizations (such as word clouds) or to use other tools to perform more complex processes such as topic modeling, named-entity recognition or sentiment analysis.

How do I use it?

By clicking on one of the collection links below, your computer should begin downloading a .zip file containing the data for that collection. More collections will be added in the future.

The Church in the Southern Black Community

First-Person Narratives of the American South

Library of Southern Literature

North American Slave Narratives

Once you download and unzip a collection from DocSouth Data, you will have a file that contains a folder named Data as well as several documents related to the BagIt process that was used to create the file (for more information on BagIt, look here: http://www.digitalpreservation.gov/multimedia/videos/bagit0609.html).

Most users will find the Data file to be most useful. It contains the following:

A folder containing each of the texts in the collection as plain text;
A folder containing each of the texts with complete TEI/XML markup;
A Table of Contents (TOC) file which is a .csv file (which can be opened in Excel as a spread sheet);
A file neamed “Read Me;”
A file named "text-only.xsl."

The plain text files can be used in text mining projects such as topic modeling, sentiment analysis and natural language processing.

The TEI/XML files have been included for advanced users who would like to isolate particular parts of text for analysis.

The .csv file acts as a table of contents for the collection and includes Title, Author, Publication Date and webaddresses for both the original Doc South page and a unique url pointing to a web accessible version of the text (this is particularly useful for use with Voyant).

The Read Me file provides documentation about the collection including information about and quirks or idiosyncrasies researchers will need to be aware of when they are working with the data.

The text-only.xsl file is the script that was used to create the folder.

Feedback

Your work drives our work. Please let us know how you are using the data and if you have any suggestions for making it even more useful. Send any feedback to wilsonlibrary@unc.edu

Copyright Statement

The University Library at the University of North Carolina at Chapel Hill has made the texts, encoding, and metadata in DocSouth Data available for use without restriction. It is ultimately the responsibility of the user to make the final determination about the legality of re-using materials from this collection. Please consider crediting the University of North Carolina at Chapel Hill for making this material available.