Data

We have many different corpora that can measured, investigated and utilized for substantive research. We are offering some of our smaller academic corpora, as well as some samples from our larger corpora free for downloading. For everything else, email us to discuss availability and pricing. Our collections have been cleaned up and normalized, and will work seamlessly with our free investigative software.



Individually and collectively, these corpora allow for a vast range of analysis and investigation; however, it is simply not practical to do research in the traditional way – reading and analyzing every document. There is also more to working with text than simply tallying raw frequencies of words or phrases. We have the ability to help you answer a universe of questions, employing a range principled methodologies. Contact us. We can help.

Note: All of the following corpora are composed of public-domain documents or documents released by permission of the owner/author.

Twitter

We have a Stratified Random Sample of Twitter for 2012 and 2013 containing approximately 30 million tweets per year. We also have an English-only full set containing approximately 12 million tweets. If you're interested in our SRS procedure, this text file explains everything. Also, if are interested in the full Twitter SRS, please contact us.

We have just released sub-samples of the Twitter Stratified Random Sample for 2012 and 2013. We have the a 10% English-only sample (approximately 1.2 million tweets). We also have a 1% sample (approximately 12,0000 tweets) and a .1% (approximately 12,000 tweets). It is highly recommended that you use an analysis software such as KwicKwic to search the data.

We have been collecting the tweets from 25 major news organizations in English going back to 2010 and continuing to the present. There are approximately 40,000 news org tweets per month. If you're interested in this corpus, contact us.

UGA Tobacco Documents

This is our favorite small corpus for preliminary tests of our programs as well as for teaching. It contains 883 public domain files (roughly 1 million words) from the University of Georgia Tobacco Documents Project website. Please see the website for a very detailed description of the corpus content, but in a nutshell, these documents represent a well-bounded, stratified, random sample of American tobacco-industry discourse from the latter half of the 20th Century. This collection contains the documents from the "quota" and "supplemental" corpora, which together form the overall sample of tobacco-industry documents. The documents were downloaded and processed in March, 2011. See the included README file for more detailed information about the arrangement of this corpus.

Linguistic Atlas of the Middle Rockies (LAMR) Transcripts

We are very happy to be able to provide to you the transcripts from the Linguistic Atlas of the Middle Rockies. This monumental collection, the first of its kind to be released to the public, contains 70 transcripts of interviews conducted with informants throughout the states of Colorado, Utah, and Wyoming as part of a Linguistic Atlas of the Western States. Conducted between 1988 and 2004, each interview was approximately 3 hours long. In total, the 70 interviews contain 1.7 million words in over 95,000 prompt and response pairs. The transcripts cover areas such as family, ranching and farming, household goods and clothing, local flora and fauna, and geography. Because it includes relatively long elicited passages of discourse from most informants, in addition to rather short responses to specific survey questions, the corpus allows analysts to go beyond investigations of the lexicon to examine morphological, syntactic and discourse features in a region that has received little attention in the variationist literature. See the included README files for more detailed information about the construction and arrangement of this corpus. Many thanks to Dr. Lamont Antieau for compiling this unique corpus and allowing us to distribute it.

The U.S. Federal Courts

This corpus consists of over 750,000 documents of U.S. Court opinions, including state-level courts, the district appeals courts, and the U.S. Supreme Court opinions, oral argument transcripts, and amicus briefs. The corpus is primarily for the years from 2005 through 2013. This corpus is growing substantially every month. The U.S. Federal Courts handle thousands of cases each year, from business bankruptcy through federal criminal cases. Questions can be asked and answered about how particularly types of cases or industries are handled in the court system, at the national aggregate level or the local level. It is also possible to investigate the success or failure of different types of legal arguments. We will be offering a small sample of this collection shortly. Check back soon.

The U.S. Federal Government

This corpus consists of over 850,000 documents between 1928 through 2013. The Legislative (Congress) documents consists of transcripts from the House and Senate debates and presentations, testimony, and reports. The Executive (President) documents includes transcripts from speeches, press briefings and conferences, letters, Executive Orders, Presidential Proclamations, and Presidential Statements. The federal administration component of the corpus consists of documents, speeches, and transcripts collected from government departments, agencies, and bureaus. This includes all the National Labor Relation Board Rulings, FDA and SEC opinions, and extensive documents from the State Department. The research that can be done from this corpus is substantial and unlimited. Questions can be asked about Congressional-Presidential relations in different periods of the 20th century; how policymakers and public administrators handle particular issues, topics, or industries; how speeches and debates translate into laws and rules; or how different or specific issues are dealt with by the federal government. We will be offering a small sample of this collection shortly. Check back soon.

The Nuclear Power Industry in the United States

This corpus consists of over 950,000 documents from the interactions and correspondence of the Nuclear Regulatory Commission (NRC) with the civilian nuclear power industry between 1999 and 2013. The NRC regulates commercial nuclear power and the use of radioactive isotopes for nuclear medicine. The corpus provides for investigations of government regulation and business interactions, as well as a detailed analysis of the operations of the nuclear power industry in the United States. We will be offering a small sample of this collection shortly. Check back soon.

Enron Email

The following corpora are based on the full Enron Email set available from William W. Cohen at cs.cmu.edu/~enron/. Our documents have been modified. They have been "cleaned" and renamed to allow easier text analysis rather than email analysis as originally intended. Basically, we have dropped most of the email header lines, keeping the four main header items and the text. You can read the included README files for more exact details. If you are interested in the full Enron collection, please contact us.

For ease of use, if you just want to get a feel for the Enron email set, this is a stratified random sample of 1,144 sent-email documents taken from the full set of 150 persons in the Enron set. This is a good place to start. Please see the included README file for more information on how the sample was taken.

This corpus was made exactly as the 1,144 set above, only the sampling was expanded such that it includes 23,112 sent emails. You can check the included README file for more exact information.

Note: The Enron documents contain sexually explicit text that may be offensive. Don't say we didn't warn you.

General Industry and Business

This corpus consists of over 200,000 documents from different industries, primarily from 2005 through 2013. The collection consists of transcripts of speeches, conference calls, and other oral presentations. It also includes some press releases and news reports. This corpus provides excellent data for investigations on the discourse that businesses have with their shareholders and the public at large, as well as with other business leaders and policymakers. Since the corpus covers the years before and after the Great Recession, it allows for trending and analysis of business, executive, and industry changes. We will be offering a small sample of this collection shortly. Check back soon.


Check back. We're adding more corpora as they become available.