Capstone Exploratory Data Analysis

Summary

The capstone project consists of applying data science in the area of natural language processing. The data, a collection of text documents also known as corpus, have been collected from different webpages and different types of sources. In particular, for this project, we analyze corpora in american english from three distinct sources in: twitter, blogs and news media sites. The files have been language filtered but may still contain some foreign text. Below, we report the major features of the data that have been identified and we briefly summarize the plans for creating a prediction algorithm.

Data Exploration

The data consists of three different collection of texts or corpora, en_US.blogs.txt taken from blogs, en_US.news.txt from news sites and en_US.twitter.txt from twitter. In the following we will refer to these datafiles or corpora simply by its source, i.e. blogs, news and twitter. The following table presents word and line count for each of the three corpora.

corpus	lines	words
en_US.blogs.txt	899,288	37,334,690
en_US.news.txt	1,010,242	34,372,720
en_US.twitter.txt	2,360,148	30,374,206

We use the tm framework for cleaning the documents and the RWeka package for tokenization. Given the sheer amount of data contained in the three corpora, the analysis is performed in a random sample of approximately 1% of the total data.

In the process of cleaning and preprocessing the data, the following transformations were performed:

each word was converted to lower-case, e.g. Tree to tree
removal of URLs, e.g. www.someaddress.com
removal of punctuation: commas, colons, semicolons, dashes, parentheses
removal of numbers
removal of stopwords, such as a, the, to, etc.

After cleaning and processing the text databases, the number of unique words is as follows:

corpus	unique words	total words
en_US.news.txt	25080	134796
en_US.twitter.txt	21579	124318
en_US.blogs.txt	22229	116281

Below we show the top ten most frequent words for each of the corpora.

The number of unique words needed to cover the entire language (sample).

To aquire acceptable accuracy of the prediction algorithm, it is important to obtain the frequency of pair of words or bigrams. Accuracy is improved by further incorporating into the model frequencies of longer word sequences, three-, four- or n-grams.

Discussion

There are subtle differences in the three databases. Therefore it seems that the optimal prediction algorithm would contain a weighted average of the n-grams obtained from the three sources at our disposal. Accuracy can be improved as well with more detailed cleaning of the data and by adjusting the processing according to the type of source of the data. Stop words will be included back to be able to predict without adding too much complexity to the prediction algorithm.

Capstone Exploratory Data Analysis

November 14, 2014

Summary

Data Exploration

Discussion