Summary

The capstone project consists of applying data science in the area of natural language processing. The data, a collection of text documents also known as corpus, have been collected from different webpages and different types of sources. In particular, for this project, we analyze corpora in american english from three distinct sources in: twitter, blogs and news media sites. The files have been language filtered but may still contain some foreign text. Below, we report the major features of the data that have been identified and we briefly summarize the plans for creating a prediction algorithm.

Data Exploration

The data consists of three different collection of texts or corpora, en_US.blogs.txt taken from blogs, en_US.news.txt from news sites and en_US.twitter.txt from twitter. In the following we will refer to these datafiles or corpora simply by its source, i.e. blogs, news and twitter. The following table presents word and line count for each of the three corpora.

corpus lines words
en_US.blogs.txt 899,288 37,334,690
en_US.news.txt 1,010,242 34,372,720
en_US.twitter.txt 2,360,148 30,374,206

We use the tm framework for cleaning the documents and the RWeka package for tokenization. Given the sheer amount of data contained in the three corpora, the analysis is performed in a random sample of approximately 1% of the total data.

In the process of cleaning and preprocessing the data, the following transformations were performed:

After cleaning and processing the text databases, the number of unique words is as follows:

corpus unique words total words
en_US.news.txt 25080 134796
en_US.twitter.txt 21579 124318
en_US.blogs.txt 22229 116281

Below we show the top ten most frequent words for each of the corpora.

The number of unique words needed to cover the entire language (sample).

To aquire acceptable accuracy of the prediction algorithm, it is important to obtain the frequency of pair of words or bigrams. Accuracy is improved by further incorporating into the model frequencies of longer word sequences, three-, four- or n-grams.

Discussion

There are subtle differences in the three databases. Therefore it seems that the optimal prediction algorithm would contain a weighted average of the n-grams obtained from the three sources at our disposal. Accuracy can be improved as well with more detailed cleaning of the data and by adjusting the processing according to the type of source of the data. Stop words will be included back to be able to predict without adding too much complexity to the prediction algorithm.