Milestone Report

Initial look at the Data

To begin this assignment, I first downloaded the three datasets (in English): Twitter, Blogs and News. I read them into R and started to explore some properties of them. I found the file size, number of lines, word count, and number of characters in each file. Here is a summary of my data.


File	FileSize	Lines	Words	Characters
en_US.twitter.txt	159 MB	2360148	30451170	162096241
en_US.blogs.txt	200 MB	899288	37570839	206824505
en_US.news.txt	196 MB	77259	2651432	15639408

I also decided to look at the number of characters per line in each of the files. In order to make the box plot more readable I used a logarithmic scale for the y-values since the data covers a large range of values.

We can see that the average number of characters per line is smallest in the twitter data, which makes sense. The average number of characters per line is slightly larger in the news data than in is in the blog data. Since there is a lot of data in all of these files combined I decided to look at only a sample of the data to do some exlporatory analysis.

Sampling the Data

I want to only look at a portion of the data, but I also want to keep the proportions the same for each file, so I first randomly sampled 1% of the number of lines in each file and then combined them into one file. From there, I explored the same properties as I did above. I found the file size, number of lines, word count, and number of characters in the new sample data file. Here is a summary of my sample data.


File	FileSize	Lines	Words	Characters
full_sample.txt	4 MB	33365	702754	3837690

So, we can see that this is a more manageable set of data to begin exploring. The number of lines is cut down to 33,365.

Cleaning the Data

Next, I used the tm package to create a corpus and clean the data. I first removed any URLs (beginning with http) and then removed profanity words. The list of profanity words I used were found here: List of Profinity words. I then changed all letters to lowercase, removed punctuation, removed numbers, stripped the white space, and then saved my cleaned data set. Here is a sample of 10 lines from my cleaned data set.

10 Lines
wild oak dr
the list begins with carefully selected bytheglass offerings including the selene hyde vineyard sauvignon blanc and the miner family napa valley syrah thats followed by halfbottles sparkling wines and champagnes and more than fullbottle selections
salt lake city the san antonio spurs were feeling good monday night after sweeping their firstround western conference series with the utah jazz
free throw name the coaching legend who recently died he got his coaching start with the cavs
before graduating uc santa barbara in he joined the rotc and served a stint in the army which might seem another stereotypebreaker for poet but mallory sees no contradiction i love my country why cant i also be a poet mallory said
that does not bode well for the eternal hope the run for the
the agreement was reached shortly before parvaizs wife was fatally shot a killing for which he and a female companion have been charged
made by swissbased syngenta under the trademark enogen the corn was approved over the objections of the biggest names in the us snack and cereals industry syngenta tests show that one kernel in can liquefy grits
it really is a casebycase basis close said
for a look at some of the action in the previous rounds of the ncaa basketball tournament check out our photo galleries throughout the east regional round we will be posting additional galleries after each game so make sure you check back

Looking at Most Common N-grams

Next, I used the RWeka package to tokenize the data and then look at the most commomly used unigrams, bigrams, and trigrams. Here are histograms of the 10 most commonly used unigrams, bigrams, and trigrams.

From looking at these most common unigrams, bigrams, and trigrams, I can see that mostly all of them are stopwords. I originially left them in my dataset because they will be important when predicting text. But, for some more exploratory analysis I decided to remove them to see what the common unigrams, bigrams, and trigrams were that did not include stopwords. Here are histograms of the 10 most commonly used unigrams, bigrams, and trigrams after excluding stopwords.

Looking Forward to Prediction Model

My plan moving forward to create a prediction model is to create a program to find the most common n-grams that will predict the next word in a sentence. For example, if given 2 words, I would search the trigrams to find the most common occurence of the 2 words as the first two words in the trigram and then use the third word in the trigram as my prediction.