To begin this assignment, I first downloaded the three datasets (in English): Twitter, Blogs and News. I read them into R and started to explore some properties of them. I found the file size, number of lines, word count, and number of characters in each file. Here is a summary of my data.
| File | FileSize | Lines | Words | Characters |
|---|---|---|---|---|
| en_US.twitter.txt | 159 MB | 2360148 | 30451170 | 162096241 |
| en_US.blogs.txt | 200 MB | 899288 | 37570839 | 206824505 |
| en_US.news.txt | 196 MB | 77259 | 2651432 | 15639408 |
I also decided to look at the number of characters per line in each of the files. In order to make the box plot more readable I used a logarithmic scale for the y-values since the data covers a large range of values.
We can see that the average number of characters per line is smallest in the twitter data, which makes sense. The average number of characters per line is slightly larger in the news data than in is in the blog data. Since there is a lot of data in all of these files combined I decided to look at only a sample of the data to do some exlporatory analysis.
I want to only look at a portion of the data, but I also want to keep the proportions the same for each file, so I first randomly sampled 1% of the number of lines in each file and then combined them into one file. From there, I explored the same properties as I did above. I found the file size, number of lines, word count, and number of characters in the new sample data file. Here is a summary of my sample data.
| File | FileSize | Lines | Words | Characters |
|---|---|---|---|---|
| full_sample.txt | 4 MB | 33365 | 702754 | 3837690 |
So, we can see that this is a more manageable set of data to begin exploring. The number of lines is cut down to 33,365.
Next, I used the tm package to create a corpus and clean the data. I first removed any URLs (beginning with http) and then removed profanity words. The list of profanity words I used were found here: List of Profinity words. I then changed all letters to lowercase, removed punctuation, removed numbers, stripped the white space, and then saved my cleaned data set. Here is a sample of 10 lines from my cleaned data set.
| wild oak dr |
| the list begins with carefully selected bytheglass offerings including the selene hyde vineyard sauvignon blanc and the miner family napa valley syrah thats followed by halfbottles sparkling wines and champagnes and more than fullbottle selections |
| salt lake city the san antonio spurs were feeling good monday night after sweeping their firstround western conference series with the utah jazz |
| free throw name the coaching legend who recently died he got his coaching start with the cavs |
| before graduating uc santa barbara in he joined the rotc and served a stint in the army which might seem another stereotypebreaker for poet but mallory sees no contradiction i love my country why cant i also be a poet mallory said |
| that does not bode well for the eternal hope the run for the |
| the agreement was reached shortly before parvaizs wife was fatally shot a killing for which he and a female companion have been charged |
| made by swissbased syngenta under the trademark enogen the corn was approved over the objections of the biggest names in the us snack and cereals industry syngenta tests show that one kernel in can liquefy grits |
| it really is a casebycase basis close said |
| for a look at some of the action in the previous rounds of the ncaa basketball tournament check out our photo galleries throughout the east regional round we will be posting additional galleries after each game so make sure you check back |
Next, I used the RWeka package to tokenize the data and then look at the most commomly used unigrams, bigrams, and trigrams. Here are histograms of the 10 most commonly used unigrams, bigrams, and trigrams.
From looking at these most common unigrams, bigrams, and trigrams, I can see that mostly all of them are stopwords. I originially left them in my dataset because they will be important when predicting text. But, for some more exploratory analysis I decided to remove them to see what the common unigrams, bigrams, and trigrams were that did not include stopwords. Here are histograms of the 10 most commonly used unigrams, bigrams, and trigrams after excluding stopwords.
My plan moving forward to create a prediction model is to create a program to find the most common n-grams that will predict the next word in a sentence. For example, if given 2 words, I would search the trigrams to find the most common occurence of the 2 words as the first two words in the trigram and then use the third word in the trigram as my prediction.