The goal of this project is just to display that I’ve gotten used to working with the data and that I am on track to create1 prediction algorithm.
The motivation for this project is to:
Link to all archives is here. Data consists of sentences from three different sources such as news, twitter and blogs. There were several languages, but I decided to stick with English language.
Lines and words count in each file:
## [1] 899288
## [1] 37132075
## [1] 77259
## [1] 2622252
## [1] 2360148
## [1] 29581900
All data looks like this:
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
Setting seed for reproducible research.
set.seed(3999)
Since datasets are too big, it’s wise to split them into samples.
Using NGramTokenizer we can split our data into tokens we can use in prediction model
## Var1 Freq
## 58147 of the 803
## 41504 in the 762
## 88125 to the 356
## 31127 for the 287
## 59169 on the 286
## 7007 and the 267
How let’s make some visualizations
WordCloud can show most popular words in our combines sample
And some barplots for 2 and 3 words combinations
So far, I’ve downloaded data, splitted it into words and grams. These objects can be used to make prediction models.