Milestone report

Goal

The goal of this project is just to display that I’ve gotten used to working with the data and that I am on track to create1 prediction algorithm.

Motivation

The motivation for this project is to:

  • Demonstrate that I’ve downloaded the data and have successfully loaded it in.
  • Create a basic report of summary statistics about the data sets.
  • Report any interesting findings that I amassed so far.
  • Get feedback on plans for creating a prediction algorithm and Shiny app.

Downloading files

Link to all archives is here. Data consists of sentences from three different sources such as news, twitter and blogs. There were several languages, but I decided to stick with English language.

Basic summary

Lines and words count in each file:

  • Blogs
## [1] 899288
## [1] 37132075
  • News
## [1] 77259
## [1] 2622252
  • Twitter
## [1] 2360148
## [1] 29581900

How data looks like

All data looks like this:

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"

Explanatory analyses

Setting seed for reproducible research.

set.seed(3999)

Since datasets are too big, it’s wise to split them into samples.

Using NGramTokenizer we can split our data into tokens we can use in prediction model

##          Var1 Freq
## 58147  of the  803
## 41504  in the  762
## 88125  to the  356
## 31127 for the  287
## 59169  on the  286
## 7007  and the  267

How let’s make some visualizations

WordCloud can show most popular words in our combines sample

And some barplots for 2 and 3 words combinations

Bigrams

Trigrams

Conclusion

So far, I’ve downloaded data, splitted it into words and grams. These objects can be used to make prediction models.