Download and read the data.

I have already download files to Capstone Project directory.

setwd("~/RDIR/Capstone Project")
list.files("en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Convert .txt files to R-objects.

blogs <- readLines("en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("en_US/en_US.twitter.txt", encoding="UTF-8", skipNul = TRUE)
news <- readLines("en_US/en_US.news.txt", encoding="UTF-8", , skipNul = TRUE)

Create a basic report of summary statistics about the data sets.

## Warning: package 'RWeka' was built under R version 3.1.3

Information about size, total lines and word count:

files_summary
##     files    memory   lines    words
## 1   blogs 260564320  899288 38308421
## 2 twitter 302322752 2360148 29354795
## 3    news 261759048 1010242 35624448

Line length statistics for each file.

## [1] "Blogs"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40830
## [1] "Twitter"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00
## [1] "News"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11380.0

Word count statistics for each file.

## [1] "Blogs"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     9.0    29.0    42.6    61.0  6851.0
## [1] "Twitter"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   13.14   19.00   47.00
## [1] "News"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   35.26   47.00 1928.00

Clean data by remove extra whitespaces, non-letter characters and convert all letters to lowercase:

blogs <- gsub(blogs, pattern = "[^A-Za-z ]", replacement = "")
blogs <- gsub(x = tolower(blogs), pattern = " {2, }", replacement = " ")

Do it with other files in silent mode.

Look at words distribution in blogs.

## 25% 50% 75% 95% 99% 
##   1   2   6 108 964

Interesting fact - 99% of blogs text consists less then 1000 words.

Most common words in blogs

Plans for creating a prediction algorithm and Shiny app.

  1. Read more about work with ngrams in R, Markov chains and other useful methods to do text prediction.
  2. Find and remove other not useful words from model.