Executive Summary

The goal of this capstone project is to build a predictive model that can predict the most probable next word. The milestone project includes importing the data, sampling the data, and understanding the data through exploratory analysis. Three files (en_US.blog.txt, en_US.twitter.txt, and en_US.news.txt) were used to build the data.

Data import and processing

File information

Below is the information table of the text files.

##                 en_US.blog en_US.twitter en_US.news
## File Size [Mb]         210           167        206
## Number of lines     899288       2360148      77259

Data sampling

Since files are quite large, we will sample 10,000 lines from each file and save as separate file to reduce the burnden of memory usage.

set.seed(1234)
con <- file("en_US.blogs.txt", "r") 
blogsample <- readLines(con)[sample(1:blogline, 10000)]
close(con)
con <-file("blogs.data.txt", open = "wt")
writeLines(blogsample, con)
close(con)

con <- file("en_US.twitter.txt", "r") 
twittersample <- readLines(con)[sample(1:twitterline, 10000)]
close(con)
writeLines(twittersample, file("twitter.data.txt", open = "wt"))

con <- file("en_US.news.txt", "r") 
newssample <- readLines(con)[sample(1:newsline, 10000)]
close(con)
writeLines(newssample, file("news.data.txt", open = "wt"))

Split data

We will split the sampled files to train and test dataset. write function will will split them in to 80% and 20% ratio using baised rbinom function.

#Split en_US.blogs.txt into blogs.train.txt and blogs.test.txt
con <- file("blogs.data.txt", "r") 
blogTrain <- file("blogs.train.txt", open = "wt")
blogTest <- file("blogs.test.txt", open = "wt")
splitblog<-sapply(c(1:10000), function(x) write(con, blogTrain, blogTest))
close(con)
close(blogTrain)
close(blogTest)

#Split en_US.twitter.txt into twitter.train.txt, twitter.test.txt
con <- file("twitter.data.txt", "r") 
twitterTrain <- file("twitter.train.txt", open = "wt")
twitterTest <- file("twitter.test.txt", open = "wt")
splittwitter<-sapply(c(1:10000), function (x) write(con, twitterTrain, twitterTest))
close(con)
close(twitterTrain)
close(twitterTest)

#Split en_US.news.txt into news.train.txt and news.test.txt
con <- file("news.data.txt", "r") 
newsTrain <- file("news.train.txt", open = "wt")
newsTest <- file("news.test.txt", open = "wt")
splitnews <-sapply(c(1:10000), function (x) write(con, newsTrain, newsTest))
close(con)
close(newsTrain)
close(newsTest)

Clean the data

Now we will clean and tokenize the sampled data. I wrote two functions ‘clean_line’ and ‘clean_file’ to clean and tokenize the input files.

#Tokenize blog train data
con <- file("blogs.train.txt", "r")
text <- readLines(con)
close(con) 

blogtokenized <- file("blogtokenized.txt", open = "wt")
clean_file(text, blogtokenized)
close(blogtokenized)

#Tokenize twitter train data
con <- file("twitter.train.txt", "r")
text <- readLines(con)
close(con) 


twittertokenized <- file("twittertokenized.txt", open = "wt")
clean_file(text, twittertokenized)
close(twittertokenized)

#Tokenize news train data
con <- file("news.train.txt", "r")
text <- readLines(con)
close(con) 

newstokenized <- file("newstokenized.txt", open = "wt")
clean_file(text, newstokenized)
close(newstokenized)

Explatory Data Analysis

Below is the summary of statistics of each training dataset. Based on the table we can see that the a sentence tends to have more words in a order of blog, news, and twitter.

##                 en_US.blog en_US.twitter en_US.news
## File Size [Mb]        1.89          0.56       1.62
## Number of lines    8014.00       7989.00    7967.00
## Word Count            0.00          0.00       0.00

Word Frequency

Below is a exploratory analysis on word frequency and distribution for each file.

##                          en_US.blog en_US.twitter en_US.news
## Unique single word count      26580         13268      26340

##                          en_US.blog en_US.twitter en_US.news
## Unique 2-gram word count     178081         64634     165672

##                          en_US.blog en_US.twitter en_US.news
## Unique 3-gram word count     298941         93754     252999