Capstone project - Milestone report

Executive Summary

The goal of this capstone project is to build a predictive model that can predict the most probable next word. The milestone project includes importing the data, sampling the data, and understanding the data through exploratory analysis. Three files (en_US.blog.txt, en_US.twitter.txt, and en_US.news.txt) were used to build the data.

Data import and processing

File information

Below is the information table of the text files.

##                 en_US.blog en_US.twitter en_US.news
## File Size [Mb]         210           167        206
## Number of lines     899288       2360148      77259

Data sampling

Since files are quite large, we will sample 10,000 lines from each file and save as separate file to reduce the burnden of memory usage.

set.seed(1234)
con <- file("en_US.blogs.txt", "r") 
blogsample <- readLines(con)[sample(1:blogline, 10000)]
close(con)
con <-file("blogs.data.txt", open = "wt")
writeLines(blogsample, con)
close(con)

con <- file("en_US.twitter.txt", "r") 
twittersample <- readLines(con)[sample(1:twitterline, 10000)]
close(con)
writeLines(twittersample, file("twitter.data.txt", open = "wt"))

con <- file("en_US.news.txt", "r") 
newssample <- readLines(con)[sample(1:newsline, 10000)]
close(con)
writeLines(newssample, file("news.data.txt", open = "wt"))

Split data

We will split the sampled files to train and test dataset. write function will will split them in to 80% and 20% ratio using baised rbinom function.

#Split en_US.blogs.txt into blogs.train.txt and blogs.test.txt
con <- file("blogs.data.txt", "r") 
blogTrain <- file("blogs.train.txt", open = "wt")
blogTest <- file("blogs.test.txt", open = "wt")
splitblog<-sapply(c(1:10000), function(x) write(con, blogTrain, blogTest))
close(con)
close(blogTrain)
close(blogTest)

#Split en_US.twitter.txt into twitter.train.txt, twitter.test.txt
con <- file("twitter.data.txt", "r") 
twitterTrain <- file("twitter.train.txt", open = "wt")
twitterTest <- file("twitter.test.txt", open = "wt")
splittwitter<-sapply(c(1:10000), function (x) write(con, twitterTrain, twitterTest))
close(con)
close(twitterTrain)
close(twitterTest)

#Split en_US.news.txt into news.train.txt and news.test.txt
con <- file("news.data.txt", "r") 
newsTrain <- file("news.train.txt", open = "wt")
newsTest <- file("news.test.txt", open = "wt")
splitnews <-sapply(c(1:10000), function (x) write(con, newsTrain, newsTest))
close(con)
close(newsTrain)
close(newsTest)

Clean the data

Now we will clean and tokenize the sampled data. I wrote two functions ‘clean_line’ and ‘clean_file’ to clean and tokenize the input files.

`clean_line’ will clean a single line of characters. It will remove any characters that are not letters, lowercase everything and get rid of extra spaces between words. Then it will tokenize the cleaned line and return a list of individual words. Finally it will remove any offensive words that we do not want to predict in the model.
‘clean_file’ will use ‘clean_line’ function to clean and tokenize the entire file.

#Tokenize blog train data
con <- file("blogs.train.txt", "r")
text <- readLines(con)
close(con) 

blogtokenized <- file("blogtokenized.txt", open = "wt")
clean_file(text, blogtokenized)
close(blogtokenized)

#Tokenize twitter train data
con <- file("twitter.train.txt", "r")
text <- readLines(con)
close(con) 


twittertokenized <- file("twittertokenized.txt", open = "wt")
clean_file(text, twittertokenized)
close(twittertokenized)

#Tokenize news train data
con <- file("news.train.txt", "r")
text <- readLines(con)
close(con) 

newstokenized <- file("newstokenized.txt", open = "wt")
clean_file(text, newstokenized)
close(newstokenized)

Explatory Data Analysis

Below is the summary of statistics of each training dataset. Based on the table we can see that the a sentence tends to have more words in a order of blog, news, and twitter.

##                 en_US.blog en_US.twitter en_US.news
## File Size [Mb]        1.89          0.56       1.62
## Number of lines    8014.00       7989.00    7967.00
## Word Count            0.00          0.00       0.00

Word Frequency

Below is a exploratory analysis on word frequency and distribution for each file.

Single word data: From the table and the barplots below we can observe that the most frequent word is “the” for all the files.

##                          en_US.blog en_US.twitter en_US.news
## Unique single word count      26580         13268      26340

2-gram word analysis: From the table and the barplots below we can observe that many of the frequent 2gram words are combination of “the” and prepositions.

##                          en_US.blog en_US.twitter en_US.news
## Unique 2-gram word count     178081         64634     165672

3-gram word analysis From the table and plots below we can see that many of the frequent words are consisted of pronouns and verbs.

##                          en_US.blog en_US.twitter en_US.news
## Unique 3-gram word count     298941         93754     252999