The goal of this capstone project is to build a predictive model that can predict the most probable next word. The milestone project includes importing the data, sampling the data, and understanding the data through exploratory analysis. Three files (en_US.blog.txt, en_US.twitter.txt, and en_US.news.txt) were used to build the data.
Below is the information table of the text files.
## en_US.blog en_US.twitter en_US.news
## File Size [Mb] 210 167 206
## Number of lines 899288 2360148 77259
Since files are quite large, we will sample 10,000 lines from each file and save as separate file to reduce the burnden of memory usage.
set.seed(1234)
con <- file("en_US.blogs.txt", "r")
blogsample <- readLines(con)[sample(1:blogline, 10000)]
close(con)
con <-file("blogs.data.txt", open = "wt")
writeLines(blogsample, con)
close(con)
con <- file("en_US.twitter.txt", "r")
twittersample <- readLines(con)[sample(1:twitterline, 10000)]
close(con)
writeLines(twittersample, file("twitter.data.txt", open = "wt"))
con <- file("en_US.news.txt", "r")
newssample <- readLines(con)[sample(1:newsline, 10000)]
close(con)
writeLines(newssample, file("news.data.txt", open = "wt"))
We will split the sampled files to train and test dataset. write function will will split them in to 80% and 20% ratio using baised rbinom function.
#Split en_US.blogs.txt into blogs.train.txt and blogs.test.txt
con <- file("blogs.data.txt", "r")
blogTrain <- file("blogs.train.txt", open = "wt")
blogTest <- file("blogs.test.txt", open = "wt")
splitblog<-sapply(c(1:10000), function(x) write(con, blogTrain, blogTest))
close(con)
close(blogTrain)
close(blogTest)
#Split en_US.twitter.txt into twitter.train.txt, twitter.test.txt
con <- file("twitter.data.txt", "r")
twitterTrain <- file("twitter.train.txt", open = "wt")
twitterTest <- file("twitter.test.txt", open = "wt")
splittwitter<-sapply(c(1:10000), function (x) write(con, twitterTrain, twitterTest))
close(con)
close(twitterTrain)
close(twitterTest)
#Split en_US.news.txt into news.train.txt and news.test.txt
con <- file("news.data.txt", "r")
newsTrain <- file("news.train.txt", open = "wt")
newsTest <- file("news.test.txt", open = "wt")
splitnews <-sapply(c(1:10000), function (x) write(con, newsTrain, newsTest))
close(con)
close(newsTrain)
close(newsTest)
Now we will clean and tokenize the sampled data. I wrote two functions ‘clean_line’ and ‘clean_file’ to clean and tokenize the input files.
#Tokenize blog train data
con <- file("blogs.train.txt", "r")
text <- readLines(con)
close(con)
blogtokenized <- file("blogtokenized.txt", open = "wt")
clean_file(text, blogtokenized)
close(blogtokenized)
#Tokenize twitter train data
con <- file("twitter.train.txt", "r")
text <- readLines(con)
close(con)
twittertokenized <- file("twittertokenized.txt", open = "wt")
clean_file(text, twittertokenized)
close(twittertokenized)
#Tokenize news train data
con <- file("news.train.txt", "r")
text <- readLines(con)
close(con)
newstokenized <- file("newstokenized.txt", open = "wt")
clean_file(text, newstokenized)
close(newstokenized)
Below is the summary of statistics of each training dataset. Based on the table we can see that the a sentence tends to have more words in a order of blog, news, and twitter.
## en_US.blog en_US.twitter en_US.news
## File Size [Mb] 1.89 0.56 1.62
## Number of lines 8014.00 7989.00 7967.00
## Word Count 0.00 0.00 0.00
Below is a exploratory analysis on word frequency and distribution for each file.
## en_US.blog en_US.twitter en_US.news
## Unique single word count 26580 13268 26340
## en_US.blog en_US.twitter en_US.news
## Unique 2-gram word count 178081 64634 165672
## en_US.blog en_US.twitter en_US.news
## Unique 3-gram word count 298941 93754 252999