Introduction

This report is prepared for the Data Science Capstone Project. It describes the initial steps for the development of a predictive text model. The goal is to build an application that would predict the next word when a word or phrase would be entered.

The analysis in this report is based on three sources/files of data. Text mining strategies are used to clean and analyze the text files.

Data

The three files used for the development of the prediction model are:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

# working directory
setwd("D://Work//Capstone")

# loading data in working directory
blg<- file("final//en_US//en_US.blogs.txt", open="rb")
blog<- readLines(blg, encoding="latin1", skipNul=TRUE)
close(blg)

nws<- file("final//en_US//en_US.news.txt", open="rb")
news<- readLines(nws, encoding="latin1", skipNul=TRUE)
close(nws)

twts<- file("final//en_US//en_US.twitter.txt", open="rb")
twitter<- readLines(twts, encoding="latin1", skipNul=TRUE)
close(twts)

Exploratory Analysis

library("stringi")

blogs.stat <- c(file.info("final//en_US//en_US.blogs.txt")$size/1024^2,length(blog), sum(stri_count_words(blog)))
news.stat <- c(file.info("final//en_US//en_US.news.txt")$size/1024^2,length(news), sum(stri_count_words(news)))
twitter.stat <- c(file.info("final//en_US//en_US.twitter.txt")$size/1024^2,length(twitter), sum(stri_count_words(twitter)))

stat <- data.frame(blogs.stat, twitter.stat, news.stat)
rownames(stat) <- c("File Size(MB)", "# of lines", "Total number of words")

options("scipen"=100, "digits"=4)
stat

##                       blogs.stat twitter.stat  news.stat
## File Size(MB)              200.4        159.4      196.3
## # of lines              899288.0    2360148.0  1010242.0
## Total number of words 38153767.0   30195719.0 35016742.0

As the results above show the data files are very large. Thus, a sample is selected from each file and saved in a new directory.

#Creating Sample

set.seed(1022)

sample_blog <- blog[sample(1:length(blog), 50000)]
sample_news <- news[sample(1:length(news), 50000)]
sample_twitter <- twitter[sample(1:length(twitter), 50000)]


dir.create("sample")
setwd("D://Work//Capstone//sample")

file1<-file("sample_blog.txt")
writeLines(sample_blog, file1)
close(file1)

file2<-file("sample_news.txt")
writeLines(sample_news, file2)
close(file2)

file3<-file("sample_twitter.txt")
writeLines(sample_twitter, file3)
close(file3)

remove(blog,news,twitter)

Transformations

Once the sample is selected the text files are modified in order to prepare words as tokens. As shown below the function ‘transformations’ carries out transformations using tm_map(), which applies the transformations to all documents in the corpus.

Import directory as corpus
Convert to lower case
Remove punctutaution
Remove all numbers
Remove stop words
Convert words to word stem
Remove special characters
Remove extra whitespace
Convert to plain text document

transformations<- function(text) {
 library(NLP)
 library(tm)
 corpus<- Corpus(DirSource(text),readerControl=list(language="english"))
 corpus <- tm_map(corpus, function(x) iconv(enc2utf8(x$content), sub = "bytes"))
 corpus <- tm_map(corpus,tolower)
 corpus <- tm_map(corpus,removePunctuation)
 corpus <- tm_map(corpus,removeNumbers )
 corpus <- tm_map(corpus,removeWords,stopwords("english"), lazy="TRUE" )
 corpus <- tm_map(corpus, stemDocument, language = "english", lazy="TRUE" )
 corpus <- tm_map(corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
 corpus <- tm_map(corpus, stripWhitespace, lazy="TRUE" )
 corpus <- tm_map(corpus, PlainTextDocument )
 corpus
}

sample_dir<-"D://Work//Capstone//sample"
corpus1 <- transformations(sample_dir)

Word Frequencies

A term document matrix is created from the cleaned corpus. The elements of this matrix are the frequency of terms that occur in a collection of documents, the rows correspond to the files/documents, and the columns correspond to terms.

tdm    <- TermDocumentMatrix(corpus1)
findFreqTerms(tdm, 2000, Inf)

##   [1] "also"      "always"    "another"   "around"    "away"     
##   [6] "back"      "best"      "better"    "big"       "book"     
##  [11] "can"       "cant"      "city"      "come"      "day"      
##  [16] "days"      "didnt"     "dont"      "end"       "even"     
##  [21] "every"     "family"    "feel"      "find"      "first"    
##  [26] "found"     "game"      "get"       "getting"   "give"     
##  [31] "going"     "good"      "got"       "great"     "help"     
##  [36] "home"      "house"     "its"       "ive"       "just"     
##  [41] "keep"      "know"      "last"      "life"      "like"     
##  [46] "little"    "long"      "look"      "lot"       "love"     
##  [51] "made"      "make"      "man"       "many"      "may"      
##  [56] "much"      "need"      "never"     "new"       "next"     
##  [61] "night"     "now"       "old"       "one"       "part"     
##  [66] "people"    "place"     "play"      "put"       "really"   
##  [71] "right"     "said"      "say"       "says"      "school"   
##  [76] "see"       "show"      "since"     "something" "state"    
##  [81] "still"     "sure"      "take"      "team"      "thanks"   
##  [86] "thats"     "thing"     "things"    "think"     "though"   
##  [91] "thought"   "three"     "time"      "today"     "two"      
##  [96] "use"       "used"      "want"      "way"       "week"     
## [101] "well"      "went"      "will"      "work"      "world"    
## [106] "year"      "years"     "youre"

Above are words which occur at least 2000 times in the three files/documents.

library("wordcloud")
ap.m <- as.matrix(tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
pal2 <- brewer.pal(8,"Dark2")
wordcloud(ap.d$word,ap.d$freq, scale=c(4,.2),min.freq=500,
          max.words=200, random.order=FALSE, rot.per=.15, colors=pal2)

As shown above a word cloud shows that the words“said, will, just, like, can” are the most occuring words. This is also supported in the bar chart below.

Next the n-gram frequencies are analysed. These are the sequences of n number of words.

Next Steps

Build a predictive model based on the data modeling steps
Evaluate the model for efficiency and accuracy
Use smoothing and back-off method to account for words not in the corpus
Develop a text prediction product

Text Prediction Model