Introduction

This report is prepared for the Data Science Capstone Project. It describes the initial steps for the development of a predictive text model. The goal is to build an application that would predict the next word when a word or phrase would be entered.

The analysis in this report is based on three sources/files of data. Text mining strategies are used to clean and analyze the text files.

Data

The three files used for the development of the prediction model are:

  1. en_US.blogs.txt
  2. en_US.news.txt
  3. en_US.twitter.txt
# working directory
setwd("D://Work//Capstone")

# loading data in working directory
blg<- file("final//en_US//en_US.blogs.txt", open="rb")
blog<- readLines(blg, encoding="latin1", skipNul=TRUE)
close(blg)

nws<- file("final//en_US//en_US.news.txt", open="rb")
news<- readLines(nws, encoding="latin1", skipNul=TRUE)
close(nws)

twts<- file("final//en_US//en_US.twitter.txt", open="rb")
twitter<- readLines(twts, encoding="latin1", skipNul=TRUE)
close(twts)

Exploratory Analysis

library("stringi")

blogs.stat <- c(file.info("final//en_US//en_US.blogs.txt")$size/1024^2,length(blog), sum(stri_count_words(blog)))
news.stat <- c(file.info("final//en_US//en_US.news.txt")$size/1024^2,length(news), sum(stri_count_words(news)))
twitter.stat <- c(file.info("final//en_US//en_US.twitter.txt")$size/1024^2,length(twitter), sum(stri_count_words(twitter)))

stat <- data.frame(blogs.stat, twitter.stat, news.stat)
rownames(stat) <- c("File Size(MB)", "# of lines", "Total number of words")

options("scipen"=100, "digits"=4)
stat
##                       blogs.stat twitter.stat  news.stat
## File Size(MB)              200.4        159.4      196.3
## # of lines              899288.0    2360148.0  1010242.0
## Total number of words 38153767.0   30195719.0 35016742.0

As the results above show the data files are very large. Thus, a sample is selected from each file and saved in a new directory.

#Creating Sample

set.seed(1022)

sample_blog <- blog[sample(1:length(blog), 50000)]
sample_news <- news[sample(1:length(news), 50000)]
sample_twitter <- twitter[sample(1:length(twitter), 50000)]


dir.create("sample")
setwd("D://Work//Capstone//sample")

file1<-file("sample_blog.txt")
writeLines(sample_blog, file1)
close(file1)

file2<-file("sample_news.txt")
writeLines(sample_news, file2)
close(file2)

file3<-file("sample_twitter.txt")
writeLines(sample_twitter, file3)
close(file3)

remove(blog,news,twitter)

Transformations

Once the sample is selected the text files are modified in order to prepare words as tokens. As shown below the function ‘transformations’ carries out transformations using tm_map(), which applies the transformations to all documents in the corpus.

  1. Import directory as corpus
  2. Convert to lower case
  3. Remove punctutaution
  4. Remove all numbers
  5. Remove stop words
  6. Convert words to word stem
  7. Remove special characters
  8. Remove extra whitespace
  9. Convert to plain text document
transformations<- function(text) {
 library(NLP)
 library(tm)
 corpus<- Corpus(DirSource(text),readerControl=list(language="english"))
 corpus <- tm_map(corpus, function(x) iconv(enc2utf8(x$content), sub = "bytes"))
 corpus <- tm_map(corpus,tolower)
 corpus <- tm_map(corpus,removePunctuation)
 corpus <- tm_map(corpus,removeNumbers )
 corpus <- tm_map(corpus,removeWords,stopwords("english"), lazy="TRUE" )
 corpus <- tm_map(corpus, stemDocument, language = "english", lazy="TRUE" )
 corpus <- tm_map(corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
 corpus <- tm_map(corpus, stripWhitespace, lazy="TRUE" )
 corpus <- tm_map(corpus, PlainTextDocument )
 corpus
}

sample_dir<-"D://Work//Capstone//sample"
corpus1 <- transformations(sample_dir)

Word Frequencies

A term document matrix is created from the cleaned corpus. The elements of this matrix are the frequency of terms that occur in a collection of documents, the rows correspond to the files/documents, and the columns correspond to terms.

tdm    <- TermDocumentMatrix(corpus1)
findFreqTerms(tdm, 2000, Inf)
##   [1] "also"      "always"    "another"   "around"    "away"     
##   [6] "back"      "best"      "better"    "big"       "book"     
##  [11] "can"       "cant"      "city"      "come"      "day"      
##  [16] "days"      "didnt"     "dont"      "end"       "even"     
##  [21] "every"     "family"    "feel"      "find"      "first"    
##  [26] "found"     "game"      "get"       "getting"   "give"     
##  [31] "going"     "good"      "got"       "great"     "help"     
##  [36] "home"      "house"     "its"       "ive"       "just"     
##  [41] "keep"      "know"      "last"      "life"      "like"     
##  [46] "little"    "long"      "look"      "lot"       "love"     
##  [51] "made"      "make"      "man"       "many"      "may"      
##  [56] "much"      "need"      "never"     "new"       "next"     
##  [61] "night"     "now"       "old"       "one"       "part"     
##  [66] "people"    "place"     "play"      "put"       "really"   
##  [71] "right"     "said"      "say"       "says"      "school"   
##  [76] "see"       "show"      "since"     "something" "state"    
##  [81] "still"     "sure"      "take"      "team"      "thanks"   
##  [86] "thats"     "thing"     "things"    "think"     "though"   
##  [91] "thought"   "three"     "time"      "today"     "two"      
##  [96] "use"       "used"      "want"      "way"       "week"     
## [101] "well"      "went"      "will"      "work"      "world"    
## [106] "year"      "years"     "youre"

Above are words which occur at least 2000 times in the three files/documents.

library("wordcloud")
ap.m <- as.matrix(tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
pal2 <- brewer.pal(8,"Dark2")
wordcloud(ap.d$word,ap.d$freq, scale=c(4,.2),min.freq=500,
          max.words=200, random.order=FALSE, rot.per=.15, colors=pal2)

As shown above a word cloud shows that the words“said, will, just, like, can” are the most occuring words. This is also supported in the bar chart below.

Next the n-gram frequencies are analysed. These are the sequences of n number of words.

Next Steps

  1. Build a predictive model based on the data modeling steps
  2. Evaluate the model for efficiency and accuracy
  3. Use smoothing and back-off method to account for words not in the corpus
  4. Develop a text prediction product