This report is prepared for the Data Science Capstone Project. It describes the initial steps for the development of a predictive text model. The goal is to build an application that would predict the next word when a word or phrase would be entered.
The analysis in this report is based on three sources/files of data. Text mining strategies are used to clean and analyze the text files.
The three files used for the development of the prediction model are:
# working directory
setwd("D://Work//Capstone")
# loading data in working directory
blg<- file("final//en_US//en_US.blogs.txt", open="rb")
blog<- readLines(blg, encoding="latin1", skipNul=TRUE)
close(blg)
nws<- file("final//en_US//en_US.news.txt", open="rb")
news<- readLines(nws, encoding="latin1", skipNul=TRUE)
close(nws)
twts<- file("final//en_US//en_US.twitter.txt", open="rb")
twitter<- readLines(twts, encoding="latin1", skipNul=TRUE)
close(twts)
library("stringi")
blogs.stat <- c(file.info("final//en_US//en_US.blogs.txt")$size/1024^2,length(blog), sum(stri_count_words(blog)))
news.stat <- c(file.info("final//en_US//en_US.news.txt")$size/1024^2,length(news), sum(stri_count_words(news)))
twitter.stat <- c(file.info("final//en_US//en_US.twitter.txt")$size/1024^2,length(twitter), sum(stri_count_words(twitter)))
stat <- data.frame(blogs.stat, twitter.stat, news.stat)
rownames(stat) <- c("File Size(MB)", "# of lines", "Total number of words")
options("scipen"=100, "digits"=4)
stat
## blogs.stat twitter.stat news.stat
## File Size(MB) 200.4 159.4 196.3
## # of lines 899288.0 2360148.0 1010242.0
## Total number of words 38153767.0 30195719.0 35016742.0
As the results above show the data files are very large. Thus, a sample is selected from each file and saved in a new directory.
#Creating Sample
set.seed(1022)
sample_blog <- blog[sample(1:length(blog), 50000)]
sample_news <- news[sample(1:length(news), 50000)]
sample_twitter <- twitter[sample(1:length(twitter), 50000)]
dir.create("sample")
setwd("D://Work//Capstone//sample")
file1<-file("sample_blog.txt")
writeLines(sample_blog, file1)
close(file1)
file2<-file("sample_news.txt")
writeLines(sample_news, file2)
close(file2)
file3<-file("sample_twitter.txt")
writeLines(sample_twitter, file3)
close(file3)
remove(blog,news,twitter)
Once the sample is selected the text files are modified in order to prepare words as tokens. As shown below the function ‘transformations’ carries out transformations using tm_map(), which applies the transformations to all documents in the corpus.
transformations<- function(text) {
library(NLP)
library(tm)
corpus<- Corpus(DirSource(text),readerControl=list(language="english"))
corpus <- tm_map(corpus, function(x) iconv(enc2utf8(x$content), sub = "bytes"))
corpus <- tm_map(corpus,tolower)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,removeNumbers )
corpus <- tm_map(corpus,removeWords,stopwords("english"), lazy="TRUE" )
corpus <- tm_map(corpus, stemDocument, language = "english", lazy="TRUE" )
corpus <- tm_map(corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
corpus <- tm_map(corpus, stripWhitespace, lazy="TRUE" )
corpus <- tm_map(corpus, PlainTextDocument )
corpus
}
sample_dir<-"D://Work//Capstone//sample"
corpus1 <- transformations(sample_dir)
A term document matrix is created from the cleaned corpus. The elements of this matrix are the frequency of terms that occur in a collection of documents, the rows correspond to the files/documents, and the columns correspond to terms.
tdm <- TermDocumentMatrix(corpus1)
findFreqTerms(tdm, 2000, Inf)
## [1] "also" "always" "another" "around" "away"
## [6] "back" "best" "better" "big" "book"
## [11] "can" "cant" "city" "come" "day"
## [16] "days" "didnt" "dont" "end" "even"
## [21] "every" "family" "feel" "find" "first"
## [26] "found" "game" "get" "getting" "give"
## [31] "going" "good" "got" "great" "help"
## [36] "home" "house" "its" "ive" "just"
## [41] "keep" "know" "last" "life" "like"
## [46] "little" "long" "look" "lot" "love"
## [51] "made" "make" "man" "many" "may"
## [56] "much" "need" "never" "new" "next"
## [61] "night" "now" "old" "one" "part"
## [66] "people" "place" "play" "put" "really"
## [71] "right" "said" "say" "says" "school"
## [76] "see" "show" "since" "something" "state"
## [81] "still" "sure" "take" "team" "thanks"
## [86] "thats" "thing" "things" "think" "though"
## [91] "thought" "three" "time" "today" "two"
## [96] "use" "used" "want" "way" "week"
## [101] "well" "went" "will" "work" "world"
## [106] "year" "years" "youre"
Above are words which occur at least 2000 times in the three files/documents.
library("wordcloud")
ap.m <- as.matrix(tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
pal2 <- brewer.pal(8,"Dark2")
wordcloud(ap.d$word,ap.d$freq, scale=c(4,.2),min.freq=500,
max.words=200, random.order=FALSE, rot.per=.15, colors=pal2)
As shown above a word cloud shows that the words“said, will, just, like, can” are the most occuring words. This is also supported in the bar chart below.
Next the n-gram frequencies are analysed. These are the sequences of n number of words.