The Data Science Capstone involves predictive text analytics. The overall objective is to help users complete sentences by analyzing the words they have entered and predicating the next word. For example, if the first few words of a text are “I want a case of …”, then the model may predict “beer” given available probabilities.
The purpose of this Milestone Report is to demonstrate progress towards the end goal of this project. The specific sections are as follows:
Prepare the session by loading initial packages and clearing the global workspace.
con <- file("./en_US.twitter.txt", open = "r")
con2 <- file("./en_US.news.txt",open= "r")
con3 <- file("./en_US.blogs.txt", open="r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(con2, encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(con3, encoding = "UTF-8", skipNul = TRUE)
blog_lines <- length(blogs)
blog_words <- sum(sapply(strsplit(blogs,"\\s+"),length))
blog_wpl <- round(blog_words/blog_lines, 2)
news_lines <- length(news)
news_words <- sum(sapply(strsplit(news,"\\s+"),length))
news_wpl <- round(news_words/news_lines, 2)
twit_lines <- length(twitter)
twit_words <- sum(sapply(strsplit(twitter,"\\s+"),length))
twit_wpl <- round(twit_words/twit_lines, 2)
twit <-round(rbind(twit_lines,twit_words,twit_wpl))
new <- round(rbind(news_lines,news_words,news_wpl))
blo <- round(rbind(blog_lines,blog_words,blog_wpl))
dt <- data.frame(twit,new,blo,
row.names = c("lines","words","words per line"))
dt
## twit new blo
## lines 2360148 77259 899288
## words 30373583 2643969 37334131
## words per line 13 34 42
The words per line statistic is interesting. Blogs are the highest at 41.52, news is in the middle at 34.22 and twitter is the least at 12.87. This makes intuitive sense since twitter is limited to 140 characters so “tweets” are naturally more concise. In addition, it would make sense that blogs are the most verbose since this is more of a “free form” style of communication.
Now I’m gonna sample the data to proced with my analysis so I can run the program smoothly.
data_sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
write(data_sample ,file = "./data_sample/sample_data.txt")
# Clean up unused objects in memory.
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5693142 304.1 11689148 624.3 5720416 305.6
## Vcells 65194521 497.4 123359177 941.2 102732644 783.8
rm(list = ls())
setwd("~/Imparare_R/Capstone/Capstone_proj")
dir <- DirSource("./data_sample")
corpus <- Corpus(dir,
readerControl = list(reader = readPlain,
language = "en_US"))
Now we’ve got the corpus of our analysis, a single file for further clearning and analysis. This section describes the process to create a sample file (training dataset) from the three raw data files. 5% of the data was randomly sampled from the three raw data files (blogs, news, twitter).
The cleaning procedure I will perform with the help of tm package is the following:
corpus <- tm_map(corpus,FUN = stripWhitespace) #Removes extra whitespace
corpus <- tm_map(corpus,FUN = removeNumbers)
corpus <- tm_map(corpus,FUN = removePunctuation)
corpus <- tm_map(corpus,FUN = stemDocument)
corpus <- tm_map(corpus, FUN =tolower)
corpus <- tm_map(corpus,FUN = removeWords, stopwords("english"))
saveRDS(corpus, file = "./sam.rds")
Exploratory data analysis will be performed to fulfill the primary goal for this report. Several techniques will be employed to develop an understanding of the training data which include looking at the most frequently used words, tokenizing and n-gram generation. N-grams are a useful tool to identify the frequency of certain words and word patterns. 1-gram (Uni-gram) - Indicates the frequcy of single words 2-gram (Bi-gram) - Indicates the frequency of two word patterns 3-gram (Tri-gram) - Indicates the frequency of three word patterns
A bar chart will be constructed to illustrate unique word frequencies for uni, bi and trigrams
unigram <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
bigrams<- bigram(corpus)
trigrams <- trigram(corpus)
gc() #clean up some memory
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3154210 168.5 9351319 499.5 7437155 397.2
## Vcells 11658100 89.0 98687342 753.0 102732644 783.8
dfcs <- data.frame(factor(unigram))
fs <- data.frame(dd = factor(dfcs$factor.unigram.))
fc <- table(fs$dd) # frequency of values in f$c
plot(sort(fc, 1:length(t),decreasing=TRUE)[1:15],ylab = "Frequencies", col = "darkred", main = "15 Most Common Bigrams")
(sort(fc,decreasing=TRUE))[1:15]
##
## just get like one go im love time can day make know good
## 2563 2434 2417 2244 2189 1948 1946 1935 1911 1813 1611 1560 1536
## thank now
## 1436 1405
##
## right now look like last night cant wait look forward
## 212 178 171 161 138
## feel like thank follow dont know im go let know
## 130 109 108 105 94
## happi birthday year ago just got dont want last year
## 87 85 82 81 80
##
## happi mother day cant wait see let us know happi new year
## 35 29 25 19
## dream come true look forward see new york citi cinco de mayo
## 16 14 14 13
## im pretti sure dont even know im go go just got back
## 11 10 10 10
## make dream come make feel like thank veri much
## 10 10 10
The final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface. The Shiny app should take as input a phrase (multiple words) in a text box input and output a prediction of the next word.
The predictive algorithm will be developed using an n-gram model with a word frequency lookup similar to that performed in the exploratory data analysis section of this report. A strategy will be built based on the knowledge gathered during the exploratory analysis. For example, as n increased for each n-gram, the frequency decreased for each of its terms. So one possible strategy may be to construct the model to first look for the unigram that would follow from the entered text. Once a full term is entered followed by a space, find the most common bigram model and so on.
Another possible strategy may be to predict the next word using the trigram model. If no matching trigram can be found, then the algorithm would check the bigram model. If still not found, use the unigram model.
The final strategy will be based on the one that increases efficiency and provides the best accuracy.