Milestone

Introduction

The Data Science Capstone involves predictive text analytics. The overall objective is to help users complete sentences by analyzing the words they have entered and predicating the next word. For example, if the first few words of a text are “I want a case of …”, then the model may predict “beer” given available probabilities.

The purpose of this Milestone Report is to demonstrate progress towards the end goal of this project. The specific sections are as follows:

Load/Profile Raw Data - demonstrate progress with loading the data into R & profiling the raw data
Create Sample Corpus - prepare the data prior to NLP
Create N-grams to Explore Data - create n-grams & explore the word patterns
Next Steps - discuss plans for creating the prediction algorighm & shiny application

Set Up

Prepare the session by loading initial packages and clearing the global workspace.

con <- file("./en_US.twitter.txt", open = "r") 
con2 <- file("./en_US.news.txt",open= "r") 
con3 <- file("./en_US.blogs.txt", open="r")

twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(con2, encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(con3, encoding = "UTF-8", skipNul = TRUE)

Basic Summaries

blog_lines <- length(blogs)
blog_words <- sum(sapply(strsplit(blogs,"\\s+"),length))
blog_wpl <- round(blog_words/blog_lines, 2)

news_lines <- length(news)
news_words <- sum(sapply(strsplit(news,"\\s+"),length))
news_wpl <- round(news_words/news_lines, 2)

twit_lines <- length(twitter)
twit_words <- sum(sapply(strsplit(twitter,"\\s+"),length))
twit_wpl <- round(twit_words/twit_lines, 2)

twit <-round(rbind(twit_lines,twit_words,twit_wpl))
new <- round(rbind(news_lines,news_words,news_wpl))
blo <- round(rbind(blog_lines,blog_words,blog_wpl))
dt <- data.frame(twit,new,blo,
                 row.names = c("lines","words","words per line"))
dt

##                    twit     new      blo
## lines           2360148   77259   899288
## words          30373583 2643969 37334131
## words per line       13      34       42

The words per line statistic is interesting. Blogs are the highest at 41.52, news is in the middle at 34.22 and twitter is the least at 12.87. This makes intuitive sense since twitter is limited to 140 characters so “tweets” are naturally more concise. In addition, it would make sense that blogs are the most verbose since this is more of a “free form” style of communication.

Sampling

Now I’m gonna sample the data to proced with my analysis so I can run the program smoothly.

data_sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))
write(data_sample ,file = "./data_sample/sample_data.txt")
# Clean up unused objects in memory.
gc()

##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  5693142 304.1   11689148 624.3   5720416 305.6
## Vcells 65194521 497.4  123359177 941.2 102732644 783.8

rm(list = ls())


setwd("~/Imparare_R/Capstone/Capstone_proj")
dir <- DirSource("./data_sample")
 
corpus <- Corpus(dir, 
               readerControl = list(reader = readPlain, 
                                    language = "en_US"))

Now we’ve got the corpus of our analysis, a single file for further clearning and analysis. This section describes the process to create a sample file (training dataset) from the three raw data files. 5% of the data was randomly sampled from the three raw data files (blogs, news, twitter).

Preprocessing

The cleaning procedure I will perform with the help of tm package is the following:

Removes extra whitespace
Remove numbers
Remove punctuation
Stemming the words
Convert text to lower case
Remove stopwords (common words such as “the”, “to”, “a”, etc..)

corpus <- tm_map(corpus,FUN = stripWhitespace) #Removes extra whitespace 
corpus <- tm_map(corpus,FUN = removeNumbers)
corpus <- tm_map(corpus,FUN = removePunctuation)
corpus <- tm_map(corpus,FUN = stemDocument)
corpus <- tm_map(corpus, FUN =tolower)
corpus <- tm_map(corpus,FUN = removeWords, stopwords("english"))
saveRDS(corpus, file = "./sam.rds")

Exploratory Data Analysis

Exploratory data analysis will be performed to fulfill the primary goal for this report. Several techniques will be employed to develop an understanding of the training data which include looking at the most frequently used words, tokenizing and n-gram generation. N-grams are a useful tool to identify the frequency of certain words and word patterns. 1-gram (Uni-gram) - Indicates the frequcy of single words 2-gram (Bi-gram) - Indicates the frequency of two word patterns 3-gram (Tri-gram) - Indicates the frequency of three word patterns

Word Frequencies

A bar chart will be constructed to illustrate unique word frequencies for uni, bi and trigrams

unigram <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1))

bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

bigrams<- bigram(corpus)

trigrams <- trigram(corpus)
gc() #clean up some memory

##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  3154210 168.5    9351319 499.5   7437155 397.2
## Vcells 11658100  89.0   98687342 753.0 102732644 783.8

dfcs <- data.frame(factor(unigram))
fs <- data.frame(dd = factor(dfcs$factor.unigram.))
fc <- table(fs$dd) # frequency of values in f$c
plot(sort(fc, 1:length(t),decreasing=TRUE)[1:15],ylab = "Frequencies", col = "darkred", main = "15 Most Common Bigrams")

(sort(fc,decreasing=TRUE))[1:15]

## 
##  just   get  like   one    go    im  love  time   can   day  make  know  good 
##  2563  2434  2417  2244  2189  1948  1946  1935  1911  1813  1611  1560  1536 
## thank   now 
##  1436  1405

## 
##      right now      look like     last night      cant wait   look forward 
##            212            178            171            161            138 
##      feel like   thank follow      dont know          im go       let know 
##            130            109            108            105             94 
## happi birthday       year ago       just got      dont want      last year 
##             87             85             82             81             80

## 
## happi mother day    cant wait see      let us know   happi new year 
##               35               29               25               19 
##  dream come true look forward see    new york citi    cinco de mayo 
##               16               14               14               13 
##   im pretti sure   dont even know         im go go    just got back 
##               11               10               10               10 
##  make dream come   make feel like  thank veri much 
##               10               10               10

Looking Forward

The final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface. The Shiny app should take as input a phrase (multiple words) in a text box input and output a prediction of the next word.

The predictive algorithm will be developed using an n-gram model with a word frequency lookup similar to that performed in the exploratory data analysis section of this report. A strategy will be built based on the knowledge gathered during the exploratory analysis. For example, as n increased for each n-gram, the frequency decreased for each of its terms. So one possible strategy may be to construct the model to first look for the unigram that would follow from the entered text. Once a full term is entered followed by a space, find the most common bigram model and so on.

Another possible strategy may be to predict the next word using the trigram model. If no matching trigram can be found, then the algorithm would check the bigram model. If still not found, use the unigram model.

The final strategy will be based on the one that increases efficiency and provides the best accuracy.