Introduction

The Coursera Data Science Capstone involves predictive text analytics. The overall objective is to help users complete sentences by analyzing the words they have entered and predicating the next word.

The purpose of this Milestone Report is to demonstrate progress towards the end goal of this project. The specific sections are as follows:

Load the Data and do some summary Statistics

library(stringi)
library(tm)
## Loading required package: NLP
library(RWeka)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
twitter <- readLines("en_US.twitter.txt")
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul
news <- readLines("en_US.news.txt")
## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'
blogs <- readLines("en_US.blogs.txt")

In the next step we are counting the number of lines and the number of words in each of our datasets

#Number of lines
paste("We have ", length(twitter), "lines in the twitter dataset")
## [1] "We have  2360148 lines in the twitter dataset"
paste("We have ", length(news), "lines in the news dataset")
## [1] "We have  77259 lines in the news dataset"
paste("We have ", length(blogs), "lines in the blogs dataset")
## [1] "We have  899288 lines in the blogs dataset"
#Number of words
paste("We have ", sum(stri_count_words(twitter)), "words in the twitter dataset")
## [1] "We have  30218125 words in the twitter dataset"
paste("We have ", sum(stri_count_words(news)), "words in the news dataset")
## [1] "We have  2693898 words in the news dataset"
paste("We have ", sum(stri_count_words(blogs)), "words in the blogs dataset")
## [1] "We have  38154238 words in the blogs dataset"

In the next step we are going to randomly sample our datasets in order to get a smaller dataset that we can work with

Create Sample dataset and Corpus file

#Sample lines from each file
twitter_sample <- sample(twitter, length(twitter)*0.01)
news_sample <- sample(news, length(news)*0.01)
blogs_sample <- sample(blogs, length(blogs)*0.01)

length(twitter_sample)
## [1] 23601
length(news_sample)
## [1] 772
length(blogs_sample)
## [1] 8992
alldata_sample <- c(twitter_sample, news_sample, blogs_sample)

corpus_sample <- VCorpus(VectorSource(alldata_sample))
#Placeholder for potentially removing stopwods at later stage
##corpus_sample <- tm_map(corpus_sample, removeWords, stopwords("en"))
corpus_sample <- tm_map(corpus_sample, removeNumbers)
corpus_sample <- tm_map(corpus_sample, stripWhitespace)
corpus_sample <- tm_map(corpus_sample, content_transformer(tolower))
corpus_sample <- tm_map(corpus_sample, removePunctuation)
text_corpus <- data.frame(text = unlist(sapply(corpus_sample, `[`, "content")), stringsAsFactors = F)
text_corpus[1:5,]
## [1] "he went surfing today so lets pray thats why it looks like crap"                                                                   
## [2] "had fun playing checkers flying helicopters and riding atvs with the family yesterday more hiking today d"                         
## [3] "great movie have fun"                                                                                                              
## [4] "just found out some very exciting news about the launch of our app stay tuned to find out"                                         
## [5] "tree infoif you have tree damage from storms make sure you get a certifed arborist they can tell you what can and cant be salvaged"

Data Exploration with Ngrams

unigrams <- NGramTokenizer(text_corpus, Weka_control(min=1, max=1))
bigrams <- NGramTokenizer(text_corpus, Weka_control(min=2, max=2))
trigrams <- NGramTokenizer(text_corpus, Weka_control(min=3, max=3))

unigrams_table <- data.frame(table(unigrams))
bigrams_table <- data.frame(table(bigrams))
trigrams_table <- data.frame(table(trigrams))

unigrams_table_top20 <- head(unigrams_table[order(unigrams_table$Freq, decreasing = T),],20)
bigrams_table_top20 <- head(bigrams_table[order(bigrams_table$Freq, decreasing = T),],20)
trigrams_table_top20 <- head(trigrams_table[order(trigrams_table$Freq, decreasing = T),],20)

Top 20 Unigrams

ggplot(unigrams_table_top20, aes(x=unigrams, y=Freq)) + geom_bar(stat = "Identity") + geom_text(aes(label=Freq)) + theme(axis.text.x = element_text(angle = 45))

Top 20 Bigrams

ggplot(bigrams_table_top20, aes(x=bigrams, y=Freq)) + geom_bar(stat = "Identity") + geom_text(aes(label=Freq)) + theme(axis.text.x = element_text(angle = 45))

Top 20 Trigrams

ggplot(trigrams_table_top20, aes(x=trigrams, y=Freq)) + geom_bar(stat = "Identity") + geom_text(aes(label=Freq)) + theme(axis.text.x = element_text(angle = 45))

Next Steps

The next step of this project will be to build our predicting algorithm and deploy it as a Shiny application. Our strategy for building the algorithm will potentially be to use the trigrams created in the above analysis in order to predict the next word. If there is no match for the trigram, then the algorithm would fall back to using the bigrams and the unigrams as a last resort. In the final step of the project we will also need to take account of the time required for the prediction, since our model should be responsive in “real time”.