In this Milestone Report, we will demonstrate predictive text analytics with n-grams. The goal of thid assignment will be to: 1. Demonstrate that I’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that amassed so far. 4. Give feedback on plans for creating a prediction algorithm and Shiny app.
The data is from a corpus called HC Corpora (www.corpora.heliohost.org). It contains the following files that will be used to build the corpus for this project.
library(tm)
## Loading required package: NLP
library(NLP)
library(wordcloud)
## Loading required package: RColorBrewer
library(stringi)
library(manipulate)
library(openNLP)
library(RColorBrewer)
library(RWeka)
setwd("~/Coursera/10. Data Science Cap Stone Project/Coursera-SwiftKey/final/en_US")
blogs <- file("en_US.blogs.txt", "r")
news <- file("en_US.news.txt", "r")
tweets <- file("en_US.twitter.txt", "r")
blogdata <- readLines(blogs)
newsdata <- readLines(news)
## Warning in readLines(news): incomplete final line found on 'en_US.news.txt'
tweetdata <- readLines(tweets)
## Warning in readLines(tweets): line 167155 appears to contain an embedded
## nul
## Warning in readLines(tweets): line 268547 appears to contain an embedded
## nul
## Warning in readLines(tweets): line 1274086 appears to contain an embedded
## nul
## Warning in readLines(tweets): line 1759032 appears to contain an embedded
## nul
data.Summary <- data.frame(Dataset = c("Blogs", "News", "Tweets"),
Filesize = c(file.size("en_US.blogs.txt"),
file.size("en_US.news.txt"),
file.size("en_US.twitter.txt")),
Lines = c(length(blogdata),
length(newsdata),
length(tweetdata)),
Words = c(sum(sapply(strsplit(blogdata, " "), FUN = length, simplify = TRUE)),
sum(sapply(strsplit(newsdata, " "), FUN = length, simplify = TRUE)),
sum(sapply(strsplit(tweetdata, " "), FUN = length, simplify = TRUE))
)
)
close(blogs)
close(news)
close(tweets)
print(data.Summary)
## Dataset Filesize Lines Words
## 1 Blogs 210160014 899288 37334131
## 2 News 205811889 77259 2643969
## 3 Tweets 167105338 2360148 30373543
As the corpus/data files are too large, we will take a sample of 10% of each dataset. We will do some data cleansing to remove numbers, whitespaces, punctuation marks and change the case of the words to lowercase.
set.seed(1234)
blogs.Sample <- sample(blogdata, length(blogdata)*0.1, replace=FALSE)
news.Sample <- sample(newsdata, length(newsdata)*0.1, replace=FALSE)
tweets.Sample <- sample(tweetdata, length(tweetdata)*0.1, replace=FALSE)
sample.Corpus <- c(blogs.Sample, news.Sample, tweets.Sample)
sample.Corpus <- VCorpus(VectorSource(sample.Corpus))
sample.Corpus <- tm_map(sample.Corpus, removeNumbers)
sample.Corpus <- tm_map(sample.Corpus, removePunctuation)
sample.Corpus <- tm_map(sample.Corpus, stripWhitespace)
sample.Corpus <- tm_map(sample.Corpus, content_transformer(tolower))
sample.Corpus <- tm_map(sample.Corpus, removeWords, stopwords("english"))
sample.Corpus <- tm_map(sample.Corpus, PlainTextDocument)
unigram <- function(x) NGramTokenizer(x, Weka_control(min =1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min =2, max = 2))
unigram.tdm <- TermDocumentMatrix(sample.Corpus, control = list(tokenize = unigram))
bigram.tdm <- TermDocumentMatrix(sample.Corpus, control = list(tokenize = bigram))
unigram.tdm.temp <- removeSparseTerms(unigram.tdm, sparse = 0.99)
bigram.tdm.temp <- removeSparseTerms(bigram.tdm, sparse = 0.999)
unitdmf <- sort(rowSums(as.matrix(unigram.tdm.temp)), decreasing=TRUE)
bitdmf <- sort(rowSums(as.matrix(bigram.tdm.temp)), decreasing=TRUE)
Histogram of common UniGrams
barplot(head(unitdmf,5), main = "Most Frequent Unigrams - Top 5", col="deepskyblue1")
Wordcloud for UniGrams
wordcloud(names(unitdmf), unitdmf, colors = brewer.pal(6, "Paired"))
Histogram of common BiGrams
barplot(head(bitdmf,5), main = "Most Frequent Bigrams - Top 5", col="deepskyblue1")
Wordcloud for BiGrams
wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6, "Paired"))
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): cant wait could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): dont know could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): right now could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): last night could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): can get could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): happy birthday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): thanks following could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): even though could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): thanks much could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): look like could not be fit on page. It will not be plotted.
The above analysis summariseS the most frequently used words in the corpus provided to us. We will use this learning from these findings and build a predictive model based on commonly used n-grams. We will prioritize suggestions according to popularity of the words used, so that the users of our model may select from a list of most commonly used words which could be 1, 2 or 3 words in that particular order.
Also, since the size of the training data is large, we will create a sampling strategy to down size the training data.
Finally, we will deploy our model in a Shiny App from which users will be able to enter a short phrase, and then the Shinay App could suggest the most suitable next word by using our predictive model.