Summary

In this Milestone Report, we will demonstrate predictive text analytics with n-grams. The goal of thid assignment will be to: 1. Demonstrate that I’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that amassed so far. 4. Give feedback on plans for creating a prediction algorithm and Shiny app.

Dataset

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). It contains the following files that will be used to build the corpus for this project.

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Load the packages needed for exploring datasets

library(tm)

## Loading required package: NLP

library(NLP)
library(wordcloud)

## Loading required package: RColorBrewer

library(stringi)
library(manipulate)
library(openNLP)
library(RColorBrewer)
library(RWeka)

Loading data

setwd("~/Coursera/10. Data Science Cap Stone Project/Coursera-SwiftKey/final/en_US")

blogs <- file("en_US.blogs.txt", "r")
news <- file("en_US.news.txt", "r")
tweets <- file("en_US.twitter.txt", "r")

blogdata <- readLines(blogs)
newsdata <- readLines(news)

## Warning in readLines(news): incomplete final line found on 'en_US.news.txt'

tweetdata <- readLines(tweets)

## Warning in readLines(tweets): line 167155 appears to contain an embedded
## nul

## Warning in readLines(tweets): line 268547 appears to contain an embedded
## nul

## Warning in readLines(tweets): line 1274086 appears to contain an embedded
## nul

## Warning in readLines(tweets): line 1759032 appears to contain an embedded
## nul

data.Summary <- data.frame(Dataset = c("Blogs", "News", "Tweets"),
                       Filesize = c(file.size("en_US.blogs.txt"),
                                    file.size("en_US.news.txt"),
                                    file.size("en_US.twitter.txt")),
                       Lines = c(length(blogdata),
                                 length(newsdata),
                                 length(tweetdata)),
                       Words = c(sum(sapply(strsplit(blogdata, " "), FUN = length, simplify = TRUE)),
                                 sum(sapply(strsplit(newsdata, " "), FUN = length, simplify = TRUE)),
                                 sum(sapply(strsplit(tweetdata, " "), FUN = length, simplify = TRUE))
                                 )
                       )
close(blogs)                  
close(news)
close(tweets)


print(data.Summary)

##   Dataset  Filesize   Lines    Words
## 1   Blogs 210160014  899288 37334131
## 2    News 205811889   77259  2643969
## 3  Tweets 167105338 2360148 30373543

Create a sample corpus to explore data

As the corpus/data files are too large, we will take a sample of 10% of each dataset. We will do some data cleansing to remove numbers, whitespaces, punctuation marks and change the case of the words to lowercase.

set.seed(1234)
blogs.Sample <- sample(blogdata, length(blogdata)*0.1, replace=FALSE)
news.Sample <- sample(newsdata, length(newsdata)*0.1, replace=FALSE)
tweets.Sample <- sample(tweetdata, length(tweetdata)*0.1, replace=FALSE)

sample.Corpus <- c(blogs.Sample, news.Sample, tweets.Sample)
sample.Corpus <- VCorpus(VectorSource(sample.Corpus))

sample.Corpus <- tm_map(sample.Corpus, removeNumbers)
sample.Corpus <- tm_map(sample.Corpus, removePunctuation)
sample.Corpus <- tm_map(sample.Corpus, stripWhitespace)
sample.Corpus <- tm_map(sample.Corpus, content_transformer(tolower))
sample.Corpus <- tm_map(sample.Corpus, removeWords, stopwords("english"))
sample.Corpus <- tm_map(sample.Corpus, PlainTextDocument)

unigram <- function(x) NGramTokenizer(x, Weka_control(min =1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min =2, max = 2))

unigram.tdm <- TermDocumentMatrix(sample.Corpus, control = list(tokenize = unigram))
bigram.tdm <- TermDocumentMatrix(sample.Corpus, control = list(tokenize = bigram))

unigram.tdm.temp <- removeSparseTerms(unigram.tdm, sparse = 0.99)
bigram.tdm.temp <- removeSparseTerms(bigram.tdm, sparse = 0.999)

unitdmf <- sort(rowSums(as.matrix(unigram.tdm.temp)), decreasing=TRUE)
bitdmf <- sort(rowSums(as.matrix(bigram.tdm.temp)), decreasing=TRUE)

Data Exploration

Histogram of common UniGrams

barplot(head(unitdmf,5), main = "Most Frequent Unigrams - Top 5", col="deepskyblue1")

Wordcloud for UniGrams

wordcloud(names(unitdmf), unitdmf, colors = brewer.pal(6, "Paired"))

Histogram of common BiGrams

barplot(head(bitdmf,5), main = "Most Frequent Bigrams - Top 5", col="deepskyblue1")

Wordcloud for BiGrams

wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6, "Paired"))

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): cant wait could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): dont know could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): right now could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): last night could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): can get could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): happy birthday could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): thanks following could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): even though could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): thanks much could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(bitdmf), bitdmf, colors = brewer.pal(6,
## "Paired")): look like could not be fit on page. It will not be plotted.

Plan for creating a prediction algorithm and Shiny app

The above analysis summariseS the most frequently used words in the corpus provided to us. We will use this learning from these findings and build a predictive model based on commonly used n-grams. We will prioritize suggestions according to popularity of the words used, so that the users of our model may select from a list of most commonly used words which could be 1, 2 or 3 words in that particular order.

Also, since the size of the training data is large, we will create a sampling strategy to down size the training data.

Finally, we will deploy our model in a Shiny App from which users will be able to enter a short phrase, and then the Shinay App could suggest the most suitable next word by using our predictive model.

Milestone Report

Karthik Chawala

October 22, 2017