Executive Summary

This milestone report is for the Data Science Capstone Project. The goal of the project is to build a predictive text model that will predict the next word as the user types a sentence, similar to when texting on a smart phone.

This report focuses on the data exploratory analysis and the goals for the final application and algorithms.

The dataset used for training the model is from a corpus called HC Corpora http://www.corpora.heliohost.org/. The data provided for this project comes in 3 sources: blogs, news and twitter feeds. It also comes 4 different languages: English, German, Finnish and Russian. For this project, we will be using the English data.

Getting and Cleaning Data

Loading the Required Libraries

library(tm)
## Loading required package: NLP
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(RWeka)
library(knitr)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)

Loading the Dataset

blogs <- readLines("./data/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./data/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./data/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Initial Analysis of the Raw Data

dataset_stats <- function(name, filename, dataset) {
  fs <- file.info(filename)$size / 1024^2
  
  lc <- length(dataset)
  
  max_lc <- max(nchar(dataset))
  
  words <- strsplit(dataset, " ")
  wc <- sum(sapply(words, FUN=length, simplify = TRUE))
  
  return (list(filename=name, filesize=fs, lines=lc, max_line=max_lc, words=wc))
}

stats_blogs <- dataset_stats("en_US.blogs", "./data/en_US/en_US.blogs.txt", blogs)
stats_news <- dataset_stats("en_US.news", "./data/en_US/en_US.news.txt", news)
stats_twitter <- dataset_stats("en_US.twitter", "./data/en_US/en_US.twitter.txt", twitter)

stats_df <- data.frame(filename=c(stats_blogs$filename, stats_news$filename, stats_twitter$filename),
                       filesize=c(stats_blogs$filesize, stats_news$filesize, stats_twitter$filesize),
                       line_count=c(stats_blogs$lines, stats_news$lines, stats_twitter$lines),
                       max_line=c(stats_blogs$max_line, stats_news$max_line, stats_twitter$max_line),
                       word_count=c(stats_blogs$words, stats_news$words, stats_twitter$words))

kable(stats_df, col.names = c("Filename", "Filesize", "Line Count", "Longest Line", "Word Count"),
      digits = 2)
Filename Filesize Line Count Longest Line Word Count
en_US.blogs 200.42 899288 40833 37334131
en_US.news 196.28 1010242 11384 34372530
en_US.twitter 159.36 2360148 140 30373583

Processing the Data

As the data is so large a sample of each of the datasets will be extracted and used within the analysis.

twitter_clean <- iconv(twitter, "UTF-8", "ASCII", "byte")

set.seed(12321)
sample_blogs <- blogs[sample(1:length(blogs), 10000)]
sample_news <- news[sample(1:length(news), 10000)]
sample_twitter <- twitter_clean[sample(1:length(twitter_clean), 10000)]
sample_data <- c(sample_blogs, sample_news, sample_twitter)
writeLines(sample_data, "./data/sample_data.txt")

We will now load the sampled data as a recognizable corpus of text using the ‘Corpus’ function.

cname <- file.path("~", "src", "R", "Coursera", "Capstone", "data")
dir(cname)
## [1] "en_US"           "sample_data.txt"
txt.corpus <- Corpus(DirSource(cname))
summary(txt.corpus)
##                 Length Class             Mode
## sample_data.txt 2      PlainTextDocument list

Before performing the exploratory data analysis, we need to tidy up the data first. This consists of lowering all characters to lowercase, removing punctuation, removing numbers, removing profanity, stripping whitespaces and converting the data into plain text.

profanity <- readLines("./profanity_list.txt")

txt.corpus <- tm_map(txt.corpus, tolower)
txt.corpus <- tm_map(txt.corpus, removePunctuation)
txt.corpus <- tm_map(txt.corpus, removeNumbers)
txt.corpus <- tm_map(txt.corpus, removeWords, profanity)
txt.corpus <- tm_map(txt.corpus, stripWhitespace)
txt.corpus <- tm_map(txt.corpus, PlainTextDocument)

You will have noticed that the stopwords were not removed during the cleaning and tidy up stage. This was on purpose, as they are used in normal sentences and could be expected as the next word when a user is inputting a sentence.

Exploratory Data Analysis

We will now create a Term Document Matrix (TDM) which reflects the number of times each word within the corpus occurs.

# Tokenizer functions
unigram_token <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigram_token <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram_token <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
quadgram_token <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))

# Create a Term Document Matrix
unigram_tdm <- TermDocumentMatrix(txt.corpus, control = list(tokenize=unigram_token))
bigram_tdm <- TermDocumentMatrix(txt.corpus, control = list(tokenize=bigram_token))
trigram_tdm <- TermDocumentMatrix(txt.corpus, control = list(tokenize=trigram_token))
quadgram_tdm <- TermDocumentMatrix(txt.corpus, control = list(tokenize=quadgram_token))

We are now going to explore the ngrams and find out what are the most frequent occuring words/phrases.

For the Unigram matrix the following words occured more than 4000 times.

findFreqTerms(unigram_tdm, 4000)
##  [1] "and"  "are"  "but"  "for"  "have" "that" "the"  "this" "was"  "with"
## [11] "you"

For the Bigram matrix the following phrases occured more than 1000 times.

findFreqTerms(bigram_tdm, 1000)
##  [1] "and the"  "at the"   "for the"  "in a"     "in the"   "of the"  
##  [7] "on the"   "to be"    "to the"   "with the"

For the Trigram matrix the following phrases occured more than 100 times.

findFreqTerms(trigram_tdm, 100)
##  [1] "a couple of"    "a lot of"       "as well as"     "be able to"    
##  [5] "going to be"    "i dont know"    "i want to"      "it was a"      
##  [9] "one of the"     "out of the"     "part of the"    "some of the"   
## [13] "thanks for the" "the end of"     "the rest of"    "to be a"

For the Quadgram matrix the following phrases occured more than 30 times.

findFreqTerms(quadgram_tdm, 30)
##  [1] "a lot of people"    "at the end of"      "at the same time"  
##  [4] "for the first time" "going to be a"      "if you want to"    
##  [7] "in the middle of"   "is one of the"      "one of the most"   
## [10] "the end of the"     "the rest of the"    "to be able to"     
## [13] "when it comes to"

We will now convert the ngram matrices into a normal matrix to use within a plot as TDM’s are stored as sparse matrices.

# Convert to normal matrix
unigram_matrix <- as.matrix(unigram_tdm)
bigram_matrix <- as.matrix(bigram_tdm)
trigram_matrix <- as.matrix(trigram_tdm)
quadgram_matrix <- as.matrix(quadgram_tdm)

To create a plot of the Unigrams we will sort the unigram matrix and create a data frame.

# Sort Bigram matrix and create Bigram data frame
unigram_sort <- sort(rowSums(unigram_matrix), decreasing = TRUE)
unigram_words <- data.frame(word = names(unigram_sort), freq = unigram_sort)

Create a Word Cloud of the Unigram Matrix

# Create a Word Cloud
wordcloud(unigram_words$word, unigram_words$freq, scale=c(10, .8), max.words=100, 
          random.order=FALSE, colors=brewer.pal(10, "Paired"))

This is a plot of the Top 20 Unigrams by Frequency. As you can see the top 3 words are: ‘the’, ‘and’, ‘that’.

# Create a Unigram Plot
ggplot(unigram_words[1:20,], aes(reorder(x=word, -freq), y=freq)) +
  geom_bar(stat="identity") +
  labs(x="Words", y="Frequency") +
  ggtitle("Top 20 Unigrams by Frequency") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1, size=10),
        axis.title=element_text(size=14))

Sort the Bigram matrix into frequency order and create a data frame.

# Sort Bigram matrix and create Bigram data frame
bigram_sort <- sort(rowSums(bigram_matrix), decreasing = TRUE)
bigram_words <- data.frame(word = names(bigram_sort), freq = bigram_sort)

Create a Word Cloud of the Bigram Matrix

# Create a Word Cloud
wordcloud(bigram_words$word, bigram_words$freq, scale=c(6, .4), max.words=100, 
          random.order=FALSE, colors=brewer.pal(10, "Paired"))

This is a plot of the Top 20 Bigrams by Frequency. The top 3 Bigrams are: ‘of the’, ‘in the’, ‘to the’.

ggplot(bigram_words[1:20,], aes(reorder(x=word, -freq), y=freq)) +
  geom_bar(stat="identity") +
  labs(x="Words", y="Frequency") +
  ggtitle("Top 20 Bigrams by Frequency") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1, size=10),
        axis.title=element_text(size=14))

Sort the Trigram matrix into frequency order and create a data frame.

# Sort Trigram matrix and create Trigram data frame
trigram_sort <- sort(rowSums(trigram_matrix), decreasing = TRUE)
trigram_words <- data.frame(word = names(trigram_sort), freq = trigram_sort)

Create a Word Cloud of the Trigram Matrix

# Create a Word Cloud
wordcloud(trigram_words$word, trigram_words$freq, scale=c(3, .3), max.words=100, 
          random.order=FALSE, colors=brewer.pal(10, "Paired"))

This is a plot of the Top 20 Trigrams by Frequency. The top 3 Trigrams are: ‘one of the’, ‘a lot of’, ‘to be a’.

ggplot(trigram_words[1:20,], aes(reorder(x=word, -freq), y=freq)) +
  geom_bar(stat="identity") + 
  labs(x="Words", y="Frequency") +
  ggtitle("Top 20 Trigrams by Frequency") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1, size=10),
        axis.title=element_text(size=14))

Sort the Quadgram matrix into frequency order and create a data frame.

# Sort Quadgram matrix and create Quadgram data frame
quadgram_sort <- sort(rowSums(quadgram_matrix), decreasing = TRUE)
quadgram_words <- data.frame(word = names(quadgram_sort), freq = quadgram_sort)

Create a Word Cloud of the Quadgram Matrix

# Create a Word Cloud
wordcloud(quadgram_words$word, quadgram_words$freq, scale=c(2, .2), max.words=100, 
          random.order=FALSE, colors=brewer.pal(10, "Paired"))

This is a plot of the Top 20 Quadgrams by Frequency. The top 3 Quadgrams are: ‘at the end of’, ‘the rest of the’, ‘the end of the’.

ggplot(quadgram_words[1:20,], aes(reorder(x=word, -freq), y=freq)) +
  geom_bar(stat="identity") + 
  labs(x="Words", y="Frequency") +
  ggtitle("Top 20 Quadgrams by Frequency") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1, size=10),
        axis.title=element_text(size=14))

Plans for Prediction Algorithms and Shiny App

Now that we have performed some exploratory data analysis, we are now ready to create a prediction model algorithm.

For the predictive model, we will use the n-gram model with a frequency lookup. The model will work in a two fold step. If the sentence/phrase that the user has written is less than four words we will use the (word count + 1)-gram model. For example if the user has enter ‘What time’, we will use the trigram model to predict the next word. Once the user has input four or more words we will then use the quadgram model to predict the next word. However, if no matching word is found, then the algorithm will use the next model in the series, and in this case the trigram. If no word is found again it will then use the bigram and so on to predict the next word.

The user interface in the Shiny app will consist of a text input box that will allow the user to input the phrase. With each word entered the prediction model will attempt to predict the next word and provide three suggestion for the next word.