Introduction

This is the Milestone Report for week 2 of the Coursera Data Science Capstone project.

The goal of this report is to perform exploratory analysis to understand statistical properties of the data set that can later be used when building the prediction model for the final Shiny application. Here we will identify the major features of the training data and then summarize plans for the predictive model.

The model will be trained using a unified document corpus compiled from the following three sources of text data:

  1. Blogs
  2. News
  3. Twitter

Download the data from - https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Unzip and Load the data in the current working directory.

The files are: * Blog: en_US.blogs.txt * News: en_US.news.txt * Twitter: en_US.twitter.txt

# blogs
blogsFileName <- "data/final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# news
newsFileName <- "data/final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete final
## line found on 'data/final/en_US/en_US.news.txt'
close(con)

# twitter
twitterFileName <- "data/final/en_US/en_US.twitter.txt"
con <- file(twitterFileName, open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

rm(con)

Basic Data Summary

Basic summary of the three text corpora.

library(stringi)
library(kableExtra)
# assign sample size
#sampleSize = 0.01
sampleSize = 0.002
# file size
fileSizeMB <- round(file.info(c(blogsFileName,
                                newsFileName,
                                twitterFileName))$size / 1024 ^ 2)
# num lines per file
numLines <- sapply(list(blogs, news, twitter), length)
# num characters per file
numChars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)
# num words per file
numWords <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]
# words per line
wpl <- lapply(list(blogs, news, twitter), function(x) stri_count_words(x))
# words per line summary
wplSummary = sapply(list(blogs, news, twitter),
             function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(wplSummary) = c('WPL.Min', 'WPL.Mean', 'WPL.Max')
summary <- data.frame(
    File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
    FileSize = paste(fileSizeMB, " MB"),
    Lines = numLines,
    Characters = numChars,
    Words = numWords,
    t(rbind(round(wplSummary)))
)
kable(summary,
      row.names = FALSE,
      align = c("l", rep("r", 7)),
      caption = "") %>% kable_styling(position = "left")
File FileSize Lines Characters Words WPL.Min WPL.Mean WPL.Max
en_US.blogs.txt 200 MB 899288 206824505 37570839 0 42 6726
en_US.news.txt 196 MB 77259 15639408 2651432 1 35 1123
en_US.twitter.txt 159 MB 2360148 162096241 30451170 1 13 47

Histogram of Words per Line

Histogram of words per line for the three text corpora.

library(ggplot2)
library(gridExtra)
plot1 <- qplot(wpl[[1]],
               geom = "histogram",
               main = "US Blogs",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot2 <- qplot(wpl[[2]],
               geom = "histogram",
               main = "US News",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5)
plot3 <- qplot(wpl[[3]],
               geom = "histogram",
               main = "US Twitter",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 1)
plotList = list(plot1, plot2, plot3)
do.call(grid.arrange, c(plotList, list(ncol = 1)))

# free up some memory
rm(plot1, plot2, plot3)

Prepare the Data

Prior to performing exploratory data analysis, the three data sets will be sampled at 0% to improve performance. All non-English characters will be removed from the subset of data and then combined into a single data set. The combined sample data set will be written to disk which contains 6,672 lines and 141,607 words.

The next step is to create a corpus from the sampled data set. A custom function named buildCorpus will be employed to perform the following transformation steps for each document:

  1. Remove URL, Twitter handles and email patterns by converting them to spaces using a custom content transformer
  2. Convert all words to lowercase
  3. Remove common English stop words
  4. Remove punctuation marks
  5. Remove numbers
  6. Trim whitespace
  7. Remove profanity
  8. Convert to plain text documents
# set seed for reproducability
set.seed(666)
# sample all three data sets
sampleBlogs <- sample(blogs, length(blogs) * sampleSize, replace = FALSE)
sampleNews <- sample(news, length(news) * sampleSize, replace = FALSE)
sampleTwitter <- sample(twitter, length(twitter) * sampleSize, replace = FALSE)
# remove all non-English characters from the sampled data
sampleBlogs <- iconv(sampleBlogs, "latin1", "ASCII", sub = "")
sampleNews <- iconv(sampleNews, "latin1", "ASCII", sub = "")
sampleTwitter <- iconv(sampleTwitter, "latin1", "ASCII", sub = "")
# combine all three data sets into a single data set and write to disk
sampleData <- c(sampleBlogs, sampleNews, sampleTwitter)
sampleDataFileName <- "data/final/en_US/en_US.sample.txt"
con <- file(sampleDataFileName, open = "w")
writeLines(sampleData, con)
close(con)
# get number of lines and words from the sample data set
sampleDataLines <- length(sampleData);
sampleDataWords <- sum(stri_count_words(sampleData))
# remove variables no longer needed to free up memory
rm(blogs, news, twitter, sampleBlogs, sampleNews, sampleTwitter)

The corpus will then be written to disk in two formats: a serialized R object in RDS format and as a text file. Finally, the first 10 documents (lines) from the corpus will be displayed.

library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
buildCorpus <- function (dataSet) {
    docs <- VCorpus(VectorSource(dataSet))
    toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
    
    # remove URL, Twitter handles and email patterns
    docs <- tm_map(docs, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
    docs <- tm_map(docs, toSpace, "@[^\\s]+")
    docs <- tm_map(docs, toSpace, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
    
# remove profane words from the sample data set
    profanity<-readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt",encoding = "UTF-8", skipNul = TRUE)
    profanity<-profanity[-(which(profanity%in%c("screw","looser","^color")==TRUE))]

    profanity <- iconv(profanity, "latin1", "ASCII", sub = "")
    docs <- tm_map(docs, removeWords, profanity)
    
    docs <- tm_map(docs, tolower)
    docs <- tm_map(docs, removeWords, stopwords("english"))
    docs <- tm_map(docs, removePunctuation)
    docs <- tm_map(docs, removeNumbers)
    docs <- tm_map(docs, stripWhitespace)
    docs <- tm_map(docs, PlainTextDocument)
    return(docs)
}
# build the corpus and write to disk (RDS)
corpus <- buildCorpus(sampleData)
saveRDS(corpus, file = "data/final/en_US/en_US.corpus.rds")
# convert corpus to a dataframe and write lines/words to disk (text)
corpusText <- data.frame(text = unlist(sapply(corpus, '[', "content")), stringsAsFactors = FALSE)
con <- file("data/final/en_US/en_US.corpus.txt", open = "w")
writeLines(corpusText$text, con)
close(con)
kable(head(corpusText$text, 10),
      row.names = FALSE,
      col.names = NULL,
      align = c("l"),
      caption = "First 10 Documents") %>% kable_styling(position = "left")
First 10 Documents
began sophmore year high school school year began simple enough class worried earth science taken lower level class chemistry something enjoyed walked classroom noticed two things friends class collection dumbest people grown throughout elementary middle school response nope promptly class ended went counselors office change schedule end got switched chemistry friends class full idiots never worse teacher life kind teacher makes cringe anything remotely related subject comes outside school made hate chemistry much feeling grown love
declutters makes piles unpile reclutter every week cant believe three words arent words declutter unpile reclutter hold call danny webster give whatfor pound fist knock change
also fact write one day will never thing will write next day perceptions meaning life whatever constantly changing knowing think im fanatic journal keeper
took yr blended circles side brush tip just first petal amount time effort now can see big different two petals first still harsh lines second smooth blend
see juan saw young male rob meijers hiphop style clothing sneakers nikes
crime delinquency council issues annual media awards
familiar exact constitution tarot pack
bashing depression
posting photos taken last week installation process landscape futures instruments devices architectural inventions nevada museum art reno exhibition opens saturday august will february
april newhouse news service reported richard nixon commissioned rand study feasibility canceling election rand denied reviewed recent work possible sources story said review fruitless
# remove variables no longer needed to free up memory
rm(sampleData)

Word Frequencies

A bar chart and word cloud will be constructed to illustrate unique word frequencies.

library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq)
# plot the top 10 most frequent words
g <- ggplot (wordFreq[1:10,], aes(x = reorder(wordFreq[1:10,]$word, -wordFreq[1:10,]$fre),
                                  y = wordFreq[1:10,]$fre ))
g <- g + geom_bar( stat = "Identity" , fill = I("grey50"))
g <- g + geom_text(aes(label = wordFreq[1:10,]$fre), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Word Frequencies")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 0.5, vjust = 0.5, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Frequent Words")
print(g)

# construct word cloud
suppressWarnings (
    wordcloud(words = wordFreq$word,
              freq = wordFreq$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35, 
              colors=brewer.pal(8, "Dark2"))
)

# remove variables no longer needed to free up memory
rm(tdm, freq, wordFreq, g)

Exploratory data analysis will be performed to fulfill the primary goal for this report. Several techniques will be employed to develop an understanding of the training data which include looking at the most frequently used words, tokenizing and n-gram generation.

Tokenizing and N-Gram Generation

The predictive model I plan to develop for the Shiny application will handle uniqrams, bigrams, and trigrams. In this section, I will use the tokenizers package to construct functions that tokenize the sample data and construct matrices of uniqrams, bigrams, and trigrams.

Tokenize Functions

library(tokenizers)
# Function to generate N-grams
ngramTokenizer <- function(x, n) {
  tokens <- unlist(tokenize_words(x))
  ngrams <- tokenize_ngrams(tokens, n = n)
  return(ngrams)
}

# Generate unigrams, bigrams, and trigrams
text <- "This is an example sentence for tokenization."

unigramTokenizer <- ngramTokenizer(text, n = 1)
bigramTokenizer <- ngramTokenizer(text, n = 2)
trigramTokenizer <- ngramTokenizer(text, n = 3)

Unigrams

# Load necessary libraries
library(tm)
library(ggplot2)

# Create a simple example corpus
docs <- c("This is the first document.",
          "This document is the second document.",
          "And this is the third one.",
          "Is this the first document?")

corpus <- Corpus(VectorSource(docs))

# Define the unigramTokenizer function
unigramTokenizer <- function(x) unlist(strsplit(tolower(x), "\\W+"))

# Create term document matrix for the corpus
unigramMatrix <- DocumentTermMatrix(corpus, control = list(tokenize = unigramTokenizer))
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom tokenizer is
## ignored
# Convert the DocumentTermMatrix to a matrix
termMatrix <- as.matrix(unigramMatrix)

# Get the frequency of most common unigrams
unigramFreq <- sort(rowSums(termMatrix), decreasing = TRUE)
unigramMatrixFreq <- data.frame(word = names(unigramFreq), freq = unigramFreq)

# Generate plot
g <- ggplot(unigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("20 Most Common Unigrams")
print(g)
## Warning: Removed 16 rows containing missing values (`position_stack()`).
## Warning: Removed 16 rows containing missing values (`geom_text()`).

Bigrams

# create term document matrix for the corpus
bigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## bigramTokenizer)): custom tokenizer is ignored
# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
bigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(bigramMatrix, 0.999))), decreasing = TRUE)
bigramMatrixFreq <- data.frame(word = names(bigramMatrixFreq), freq = bigramMatrixFreq)
# generate plot
g <- ggplot(bigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("20 Most Common Bigrams")
print(g)
## Warning: Removed 10 rows containing missing values (`position_stack()`).
## Warning: Removed 10 rows containing missing values (`geom_text()`).

Trigrams

# create term document matrix for the corpus
trigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## trigramTokenizer)): custom tokenizer is ignored
# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
trigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(trigramMatrix, 0.9999))), decreasing = TRUE)
trigramMatrixFreq <- data.frame(word = names(trigramMatrixFreq), freq = trigramMatrixFreq)
# generate plot
g <- ggplot(trigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("20 Most Common Trigrams")
print(g)
## Warning: Removed 10 rows containing missing values (`position_stack()`).
## Warning: Removed 10 rows containing missing values (`geom_text()`).

The final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface. The Shiny app should take as input a phrase (multiple words) in a text box input and output a prediction of the next word.

The predictive algorithm will be developed using an n-gram model with a word frequency lookup similar to that performed in the exploratory data analysis section of this report. A strategy will be built based on the knowledge gathered during the exploratory analysis. For example, as n increased for each n-gram, the frequency decreased for each of its terms. So one possible strategy may be to construct the model to first look for the unigram that would follow from the entered text. Once a full term is entered followed by a space,find the most common bigram model and so on.

Another possible strategy may be to predict the next word using the trigram model. If no matching trigram can be found, then the algorithm would check the bigram model. If still not found, use the unigram model.

The final strategy will be based on the one that increases efficiency and provides the best accuracy.