This is the milestone report for week 2 of the Johns Hopkins University on Coursera Data Science Capstone project. The overall goal of the Capstone project is to build a predictive text model using Natural Language Processing (NLM) along with a predictive text application that will determine the most likely next word when a user inputs a word or a phrase.
The purpose of this milestone report is to demonstrate how the data was downloaded, imported into R, and cleaned. This report also contains an exploratory analysis of the data including summary statistics about the three separate data sets (blogs, news and tweets), graphics that illustrate features of the data, interesting findings discovered along the way, and an outline of the next steps that will be taken toward building the predictive application.
library(tm)
## Loading required package: NLP
library(RWeka)
library(stringi)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(pryr)
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
##
## Attaching package: 'pryr'
## The following object is masked from 'package:tm':
##
## inspect
library(RColorBrewer)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(textmineR)
## Loading required package: Matrix
##
## Attaching package: 'textmineR'
## The following object is masked from 'package:Matrix':
##
## update
## The following object is masked from 'package:stats':
##
## update
library(wordcloud)
if (!file.exists("Coursera-SwiftKey.zip")){
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
blogs <- readLines("./final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
stats <- data.frame(
FileName=c("blogs", "news", "twitter"),
FileSize=sapply(list(blogs, news, twitter), function(x){format(object.size(x), "MB")}),
# FileSizeMB=c(file.info("./en_US.blogs.txt")$size/1024^2,
#file.info("./en_US.news.txt")$size/1024^2,
#file.info("./en_US.twitter.txt")$size/1024^2),
t(rbind(sapply(list(blogs, news, twitter), stri_stats_general),#[c("Lines", "Chars"),],
Words = sapply(list(blogs, news, twitter), stri_stats_latex)[4,])
)
)
stats
## FileName FileSize Lines LinesNEmpty Chars CharsNWhite Words
## 1 blogs 255.4 Mb 899288 899288 206824382 170389539 37570839
## 2 news 19.8 Mb 77259 77259 15639408 13072698 2651432
## 3 twitter 319 Mb 2360148 2360148 162096241 134082806 30451170
From the summary, we can see the sizes of the data files are quite large (the smallest file in the set is nearly 160MB). So, we are going to subset the data into three new data files containing a 1% sample of each of the original data files. We are going to start with a 1% sample and check the size of the VCorpus (Virtual Corpus) object that will be loaded into memory.
We will set a seed so the sampling will be reproducible. Before building the corpus, we will create a combined sample file and once again check the summary statistics to make sure the file sizes are not too large.
set.seed(2705)
sampleSize <- 0.01
blogsSub <- sample(blogs, length(blogs) * sampleSize)
newsSub <- sample(news, length(news) * sampleSize)
twitterSub <- sample(twitter, length(twitter) * sampleSize)
sampleData <- c(sample(blogs, length(blogs) * sampleSize),
sample(news, length(news) * sampleSize),
sample(twitter, length(twitter) * sampleSize))
sampleStats <- data.frame(
FileName=c("blogsSub", "newsSub", "twitterSub", "sampleData"),
FileSize=sapply(list(blogsSub, newsSub, twitterSub, sampleData), function(x){format(object.size(x), "MB")}),
t(rbind(sapply(list(blogsSub, newsSub, twitterSub, sampleData), stri_stats_general),#[c("Lines", "Chars"),],
Words = sapply(list(blogsSub, newsSub, twitterSub, sampleData), stri_stats_latex)[4,])
)
)
sampleStats
## FileName FileSize Lines LinesNEmpty Chars CharsNWhite Words
## 1 blogsSub 2.6 Mb 8992 8992 2069944 1705415 376661
## 2 newsSub 0.2 Mb 772 772 159267 132961 27112
## 3 twitterSub 3.2 Mb 23601 23601 1619285 1339902 304540
## 4 sampleData 6 Mb 33365 33365 3867535 3193720 709304
Build the corpus.
corpus <- VCorpus(VectorSource(sampleData))
Check the size of the corpus in memory using the object_size function from the pryr package.
object_size(corpus)
## 77.8 MB
The VCorpus object is quite large (77.8 MB), even when the sample size is only 1%. This may be an issue due to memory constraints when it comes time to build the predictive model. But, we will start here and see where this approach leads us.
We next need to clean the corpus Data using functions from the tm package. Common text mining cleaning tasks include:
Convert everything to lower case
Remove punctuation marks, numbers, extra whitespace, and stopwords (common words like “and”, “or”, “is”, “in”, etc.)
Filtering out unwanted words
At this early stage, I am not sure if I want to remove the stopwords, even though they may have an adverse affect on the two and three N-Grams. I would like to see how the predictive model works first, before removing the stopwords.
I did notice the source data contains some profanity. I am also not sure if I want to filter these out yet as doing so could leave sentences in the data that make no sense. I want to see if the final application predicts a profane word to the users first. If so, then I will go back and remove these words before finalizing the application.
cleanCorpus <- tm_map(corpus, content_transformer(tolower))
cleanCorpus <- tm_map(cleanCorpus, removePunctuation)
cleanCorpus <- tm_map(cleanCorpus, removeNumbers)
cleanCorpus <- tm_map(cleanCorpus, stripWhitespace)
cleanCorpus <- tm_map(cleanCorpus, PlainTextDocument)
We next need to tokenize the clean Corpus (i.e., break the text up into words and short phrases) and construct a set of N-grams. We will start with the following three N-Grams:
Unigram - A matrix containing individual words
Bigram - A matrix containing two-word patterns
Trigram - A matrix containing three-word patterns
We could also contrust a Quadgram matrix based on four words, but at this point in the project we have decided to start with with the first three N-Grams and see how the predictive model works with these first.
uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
uniMatrix <- TermDocumentMatrix(cleanCorpus, control = list(tokenize = uniTokenizer))
biMatrix <- TermDocumentMatrix(cleanCorpus, control = list(tokenize = biTokenizer))
triMatrix <- TermDocumentMatrix(cleanCorpus, control = list(tokenize = triTokenizer))
We now need to calculate the frequencies of the N-Grams and see what these look like.
uniCorpus <- findFreqTerms(uniMatrix, lowfreq = 20)
biCorpus <- findFreqTerms(biMatrix, lowfreq = 20)
triCorpus <- findFreqTerms(triMatrix, lowfreq = 20)
uniCorpusFreq <- rowSums(as.matrix(uniMatrix[uniCorpus,]))
uniCorpusFreq <- data.frame(word = names(uniCorpusFreq), frequency = uniCorpusFreq)
biCorpusFreq <- rowSums(as.matrix(biMatrix[biCorpus,]))
biCorpusFreq <- data.frame(word = names(biCorpusFreq), frequency = biCorpusFreq)
triCorpusFreq <- rowSums(as.matrix(triMatrix[triCorpus,]))
triCorpusFreq <- data.frame(word = names(triCorpusFreq), frequency = triCorpusFreq)
head(uniCorpusFreq)
## word frequency
## “the “the 71
## “we “we 27
## “you “you 27
## ability ability 46
## able able 212
## about about 2141
head(biCorpusFreq)
## word frequency
## – and – and 29
## – the – the 27
## “ i “ i 26
## a bad a bad 40
## a beautiful a beautiful 42
## a better a better 57
head(triCorpusFreq)
## word frequency
## a bit of a bit of 64
## a bunch of a bunch of 27
## a chance to a chance to 35
## a couple of a couple of 84
## a few days a few days 33
## a few of a few of 21
We need to set the order of each corpus frequency to descending as a preparation step for visualizing the data.
uniCorpusFreqDescend <- arrange(uniCorpusFreq, desc(frequency))
biCorpusFreqDescend <- arrange(biCorpusFreq, desc(frequency))
triCorpusFreqDescend <- arrange(triCorpusFreq, desc(frequency))
The final step will be to create visualizations of the data.
uniBar <- ggplot(data = uniCorpusFreqDescend[1:20,], aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "orange") +
xlab("Words") +
ylab("Frequency") +
ggtitle(paste("Top 20 Unigrams")) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
biBar <- ggplot(data = biCorpusFreqDescend[1:20,], aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "red") +
xlab("Words") +
ylab("Frequency") +
ggtitle(paste("Top 20 Bigrams")) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
triBar <- ggplot(data = triCorpusFreqDescend[1:20,], aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "springgreen") +
xlab("Words") +
ylab("Frequency") +
ggtitle(paste("Top 20 Trigrams")) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
uniBar
biBar
triBar
A word cloud is another interesting way to visualize the data. Word clouds are easy to understand as the words with the highest frequency stand out better. Word clouds are also visually engaging and work well for presentations.
uniCloud <- wordcloud(uniCorpusFreq$word, uniCorpusFreq$frequency, scale = c(2, 0.5), max.words = 100, random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
biCloud <- wordcloud(biCorpusFreq$word, biCorpusFreq$frequency, scale = c(2, 0.5), max.words = 100, random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
triCloud <- wordcloud(triCorpusFreq$word, triCorpusFreq$frequency, scale = c(2, 0.5), max.words = 100, random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
One question I have is whether a 1% sample of the data is enough? I may find I need to increase the sample size, but doing so could affect the performance of the application.
The VCorpus object is also quite large (77.8 MB), even with a sample size of only 1%. This may create issues due to memory constraints when it comes time to build the predictive model.
We may need to try different sample sizes to get a balance between enough data, memory consumption and acceptable performance.
We also need to determine whether stopwords need to be removed and create a filter if profane words are suggested when a word or phrase is entered by the user.
1.- Build and test different prediction models and evaluate each based on their performance.
2.- Make and test any necessary modifications to resolve any issues encountered during modeling.
3.- Build, test and deploy a Shiny app with a simple user interface that has acceptable run time and reliably and accurately predicts the next word based on a word or phrase entered by the user.
4.- Decide whether to remove the stopwords and filter out profanity, if necessary.
CRAN Task View: Natural Language Processing: [link] (https://cran.r-project.org/web/views/NaturalLanguageProcessing.html)
Text Mining Infrastructure in R: [link] (https://www.jstatsoft.org/article/view/v025i05)
Text mining and word cloud fundamentals in r: 5 simple steps you should know: [link] (http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know)