This is the Milestone Report of the Johns Hopkins University Coursera Data Science Specialization Capstone project. The goal of this report is to show working with data to understand its statistical properties so as to build a prediction model for the final data product: the Shiny app. Through exploratory data analysis, the main characteristics of the training data are described and then the predictive model is created. The model is trained through a corpus of unified documents compiled from three text data sources: Blog, News y Twitter. The provided text data is presented in four different languages. English is used in this project.
Prepare the work session, load starter packages and clean up the general workspace
library(knitr)
rm(list = ls(all.names = TRUE))
#setwd("./a_Ciencia de Datos/Curso_10_proyecto_final_ciencia_datos/Projecto/milestoneReportNicolas")
trainURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
trainDataFile <- "data/Coursera-SwiftKey.zip"
if (!file.exists('data')) {
dir.create('data')
}
if (!file.exists("data/final/en_US")) {
tempFile <- tempfile()
download.file(trainURL, tempFile)
unzip(tempFile, exdir = "data")
unlink(tempFile)
}
# blogs
blogsFileName <- "data/final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# news
newsFileName <- "data/final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete final
## line found on 'data/final/en_US/en_US.news.txt'
close(con)
# twitter
twitterFileName <- "data/final/en_US/en_US.twitter.txt"
con <- file(twitterFileName, open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
rm(con)
An overview of the three text corpora is presented, including file sizes, number of lines, number of characters, and number of words for each of the text files. Likewise, statistics such as: min., average and max., of the number of words per line are included.
| File | FileSize | Lines | Characters | Words | WPL.Min | WPL.Mean | WPL.Max |
|---|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200 MB | 899288 | 206824505 | 37570839 | 0 | 42 | 6726 |
| en_US.news.txt | 196 MB | 77259 | 15639408 | 2651432 | 1 | 35 | 1123 |
| en_US.twitter.txt | 159 MB | 2360148 | 162096241 | 30451170 | 1 | 13 | 47 |
The source code for the above table is attached as A.1 Basic Data Summary in the Appendix section.
Note: In general, a relatively low number of words per line is observed in the three text corpus. However, blogs tend to have the most words per line, followed by news and Twitter respectively. By virtue of their character limitation, Twitter feeds tend to have fewer words per line. For example, when Twitter increased its character count from 140 to 280 in 2017, you see 1% of tweets hit the 280 character limit, and 12% of tweets exceed 140. Quite large text files are observed. Thus, to improve processing, a sample size of 1% will be selected in the three data sets. It will then be combined into a unified document corpus for further analysis
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
Note: A low number of words is observed in the three files, which can be seen in the graphed histograms. This feature can support a trend toward short, concise communications that can be used on the project.
The source code for the above plot is attached as A.2 Histogram of Words per Line in the Appendix section.
Data is sampled across the three data sets in 1%. Non-English
characters will be removed from the data subset and then combined into a
single data set. This combined dataset is copied to disk containing
33,365 lines and 697,575 words. A corpus is then created from the
sampled dataset. The buildCorpus function will be used to
perform the transformation of each document: 1. Remove URL, Twitter
handles, and email patterns by converting them to spaces via a custom
content transformer 2. Convert all words to lowercase 3. Eliminate
common stop words in English 4. Remove punctuation marks 5. Delete
numbers 6. Cut out the blanks 7. Eliminate profanity 8. Convert to plain
text documents The corpus will then be written to disk in two formats: a
serialized R object in RDS format and as a text file. Finally, the first
10 lines of the corpus will be displayed.
| splash fresh lime juice |
| found hidden just behind screens |
| well cant take just piece specifically inspirational piece |
| los angeles clippers s grizzlies pm tntlac |
| can may ventured boldly |
| aside hit house |
| long starting first current fulltime social work position one clients experienced something eerily similar something identify connected decision enter field potential comparisons seemed endless fathers death equally unexpected age grade happened month wasnt just experience everything didnthappen afterwards eventually made propelled towards social work course felt immediately itthat moment one |
| today looked like raining day sad pumped spring already tired coats time put fur wool away year |
| sartorially uniqloj far cry another collection launched week burberrys winter storm w lean mean much magnetic hinges upon pieces complex texture craftsmanship plush quilted biker jackets thoroughly modern update classic quintessential trench shimmering oilslick patent black |
| pull bowls oven remove foil add cheese egg mixture stir combine place mixture well cover foil place back oven minutes remove foil bake minutes serve |
The source code for preparing the data is attached as A.3 Sample and Clean the Data and A.4 Build Corpus in the Appendix section.
An exploratory data analysis is performed employing techniques in order to obtain an understanding of the training data such as observing the most used words, tokenization and generation of n-grams.
Through a bar chart and word cloud, the frequencies of unique words are represented.
The source code for the word frequency bar chart and constructing the work cloud is attached as A.5 Word Frequencies in the Appendix section.
The predictive model to be developed will have unigrams, bigrams and
trigrams. The RWeka package will be used to develop
functions that tokenize the sample data and build arrays of uniqrams,
bigrams, and trigrams.
The source code for this section is attached as A.6 Tokenizing and N-Gram Generation in the Appendix section.
The objective of the project is to create a predictive algorithm to be implemented in a Shiny application. This Shiny app must take as input a phrase (words) in a text box and generate a prediction of the next word. The predictive algorithm will be developed from an n-gram model with a word frequency search as performed in the exploratory data analysis of this report. The model will be built to first find the unigram from the entered text. Once a full term followed by a space is entered, find the most common bigram pattern and so on. It is also possible to predict the next word using the trigram model. If it doesn’t find a matching trigram, then the algorithm will check the bigram model and if it doesn’t, it will use the unigram model. Finally, it will be based on the strategy that increases efficiency and provides the best accuracy.
Basic summary of the three sets of texts.
library(ggplot2)
library(gridExtra)
plot1 <- qplot(wpl[[1]],
geom = "histogram",
main = "US Blogs",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 5)
plot2 <- qplot(wpl[[2]],
geom = "histogram",
main = "US News",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 5)
plot3 <- qplot(wpl[[3]],
geom = "histogram",
main = "US Twitter",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 1)
plotList = list(plot1, plot2, plot3)
do.call(grid.arrange, c(plotList, list(ncol = 1)))
# free memory
rm(plot1, plot2, plot3)
Histogram of words per line for the three text corpus.
library(wordcloud)
library(RColorBrewer)
tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq)
# plot the top 10 most frequent words
g <- ggplot (wordFreq[1:10,], aes(x = reorder(wordFreq[1:10,]$word, -wordFreq[1:10,]$fre),
y = wordFreq[1:10,]$fre ))
g <- g + geom_bar( stat = "Identity" , fill = I("grey50"))
g <- g + geom_text(aes(label = wordFreq[1:10,]$fre), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Word Frequencies")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 0.5, vjust = 0.5, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Frequent Words")
print(g)
# construct word cloud
suppressWarnings (
wordcloud(words = wordFreq$word,
freq = wordFreq$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(8, "Dark2"))
)
# free memory
rm(tdm, freq, wordFreq, g)
Tokenize Functions
library(RWeka)
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
Unigrams
# create term document matrix for the corpus
unigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))
# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
unigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(unigramMatrix, 0.99))), decreasing = TRUE)
unigramMatrixFreq <- data.frame(word = names(unigramMatrixFreq), freq = unigramMatrixFreq)
# generate plot
g <- ggplot(unigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("Most Common Unigrams")
print(g)
Bigrams
# create term document matrix for the corpus
bigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))
# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
bigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(bigramMatrix, 0.999))), decreasing = TRUE)
bigramMatrixFreq <- data.frame(word = names(bigramMatrixFreq), freq = bigramMatrixFreq)
# generate plot
g <- ggplot(bigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("Most Common Bigrams")
print(g)
Trigrams
# create term document matrix for the corpus
trigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))
# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
trigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(trigramMatrix, 0.9999))), decreasing = TRUE)
trigramMatrixFreq <- data.frame(word = names(trigramMatrixFreq), freq = trigramMatrixFreq)
# generate plot
g <- ggplot(trigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("Most Common Trigrams")
print(g)