Introduction

This is the Milestone Report of the Johns Hopkins University Coursera Data Science Specialization Capstone project. The goal of this report is to show working with data to understand its statistical properties so as to build a prediction model for the final data product: the Shiny app. Through exploratory data analysis, the main characteristics of the training data are described and then the predictive model is created. The model is trained through a corpus of unified documents compiled from three text data sources: Blog, News y Twitter. The provided text data is presented in four different languages. English is used in this project.

Environment Setup

Prepare the work session, load starter packages and clean up the general workspace

library(knitr)
rm(list = ls(all.names = TRUE))
#setwd("./a_Ciencia de Datos/Curso_10_proyecto_final_ciencia_datos/Projecto/milestoneReportNicolas")

Load the Data

trainURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
trainDataFile <- "data/Coursera-SwiftKey.zip"

if (!file.exists('data')) {
    dir.create('data')
}

if (!file.exists("data/final/en_US")) {
    tempFile <- tempfile()
    download.file(trainURL, tempFile)
    unzip(tempFile, exdir = "data")
    unlink(tempFile)
}

# blogs
blogsFileName <- "data/final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# news
newsFileName <- "data/final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete final
## line found on 'data/final/en_US/en_US.news.txt'
close(con)

# twitter
twitterFileName <- "data/final/en_US/en_US.twitter.txt"
con <- file(twitterFileName, open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

rm(con)

Summary of initial data analysis

An overview of the three text corpora is presented, including file sizes, number of lines, number of characters, and number of words for each of the text files. Likewise, statistics such as: min., average and max., of the number of words per line are included.

File FileSize Lines Characters Words WPL.Min WPL.Mean WPL.Max
en_US.blogs.txt 200 MB 899288 206824505 37570839 0 42 6726
en_US.news.txt 196 MB 77259 15639408 2651432 1 35 1123
en_US.twitter.txt 159 MB 2360148 162096241 30451170 1 13 47

The source code for the above table is attached as A.1 Basic Data Summary in the Appendix section.

Note: In general, a relatively low number of words per line is observed in the three text corpus. However, blogs tend to have the most words per line, followed by news and Twitter respectively. By virtue of their character limitation, Twitter feeds tend to have fewer words per line. For example, when Twitter increased its character count from 140 to 280 in 2017, you see 1% of tweets hit the 280 character limit, and 12% of tweets exceed 140. Quite large text files are observed. Thus, to improve processing, a sample size of 1% will be selected in the three data sets. It will then be combined into a unified document corpus for further analysis

Histogram of Words

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

Note: A low number of words is observed in the three files, which can be seen in the graphed histograms. This feature can support a trend toward short, concise communications that can be used on the project.

The source code for the above plot is attached as A.2 Histogram of Words per Line in the Appendix section.

Data sampling and cleaning

Data is sampled across the three data sets in 1%. Non-English characters will be removed from the data subset and then combined into a single data set. This combined dataset is copied to disk containing 33,365 lines and 697,575 words. A corpus is then created from the sampled dataset. The buildCorpus function will be used to perform the transformation of each document: 1. Remove URL, Twitter handles, and email patterns by converting them to spaces via a custom content transformer 2. Convert all words to lowercase 3. Eliminate common stop words in English 4. Remove punctuation marks 5. Delete numbers 6. Cut out the blanks 7. Eliminate profanity 8. Convert to plain text documents The corpus will then be written to disk in two formats: a serialized R object in RDS format and as a text file. Finally, the first 10 lines of the corpus will be displayed.

First 10 Documents
splash fresh lime juice
found hidden just behind screens
well cant take just piece specifically inspirational piece
los angeles clippers s grizzlies pm tntlac
can may ventured boldly
aside hit house
long starting first current fulltime social work position one clients experienced something eerily similar something identify connected decision enter field potential comparisons seemed endless fathers death equally unexpected age grade happened month wasnt just experience everything didnthappen afterwards eventually made propelled towards social work course felt immediately itthat moment one
today looked like raining day sad pumped spring already tired coats time put fur wool away year
sartorially uniqloj far cry another collection launched week burberrys winter storm w lean mean much magnetic hinges upon pieces complex texture craftsmanship plush quilted biker jackets thoroughly modern update classic quintessential trench shimmering oilslick patent black
pull bowls oven remove foil add cheese egg mixture stir combine place mixture well cover foil place back oven minutes remove foil bake minutes serve

The source code for preparing the data is attached as A.3 Sample and Clean the Data and A.4 Build Corpus in the Appendix section.

Exploratory Data Analysis

An exploratory data analysis is performed employing techniques in order to obtain an understanding of the training data such as observing the most used words, tokenization and generation of n-grams.

Word Frequencies

Through a bar chart and word cloud, the frequencies of unique words are represented.

The source code for the word frequency bar chart and constructing the work cloud is attached as A.5 Word Frequencies in the Appendix section.

Tokenization and creation of N-Gram

The predictive model to be developed will have unigrams, bigrams and trigrams. The RWeka package will be used to develop functions that tokenize the sample data and build arrays of uniqrams, bigrams, and trigrams.

Unigrams

Bigrams

Trigrams

The source code for this section is attached as A.6 Tokenizing and N-Gram Generation in the Appendix section.

Steps to follow

The objective of the project is to create a predictive algorithm to be implemented in a Shiny application. This Shiny app must take as input a phrase (words) in a text box and generate a prediction of the next word. The predictive algorithm will be developed from an n-gram model with a word frequency search as performed in the exploratory data analysis of this report. The model will be built to first find the unigram from the entered text. Once a full term followed by a space is entered, find the most common bigram pattern and so on. It is also possible to predict the next word using the trigram model. If it doesn’t find a matching trigram, then the algorithm will check the bigram model and if it doesn’t, it will use the unigram model. Finally, it will be based on the strategy that increases efficiency and provides the best accuracy.

Appendix

A.1 Basic Data Summary

Basic summary of the three sets of texts.

library(ggplot2)
library(gridExtra)

plot1 <- qplot(wpl[[1]],
               geom = "histogram",
               main = "US Blogs",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5)

plot2 <- qplot(wpl[[2]],
               geom = "histogram",
               main = "US News",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5)

plot3 <- qplot(wpl[[3]],
               geom = "histogram",
               main = "US Twitter",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 1)

plotList = list(plot1, plot2, plot3)
do.call(grid.arrange, c(plotList, list(ncol = 1)))

# free memory
rm(plot1, plot2, plot3)

A.2 Histogram of Words

Histogram of words per line for the three text corpus.

A.3 Sample and Clean the Data

A.4 Build Corpus

library(wordcloud)
library(RColorBrewer)

tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq)

# plot the top 10 most frequent words
g <- ggplot (wordFreq[1:10,], aes(x = reorder(wordFreq[1:10,]$word, -wordFreq[1:10,]$fre),
                                  y = wordFreq[1:10,]$fre ))
g <- g + geom_bar( stat = "Identity" , fill = I("grey50"))
g <- g + geom_text(aes(label = wordFreq[1:10,]$fre), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Word Frequencies")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 0.5, vjust = 0.5, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Frequent Words")
print(g)

# construct word cloud
suppressWarnings (
    wordcloud(words = wordFreq$word,
              freq = wordFreq$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35, 
              colors=brewer.pal(8, "Dark2"))
)

# free memory
rm(tdm, freq, wordFreq, g)

A.6 Tokenizing and N-Gram Generation

Tokenize Functions

library(RWeka)

unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

Unigrams

# create term document matrix for the corpus
unigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))

# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
unigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(unigramMatrix, 0.99))), decreasing = TRUE)
unigramMatrixFreq <- data.frame(word = names(unigramMatrixFreq), freq = unigramMatrixFreq)

# generate plot
g <- ggplot(unigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("Most Common Unigrams")
print(g)

Bigrams

# create term document matrix for the corpus
bigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))

# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
bigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(bigramMatrix, 0.999))), decreasing = TRUE)
bigramMatrixFreq <- data.frame(word = names(bigramMatrixFreq), freq = bigramMatrixFreq)

# generate plot
g <- ggplot(bigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("Most Common Bigrams")
print(g)

Trigrams

# create term document matrix for the corpus
trigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))

# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
trigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(trigramMatrix, 0.9999))), decreasing = TRUE)
trigramMatrixFreq <- data.frame(word = names(trigramMatrixFreq), freq = trigramMatrixFreq)

# generate plot
g <- ggplot(trigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("Most Common Trigrams")
print(g)