Coursera Capstone Project Milestone Report

Initial Setup

To prepare the environment for this analysis let us first of all set up a directory where the project is going to be stored, and let us load the packages that we are going to use during this analysis.

setwd("~/Desktop/Coursera/Capstone Project/")

Let us also load all packages necessary:

library(RColorBrewer)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(wordcloud)
library(stringr)
library(tm)
library(tidytext)

Load the Data

Let us now download the data from the original location at Coursera. Since the file is quite heavy, we shall only load the data if it is needed and let’s unzip the folder.

if(!file.exists("./Coursera-SwiftKey.zip")){
    fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
    download.file(fileUrl, destfile = "Coursera-SwiftKey.zip", method = "curl")
    unzip("Coursera-SwiftKey.zip")
    rm(fileUrl)
}

It is also important to check what we have actually downloaded and unpacked.

list.files()

##  [1] "all.R"                 "badwords.txt"          "BiGram.txt"           
##  [4] "Capstone.R"            "Coursera-SwiftKey.zip" "en_US.corpus.txt"     
##  [7] "en_US.sample.txt"      "final"                 "milestone_cache"      
## [10] "milestone_files"       "milestone.html"        "milestone.md"         
## [13] "milestone.Rmd"         "MonoGram.txt"          "Project.R"            
## [16] "rsconnect"             "tidy-sample.txt"       "TriGram.txt"

list.files("./final")

## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

list.files("./final/en_US/")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Load the Data into R

We can now load the files into R:

# Reading files 
f1 <- file("./final/en_US/en_US.blogs.txt", open = "r")
blogs <- readLines(f1, encoding = "UTF-8", skipNul = TRUE)
close(f1)

f2 <- file("./final/en_US/en_US.news.txt", open = "r")
news <- readLines(f2, encoding = "UTF-8", skipNul = TRUE)
close(f2)

f3 <- file("./final/en_US/en_US.twitter.txt", open = "r")
twitter <- readLines(f3, encoding = "UTF-8", skipNul = TRUE)
close(f3)

rm(f1, f2, f3)

Initial Data Exploration

Prior to begin building the unified document corpus and cleaning the data, it is convenient to produce a brief summary of the three data sets we have loaded. In particular, since they are strings of texts, we shall be interested in the number of characters, the number of words and the number of words per line.

# Size of files
fileNames <- list(c("./final/en_US/en_US.blogs.txt", 
                    "./final/en_US/en_US.news.txt", 
                    "./final/en_US/en_US.twitter.txt"))
size <- round(sapply(fileNames, file.info)[[1]] / 1024^2, 0)

# Number of lines
lines <- sapply(list(blogs, news, twitter), length)

# Number of characters
characters <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)

# Number of words
words <- sapply(list(blogs, news, twitter), function(x) sum(str_count(x, "\\w+")))

# Number of words per lines
wpl <- lapply(list(blogs, news, twitter), function(x) str_count(x, "\\w+"))
names(wpl) <- c("Blogs", "News", "Tweeter")
wplSummary <- t(sapply(wpl, function(x) round(c(min(x), mean(x), max(x)), 0)))
colnames(wplSummary) = c('WPL.Min', 'WPL.Mean', 'WPL.Max')

summarised.data <- data.frame(
    File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
    FileSize = size,
    Lines = lines,
    Characters = characters,
    Words = words,
    rbind(wplSummary))

The result is summarised in the following table:

	File	FileSize	Lines	Characters	Words	WPL.Min	WPL.Mean	WPL.Max
Blogs	en_US.blogs.txt	200	899288	206824505	38309620	1	43	6851
News	en_US.news.txt	196	1010242	203223159	35624454	1	35	1928
Tweeter	en_US.twitter.txt	159	2360148	162096241	31003544	1	13	47

This preliminary analysis shows that each text has a relatively small number of words per line, but the three files have many lines and they are overall big files (from 159MB to 200MB). In order to improve the processing time, we shall therefore take a random sample with 1% of the original size for each data set.

It is also interesting to notice that the maximum number of words per line in the twitter data set, 47 is much smaller than the number of words in the other two data sets, 1928 and 6851, which is expected since twitter limits the number of character per tweet.

Let us also plot a histogram of the number of words per line in the three data sets.

wpl.df <- stack(wpl)
names(wpl.df) <- c("WPL", "Source")

ggplot(data = wpl.df, aes(x = WPL,y = after_stat(density), colour = Source)) +
    geom_histogram(binwidth = 5, fill = "white") +
    coord_cartesian(xlim = c(0, 150)) +
    facet_grid(Source ~ .) +
    xlab("Words per line") +
    ylab("Frequency Density")

The relatively small number of words per line is also visible from the three histograms, which is restricted to the interval between 0 and 150 words per line, as the rest of the tail is negligible. This observation seems to support a general trend towards short and concise communications which may turn out to be useful later on.

Data Preparation

Introduction

Before performing the exploratory data analysis, we shall sample the three data sets to take only 1% of the data. This will improve the computational performance and the random sampling shall not, or should not, affect the analysis, because the uncertainty given by the sampling can be addressed with statistical methods. Also, all non-English characters shall be removed from the subset of data and then combined into a single data set.

Afterwards, we shall create a corpus from the sampled data set. We shall also focus on the following points to clean the data set:

Remove url, twitter handles and email patterns
Convert all words in lowercase
Remove any punctuation marks
Remove numbers
Remove profanities
Convert to plain text documents

Sampling and Cleaning Data

Let us first set the seed, in order to make the operation fully reproducible

set.seed(1010011)

With the function sample let us take a sample of all three data sets:

sampleSize <- 0.01
sampleBlogs <- sample(blogs, length(blogs) * sampleSize, replace = FALSE)
sampleNews <- sample(news, length(news) * sampleSize, replace = FALSE)
sampleTwitter <- sample(twitter, length(twitter) * sampleSize, replace = FALSE)

We shall also link these three samples into a single list, in order to make the coding simpler.

sampledList <- list(sampleBlogs, sampleNews, sampleTwitter)
names(sampledList) <- c("Blogs", "News", "Twitter")

Let us now remove all non-English characters from the sampled data. This is going to done converting the data from “latin1” to “ASCII”.

sampledList <- list(sampleBlogs,sampleNews, sampleTwitter)
sampledList <- lapply(sampledList, iconv, "latin1", "ASCII", sub = "")
names(sampledList) <- c("Blogs", "News", "Twitter")

Let us now combine the three sampled data sets into a single data set.

sampleData <- c(sampledList$Blogs, sampledList$News, sampledList$Twitter)
rm(sampledList)

Let us finally store in two variables sampleDataLines and sampleDataWords the total number of lines and the total number of words in the sampled data, respectively.

sampleDataLines <- length(sampleData)
sampleDataWords <- sum(str_count(sampleData, "\\w+"))

Let us also save a copy of this sampled data set into the hard drive:

con <- file("./en_US.sample.txt", open = "w")
writeLines(sampleData, con)
close(con); rm(con)

Finally, let us clean up the memory of all files:

rm(list = ls())

Building the Corpus

Before we can proceed to create the corpus and tidy up the data, let us load the sampled data:

f <- file("./en_US.sample.txt", open = "r")
sampleData <- readLines(f)
close(f); rm(f)

Let us first of all find a list of bad words and load it into R. To this end, upon research it turns out that google has created a full list of bad words that is freely downloadable. This list needs to be cleaned up.

# Downloading bad words from google
if(!file.exists("./badwords.txt")){
    fileUrl <- "https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt"
    download.file(fileUrl, destfile = "badwords.txt", method = "curl")
    rm(fileUrl)
}

# Reading bad words file
f <- file("./badwords.txt", open = "r")
profanities <- readLines(f)
close(f); rm(f)

# Cleaning up badwords
profanities <- profanities[-grep("(^[*])|([*]$)", profanities)]

We shall create the corpus with the aid of the package tm.

myCorpus <- VCorpus(VectorSource(sampleData))

In order to tidy up the corpus we need first to create a function which aims to clean up characters and strings that we wish to exclude from the analysis, such as URLs, email addresses, hashtags, but also punctuation, numbers and swear words. The function cleanup accomplishes this job.

cleanup <- function(corpus, profanity = NULL){
    subs <- content_transformer(function(x, pattern) gsub(pattern, "", x))
    # Removing URLs, Twitter handles, email patterns and hashtags
    corpus <- tm_map(corpus, subs, "(f|ht)tp(s?)://(.*)[.][a-z]+")
    corpus <- tm_map(corpus, subs, "@[^\\s]+")
    corpus <- tm_map(corpus, subs, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
    corpus <- tm_map(corpus, subs, "#(.*)")
    corpus <- tm_map(corpus, subs, "\\b[a-z]{1}\\b")
    # Transforming character in lower case, removing punctuation, numbers and extra spaces
    corpus <- tm_map(corpus, tolower)
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, PlainTextDocument)
    # Removing profanities
    if(!is.null(profanity)) corpus <- tm_map(corpus, removeWords, profanity)
    return(corpus)
}

Let us now complete the work cleaning up the corpus and transforming it into a tidy data frame.

# Making a clean corpus
myCorpus <- cleanup(myCorpus, profanities)

# Transforming the corpus into a vector and a tidy data frame
cleanSample <- as.vector(tidy(myCorpus)$text)
tidyData <- data.frame(line = 1 : length(cleanSample), text = cleanSample)

This has already cost a lot of memory, so it is wise to now save the tidy data frame into an external file and clean up the memory.

f <- file("./tidy-sample.txt", open = "w")
saveRDS(tidyData, f, ascii = TRUE)
close(f); rm(f)

# Cleaning up messy memory
rm(list = ls())

Data Exploration

Once we have a tidy data frame, we can now begin the proper investigation. To this end, let us start by reading the data frame we have just saved:

f <- file("./tidy-sample.txt", open = "r")
tidyData <- readRDS(f)
close(f); rm(f)

We wish now to create three further data frames containing the mono-grams, bi-grams and tri-grams found in the corpus that we have sampled and cleansed. We shall do this with the aid of the package tidytext and in particular with the aid of the method unnest_tokens.

monoGrams <- tidyData %>% 
    unnest_tokens(word, text) %>%
    count(word, sort = TRUE)

biGrams <- tidyData %>% 
    unnest_tokens(word, text, token = "ngrams", n = 2) %>%
    count(word, sort = TRUE) %>%
    filter(word != "<NA>")

triGrams <- tidyData %>% 
    unnest_tokens(word, text, token = "ngrams", n = 3) %>%
    count(word, sort = TRUE) %>%
    filter(word != "<NA>")

In order to create the prediction model, let us save these three data frames into three different files which shall be used in the prediction model.

# Saving mono-grams, bi-grams and tri-grams
f1 <- file("./MonoGram.txt", open = "w")
saveRDS(monoGrams, f1, ascii = TRUE)
close(f1)

f2 <- file("./BiGram.txt", open = "w")
saveRDS(monoGrams, f2, ascii = TRUE)
close(f2)

f3 <- file("./TriGram.txt", open = "w")
saveRDS(monoGrams, f3, ascii = TRUE)
close(f3)

rm(f1, f2, f3)

Now let us complete the exploration by plotting a graph of the most common mono-, bi- and tri-grams.

Mono-grams

ggplot(data = monoGrams[1:20,], aes(x = reorder(word, -n), y = n)) +
    geom_bar(stat = "Identity") + 
    xlab("") + ylab("Frequency of Words") +
    ggtitle("20 Most Common Words") +
    geom_text(aes(label = n), vjust = -0.5, size = 3) +
    geom_bar(stat = "identity", aes(colour = word), fill = "white") +
    theme(legend.position = "none", 
          axis.text.x = element_text(hjust = 1.0, angle = 45),
          plot.title = element_text(hjust = 0.5, face = "bold"))

suppressWarnings(with(monoGrams, wordcloud(words = word, freq = n, 
          min.freq = 1, max.words = 100, random.order = FALSE,
          rot.per = 0.35, colors = brewer.pal(9, "Purples"))))

Bi-grams

ggplot(data = biGrams[1:20,], aes(x = reorder(word, -n), y = n)) +
    geom_bar(stat = "Identity") + 
    xlab("") + ylab("Frequency of Bigrams") +
    ggtitle("20 Most Common Bigrams") +
    geom_text(aes(label = n), vjust = -0.5, size = 3) +
    geom_bar(stat = "identity", aes(colour = word), fill = "white") +
    theme(legend.position = "none", 
          axis.text.x = element_text(hjust = 1.0, angle = 45),
          plot.title = element_text(hjust = 0.5, face = "bold"))

suppressWarnings(with(biGrams, wordcloud(words = word, freq = n, 
                          min.freq = 1, max.words = 100, random.order = FALSE,
                          rot.per = 0.35, colors = brewer.pal(9, "Greens"))))

Tri-grams

ggplot(data = triGrams[1:20,], aes(x = reorder(word, -n), y = n)) +
    geom_bar(stat = "Identity") + 
    xlab("") + ylab("Frequency of Trigrams") +
    ggtitle("20 Most Common Trigrams") +
    geom_text(aes(label = n), vjust = -0.5, size = 3) +
    geom_bar(stat = "identity", aes(colour = word), fill = "white") +
    theme(legend.position = "none", 
          axis.text.x = element_text(hjust = 1.0, angle = 45),
          plot.title = element_text(hjust = 0.5, face = "bold"))

suppressWarnings(with(triGrams, wordcloud(words = word, freq = n, 
                        min.freq = 1, max.words = 100, random.order = FALSE,
                        rot.per = 0.35, colors = brewer.pal(9, "Reds"))))

What’s Next

The next step is to create a Shiny app which performs a word prediction. Precisely, we shall construct a Shiny app that takes as input a phrase in a text box and outputs a prediction of the word that would come next.

The predictive algorithm that we shall develop uses n-gram models with a word frequency lookup similar to that performed in the exploratory data analysis section of this report. A strategy will bee built based on the knowledge gathered during the exploratory data analysis. For example, as n increased for each n-gram, the frequency decreased for each of its terms. So one possible strategy may be to construct the model to first look for the uni-grams that would follow from the entered text. Once a full term is entered followed by a space, find the most common bigram model and so on.