To prepare the environment for this analysis let us first of all set up a directory where the project is going to be stored, and let us load the packages that we are going to use during this analysis.
setwd("~/Desktop/Coursera/Capstone Project/")
Let us also load all packages necessary:
library(RColorBrewer)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(wordcloud)
library(stringr)
library(tm)
library(tidytext)
Let us now download the data from the original location at Coursera. Since the file is quite heavy, we shall only load the data if it is needed and let’s unzip the folder.
if(!file.exists("./Coursera-SwiftKey.zip")){
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "Coursera-SwiftKey.zip", method = "curl")
unzip("Coursera-SwiftKey.zip")
rm(fileUrl)
}
It is also important to check what we have actually downloaded and unpacked.
list.files()
## [1] "all.R" "badwords.txt" "BiGram.txt"
## [4] "Capstone.R" "Coursera-SwiftKey.zip" "en_US.corpus.txt"
## [7] "en_US.sample.txt" "final" "milestone_cache"
## [10] "milestone_files" "milestone.html" "milestone.md"
## [13] "milestone.Rmd" "MonoGram.txt" "Project.R"
## [16] "rsconnect" "tidy-sample.txt" "TriGram.txt"
list.files("./final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
list.files("./final/en_US/")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
We can now load the files into R:
# Reading files
f1 <- file("./final/en_US/en_US.blogs.txt", open = "r")
blogs <- readLines(f1, encoding = "UTF-8", skipNul = TRUE)
close(f1)
f2 <- file("./final/en_US/en_US.news.txt", open = "r")
news <- readLines(f2, encoding = "UTF-8", skipNul = TRUE)
close(f2)
f3 <- file("./final/en_US/en_US.twitter.txt", open = "r")
twitter <- readLines(f3, encoding = "UTF-8", skipNul = TRUE)
close(f3)
rm(f1, f2, f3)
Prior to begin building the unified document corpus and cleaning the data, it is convenient to produce a brief summary of the three data sets we have loaded. In particular, since they are strings of texts, we shall be interested in the number of characters, the number of words and the number of words per line.
# Size of files
fileNames <- list(c("./final/en_US/en_US.blogs.txt",
"./final/en_US/en_US.news.txt",
"./final/en_US/en_US.twitter.txt"))
size <- round(sapply(fileNames, file.info)[[1]] / 1024^2, 0)
# Number of lines
lines <- sapply(list(blogs, news, twitter), length)
# Number of characters
characters <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)
# Number of words
words <- sapply(list(blogs, news, twitter), function(x) sum(str_count(x, "\\w+")))
# Number of words per lines
wpl <- lapply(list(blogs, news, twitter), function(x) str_count(x, "\\w+"))
names(wpl) <- c("Blogs", "News", "Tweeter")
wplSummary <- t(sapply(wpl, function(x) round(c(min(x), mean(x), max(x)), 0)))
colnames(wplSummary) = c('WPL.Min', 'WPL.Mean', 'WPL.Max')
summarised.data <- data.frame(
File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
FileSize = size,
Lines = lines,
Characters = characters,
Words = words,
rbind(wplSummary))
The result is summarised in the following table:
| File | FileSize | Lines | Characters | Words | WPL.Min | WPL.Mean | WPL.Max | |
|---|---|---|---|---|---|---|---|---|
| Blogs | en_US.blogs.txt | 200 | 899288 | 206824505 | 38309620 | 1 | 43 | 6851 |
| News | en_US.news.txt | 196 | 1010242 | 203223159 | 35624454 | 1 | 35 | 1928 |
| Tweeter | en_US.twitter.txt | 159 | 2360148 | 162096241 | 31003544 | 1 | 13 | 47 |
This preliminary analysis shows that each text has a relatively small number of words per line, but the three files have many lines and they are overall big files (from 159MB to 200MB). In order to improve the processing time, we shall therefore take a random sample with 1% of the original size for each data set.
It is also interesting to notice that the maximum number of words per line in the twitter data set, 47 is much smaller than the number of words in the other two data sets, 1928 and 6851, which is expected since twitter limits the number of character per tweet.
Let us also plot a histogram of the number of words per line in the three data sets.
wpl.df <- stack(wpl)
names(wpl.df) <- c("WPL", "Source")
ggplot(data = wpl.df, aes(x = WPL,y = after_stat(density), colour = Source)) +
geom_histogram(binwidth = 5, fill = "white") +
coord_cartesian(xlim = c(0, 150)) +
facet_grid(Source ~ .) +
xlab("Words per line") +
ylab("Frequency Density")
The relatively small number of words per line is also visible from the three histograms, which is restricted to the interval between 0 and 150 words per line, as the rest of the tail is negligible. This observation seems to support a general trend towards short and concise communications which may turn out to be useful later on.
Before performing the exploratory data analysis, we shall sample the three data sets to take only 1% of the data. This will improve the computational performance and the random sampling shall not, or should not, affect the analysis, because the uncertainty given by the sampling can be addressed with statistical methods. Also, all non-English characters shall be removed from the subset of data and then combined into a single data set.
Afterwards, we shall create a corpus from the sampled data set. We shall also focus on the following points to clean the data set:
Let us first set the seed, in order to make the operation fully reproducible
set.seed(1010011)
With the function sample let us take a sample of all
three data sets:
sampleSize <- 0.01
sampleBlogs <- sample(blogs, length(blogs) * sampleSize, replace = FALSE)
sampleNews <- sample(news, length(news) * sampleSize, replace = FALSE)
sampleTwitter <- sample(twitter, length(twitter) * sampleSize, replace = FALSE)
We shall also link these three samples into a single list, in order to make the coding simpler.
sampledList <- list(sampleBlogs, sampleNews, sampleTwitter)
names(sampledList) <- c("Blogs", "News", "Twitter")
Let us now remove all non-English characters from the sampled data. This is going to done converting the data from “latin1” to “ASCII”.
sampledList <- list(sampleBlogs,sampleNews, sampleTwitter)
sampledList <- lapply(sampledList, iconv, "latin1", "ASCII", sub = "")
names(sampledList) <- c("Blogs", "News", "Twitter")
Let us now combine the three sampled data sets into a single data set.
sampleData <- c(sampledList$Blogs, sampledList$News, sampledList$Twitter)
rm(sampledList)
Let us finally store in two variables sampleDataLines
and sampleDataWords the total number of lines and the total
number of words in the sampled data, respectively.
sampleDataLines <- length(sampleData)
sampleDataWords <- sum(str_count(sampleData, "\\w+"))
Let us also save a copy of this sampled data set into the hard drive:
con <- file("./en_US.sample.txt", open = "w")
writeLines(sampleData, con)
close(con); rm(con)
Finally, let us clean up the memory of all files:
rm(list = ls())
Before we can proceed to create the corpus and tidy up the data, let us load the sampled data:
f <- file("./en_US.sample.txt", open = "r")
sampleData <- readLines(f)
close(f); rm(f)
Let us first of all find a list of bad words and load it into R. To this end, upon research it turns out that google has created a full list of bad words that is freely downloadable. This list needs to be cleaned up.
# Downloading bad words from google
if(!file.exists("./badwords.txt")){
fileUrl <- "https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt"
download.file(fileUrl, destfile = "badwords.txt", method = "curl")
rm(fileUrl)
}
# Reading bad words file
f <- file("./badwords.txt", open = "r")
profanities <- readLines(f)
close(f); rm(f)
# Cleaning up badwords
profanities <- profanities[-grep("(^[*])|([*]$)", profanities)]
We shall create the corpus with the aid of the package
tm.
myCorpus <- VCorpus(VectorSource(sampleData))
In order to tidy up the corpus we need first to create a function
which aims to clean up characters and strings that we wish to exclude
from the analysis, such as URLs, email addresses, hashtags, but also
punctuation, numbers and swear words. The function cleanup
accomplishes this job.
cleanup <- function(corpus, profanity = NULL){
subs <- content_transformer(function(x, pattern) gsub(pattern, "", x))
# Removing URLs, Twitter handles, email patterns and hashtags
corpus <- tm_map(corpus, subs, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, subs, "@[^\\s]+")
corpus <- tm_map(corpus, subs, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
corpus <- tm_map(corpus, subs, "#(.*)")
corpus <- tm_map(corpus, subs, "\\b[a-z]{1}\\b")
# Transforming character in lower case, removing punctuation, numbers and extra spaces
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
# Removing profanities
if(!is.null(profanity)) corpus <- tm_map(corpus, removeWords, profanity)
return(corpus)
}
Let us now complete the work cleaning up the corpus and transforming it into a tidy data frame.
# Making a clean corpus
myCorpus <- cleanup(myCorpus, profanities)
# Transforming the corpus into a vector and a tidy data frame
cleanSample <- as.vector(tidy(myCorpus)$text)
tidyData <- data.frame(line = 1 : length(cleanSample), text = cleanSample)
This has already cost a lot of memory, so it is wise to now save the tidy data frame into an external file and clean up the memory.
f <- file("./tidy-sample.txt", open = "w")
saveRDS(tidyData, f, ascii = TRUE)
close(f); rm(f)
# Cleaning up messy memory
rm(list = ls())
Once we have a tidy data frame, we can now begin the proper investigation. To this end, let us start by reading the data frame we have just saved:
f <- file("./tidy-sample.txt", open = "r")
tidyData <- readRDS(f)
close(f); rm(f)
We wish now to create three further data frames containing the
mono-grams, bi-grams and tri-grams found in the corpus that we have
sampled and cleansed. We shall do this with the aid of the package
tidytext and in particular with the aid of the method
unnest_tokens.
monoGrams <- tidyData %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
biGrams <- tidyData %>%
unnest_tokens(word, text, token = "ngrams", n = 2) %>%
count(word, sort = TRUE) %>%
filter(word != "<NA>")
triGrams <- tidyData %>%
unnest_tokens(word, text, token = "ngrams", n = 3) %>%
count(word, sort = TRUE) %>%
filter(word != "<NA>")
In order to create the prediction model, let us save these three data frames into three different files which shall be used in the prediction model.
# Saving mono-grams, bi-grams and tri-grams
f1 <- file("./MonoGram.txt", open = "w")
saveRDS(monoGrams, f1, ascii = TRUE)
close(f1)
f2 <- file("./BiGram.txt", open = "w")
saveRDS(monoGrams, f2, ascii = TRUE)
close(f2)
f3 <- file("./TriGram.txt", open = "w")
saveRDS(monoGrams, f3, ascii = TRUE)
close(f3)
rm(f1, f2, f3)
Now let us complete the exploration by plotting a graph of the most common mono-, bi- and tri-grams.
ggplot(data = monoGrams[1:20,], aes(x = reorder(word, -n), y = n)) +
geom_bar(stat = "Identity") +
xlab("") + ylab("Frequency of Words") +
ggtitle("20 Most Common Words") +
geom_text(aes(label = n), vjust = -0.5, size = 3) +
geom_bar(stat = "identity", aes(colour = word), fill = "white") +
theme(legend.position = "none",
axis.text.x = element_text(hjust = 1.0, angle = 45),
plot.title = element_text(hjust = 0.5, face = "bold"))
suppressWarnings(with(monoGrams, wordcloud(words = word, freq = n,
min.freq = 1, max.words = 100, random.order = FALSE,
rot.per = 0.35, colors = brewer.pal(9, "Purples"))))
ggplot(data = biGrams[1:20,], aes(x = reorder(word, -n), y = n)) +
geom_bar(stat = "Identity") +
xlab("") + ylab("Frequency of Bigrams") +
ggtitle("20 Most Common Bigrams") +
geom_text(aes(label = n), vjust = -0.5, size = 3) +
geom_bar(stat = "identity", aes(colour = word), fill = "white") +
theme(legend.position = "none",
axis.text.x = element_text(hjust = 1.0, angle = 45),
plot.title = element_text(hjust = 0.5, face = "bold"))
suppressWarnings(with(biGrams, wordcloud(words = word, freq = n,
min.freq = 1, max.words = 100, random.order = FALSE,
rot.per = 0.35, colors = brewer.pal(9, "Greens"))))
ggplot(data = triGrams[1:20,], aes(x = reorder(word, -n), y = n)) +
geom_bar(stat = "Identity") +
xlab("") + ylab("Frequency of Trigrams") +
ggtitle("20 Most Common Trigrams") +
geom_text(aes(label = n), vjust = -0.5, size = 3) +
geom_bar(stat = "identity", aes(colour = word), fill = "white") +
theme(legend.position = "none",
axis.text.x = element_text(hjust = 1.0, angle = 45),
plot.title = element_text(hjust = 0.5, face = "bold"))
suppressWarnings(with(triGrams, wordcloud(words = word, freq = n,
min.freq = 1, max.words = 100, random.order = FALSE,
rot.per = 0.35, colors = brewer.pal(9, "Reds"))))
The next step is to create a Shiny app which performs a word prediction. Precisely, we shall construct a Shiny app that takes as input a phrase in a text box and outputs a prediction of the word that would come next.
The predictive algorithm that we shall develop uses n-gram models with a word frequency lookup similar to that performed in the exploratory data analysis section of this report. A strategy will bee built based on the knowledge gathered during the exploratory data analysis. For example, as n increased for each n-gram, the frequency decreased for each of its terms. So one possible strategy may be to construct the model to first look for the uni-grams that would follow from the entered text. Once a full term is entered followed by a space, find the most common bigram model and so on.