This is the milestone Report for the Coursera Data Science Specialization Capstone Project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language Processing (NLP) techniques will be used to perform the analysis and build the predictive model.
This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model.
We will preload necessary R libraries and create parallel clusters in order to accelerate execution time.
# Cleans enviroment variables
rm(list = ls())
# Loads libraries
library(doParallel)
library(readr)
library(stringi)
library(quanteda)
library(magrittr)
library(ggplot2)
# Setup parallel clusters
job_cluster <- makeCluster(detectCores())
invisible(clusterEvalQ(job_cluster, library(stringi)))
invisible(clusterEvalQ(job_cluster, library(quanteda)))
The text files are in a zip file available in the link.
# Downloads dataset file
url_file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("../data")) dir.create("../data")
if (!file.exists("../data/Coursera-SwiftKey.zip")) {
download.file(url = url_file, destfile = "../data/Coursera-SwiftKey.zip")
unzip(zipfile = "../data/Coursera-SwiftKey.zip", exdir = "../data")
file.remove("../data/Coursera-SwiftKey.zip")
}
These datasets consist of text from 3 different sources: news, blogs and twitter feeds. The text data are provided in 4 different languages: German, English (US), Finnish and Russian. In this work, we will only focus on the English (US) datasets.
# Reads text files
blogs <- read_lines("../data/en_US/en_US.blogs.txt")
news <- read_lines("../data/en_US/en_US.news.txt")
twitter <- read_lines("../data/en_US/en_US.twitter.txt")
# Gets file sizes in MB
blogs_size <- file.size("../data/en_US/en_US.blogs.txt") / 2 ^ 20
news_size <- file.size("../data/en_US/en_US.news.txt") / 2 ^ 20
twitter_size <- file.size("../data/en_US/en_US.twitter.txt") / 2 ^ 20
# Gets number of words
blogs_words <- stri_count_words(blogs)
news_words <- stri_count_words(news)
twitter_words <- stri_count_words(twitter)
# Creates summary of the data sets
summ <- data.frame(source_file = c("blogs", "news", "twitter"),
file_size_MB = c(blogs_size, news_size, twitter_size),
num_lines = c(length(blogs), length(news),
length(news)),
num_words = c(sum(blogs_words), sum(news_words),
sum(twitter_words)),
words_per_line = c(mean(blogs_words), mean(news_words),
mean(twitter_words)))
print(summ)
## source_file file_size_MB num_lines num_words words_per_line
## 1 blogs 200.4242 899288 37546246 41.75108
## 2 news 196.2775 1010242 34762395 34.40997
## 3 twitter 159.3641 1010242 30093369 12.75063
Processing these files in a computer desktop can be time consuming. Thus we will take samples of 10% of original source and analyse them.
Before analysing text data, we need to clean text data. The final product will be a large character vector containing all lines.
Note that we will sample the data.
# Cleans variables - memory management
rm("blogs_size", "news_size", "twitter_size", "blogs_words", "news_words",
"twitter_words", "summ")
# Configs the sample percent
sample_perc <- .10
# Gets sample of files
set.seed(15)
txt <- c(sample(x = blogs, size = length(blogs) * sample_perc),
sample(x = news, size = length(news) * sample_perc),
sample(x = twitter, size = length(twitter) * sample_perc)) %>%
# Converts to lower case
toLower()
head(txt)
## [1] "“i think this winter, as whooping cough upticks, measles continues to be under-vaccinated, we’re going to increasingly see pockets of communicable infections that a few decades ago we thought, frankly, we had eradicated from the united states. these illnesses should not be seen in the united states with the vaccinations that we have at hand,”says snyderman."
## [2] "magazines are already embracing the internet, though some are being braver with the platform than others."
## [3] "pew also breaks the country down by how they get their news into integrators, net-newsers, traditionalists and disengaged, with traditionalists being by far the largest segment (46%). this is the only segment that is almost solely reliant on tv for their news. however, the integrators (23%) also use tv as their main news source. integrators are defined as those who use traditional sources (tv, magazine, newspaper) and the internet. they tend to be middle-aged americans who are “well-educated and affluent.” this means that, taken as an aggregate, 69% of all americans rely on tv for all or most of their news."
## [4] "‘back in 2004, president bush ran a smear campaign against challenger sen. john kerry (d-ma) which undermined his service in vietnam and questioned kerry’s ability and determination to protect the united states — just three years removed from the 9/11 attacks — from another terror strike."
## [5] "walking out, our spirits filled, we stand outside the oncology department as we wait for the elevator to the parking deck."
## [6] "instead, the agency should number no more than 5,000, and carry out his original intent, which was to monitor terrorist threats and collect intelligence."
Now we can analyse the data. First, we will create document-feature matrices for each of the groups of words analysed: unigrams, bigrams and trigrams.
# Cleans variables - memory management
rm("blogs", "news", "twitter")
# Creates document-feature matrices
## unigrams
dfm_uni <- txt %>%
tokenize(removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeSymbols = TRUE, removeTwitter = TRUE) %>%
dfm()
## bigrams
dfm_bi <- txt %>%
tokenize(removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeSymbols = TRUE, removeTwitter = TRUE, ngrams = 2) %>%
dfm()
## trigrams
dfm_tri <- txt %>%
tokenize(removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeSymbols = TRUE, removeTwitter = TRUE, ngrams = 3) %>%
dfm()
The stop and profanity words are not removed because the main goal of this capstone project is to predict what word the user will be entering in the smartphone keyboard. So all words are important and it does not make sense to remove some words.
Second, we will plot graphs with the 20 most commons unigrams, bigrams and trigrams.
# Creates top 20
top_uni <- topfeatures(dfm_uni, 20)
top_uni <- data.frame(word = names(top_uni), n = top_uni)
top_bi <- topfeatures(dfm_bi, 20)
top_bi <- data.frame(word = names(top_bi), n = top_bi)
top_tri <- topfeatures(dfm_tri, 20)
top_tri <- data.frame(word = names(top_tri), n = top_tri)
# Plots graphs
## Unigrams
g_uni <- ggplot(data = top_uni, aes(x = reorder(word, n), y = n)) +
geom_bar(stat = "identity", colour = "light blue", fill = "light blue") +
xlab(NULL) + coord_flip() + ggtitle("20 Most Commom Unigrams")
g_uni
## Bigrams
g_bi <- ggplot(data = top_bi, aes(x = reorder(word, n), y = n)) +
geom_bar(stat = "identity", colour = "light green", fill = "light green") +
xlab(NULL) + coord_flip() + ggtitle("20 Most Commom Bigrams")
g_bi
## Trigrams
g_tri <- ggplot(data = top_tri, aes(x = reorder(word, n), y = n)) +
geom_bar(stat = "identity", colour = "orange", fill = "orange") +
xlab(NULL) + coord_flip() + ggtitle("20 Most Common Trigrams")
g_tri
The next steps of this capstone project will be to build a predictive algorithm, and deploy our algorithm as a Shiny app.
Our predictive algorithm will be a n-gram model with frequency lookup similar to our exploratory analysis above. One possible strategy can be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model if needed.
The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay.