By Sandy Sng
12 June 2018
This is the Milestone Report for the Coursera Data Science Capstone Project.
The goal of the Capstone Project is to create an algorithm and build a predictive text mining application to predict the next word based on previous words typed by a user. Using three databases of english sentences (extracted from blogs, news, and twitter), we will build and analyse basic n-gram models for predicting the next word based on previous frequently occuring words.
The motivation for this Milestone Report is to:
1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that you amassed so far.
4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Download the three datasets from this Data Source. We will only use the English versions (File name: en_US) for this analysis.
if (!file.exists("final")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
Examine datasets for these information: file sizes, line counts, word counts, and mean words per line.
library(stringi)
# Get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
# Get no. of words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
# Summary of the datasets
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
## source file.size.MB num.lines num.words mean.num.words
## 1 blogs 200.4242 899288 37546239 41.75107
## 2 news 196.2775 1010242 34762395 34.40997
## 3 twitter 159.3641 2360148 30093413 12.75065
This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower case. Since the data sets are quite large, we will randomly choose 1% of the data to demonstrate the data cleaning and exploratory analysis.
library(tm)
# Sample the data (random at 1%)
set.seed(324)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample)) # Vcorpus: used to create volatile corpora, so it is fully kept in memory and any changes affect only the corresponding R object. For reference check the tm #package.
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
Words with the highest frequency of occurence are plotted first in the center of the wordcloud.
library(wordcloud)
wordcloud(corpus, max.words = 1000, random.order = FALSE, rot.per = 0.3, use.r.layout = FALSE, colors = brewer.pal(4, "BuPu"))
While working on my report, I was prompted by an “Error in NGramTokenizer”. Thereafter, I reloaded the required RWeka package and got a java runtime error:
“Unable to find any JVMs matching version “(null)”.
No Java runtime present, try –request to install.”
Below was the code I was working on, which I presume will run smoothly when the required packages and systems are up.
# Build basic n-gram models for predicting the next word based on the frequency occuring words in the data.
library(ngram)
library(RWeka)
options(mc.cores=1) # RWeka bug workaround
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
freq1 <- getFreq(TermDocumentMatrix(corpus, control = list(unigram)))
freq2 <- getFreq(TermDocumentMatrix(corpus, control = list(bigram)))
freq3 <- getFreq(TermDocumentMatrix(corpus, control = list(trigram)))
makePlot <- function(data, label) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("grey50"))
}
# Here is a histogram of the 30 most common unigrams, bigrams, and trigrams, in the data sample.
makePlot(freq1, "30 Most Common Unigrams")
makePlot(freq2, "30 Most Common Bigrams")
makePlot(freq3, "30 Most Common Trigrams")
I will be working on finding a solution for the “Error in NGramTokenizer” for above code to be run successfully. In addition, I will next aim to:
- increase our sample size from the current 1% to a larger sample size,
- improve accuracy by fine tuning the prediction model,
- create a Shiny app for a friendly user-interface.