This is the milestone report for the Coursera Data Science Specialization’s Capstone project. The goals of this report are two:
The following are a list of libraries that are necessary to run this report. Note that RWeka requires Java. (If you are running a 64-bit computer, don’t make the same mistake that I did - your RStudio and Java versions (32 vs. 64-bit) must be the same! 32-bit versions of either Java and RStudio can be installed on a 64-bit computer.)
Sys.setenv(JAVA_HOME="C:/Program Files/Java/jre1.8.0_151")
library(RWeka)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringi)
library(tm)
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(SnowballC)
The data consists of excerpts taken from blogs, twitter, and news sources, which are packaged into three separate files. This data was provided by Coursera for the Data Science Specalization capstone.
After downloading, unzipping, and reading in the data, I summarized basic information about the data:
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip","swiftkey.zip")
unzip("swiftkey.zip",files=c("final/en_US/en_US.twitter.txt","final/en_US/en_US.news.txt","final/en_US/en_US.blogs.txt"), overwrite=TRUE, junkpaths = TRUE, exdir="swiftkey")
blogs <- "swiftkey/en_US.blogs.txt"
news <- "swiftkey/en_US.news.txt"
twitter <- "swiftkey/en_US.twitter.txt"
conn <- file(blogs, "r")
blogs <- readLines(conn, encoding="UTF-8")
close(conn)
conn <- file(news, "r")
news <- readLines(conn, encoding="UTF-8")
## Warning in readLines(conn, encoding = "UTF-8"): incomplete final line found
## on 'swiftkey/en_US.news.txt'
close(conn)
conn <- file(twitter, "r")
twitter <- readLines(conn, encoding="UTF-8")
## Warning in readLines(conn, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul
close(conn)
line_stats <- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Max.','Mean')])
rownames(line_stats) <- c('min_words','max_words','avg_words')
stats <- data.frame(
FileName=c("en_US.blogs","en_US.news","en_US.twitter"),
t(rbind(
sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
line_stats)
))
head(stats)
## FileName Lines Chars Words min_words max_words avg_words
## 1 en_US.blogs 899288 206824382 37570839 0 6726 41.75108
## 2 en_US.news 77259 15639408 2651432 1 1123 34.61779
## 3 en_US.twitter 2360148 162096031 30451128 1 47 12.75063
It quickly became clear that running the entire data sets would be impossible on my computer because of their size. I tried sampling 10% and 5% of the data, but even these samples were a bit too big. Ultimately I used a 1% sample size.
set.seed(1138)
sample_size<-0.01
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
sample_set <- c(sample(blogs, length(blogs) * sample_size),
sample(news, length(news) * sample_size),
sample(twitter, length(twitter) * sample_size))
Using the text mining package, I build a corpus and made modifications to enable the calculation of n-grams, including:
corpus <- VCorpus(VectorSource(sample_set))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
This step, as I discovered after much research, is needed to calculate the frequencies of each n-gram.
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
Unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))
Bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
Trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
Unigrams
## <<TermDocumentMatrix (terms: 44656, documents: 33365)>>
## Non-/sparse entries: 475754/1489471686
## Sparsity : 100%
## Maximal term length: 255
## Weighting : term frequency (tf)
Bigrams
## <<TermDocumentMatrix (terms: 315027, documents: 33365)>>
## Non-/sparse entries: 649543/10510226312
## Sparsity : 100%
## Maximal term length: 260
## Weighting : term frequency (tf)
Trigrams
## <<TermDocumentMatrix (terms: 533662, documents: 33365)>>
## Non-/sparse entries: 626374/17805006256
## Sparsity : 100%
## Maximal term length: 264
## Weighting : term frequency (tf)
Below are the plots of the most frequent n-grams:
FrequentUnigrams <- findFreqTerms(Unigrams,lowfreq = 50)
FrequentUnigrams <- rowSums(as.matrix(Unigrams[FrequentUnigrams,]))
FrequentUnigrams <- data.frame(word=names(FrequentUnigrams), frequency=FrequentUnigrams)
FrequentBigrams <- findFreqTerms(Bigrams,lowfreq=50)
FrequentBigrams <- rowSums(as.matrix(Bigrams[FrequentBigrams,]))
FrequentBigrams <- data.frame(word=names(FrequentBigrams), frequency=FrequentBigrams)
FrequentTrigrams <- findFreqTerms(Trigrams,lowfreq=50)
FrequentTrigrams <- rowSums(as.matrix(Trigrams[FrequentTrigrams,]))
FrequentTrigrams <- data.frame(word=names(FrequentTrigrams), frequency=FrequentTrigrams)
plot_n_grams <- function(data, title, num) {
my_plot <- ggplot(data = data[1:num,], aes(x = reorder(word, -frequency), y = frequency)) + geom_bar(stat="identity")
my_plot <- my_plot + labs(x = "N-gram", y = "Frequency", title = title)
my_plot <- my_plot + theme(axis.text.x=element_text(angle=90))
my_plot
}
plot_n_grams(FrequentUnigrams,"Top Unigrams",20)
plot_n_grams(FrequentBigrams,"Top Bigrams",20)
plot_n_grams(FrequentTrigrams,"Top Trigrams",20)
That was a huge pain and not well documented anywhere. Nonetheless, I’m sure that the code will help with the rest of the capstone.