Introduction

This is the milestone report for the Coursera Data Science Specialization’s Capstone project. The goals of this report are two:

Download data and present basic information about the data sets
Clean and sample the data
Build a corpus
Tokenize the data
Plot the most frequent 1,2, and 3-grams

Load Libraries

The following are a list of libraries that are necessary to run this report. Note that RWeka requires Java. (If you are running a 64-bit computer, don’t make the same mistake that I did - your RStudio and Java versions (32 vs. 64-bit) must be the same! 32-bit versions of either Java and RStudio can be installed on a 64-bit computer.)

Sys.setenv(JAVA_HOME="C:/Program Files/Java/jre1.8.0_151")
library(RWeka)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringi)
library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

library(SnowballC)

Step 1: Download Data and Summarize Data

The data consists of excerpts taken from blogs, twitter, and news sources, which are packaged into three separate files. This data was provided by Coursera for the Data Science Specalization capstone.

After downloading, unzipping, and reading in the data, I summarized basic information about the data:

download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip","swiftkey.zip")
unzip("swiftkey.zip",files=c("final/en_US/en_US.twitter.txt","final/en_US/en_US.news.txt","final/en_US/en_US.blogs.txt"), overwrite=TRUE, junkpaths = TRUE, exdir="swiftkey")

blogs <- "swiftkey/en_US.blogs.txt"
news <- "swiftkey/en_US.news.txt"
twitter <- "swiftkey/en_US.twitter.txt"

conn <- file(blogs, "r")
blogs <- readLines(conn, encoding="UTF-8")
close(conn)

conn <- file(news, "r")
news <- readLines(conn, encoding="UTF-8")

## Warning in readLines(conn, encoding = "UTF-8"): incomplete final line found
## on 'swiftkey/en_US.news.txt'

close(conn)

conn <- file(twitter, "r")
twitter <- readLines(conn, encoding="UTF-8")

## Warning in readLines(conn, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul

## Warning in readLines(conn, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul

## Warning in readLines(conn, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul

## Warning in readLines(conn, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul

close(conn)

line_stats <- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Max.','Mean')])
rownames(line_stats) <- c('min_words','max_words','avg_words')
stats <- data.frame(
  FileName=c("en_US.blogs","en_US.news","en_US.twitter"),      
  t(rbind(
    sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
    Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
    line_stats)
  ))
head(stats)

##        FileName   Lines     Chars    Words min_words max_words avg_words
## 1   en_US.blogs  899288 206824382 37570839         0      6726  41.75108
## 2    en_US.news   77259  15639408  2651432         1      1123  34.61779
## 3 en_US.twitter 2360148 162096031 30451128         1        47  12.75063

Step 2: Clean and Sample Data Sets

It quickly became clear that running the entire data sets would be impossible on my computer because of their size. I tried sampling 10% and 5% of the data, but even these samples were a bit too big. Ultimately I used a 1% sample size.

set.seed(1138)
sample_size<-0.01

blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

sample_set <- c(sample(blogs, length(blogs) * sample_size),
                 sample(news, length(news) * sample_size),
                 sample(twitter, length(twitter) * sample_size))

Step 3: Build Corpus

Using the text mining package, I build a corpus and made modifications to enable the calculation of n-grams, including:

Lower case all letters
Remove punctuation
Remove numbers
Convert to simple text

corpus <- VCorpus(VectorSource(sample_set))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Step 4: Tokenize data to enable the calculation of frequencies

This step, as I discovered after much research, is needed to calculate the frequencies of each n-gram.

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

Unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))
Bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
Trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))

Unigrams

## <<TermDocumentMatrix (terms: 44656, documents: 33365)>>
## Non-/sparse entries: 475754/1489471686
## Sparsity           : 100%
## Maximal term length: 255
## Weighting          : term frequency (tf)

Bigrams

## <<TermDocumentMatrix (terms: 315027, documents: 33365)>>
## Non-/sparse entries: 649543/10510226312
## Sparsity           : 100%
## Maximal term length: 260
## Weighting          : term frequency (tf)

Trigrams

## <<TermDocumentMatrix (terms: 533662, documents: 33365)>>
## Non-/sparse entries: 626374/17805006256
## Sparsity           : 100%
## Maximal term length: 264
## Weighting          : term frequency (tf)

Step 5: Plot most frequent 1,2,3-grams

Below are the plots of the most frequent n-grams:

FrequentUnigrams <- findFreqTerms(Unigrams,lowfreq = 50)
FrequentUnigrams <- rowSums(as.matrix(Unigrams[FrequentUnigrams,]))
FrequentUnigrams <- data.frame(word=names(FrequentUnigrams), frequency=FrequentUnigrams)

FrequentBigrams <- findFreqTerms(Bigrams,lowfreq=50)
FrequentBigrams <- rowSums(as.matrix(Bigrams[FrequentBigrams,]))
FrequentBigrams <- data.frame(word=names(FrequentBigrams), frequency=FrequentBigrams)

FrequentTrigrams <- findFreqTerms(Trigrams,lowfreq=50)
FrequentTrigrams <- rowSums(as.matrix(Trigrams[FrequentTrigrams,]))
FrequentTrigrams <- data.frame(word=names(FrequentTrigrams), frequency=FrequentTrigrams)

plot_n_grams <- function(data, title, num) {
my_plot <- ggplot(data = data[1:num,], aes(x = reorder(word, -frequency), y = frequency)) + geom_bar(stat="identity")
my_plot <- my_plot + labs(x = "N-gram", y = "Frequency", title = title)
my_plot <- my_plot + theme(axis.text.x=element_text(angle=90))
my_plot
}

plot_n_grams(FrequentUnigrams,"Top Unigrams",20)

plot_n_grams(FrequentBigrams,"Top Bigrams",20)

plot_n_grams(FrequentTrigrams,"Top Trigrams",20)

Conclusion

That was a huge pain and not well documented anywhere. Nonetheless, I’m sure that the code will help with the rest of the capstone.

Data Science Capstone Week 2 Project

Rohit Joshi

January 7, 2018