Executive Summary

Here we will do some exploratory data analysis for the Coursera Data Science Capstone Project. The ultimate goal is to use a Corpora (provided here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) to develope a predictive text algorithm. In this Milestone Report we will take a look at the Corpora and some of it’s features.

Let’s load our libraries

library(dplyr)
## Warning: Installed Rcpp (0.12.10) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(stringi)
library(tm)
## Warning: package 'tm' was built under R version 3.4.1
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.4.1
library(SnowballC)
library(RColorBrewer)

Here we define the URL where we get our Corpora

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
temp <- tempfile()
download.file(url,temp)
unlink(temp)
if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}
blog_file <- file("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "r")
news_file <- file("Coursera-SwiftKey/final/en_US/en_US.news.txt", "r")
twitter_file <- file("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "r")

Let’s look at some of the content of the Corpora

readLines(blog_file, 2)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."
## [2] "We love you Mr. Brown."
readLines(news_file, 5)
## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
readLines(twitter_file, 5)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"
blog_data <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = 'UTF-8', skipNul = TRUE)
news_data <- readLines( "Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = 'UTF-8', skipNul = TRUE)
## Warning in readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt",
## encoding = "UTF-8", : incomplete final line found on 'Coursera-SwiftKey/
## final/en_US/en_US.news.txt'
twitter_data <- readLines( "Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = 'UTF-8', skipNul = TRUE)

And now let’s look at some stats about the Corpora:

blog_words <- stri_count_words(blog_data)
news_words <- stri_count_words(news_data)
twitter_words <- stri_count_words(twitter_data)
all_words <- c(blog_words, news_words, twitter_words)
ave_words <- mean(all_words)
max_words <- max(all_words)
min_words <- min(all_words)
words_data_table <- summary(all_words)
print(words_data_table)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.00   14.00   21.07   22.00 6726.00
blog_wordcount <- sum(blog_words)
news_wordcount <- sum(news_words)
twitter_wordcount <- sum(twitter_words)
total_wordcount <- sum(blog_wordcount, news_wordcount, twitter_wordcount)
all_wordcounts <- data.frame(c(blog_wordcount, news_wordcount, 
                               twitter_wordcount), 
                             row.names = c("Blog", 
                                           "News",
                                           "Twitter"))
print(all_wordcounts)
##         c.blog_wordcount..news_wordcount..twitter_wordcount.
## Blog                                                37546246
## News                                                 2674536
## Twitter                                             30093410
blog_lines <- length(blog_data)
news_lines <- length(news_data)
twitter_lines <- length(twitter_data)
all_lines <- data.frame(c(blog_lines, news_lines, twitter_lines), 
                        row.names = c("Blog",
                                      "News",
                                      "Twitter"))
print(all_lines)
##         c.blog_lines..news_lines..twitter_lines.
## Blog                                      899288
## News                                       77259
## Twitter                                  2360148
data_table <- (as.data.frame(c(all_wordcounts, all_lines), 
                          row.names = c("Blog",
                                      "News",
                                      "Twitter"),
                          col.names = c("Wordcount", "Linecount")))

print(data_table)
##         Wordcount Linecount
## Blog     37546246    899288
## News      2674536     77259
## Twitter  30093410   2360148

Now let’s sample the data to make it easier to work with

data_samps <- sample(paste(blog_data, news_data, twitter_data), size = 10000, replace = TRUE)
corpus <- Corpus(VectorSource(data_samps))

We have to clean the data

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, tolower)

Tokenize data

Tokenizers:

Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

Term Document Matrices:

tdm1 <- TermDocumentMatrix(corpus, control = list(tokenize = Tokenizer))
tdm2 <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
tdm3 <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))

Identify and quanitify most frequently occurring unigrams

frequent_terms1 <- findFreqTerms(tdm1, lowfreq = 1000)
unigram_frequency <- rowSums(as.matrix(tdm1[frequent_terms1,]))
unigram_frequency <- data.frame(unigram = names(unigram_frequency), frequency = unigram_frequency)

Plot the unigram frequencies

g <- ggplot(unigram_frequency, aes(x = unigram_frequency$unigram, y = unigram_frequency$frequency)) +
    geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=45, hjust=1)) +
    theme(legend.title = element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    labs(title = "Frequency of Most Frequent Unigrams")
print(g)

Identify and quanitify most frequently occurring bigrams

frequent_terms2 <- findFreqTerms(tdm2, lowfreq = 1500)
bigram_frequency <- rowSums(as.matrix(tdm2[frequent_terms2,]))
bigram_frequency <- data.frame(bigram = names(bigram_frequency), frequency = bigram_frequency)

Plot the bigram frequencies

h <- ggplot(bigram_frequency, aes(x = bigram_frequency$bigram, y = bigram_frequency$frequency)) +
    geom_bar(stat = "identity") +  theme(axis.text.x=element_text(angle=45, hjust=1)) +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    labs(title = "Frequency of Most Frequent Words in Bigrams")
print(h)

Identify and quanitify most frequently occurring trigrams

frequent_terms3 <- findFreqTerms(tdm3, lowfreq = 1100)
trigram_frequency <- rowSums(as.matrix(tdm3[frequent_terms3,]))
trigram_frequency <- data.frame(trigram = names(trigram_frequency), frequency = trigram_frequency)

Plot the trigrams

i <- ggplot(trigram_frequency, aes(x = trigram_frequency$trigram, y = trigram_frequency$frequency)) +
    geom_bar(stat = "identity") +  theme(axis.text.x=element_text(angle=45, hjust=1)) +
    theme(legend.title=element_blank()) +
    xlab("Trigram") + ylab("Frequency") +
    labs(title = "Frequency of Most Frequent Words in Trigrams")
print(i)

Conclusion

Now that I have figured out how to get, clean and analyze the data, generating a few models should be easy! Looking forward to the rest of this course.