Summary

This document serves as a Milestone Report for the Capstone course in Coursera’s Data Science Specialization. The ultimate goal of the Capstone project is to produce a prediction algorithm and Shiny app that will serve as a predictive text product. This exploratory analysis takes stock of the development process by showing that the data has been successfully loaded, summary statistics are being evaluated, interesting trends are being uncovered, and ideas are percolating for the next steps.

Getting and Cleaning the Data

The data sets for this project are available at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The Twitter, news, and blog data in U.S. English are successfully read into R, as are our necessary R packages.

library(readtext)
## Warning: package 'readtext' was built under R version 3.4.4
library(stringi)
## Warning: package 'stringi' was built under R version 3.4.4
library(tm)
## Warning: package 'tm' was built under R version 3.4.4
## Loading required package: NLP
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")}
twitter_text<- readLines("~/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
news_text<- readLines("~/Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
blogs_text<- readLines("~/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)

By taking a cursory look at the data, we can get a better feel for that with which we are working, and get a clearer perspective on our ultimate goals. As requested, let’s perform some word counts, line counts, etc.

twitter_size <- file.info("~/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size /(1024 ^ 2)
news_size <- file.info("~/Coursera-SwiftKey/final/en_US/en_US.news.txt")$size /(1024 ^ 2)
blogs_size <- file.info("~/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size /(1024 ^ 2)
twitter_words <- stri_count_words(twitter_text)
news_words <- stri_count_words(news_text)
blogs_words <- stri_count_words(blogs_text)
data.frame(source = c("Twitter", "news", "blogs"),
           file_size_MB = c(twitter_size, news_size, blogs_size),
           num_lines = c(length(twitter_text), length(news_text), length(blogs_text)),
           num_words = c(sum(twitter_words), sum(news_words), sum(blogs_words)),
           mean_num_words = c(mean(twitter_words), mean(news_words), mean(blogs_words)))
##    source file_size_MB num_lines num_words mean_num_words
## 1 Twitter     159.3641   2360148  30093410       12.75065
## 2    news     196.2775   1010242  34762395       34.40997
## 3   blogs     200.4242    899288  37546246       41.75108

For the purposes of this exploratory analysis, we will randomly choose 1% of the data from each of the three sets. We will then make the necessary transformations to the data to eliminate non-words and standardize lower case text.

set.seed(1234)
sample_text <- c(sample(twitter_text, length(twitter_text) * 0.01),
                 sample(news_text, length(news_text) * 0.01),
                 sample(blogs_text, length(blogs_text) * 0.01))
clean_sample_text <- VCorpus(VectorSource(sample_text))
rm(twitter_text, twitter_size, twitter_words, news_text, news_size, news_words, blogs_text, blogs_size, blogs_words)
now_a_space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
clean_sample_text <- tm_map(clean_sample_text, now_a_space, "(f|ht)tp(s?)://(.*)[.][a-z]+")
clean_sample_text <- tm_map(clean_sample_text, now_a_space, "@[^\\s]+")
clean_sample_text <- tm_map(clean_sample_text, removeNumbers)
clean_sample_text <- tm_map(clean_sample_text, removeWords, stopwords("english"))
clean_sample_text <- tm_map(clean_sample_text, removePunctuation)
clean_sample_text <- tm_map(clean_sample_text, stripWhitespace)
clean_sample_text <- tm_map(clean_sample_text, PlainTextDocument)

# We almost forgot to remove the profanity. Let's do that now.
profanity_file <- file("http://www.bannedwordlist.com/lists/swearWords.txt", open = "rb")
profanity <- readLines(profanity_file, encoding = "UTF-8", warn=TRUE, skipNul=TRUE)
## Warning in readLines(profanity_file, encoding = "UTF-8", warn =
## TRUE, skipNul = TRUE): incomplete final line found on 'http://
## www.bannedwordlist.com/lists/swearWords.txt'
close(profanity_file)
rm(profanity_file)
clean_sample_text <- tm_map(clean_sample_text, removeWords, profanity)
clean_sample_text <- tm_map(clean_sample_text, tolower)

Exploratory Analysis

Finally, “it’s time to make the doughnuts.” Let’s get a taste of what the text data holds by analyzing our 1% in terms of n-grams. We will find it productive to look at the most popular individual words, two-word, and three-word combinations. This will undoubtedly help us as we prepare to develop our model.

clean_sample_text <- tm_map(clean_sample_text, PlainTextDocument)
BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

freq_df <- function(tdm){
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_df <- data.frame(word=names(freq), freq=freq)
  return(freq_df)
}

unigram <- removeSparseTerms(TermDocumentMatrix(clean_sample_text), 0.9999)
unigram_freq <- freq_df(unigram)

bigram <- removeSparseTerms(TermDocumentMatrix(clean_sample_text, control = list(tokenize = BigramTokenizer)), 0.9999)
bigram_freq <- freq_df(bigram)

trigram <- removeSparseTerms(TermDocumentMatrix(clean_sample_text, control = list(tokenize = TrigramTokenizer)), 0.9999)
trigram_freq <- freq_df(trigram)
freq_plot <- function(data, title) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = "Words/Phrases", y = "Frequency") +
         ggtitle(title) +
         theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = "purple3")
}

freq_plot(unigram_freq, "Top 30 Unigrams")

freq_plot(bigram_freq, "Top 30 Bigrams")

freq_plot(trigram_freq, "Top 30 Trigrams")

Conclusions and Next Steps

There is a great amount of data here, and the corpus is a time hog. We will need to more thoroughly examine our data cleansing. There may also be speed benefits obtained from other packages we have not yet considered or gotten to work properly (RWeka). In this rudimentary study, however, a better feel for the data has been achieved. We’re off to a good start. We may even consider input from four-grams. The big steps of training, testing, and validating still remain for our theoretical model. The Shiny app and presentation will be the culmination of these efforts. Their design should begin as the predictive model becomes viable.