Data Science Capstone - Milestone Report

Synopsis

Predictive text algorithms are a useful means of simplifying keyboard input on a smartphone. The following document outlines the preliminary steps necessary to build such an algorith: acquire and explore the data, create a corpus, and identify the primary elements of a predictive model.

Create Working Environment in R

Load all required libraries.

library(knitr)
library(tm)
library(ggplot2)
library(dplyr)
library(gridExtra)

Download and unzip data sources.

download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
              destfile = "Coursera-SwiftKey.zip")
unzip("Coursera-Swiftkey.zip")

Read in each of the three files in the English data set.

conn <- file("final/en_US/en_US.twitter.txt", "r") 
twit <- readLines(conn)
close(conn)

conn <- file("final/en_US/en_US.blogs.txt", "r") 
blog <- readLines(conn)
close(conn)

conn <- file("final/en_US/en_US.news.txt", "r") 
news <- readLines(conn)
close(conn)

rm(conn)

Exploratory Analysis

After preparing the working environment, some basic summary statistics can be computed.

word_stats <- data.frame()

for (i in c("twit", "blog", "news")){
      
      assign(paste(i, "_lines", sep = ""), length(get(i)))
      assign(paste(i, "_words", sep = ""), strsplit(get(i), " "))
      assign(paste(i, "_wc", sep = ""), sum(sapply(get(paste(i, "_words", sep = "")), length)))
      assign(paste(i, "_wpl", sep = ""), mean(sapply(get(paste(i, "_words", sep = "")), length)))
      assign(paste(i, "_wl", sep = ""), mean(nchar(unlist(get(paste(i, "_words", sep = ""))))))
      
      word_dat <- data.frame(source = i, line_count = get(paste(i, "_lines", sep = "")),
                             word_count = get(paste(i, "_wc", sep = "")),
                             mean_words_per_line = get(paste(i, "_wpl", sep = "")), 
                             mean_word_length = get(paste(i, "_wl", sep = "")))
      
      word_stats <- rbind(word_dat, word_stats)
}

kable(word_stats, digits = 2, caption = "Summary Statistics")

Summary Statistics
source	line_count	word_count	mean_words_per_line	mean_word_length
news	77259	2643969	34.22	4.96
blog	899288	37334131	41.52	4.61
twit	2360148	30373543	12.87	4.42

The mean word length shows minimal variation across the three sources. It was assumed that the twitter source would show a lower mean word length due the abbreviated text that is common on the messaging platform. To help clarify the distribution of word lenth, a frequency polygon is plotted. This visualization shows that word length distribution is almost identical across all three sources. The summary stats indicate that words per line exhibits more variability. A frequency polygon is also plotted for this variable. The twitter source exhibits a much higher density on the shorter end on the x-axis than the other two sources. Given that tweets are artificially constrained to 140 characters, this is as expected.

Corpus Processing

The next step is to create a corpus from the three data sources (news, blogs, and twitter). This will serve as the data source for the predictive text model. Given the size of the data involved, sampling is performed. This will increase the speed and performance of the resulting model.

sam_percent <- 0.1
set.seed(232344)
twit_sam <- twit[sample(1:length(twit),length(twit)*sam_percent)]
blog_sam <- blog[sample(1:length(blog),length(blog)*sam_percent)]
news_sam <- news[sample(1:length(news),length(news)*sam_percent)]

docs <- c(twit_sam, blog_sam, news_sam)
corp <- VCorpus(VectorSource(docs))

The corpus can now be cleaned up. Various text transformations are performed to remove punctuation, profanity, numbers, and then to strip the resulting white space.

#download Google's list of profane words and read into R
download.file("http://www.freewebheaders.com/wordpress/wp-content/uploads/full-list-of-bad-words-banned-by-google-txt-file.zip",
              destfile = "badwords.zip")
unzip("badwords.zip")
badwords <- read.csv(list.files(pattern = "full-list-of-bad-words*"),
                     strip.white = TRUE, header = FALSE, stringsAsFactors = FALSE)
badwords <- sub("\\s+$", "", badwords$V1)

#transform corpus text
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removeWords, badwords)
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, stripWhitespace)

The corpus is now be processed into 1-, 2-, and 3-word tokens (n-grams). These tokens can then be analyzed in terms of their frequencies and correlations. This results of this analysis will begin to inform the development of a predictive text model.

#create unigram document-term matrix and calculate frequencies
dtm1 <- DocumentTermMatrix(corp)
dtm1 <- removeSparseTerms(dtm1, .999)
freq1 <- sort(colSums(as.matrix(dtm1)), decreasing = TRUE)

#create bigram document-term matrix and calculate frequencies
BigramTokenizer <- function(x)
      unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

dtm2 <- DocumentTermMatrix(corp, control = list(tokenize = BigramTokenizer))
dtm2 <- removeSparseTerms(dtm2, .999)
freq2 <- sort(colSums(as.matrix(dtm2)), decreasing = TRUE)

#create trigram document-term matrix and calculate frequencies
TrigramTokenizer <- function(x)
      unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

dtm3 <- DocumentTermMatrix(corp, control = list(tokenize = TrigramTokenizer))
dtm3 <- removeSparseTerms(dtm3, .9999)
freq3 <- sort(colSums(as.matrix(dtm3)), decreasing = TRUE)

Next Steps

A predictive model will be constructed using the tokenized corpus. Similar to the exploratory analysis conducted above, n-grams will be used to estimate the correlation between tokens. The resulting model will then be integrated into a Shiny app that will provide predictions based on entered text.

Based on the top n-grams as listed above, common English stopwords will no longer be removed from the corpus. Some of the phrases, such as “can’t wait see”, only make sense with the linking words intact. The prediction that results from “can’t wait” should be “to see.” A model that returns only the word “see” would not be an effective engine for a predictive text system.