Executive Summary

The Johns Hopkins Data Science Capstone aims to develop a preditive text model for the Swiftkey smart keyboard. This summary of the exploratory analytics reveals that the next phase of predictive modeling should focus on sampling the blog text. Sampling the text will render a more efficient and mobile phone friendly predictive model. Focusing on the blog text further streamlines the exercise because the rich language from the blog text is broad enough to satisfy common user experiences. Accuracy testing against the news and twitter samples will assess the usefulness of the final model. Although it is beyond the scope of this project, UX testing should also be applied for refinement prior to launch.

Background and Objectives

Around the world, people are spending an increasing amount of time on their mobile devices. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

The data for this project come from a corpus called HC Corpora and include text files four locales en_US, de_DE, ru_RU and fi_FI. The analysis focuses on the en_US blog, news and twitter files.

At the time of this report there is no information avaailable about who authored the data, where it was pulled from, when, why or how it was acquired.

The reproducible code in this analysis appears at the end of this summary.

Key Findings

Finding 1: Sample the text data to maximize efficiency

The data sets are large and require lots of system time to process, making them too heavy for standard mobile phone use. Therefore, mining the information with representative, random samples and using principles of inferential statistics is an efficient alternative to processing the entire body of text in the files.

The twitter file is the smallest in the collection.

## # A tibble: 4 × 2
##           file_name size_mb
##               <chr>   <dbl>
## 1         milestone    0.00
## 2 en_US.twitter.txt  159.36
## 3    en_US.news.txt  196.28
## 4   en_US.blogs.txt  200.42

On a lean, Windows, 2.13 GHz processor this little twitter file needs more than 32 CPU time units to read. Therefore, using representative, random samples to explore and analyze the data and build predictive models will both reliable and efficient.

##    user  system elapsed 
##   34.84    0.59   36.30

Finding 2: The blog text has the richest language.

After cleaning and filtering for stopwords and profanity, the blog sample has a greater number of words as well as a broader range of word types. This suggests that a predictive text model built on the language of the blog text may be extendable to news and twitter texts. If so, building a single model will be most efficient.

Vocabulary = the total number of words in the sample.

Type = number of unique words in the sample.

TTR = Type/Vocabulary Ratio - this indicates complexity of the language. High scores indicate more complex language. Here Twitter is high because it is creative. News is low because it uses standard, common terms.

Diversity = Type/the square root of 2 x vocabulary. This is an measure of language diversity that is independent of the samlpe size

##            blog.sample.txt news.sample.txt twit.sample.txt
## vocabulary       247512.00       215686.00        77224.00
## type              12276.00        10168.00        11653.00
## TTR                   0.05            0.05            0.15
## Diversity            17.45           15.48           29.65

This plot of the top 100 most frequent words comes from the blog, news and twitter vocabulars - combined. The solid green presence reflects on how the news text uses and reuses common words. Twitter accounts for the meager blue presence, reflecting its more diverse and fluid dictionary.

Finally, these word clouds demonstrate the spirit and flavor of each type of text. This word cloud includes the top terms from the blog sample. It features terms that tend to be related to emotions and personal life, such as “always”, “never”, “good”, “bad” and “fun”.

In this word cloud for the news text notice the appearance of formal terms associated with politics and business such as “president”, “contributions”, “million” and “assets”.

This word cloud reveals unique and casual nature of the twitter text which makes it diverse, but, at the same time, a poor candidate as a predictor of common language. Examples include “lol”, “hey”, and “guys”.

Next Steps - Strategy for Building the Predictive Text Model

The next phase of the project will focus on building a predicitve text model using a sample of the blog text. It will incorporate ngram algorithms. Ngrams are analytic tools that focus on predicting sequences of text. Standard machine learning techniques to split the sample into test/training/validation subsets will be part of the analysis. Accuracy tests will assess the strength of the model on samples of the news and twitter texts.


Appendix

Reproducible code that drove this analysis

Built with 3.3.2.

#Initialize with libraries and data sets
library(tm)
library(wordcloud)
library(RWeka)
library(tidyr)
library(ggplot2)
library(plyr)
library(dplyr)
library(tibble)
library(reshape2)

options(stringsAsFactors = FALSE)

#Load text data
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download(url, dest="Coursera-SwiftKey.zip", mode="wb") 
unzip ("Coursera-SwiftKey.zip", exdir = "./")

#check size of files
file.size <- file.info(dir())
file.size <- tibble::rownames_to_column(file.size, "file_name")
file.size <- tbl_df(file.size)
file.size <- file.size %>% arrange(size) %>% 
  mutate(size_mb = round(size/2^20, 2)) %>%
  select(file_name, size_mb)

file.size

smallfile    <- "en_US.twitter.txt"
mydata       <- file(smallfile, "r")
readTime     <- system.time(readLines(mydata))
rm(mydata)
rm(smallfile)
gc()


#Load profanity filter
url <- "https://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/Terms-to-Block.csv"

download(url, dest="Terms-to-Block.csv")
naughty <- read.csv("Terms-to-Block.csv", 
                    skip=3,
                    stringsAsFactors = FALSE)
naughty <- naughty[,1]
naughty  <- gsub(",", "", naughty)


#extract samples from the English language data

extract        <- function(filename, sample.size) {
  i            <- 0
  mydata       <- file(filename, "r")
  dataR        <- readLines(mydata)
  chunksize    <- as.integer(2*(length(dataR)/(sample.size)))
  chunknumber  <- as.integer(sample.size/chunksize)
  sample.text  <- as.data.frame(NULL)
  close(mydata)
  rm(dataR)
  gc()
  while(i < chunknumber) {
    sample.text   <- sample.text
    mydata        <- file(filename, "r")
    pull          <- readLines(mydata, chunksize, skipNul=TRUE)
    pull          <- as.data.frame(pull)
    set.seed(1234)
    if(rbinom(n=1, size=1, prob=.5)<1) {
      sample.text <- rbind(sample.text, pull)
    } else {
      sample.text <- sample.text
    }
    close(mydata)
    i <- i + 1
  }
  return(sample.text)
}


news.sample <- extract("en_US.news.txt",    10000)
blog.sample <- extract("en_US.blogs.txt",   10000)
twit.sample <- extract("en_US.twitter.txt", 10000)

gc()

dir.create("milestone")
setwd("milestone")

pathname <- "c:/Users/analyticsrmo/Desktop/Coursera/capstone/Coursera_SwiftKey_unzipped/final/en_US/milestone"

#Build and clean Corpus
write.table(news.sample, "news.sample.txt", quote=FALSE)
write.table(blog.sample, "blog.sample.txt", quote=FALSE)
write.table(twit.sample, "twit.sample.txt", quote=FALSE)

milestone.corpus <- Corpus(DirSource(pathname))

cleaning     <- function(corpusfile) {
  cleaned    <- tm_map(corpusfile, removePunctuation)
  cleaned    <- tm_map(cleaned, content_transformer(tolower))
  cleaned    <- tm_map(cleaned, stripWhitespace)
  cleaned    <- tm_map(cleaned, removeWords, stopwords("english"))
  cleaned    <- tm_map(cleaned, removeWords, naughty)
  return(cleaned)
}

milestone.cleaned.corpus <- cleaning(milestone.corpus)



#Create Document Term Matrix and look for patterns

dtm          <- TermDocumentMatrix(milestone.cleaned.corpus)
dtm2         <- as.matrix(dtm)

#Explore data patterns
##1. Characterize the Corpus by word count, type, ratios & Diversity

library(ngram)
summary        <- function(data) {
 vocabulary    <- round(colSums(data), 0)
 type          <- round(colSums(data !=0), 0)
 TTR           <- round(type/vocabulary, 2)
 Diversity     <- round(type/sqrt(2*vocabulary), 2)
 Corpora.Stats <- rbind(vocabulary, type, TTR, Diversity)
 print(Corpora.Stats)
}


summary(dtm2)


# 2. Are  high frequency terms evenly distributed across the texts?

highfrequency   <- function(data, n) {
  frequency     <- rowSums(data)
  trm.mtrx.freq <- cbind(data, frequency)
  trm.mtrx.freq <- trm.mtrx.freq[order(trm.mtrx.freq[,"frequency"],
                                       decreasing=TRUE),]
  top           <- as.data.frame(trm.mtrx.freq[1:n,])
  top           <- tibble::rownames_to_column(
    top, "token")
}

tophundred <- highfrequency(dtm2, 100)


#Create ggplot that demonstrates distribution of words

tophundred$rank <- as.factor(rank(tophundred$frequency, ties.method="first"))
tophundred.melt <- melt(tophundred, id=c("token", "frequency", "rank"),
                        measure.vars=c("blog.sample.txt",
                                       "news.sample.txt",
                                       "twit.sample.txt"))
tophundred.melt <- rename(tophundred.melt, media = variable, word.count=value)


ggplot(data = tophundred.melt, aes(x = rank, y = word.count, fill = media)) + 
  geom_bar(stat="identity")+coord_flip()

#3. Compare nature of the language in sample
par(mfrow=c(1,3))
blog.cloud <- sort(dtm2[,"blog.sample.txt"], decreasing=TRUE)
blog.cloud <- data.frame(word=names(blog.cloud), freq=blog.cloud)
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "High Frequency Words in Blog Sample")
wordcloud(blog.cloud$word, blog.cloud$freq, 
          max.words = 100,
          ordered.colors = TRUE,
          random.order=FALSE,
          main="Title")   


news.cloud <- sort(dtm2[,"news.sample.txt"], decreasing=TRUE)
news.cloud <- data.frame(word=names(news.cloud), freq=news.cloud)
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "High Frequency Words in News Sample")
wordcloud(news.cloud$word, news.cloud$freq, 
          max.words = 100,
          ordered.colors = TRUE,
          random.order=FALSE,
          main="Title")   


twit.cloud <- sort(dtm2[,"twit.sample.txt"], decreasing=TRUE)
twit.cloud <- data.frame(word=names(twit.cloud), freq=twit.cloud)
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "High Frequency Words in Twitter Sample")
wordcloud(twit.cloud$word, twit.cloud$freq, 
          max.words = 100,
          ordered.colors = TRUE,
          random.order=FALSE,
          main="Title")