Executive Summary

This milestone report is a part of the Data Science Capstone project provided by SwiftKey, a corporate partner in this course which builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

Our goal is to understand and build predictive text models like those used by SwiftKey which give three options for what the next word might be when users type with the keyboard. To accomplish the task, the first step is to perform a thorough exploratory analysis of the training data set from SwiftKey.

The motivation for the analysis is to:

  1. Create a basic report of summary statistics about the data sets
  2. Understand the distribution of words and relationship between the words in the corpora
  3. Build figures and tables to understand variation in the frequencies of word pairs in the data
  4. Briefly summarize the plan for creating the prediction algorithm and Shiny app

Set up The Environment

library(stringr)
library(tm)
library(quanteda)
library(ggplot2)
library(wordcloud)

Load The SwiftKey Data

blogspath = "Coursera-SwiftKey/en_US/en_US.blogs.txt"
blogs = readLines(blogspath, encoding = "UTF-8", skipNul = TRUE)

newspath = "Coursera-SwiftKey/en_US/en_US.news.txt"
# The en_US.news.txt is read in binary mode 
con = file(newspath, open = 'rb')
news = readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

twitterpath = "Coursera-SwiftKey/en_US/en_US.twitter.txt"
twitter = readLines(twitterpath, encoding = "UTF-8", skipNul = TRUE)

Basic Summaries

filePath = list(blogspath, newspath, twitterpath)
fileSize = round(sapply(filePath, file.size)/(1024^2), 2)

fileComb = list(blogs, news, twitter)
fileLength = sapply(fileComb, length)

lineWords = lapply(fileComb, function(fl) 
      sapply(fl, str_count, pattern = '\\w+'))
fileWords = sapply(lineWords, sum)
lineMean = round(sapply(lineWords, mean), 0)
lineMax = sapply(lineWords, max)

BasicSum = data.frame(FileName = c('en_US.blogs.txt', 
                                   'en_US.news.txt', 
                                   'en_US.twitter.txt'), 
                      FileSize = paste(fileSize, 'MB', sep = ''), 
                      TotalLines = fileLength, TotalWords = fileWords, 
                      MeanWPL = lineMean, MaxMPL = lineMax)
BasicSum
##            FileName FileSize TotalLines TotalWords MeanWPL MaxMPL
## 1   en_US.blogs.txt 200.42MB     899288   38309620      43   6851
## 2    en_US.news.txt 196.28MB    1010242   35624454      35   1928
## 3 en_US.twitter.txt 159.36MB    2360148   31003544      13     47

Take Samples from fileComb

Since the original data sets are too large and the processing time is too long, we are going to take samples (sample size = 1%) from all three files and combine them together for further analysis.

set.seed(1872)
sampleSize = 0.01
sampleComb = sapply(fileComb, function(fl) 
      sample(fl, size = length(fl)*sampleSize, replace = FALSE))
sampleComb = unlist(sampleComb)
sampleComb = iconv(sampleComb, 'latin1', 'ASCII', sub = '')

# Remove the original data and save the sample as .txt
rm(list = c('blogs', 'news', 'twitter', 'fileComb', 'lineWords'))
samplePath = 'Coursera-SwiftKey/en_US/en_US.sample.txt'
write(sampleComb, file = samplePath)

Basic Summaries for The Sample Data

scSize = round(file.size(samplePath)/(1024^2), 2)
scLength = length(sampleComb)
sclineWords = sapply(sampleComb, str_count, pattern = '\\w+')
scWords = sum(sclineWords)
sclineMean = round(mean(sclineWords), 0)
sclineMax = max(sclineWords)

scBasic = data.frame(FileName = 'en_US.sample.txt', 
                     FileSize = paste(scSize, 'MB', sep = ''), 
                     TotalLines = scLength, TotalWords = scWords, 
                     MeanWPL = sclineMean, MaxMPL = sclineMax)
scBasic
##           FileName FileSize TotalLines TotalWords MeanWPL MaxMPL
## 1 en_US.sample.txt   5.51MB      42695    1042070      24    411

Now the combined data set is only about 5MB with 42695 lines and 1042070 words.

Create Corpus and Clean The Data

The tm package is used here to provide a framework for text mining. Several steps are taken to clean the data: 1) covert to lowercase, 2) remove URLs, mentions and emails, 3) remove profanity, 4) remove numbers, 5) remove punctuation, 6) remove stop words, 7) strip extra white space.

spCorpus = VCorpus(VectorSource(sampleComb))

# Function to remove URLs, mentions and emails
removePattern = function(x) {
      gsub('(f|ht)tp[^[:space:]]*', ' ', x)
      gsub('^[[:alnum:].-_]+@[[:alnum:].-]+$', ' ', x)
      gsub('@\\w+', ' ', x)
}

# The badwords list are downloaded from https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt
# The data is processed before use
badwordsPath = 'Coursera-SwiftKey/en_US/badwords.txt'
badwords = readLines(badwordsPath, encoding = "UTF-8", skipNul = TRUE)
badwords = gsub('\\*', '\\\\*', badwords)
badwords = gsub('\\(', '\\\\(', badwords)

# Clean data with tm_map
spCorpus = tm_map(spCorpus, tolower)
spCorpus = tm_map(spCorpus, removePattern)
spCorpus = tm_map(spCorpus, removeWords, badwords)
spCorpus = tm_map(spCorpus, removeNumbers)
spCorpus = tm_map(spCorpus, removePunctuation)
spCorpus = tm_map(spCorpus, removeWords, stopwords("english"))
spCorpus = tm_map(spCorpus, stripWhitespace)
# Save spCorpus as .txt
writeLines(as.character(spCorpus), 
           con = 'Coursera-SwiftKey/en_US/scCorpus.txt')

Exploratory Analysis

In this part, the distribution of word and word pairs are analyzed to better understand the feature of the training data. The quanteda package is applied to create tokens objects and sets of ngrams. The 10 most frequent words and 20 most frequent 2-grams and 3-grams are shown in the bar charts. Also, since some words are much more widely used than the others, the numbers of unique words needed to cover 50% and 90% of all word instances are calculated for later use.

Word Frequencies

tdm = TermDocumentMatrix(Corpus(VectorSource(spCorpus)))
wordFreq = sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreqdf = data.frame(word = names(wordFreq), freq = wordFreq)
rownames(wordFreqdf) = NULL

# Top 10 most frequent words
plotdf1 = head(wordFreqdf, 10)
plot1 = ggplot(plotdf1, aes(x = reorder(word, -freq), y = freq))
plot1 = plot1 + geom_bar(stat = "identity") + 
      geom_text(aes(label = freq), vjust = -0.25) + 
      labs(x = 'Word', y = 'Frequency', 
           title = 'Top 10 Most Frequent Words')
plot1

# Generate a word cloud
set.seed(9999)
wordcloud(words = wordFreqdf$word, freq = wordFreqdf$freq, 
          max.words=200, random.order=FALSE, rot.per=0.33, 
          scale=c(3, 0.5), colors=brewer.pal(8, "Set2"))

2-grams Frequencies

toks = tokens(corpus(unlist(spCorpus)))
toks2gram = tokens_ngrams(toks, n = 2, concatenator = ' ')
dfm2gram = dfm(toks2gram)
freq2gram = textstat_frequency(dfm2gram)

# Top 20 most frequent 2-grams
plotdf2 = head(freq2gram, 20)
plot2 = ggplot(plotdf2, aes(x = reorder(feature, -frequency), 
                            y = frequency))
plot2 = plot2 + geom_bar(stat = "identity") + 
      geom_text(aes(label = frequency), vjust = -0.25) + 
      labs(x = '2-grams', y = 'Frequency', 
           title = 'Top 20 Most Frequent 2-grams') + 
      theme(axis.text.x = element_text(angle = 45))
plot2

3-grams Frequencies

toks3gram = tokens_ngrams(toks, n = 3, concatenator = ' ')
dfm3gram = dfm(toks3gram)
freq3gram = textstat_frequency(dfm3gram)

# Top 20 most frequent 2-grams
plotdf3 = head(freq3gram, 20)
plot3 = ggplot(plotdf3, aes(x = reorder(feature, -frequency), 
                            y = frequency))
plot3 = plot3 + geom_bar(stat = "identity") + 
      geom_text(aes(label = frequency), vjust = -0.25) + 
      labs(x = '3-grams', y = 'Frequency', 
           title = 'Top 20 Most Frequent 3-grams') + 
      theme(axis.text.x = element_text(angle = 45))
plot3

Word Coverage

coverage = cumsum(wordFreqdf$freq)/sum(wordFreqdf$freq)
which(coverage > 0.5)[1]
## [1] 1281
which(coverage > 0.5)[1]/length(coverage)
## [1] 0.01833982
which(coverage > 0.9)[1]
## [1] 22685
which(coverage > 0.9)[1]/length(coverage)
## [1] 0.3247767

As we can see, only about 1200 most frequent words (1.8%) are needed to cover 50% of all word instances in the corpus. For a 90% coverage, 22685 words (32.5%) are needed.

Next Steps

The final predictive text model will be built on the result of exploratory analysis. An assumption is that the probability of a word only depends on the previous n words. It is known as an n-gram model. However, the probabilities are usually not derived directly from frequency counts as many n-grams are not included in the corpus. Certain form of smoothing is necessary to assign some of the total probability mass to unseen words or n-grams.

In the next few weeks, both the n-grams model and other models like Good-Turing discounting will be explored to determine the best way to balance the performance and efficiency of the model.