This milestone report is a part of the Data Science Capstone project provided by SwiftKey, a corporate partner in this course which builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.
Our goal is to understand and build predictive text models like those used by SwiftKey which give three options for what the next word might be when users type with the keyboard. To accomplish the task, the first step is to perform a thorough exploratory analysis of the training data set from SwiftKey.
The motivation for the analysis is to:
library(stringr)
library(tm)
library(quanteda)
library(ggplot2)
library(wordcloud)
blogspath = "Coursera-SwiftKey/en_US/en_US.blogs.txt"
blogs = readLines(blogspath, encoding = "UTF-8", skipNul = TRUE)
newspath = "Coursera-SwiftKey/en_US/en_US.news.txt"
# The en_US.news.txt is read in binary mode
con = file(newspath, open = 'rb')
news = readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
twitterpath = "Coursera-SwiftKey/en_US/en_US.twitter.txt"
twitter = readLines(twitterpath, encoding = "UTF-8", skipNul = TRUE)
filePath = list(blogspath, newspath, twitterpath)
fileSize = round(sapply(filePath, file.size)/(1024^2), 2)
fileComb = list(blogs, news, twitter)
fileLength = sapply(fileComb, length)
lineWords = lapply(fileComb, function(fl)
sapply(fl, str_count, pattern = '\\w+'))
fileWords = sapply(lineWords, sum)
lineMean = round(sapply(lineWords, mean), 0)
lineMax = sapply(lineWords, max)
BasicSum = data.frame(FileName = c('en_US.blogs.txt',
'en_US.news.txt',
'en_US.twitter.txt'),
FileSize = paste(fileSize, 'MB', sep = ''),
TotalLines = fileLength, TotalWords = fileWords,
MeanWPL = lineMean, MaxMPL = lineMax)
BasicSum
## FileName FileSize TotalLines TotalWords MeanWPL MaxMPL
## 1 en_US.blogs.txt 200.42MB 899288 38309620 43 6851
## 2 en_US.news.txt 196.28MB 1010242 35624454 35 1928
## 3 en_US.twitter.txt 159.36MB 2360148 31003544 13 47
Since the original data sets are too large and the processing time is too long, we are going to take samples (sample size = 1%) from all three files and combine them together for further analysis.
set.seed(1872)
sampleSize = 0.01
sampleComb = sapply(fileComb, function(fl)
sample(fl, size = length(fl)*sampleSize, replace = FALSE))
sampleComb = unlist(sampleComb)
sampleComb = iconv(sampleComb, 'latin1', 'ASCII', sub = '')
# Remove the original data and save the sample as .txt
rm(list = c('blogs', 'news', 'twitter', 'fileComb', 'lineWords'))
samplePath = 'Coursera-SwiftKey/en_US/en_US.sample.txt'
write(sampleComb, file = samplePath)
scSize = round(file.size(samplePath)/(1024^2), 2)
scLength = length(sampleComb)
sclineWords = sapply(sampleComb, str_count, pattern = '\\w+')
scWords = sum(sclineWords)
sclineMean = round(mean(sclineWords), 0)
sclineMax = max(sclineWords)
scBasic = data.frame(FileName = 'en_US.sample.txt',
FileSize = paste(scSize, 'MB', sep = ''),
TotalLines = scLength, TotalWords = scWords,
MeanWPL = sclineMean, MaxMPL = sclineMax)
scBasic
## FileName FileSize TotalLines TotalWords MeanWPL MaxMPL
## 1 en_US.sample.txt 5.51MB 42695 1042070 24 411
Now the combined data set is only about 5MB with 42695 lines and 1042070 words.
The tm package is used here to provide a framework for text mining. Several steps are taken to clean the data: 1) covert to lowercase, 2) remove URLs, mentions and emails, 3) remove profanity, 4) remove numbers, 5) remove punctuation, 6) remove stop words, 7) strip extra white space.
spCorpus = VCorpus(VectorSource(sampleComb))
# Function to remove URLs, mentions and emails
removePattern = function(x) {
gsub('(f|ht)tp[^[:space:]]*', ' ', x)
gsub('^[[:alnum:].-_]+@[[:alnum:].-]+$', ' ', x)
gsub('@\\w+', ' ', x)
}
# The badwords list are downloaded from https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt
# The data is processed before use
badwordsPath = 'Coursera-SwiftKey/en_US/badwords.txt'
badwords = readLines(badwordsPath, encoding = "UTF-8", skipNul = TRUE)
badwords = gsub('\\*', '\\\\*', badwords)
badwords = gsub('\\(', '\\\\(', badwords)
# Clean data with tm_map
spCorpus = tm_map(spCorpus, tolower)
spCorpus = tm_map(spCorpus, removePattern)
spCorpus = tm_map(spCorpus, removeWords, badwords)
spCorpus = tm_map(spCorpus, removeNumbers)
spCorpus = tm_map(spCorpus, removePunctuation)
spCorpus = tm_map(spCorpus, removeWords, stopwords("english"))
spCorpus = tm_map(spCorpus, stripWhitespace)
# Save spCorpus as .txt
writeLines(as.character(spCorpus),
con = 'Coursera-SwiftKey/en_US/scCorpus.txt')
In this part, the distribution of word and word pairs are analyzed to better understand the feature of the training data. The quanteda package is applied to create tokens objects and sets of ngrams. The 10 most frequent words and 20 most frequent 2-grams and 3-grams are shown in the bar charts. Also, since some words are much more widely used than the others, the numbers of unique words needed to cover 50% and 90% of all word instances are calculated for later use.
tdm = TermDocumentMatrix(Corpus(VectorSource(spCorpus)))
wordFreq = sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreqdf = data.frame(word = names(wordFreq), freq = wordFreq)
rownames(wordFreqdf) = NULL
# Top 10 most frequent words
plotdf1 = head(wordFreqdf, 10)
plot1 = ggplot(plotdf1, aes(x = reorder(word, -freq), y = freq))
plot1 = plot1 + geom_bar(stat = "identity") +
geom_text(aes(label = freq), vjust = -0.25) +
labs(x = 'Word', y = 'Frequency',
title = 'Top 10 Most Frequent Words')
plot1
# Generate a word cloud
set.seed(9999)
wordcloud(words = wordFreqdf$word, freq = wordFreqdf$freq,
max.words=200, random.order=FALSE, rot.per=0.33,
scale=c(3, 0.5), colors=brewer.pal(8, "Set2"))
toks = tokens(corpus(unlist(spCorpus)))
toks2gram = tokens_ngrams(toks, n = 2, concatenator = ' ')
dfm2gram = dfm(toks2gram)
freq2gram = textstat_frequency(dfm2gram)
# Top 20 most frequent 2-grams
plotdf2 = head(freq2gram, 20)
plot2 = ggplot(plotdf2, aes(x = reorder(feature, -frequency),
y = frequency))
plot2 = plot2 + geom_bar(stat = "identity") +
geom_text(aes(label = frequency), vjust = -0.25) +
labs(x = '2-grams', y = 'Frequency',
title = 'Top 20 Most Frequent 2-grams') +
theme(axis.text.x = element_text(angle = 45))
plot2
toks3gram = tokens_ngrams(toks, n = 3, concatenator = ' ')
dfm3gram = dfm(toks3gram)
freq3gram = textstat_frequency(dfm3gram)
# Top 20 most frequent 2-grams
plotdf3 = head(freq3gram, 20)
plot3 = ggplot(plotdf3, aes(x = reorder(feature, -frequency),
y = frequency))
plot3 = plot3 + geom_bar(stat = "identity") +
geom_text(aes(label = frequency), vjust = -0.25) +
labs(x = '3-grams', y = 'Frequency',
title = 'Top 20 Most Frequent 3-grams') +
theme(axis.text.x = element_text(angle = 45))
plot3
coverage = cumsum(wordFreqdf$freq)/sum(wordFreqdf$freq)
which(coverage > 0.5)[1]
## [1] 1281
which(coverage > 0.5)[1]/length(coverage)
## [1] 0.01833982
which(coverage > 0.9)[1]
## [1] 22685
which(coverage > 0.9)[1]/length(coverage)
## [1] 0.3247767
As we can see, only about 1200 most frequent words (1.8%) are needed to cover 50% of all word instances in the corpus. For a 90% coverage, 22685 words (32.5%) are needed.
The final predictive text model will be built on the result of exploratory analysis. An assumption is that the probability of a word only depends on the previous n words. It is known as an n-gram model. However, the probabilities are usually not derived directly from frequency counts as many n-grams are not included in the corpus. Certain form of smoothing is necessary to assign some of the total probability mass to unseen words or n-grams.
In the next few weeks, both the n-grams model and other models like Good-Turing discounting will be explored to determine the best way to balance the performance and efficiency of the model.