Swiftkey Capstone Project milestone report

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices.

The requirement for the milestone project is to display that one is on track to creating a prediction algorithm to auto suggest next word/phrases using the capstone dataset. The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

R packages

At this step, we use these library from R as belows:

library(stringi)
library(stringr)
library(dplyr)
library(tm)
library(wordcloud)
library(slam)
library(ggplot2)

Data

We use the data provided by Swiftkey for this Capstone project. This zip file is more than 500 MB in size, therefore we should download it using web browser or a download manager.

Below is the structure of the extracted package:

cat(system("tree final /f /a", intern = TRUE), sep = "\n")

## +---final
## |   +---de_DE
## |   |       de_DE.blogs.txt
## |   |       de_DE.news.txt
## |   |       de_DE.twitter.txt
## |   |       
## |   +---en_US
## |   |       en_US.blogs.txt
## |   |       en_US.news.txt
## |   |       en_US.twitter.txt
## |   |       
## |   +---fi_FI
## |   |       fi_FI.blogs.txt
## |   |       fi_FI.news.txt
## |   |       fi_FI.twitter.txt
## |   |       
## |   ---ru_RU
## |           ru_RU.blogs.txt
## |           ru_RU.news.txt
## |           ru_RU.twitter.txt

There are 4 folders, each folder contains text data for a language: English, Russian, German and Finnish. We focus on English language (folder “en_US”). We look into the file size:

file_list <- paste0(getwd(),"/final","/en_US/", list.files("final/en_US/"))
lapply(file_list, function(x) file.info(x)$size)

## [[1]]
## [1] 210160014
## 
## [[2]]
## [1] 205811889
## 
## [[3]]
## [1] 167105338

This data was collected from blogs, news and twitter (each file for a category), we see how many entries (lines) and how many words per file:

read_binary <- function(file_path)
{
    print(file_path)
    con <- file(file_path, open="rb")
    data <- readLines(con, encoding="UTF-8")
    print(paste("The number of lines:", length(data)))
    count_of_word <- sapply(gregexpr("\\W+", data), length) + 1
    print(paste("The number of words:", sum(count_of_word)))
    close(con)
    rm(con)
    gc()
    return(data)
}

blogs <- read_binary("final/en_US/en_US.blogs.txt")

## [1] "final/en_US/en_US.blogs.txt"
## [1] "The number of lines: 899288"
## [1] "The number of words: 39121566"

twitter <- read_binary("final/en_US/en_US.twitter.txt")

## [1] "final/en_US/en_US.twitter.txt"
## [1] "The number of lines: 2360148"
## [1] "The number of words: 32793388"

news <- read_binary("final/en_US/en_US.news.txt")

## [1] "final/en_US/en_US.news.txt"
## [1] "The number of lines: 1010242"
## [1] "The number of words: 36721104"

The distribution of length of data line could also be an interesting information:

blogs_length <- nchar(blogs)
news_length <- nchar(news)
twitter_length <- nchar(twitter)

hist(blogs_length)

hist(news_length)

hist(twitter_length)

summary(blogs_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40830

summary(news_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11380.0

summary(twitter_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

Twitter restricts the size of each tweet; therefore we could notice the difference of line length compare to the others data source.

Sampling and cleaning data

In order to make further analysis, we need to sample data. My strategy is describe as below:

For each file, extract randomly 100,000 entries.
Clean data, only keep alphabert character, remove bad words, etc… We only analyze English language; therefore only alphabert characters are kept. Thanks to the simplicity of English language, cleaning process could be easier than other language which has non standard-alphabert-character (a to z).
Combine those entries into 1 data set and use it as our training data.

# Sampling the data
set.seed(0)
sample_blogs <- sample(blogs,100000)
set.seed(0)
sample_news <- sample(news,100000)
set.seed(0)
sample_twitter <- sample(twitter, 100000)

drop_non_alphabert <- function(text_data)
{
    clean_data <- str_replace_all(text_data, "\'", "")
    clean_data <- str_replace_all(text_data, "[^[:alnum:]]", " ")
    return (clean_data)
}

sample_blogs <- drop_non_alphabert(sample_blogs)
sample_twitter <- drop_non_alphabert(sample_twitter)
sample_news <- drop_non_alphabert(sample_news)

sample_data <- c(sample_blogs, sample_news, sample_twitter)

Descriptive analysis

We investigate our sample data size:

print(paste("The number of lines:", length(sample_data)))

## [1] "The number of lines: 300000"

count_of_word <- sapply(gregexpr("\\W+", sample_data), length) + 1
print(paste("The number of word:", sum(count_of_word)))

## [1] "The number of word: 9363382"

Thank to tm package, we init the corpus, lower all the token:

myCorpus <- Corpus(VectorSource(sample_data))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))

We use word cloud to visualize the word frequence:

ggColors <- function(n) {
  hues = seq(15, 375, length=n+1)
  hcl(h=hues, l=65, c=100)[1:n]
}
gg.cols <- ggColors(8)
bp.cols<- c("light blue","cornflowerblue", "coral2", brewer.pal(8,"Dark2"))

wordcloud(
    myCorpus, 
    max.words = 500,
    colors = bp.cols
    )

A bar chart show more detail for top 20 used tokens:

myTDM <- TermDocumentMatrix(myCorpus)
freq <- sort(row_sums(myTDM, na.rm=TRUE), decreasing=TRUE)
frequency_table <- data.frame(word = names(freq), freq = freq)
sum <- sum(frequency_table$freq)
frequency_table$percent <- (100*frequency_table$freq)/sum

plot_data <- frequency_table[1:20,]
rownames(plot_data) <- NULL

g <- ggplot(plot_data, aes(reorder(word, percent), percent))
g + geom_bar(stat = "identity") + coord_flip() + xlab("word") + ylab("frequency (%)")

As the charts show, there are many common words but not much meaningful. Filtering those common words bring a more interesting insight. We could observe most “meaningful” words though the frequencies are dropped.

myCorpus <- tm_map(myCorpus, removeWords,
                   c(stopwords("SMART"), "thy", "thou", "thee", "the", "and", "but"))

wordcloud(
    myCorpus, 
    max.words = 500,
    colors = bp.cols
    )

myTDM <- TermDocumentMatrix(myCorpus)
freq <- sort(row_sums(myTDM, na.rm=TRUE), decreasing=TRUE)
frequency_table <- data.frame(word = names(freq), freq = freq)
sum <- sum(frequency_table$freq)
frequency_table$percent <- (100*frequency_table$freq)/sum

plot_data <- frequency_table[1:20,]
rownames(plot_data) <- NULL

g <- ggplot(plot_data, aes(reorder(word, percent), percent))
g + geom_bar(stat = "identity") + coord_flip() + xlab("word") + ylab("frequency (%)")

Analytic strategy

In order to predict next typed word, we use n-gram algorithm. The idea is rather simple: Using the the sequence of sample text data, we calculate the probality of the next token base on the most recently input tokens (2, 3, 4 tokens, etc…) . From the corpus above, we make 4 n-gram:

Bigram
Trigram
Four-gram
Five-gram

Also Stupid BackOff procedure would be implemented. When a sequence of tokens failed in prediting using five-gram, the searching could continue in four-gram and so on. This algorithm were chosen because of it’s simple concept, existing developed package and the easy of update model on the fly.

Next step

Store and querying n-grams for predicted word

The final application will be deployed to shinyapps.io. Due to the limited resources, we need to consider carefully how we store the pre-calculated n-gram model and query. There are 2 options:

Store n-grams in R data object. It will be loaded when the application initialize. After loading (could take sometime), the query speed could be fast since all n-grams stored in RAM but RAM is also easily reach to the limitation.
Each of our n-drams would be stored in a SQLLite file (partition like this for better query speed). When we need to query, RSQLLite will be used to query from the data file. The simplicity of n-grams allows us to transform it to a relational database with not much effort. This approach looks promising and to me it’s closer to what should be done in real-life application. Using a database (we could even use an external power database server like PostgreSQL or MariaDB) allow scaling up the application easier. The loading time of the application will decrease significantly and our Shiny code is more clear. The query speed could be increased and we need to consider the performance by some benchmark to see which approach would bring a better user experience.

Build user custom n-grams

In real life application, we should make the application smater. The pre-calculated n-grams which were build from SwiftKey data should only be used as an initialized model. Besides those, we build another set of n-grams base on user’s input words. When the user input successfully a word, related sequence should be calculated and updated to these n-gram. The procedure of predict text would be as follow:

Look up in user custom n-grams, if find the result(s) for the sequence of input, return to the user
If not any result in user custom n-grams, query process would be done in the initialized data of SwiftKey
After input successfully, calculate and update the user custom n-grams. This approach seems applicaple to me and after sometime, when the user custom n-grams big enough, the application would work more efficiently because the size of user custom n-grams is smaller and tailored for the specific user.
The user custom n-grams should be stored directly in RAM and save to R data file.

Conclusion

We use data from Swiftkey to build our initialized n-grams model to predict the next input words/phrases. Further more, we build user custom n-grams from the input of user for better performance and accurate.