Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices.
The requirement for the milestone project is to display that one is on track to creating a prediction algorithm to auto suggest next word/phrases using the capstone dataset. The motivation for this project is to:
At this step, we use these library from R as belows:
library(stringi)
library(stringr)
library(dplyr)
library(tm)
library(wordcloud)
library(slam)
library(ggplot2)
We use the data provided by Swiftkey for this Capstone project. This zip file is more than 500 MB in size, therefore we should download it using web browser or a download manager.
Below is the structure of the extracted package:
cat(system("tree final /f /a", intern = TRUE), sep = "\n")
## +---final
## | +---de_DE
## | | de_DE.blogs.txt
## | | de_DE.news.txt
## | | de_DE.twitter.txt
## | |
## | +---en_US
## | | en_US.blogs.txt
## | | en_US.news.txt
## | | en_US.twitter.txt
## | |
## | +---fi_FI
## | | fi_FI.blogs.txt
## | | fi_FI.news.txt
## | | fi_FI.twitter.txt
## | |
## | ---ru_RU
## | ru_RU.blogs.txt
## | ru_RU.news.txt
## | ru_RU.twitter.txt
There are 4 folders, each folder contains text data for a language: English, Russian, German and Finnish. We focus on English language (folder “en_US”). We look into the file size:
file_list <- paste0(getwd(),"/final","/en_US/", list.files("final/en_US/"))
lapply(file_list, function(x) file.info(x)$size)
## [[1]]
## [1] 210160014
##
## [[2]]
## [1] 205811889
##
## [[3]]
## [1] 167105338
This data was collected from blogs, news and twitter (each file for a category), we see how many entries (lines) and how many words per file:
read_binary <- function(file_path)
{
print(file_path)
con <- file(file_path, open="rb")
data <- readLines(con, encoding="UTF-8")
print(paste("The number of lines:", length(data)))
count_of_word <- sapply(gregexpr("\\W+", data), length) + 1
print(paste("The number of words:", sum(count_of_word)))
close(con)
rm(con)
gc()
return(data)
}
blogs <- read_binary("final/en_US/en_US.blogs.txt")
## [1] "final/en_US/en_US.blogs.txt"
## [1] "The number of lines: 899288"
## [1] "The number of words: 39121566"
twitter <- read_binary("final/en_US/en_US.twitter.txt")
## [1] "final/en_US/en_US.twitter.txt"
## [1] "The number of lines: 2360148"
## [1] "The number of words: 32793388"
news <- read_binary("final/en_US/en_US.news.txt")
## [1] "final/en_US/en_US.news.txt"
## [1] "The number of lines: 1010242"
## [1] "The number of words: 36721104"
The distribution of length of data line could also be an interesting information:
blogs_length <- nchar(blogs)
news_length <- nchar(news)
twitter_length <- nchar(twitter)
hist(blogs_length)
hist(news_length)
hist(twitter_length)
summary(blogs_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40830
summary(news_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.2 268.0 11380.0
summary(twitter_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
Twitter restricts the size of each tweet; therefore we could notice the difference of line length compare to the others data source.
In order to make further analysis, we need to sample data. My strategy is describe as below:
# Sampling the data
set.seed(0)
sample_blogs <- sample(blogs,100000)
set.seed(0)
sample_news <- sample(news,100000)
set.seed(0)
sample_twitter <- sample(twitter, 100000)
drop_non_alphabert <- function(text_data)
{
clean_data <- str_replace_all(text_data, "\'", "")
clean_data <- str_replace_all(text_data, "[^[:alnum:]]", " ")
return (clean_data)
}
sample_blogs <- drop_non_alphabert(sample_blogs)
sample_twitter <- drop_non_alphabert(sample_twitter)
sample_news <- drop_non_alphabert(sample_news)
sample_data <- c(sample_blogs, sample_news, sample_twitter)
We investigate our sample data size:
print(paste("The number of lines:", length(sample_data)))
## [1] "The number of lines: 300000"
count_of_word <- sapply(gregexpr("\\W+", sample_data), length) + 1
print(paste("The number of word:", sum(count_of_word)))
## [1] "The number of word: 9363382"
Thank to tm package, we init the corpus, lower all the token:
myCorpus <- Corpus(VectorSource(sample_data))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
We use word cloud to visualize the word frequence:
ggColors <- function(n) {
hues = seq(15, 375, length=n+1)
hcl(h=hues, l=65, c=100)[1:n]
}
gg.cols <- ggColors(8)
bp.cols<- c("light blue","cornflowerblue", "coral2", brewer.pal(8,"Dark2"))
wordcloud(
myCorpus,
max.words = 500,
colors = bp.cols
)
A bar chart show more detail for top 20 used tokens:
myTDM <- TermDocumentMatrix(myCorpus)
freq <- sort(row_sums(myTDM, na.rm=TRUE), decreasing=TRUE)
frequency_table <- data.frame(word = names(freq), freq = freq)
sum <- sum(frequency_table$freq)
frequency_table$percent <- (100*frequency_table$freq)/sum
plot_data <- frequency_table[1:20,]
rownames(plot_data) <- NULL
g <- ggplot(plot_data, aes(reorder(word, percent), percent))
g + geom_bar(stat = "identity") + coord_flip() + xlab("word") + ylab("frequency (%)")
As the charts show, there are many common words but not much meaningful. Filtering those common words bring a more interesting insight. We could observe most “meaningful” words though the frequencies are dropped.
myCorpus <- tm_map(myCorpus, removeWords,
c(stopwords("SMART"), "thy", "thou", "thee", "the", "and", "but"))
wordcloud(
myCorpus,
max.words = 500,
colors = bp.cols
)
myTDM <- TermDocumentMatrix(myCorpus)
freq <- sort(row_sums(myTDM, na.rm=TRUE), decreasing=TRUE)
frequency_table <- data.frame(word = names(freq), freq = freq)
sum <- sum(frequency_table$freq)
frequency_table$percent <- (100*frequency_table$freq)/sum
plot_data <- frequency_table[1:20,]
rownames(plot_data) <- NULL
g <- ggplot(plot_data, aes(reorder(word, percent), percent))
g + geom_bar(stat = "identity") + coord_flip() + xlab("word") + ylab("frequency (%)")
In order to predict next typed word, we use n-gram algorithm. The idea is rather simple: Using the the sequence of sample text data, we calculate the probality of the next token base on the most recently input tokens (2, 3, 4 tokens, etc…) . From the corpus above, we make 4 n-gram:
Also Stupid BackOff procedure would be implemented. When a sequence of tokens failed in prediting using five-gram, the searching could continue in four-gram and so on. This algorithm were chosen because of it’s simple concept, existing developed package and the easy of update model on the fly.
The final application will be deployed to shinyapps.io. Due to the limited resources, we need to consider carefully how we store the pre-calculated n-gram model and query. There are 2 options:
In real life application, we should make the application smater. The pre-calculated n-grams which were build from SwiftKey data should only be used as an initialized model. Besides those, we build another set of n-grams base on user’s input words. When the user input successfully a word, related sequence should be calculated and updated to these n-gram. The procedure of predict text would be as follow:
We use data from Swiftkey to build our initialized n-grams model to predict the next input words/phrases. Further more, we build user custom n-grams from the input of user for better performance and accurate.