Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.
In this capstone I will work on understanding and building predictive text models like those used by SwiftKey. This project will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, I will use the knowledge gained in data products to build a predictive text product.
library(NLP)
library(tm)
library(stringi)
library(dplyr)
library(ggplot2)
library(textcat)
library(RWeka)
library(wordcloud)
Load data:
con <- file("./Data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt","r")
lineBlogs <- readLines(con)
con <- file("./Data/Coursera-SwiftKey/final/en_US/en_US.news.txt","r")
lineNews <- readLines(con)
con <- file("./Data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt","r")
lineTwitter <- readLines(con)
close(con)
lineAll <- c(lineBlogs, lineNews, lineTwitter)
Check the size of three files (MB), number of lines, and maximum line length (words):
fileName fileSize_MB linesNumber_lines maxLinesLength_words
1 Blogs 200.4242 899288 40835
2 News 196.2775 77259 5760
3 Twitter 159.3641 2360148 213
Sampling:
set.seed(314159)
lenAll <- length(lineAll)
index <- rbinom(n=lenAll,size=1,prob=0.1) %>%
as.logical()
lineSub <- lineAll[index]
rm(lineAll)
corpus <- VCorpus(VectorSource(lineSub))
Clean the corpus: transfer to lower case, remove numbers, punctuations, badwords and white spaces:
# Get bad words
badWords <- read.csv("./Data/BadWords.csv", stringsAsFactors = FALSE)
badWords <- as.character(badWords[,1])
# Clean Data
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
corpus <- tm_map(corpus, content_transformer(tolower))
#corpus <- tm_map(corpus, removeWords, badWords)
corpus <- tm_map(corpus, stripWhitespace)
Then transfer the corpus to dataframe for later use. In the meantime, remove the lines that don’t belong to english.
lineInf <- data.frame(text = sapply(corpus, as.character), stringsAsFactors = FALSE)
lineInf$language <- textcat(lineInf$text)
lineInf <- filter(lineInf, language == "english") %>%
select(text)
Since we’ve already processed the corpus, then use RWeka package to tokenize corpus. Write a function for tokenization in order to same time. Here, we pick N from 1 to 4, which is unigram, bigram, trigram, and quagram.
# Sample first
lenLine <- length(lineInf[,1])
index <- rbinom(n=lenLine, size = 1, prob = 0.5) %>%
as.logical()
lineSample <- lineInf[index,]
# Construct function
NGT <- function(text, numControl){
NGTProcess <- NGramTokenizer(text,
control = Weka_control(min=numControl, max=numControl))
NGTProcess <- data.frame(table(NGTProcess))
NGTProcess <- NGTProcess[order(NGTProcess$Freq, decreasing = TRUE),]
colnames(NGTProcess) <- c("String","Freq")
NGTProcess <- NGTProcess[1:1000,]
NGTProcess$String <- as.character(NGTProcess$String)
}
# Tokenize
NGT1 <- NGT(lineSample, 1)
NGT2 <- NGT(lineSample, 2)
NGT3 <- NGT(lineSample, 3)
NGT4 <- NGT(lineSample, 4)
Just show the result unigram for example:
head(NGT1)
String Freq
97815 the 132257
99361 to 88587
6869 and 75250
44 a 67606
69576 of 62705
48165 i 57410
Plot the frequency of unigram, bigram, trigram, and quagram.
Also the word cloud:
Since we have already got the N-Gram model (N is from 1 to 4), the next step would be to build the prediction algorithms and develop data product. For the algorithms, the idea is to search inside the N-Gram in a certain way according to user’s input. For example, search in the quagram first and store the results. If we are not satisfied with the results, then search inside the trigram, and so on. In the end, compare the frequency and make the final decision. As for the data product, the plan is to use Shiny package in R to develop a interactive web application.