Each day, people spend an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on these devices can be a serious pain. The idea of this project is to build a smart keyboard that predicts the next word that might be typed by the user.
In this report, first we provide major features about the dataset that will be used to build the prediction model. We analyze a large corpus of text documents to discover the structure in the data and how words are put together. Finally, we provide a summary of the plans for creating the prediction algorithm and the application.
Note: Technical aspects of the analysis and data manipulation will be left as part of the optional appendix for those that would like to reproduce the analysis.
The data provided for this project comes from three sources (blogs, news and twitter feeds) in four different languages (German, English, Finnish and Russian). For this project, the English dataset will be used. The dataset can be downloaded from the following this link. Table 1 shows general information about the English dataset.
| File | Size (MB) | Number of Lines | Number of Words |
|---|---|---|---|
| blogs | 200.4 | 899,288 | 37,570,839 |
| news | 196.3 | 1,010,242 | 34,494,539 |
| 159.4 | 2,360,148 | 30,451,128 |
Since the dataset is fairly large and in order to reduce the time needed for preprocessing and cleaning, a text corpus is created by combining a 5% sample from each of three sources.
The corpus was cleaned to:
Stopwords were left in since these are used in normal language. The profanity list was downloaded from http://www.bannedwordlist.com Stemming was not applied to reduce words to the root form but it is something that we may decide to apply at a later stage.
The first step in building a text prediction model is understanding the distribution and relationship between words, tokens, and phrases in the text. In this sense, some exploratory data analysis was performed over the corpus. We tokenize the sample into unigrams, bigrams and trigrams using RWeka package. Figures 1, 2 and 3 show the most frequent combinations found for one word, two words and three words.
Fig. 1 Most frequent One-word
Figure 2. Most frequent Two-words combination
Figure 3. Most frequent Three-words combination
We also analyzed the number of words needed to cover a corpus. To do this, the unigram dictionary sorted by frequency obtained from the sample data was used over the entire twitter dataset. Results are showed in Table 2. An important aspect is that with less than 2000 different words, it is possible to cover more than 80% of the text. This allows us to think that probably prediction algorithms based on small dictionaries might perform well in text prediction algorithms.
| Words | percentage |
|---|---|
| 5 | 10 |
| 26 | 20 |
| 62 | 30 |
| 136 | 40 |
| 269 | 50 |
| 555 | 60 |
| 1,021 | 70 |
| 1,886 | 80 |
For the prediction algorithm we will use frequency sorted dictionary for unigrams, bigrams and trigrams with backing-off to models with smaller histories under certain condition. The model with the most reliable information about a given history will be used to provide the better results.
The Shiny app will have a text box to enter texts. In another box, the app will show the top 3 predicted choices. User should be able to select the word he/she would like to use next if it is among the predicted words. The app will have a section with instructions.
library("stringi")
library("pander")
library("tm")
library("qdap")
library("RWeka")
library("ggplot2")
library("knitr")
setwd("C:/Users/Adsi/Documents/Coursera/Data Science/10_Capstone project/")
# Blogs Data in binary mode
connection <- file("./final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(connection, encoding = "UTF-8")
close(connection)
# News data in binary mode
connection <- file("./final/en_US/en_US.news.txt", open = "rb")
news <- readLines(connection, encoding = "UTF-8")
close(connection)
# Twitter data in binary mode
connection <- file("./final/en_US/en_US.twitter.txt", open = "rb")
twitter <- readLines(connection, encoding = "UTF-8")
close(connection)
rm(connection)
n <- 3
desc <- data.frame( File = rep(0,n),
Size = rep(0,n),
Lines = rep(0,n),
Words =rep(0,n))
Files <- c("./final/en_US/en_US.blogs.txt", "./final/en_US/en_US.news.txt",
"./final/en_US/en_US.twitter.txt")
desc$File <- c("blogs","news","twitter")
tempo <- lapply(list(blogs,news,twitter), stri_stats_latex)
tempo <- data.frame(matrix(unlist(tempo), nrow=3, byrow=T))
desc$Words <- tempo$X4
tempo <- unlist(lapply(Files, file.size))
desc$Size <- tempo/(1024*1024)
desc$Lines <- unlist(lapply(list(blogs,news,twitter), length))
names(desc)<- c("File","Size (MB)","Number of Lines","Number of Words")
rm(Files,n,tempo)
set.seed(1234)
mysample.blogs <- blogs[rbinom(length(blogs)*.05, length(blogs), .5)]
mysample.news <- news[rbinom(length(news)*.05, length(news), .5)]
mysample.twitter <- twitter[rbinom(length(twitter)*.05, length(twitter), .5)]
mysample <- c(mysample.blogs,mysample.news,mysample.twitter)
rm(mysample.blogs,mysample.news,mysample.twitter)
mycorpus <- Corpus(VectorSource(mysample))
rm(blogs, news, twitter)
rm(mysample)
gc()
connection <- file("final/swearWords.txt", open = "r")
profanityWords <-readLines(connection)
close(connection)
rm(connection)
# Cleaning the corpus object
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
toEmpty <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})
mycorpus<-tm_map(mycorpus, toSpace,"[^[:graph:]]")
mycorpus<-tm_map(mycorpus, toEmpty, "#\\w+") # hashtags
mycorpus<-tm_map(mycorpus, toEmpty , "(\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)") # email address
mycorpus<-tm_map(mycorpus, toEmpty, "@\\w+")
mycorpus<-tm_map(mycorpus, toEmpty, "http[^[:space:]]*")
mycorpus<-tm_map(mycorpus, toSpace, "/|@|\\|")
mycorpus<-tm_map(mycorpus, removeNumbers)
mycorpus<-tm_map(mycorpus, content_transformer(tolower))
mycorpus<-tm_map(mycorpus, removeWords, profanityWords)
mycorpus<-tm_map(mycorpus, content_transformer(strip),char.keep="'")
mycorpus<-tm_map(mycorpus, stripWhitespace)
UniGramas <- nGramas(UniGram,low = 1, high=Inf)
UniGramas <- UniGramas[unlist(sort(UniGramas$frequency,index.return=T,decreasing=TRUE)$ix),]
head(UniGramas,20)
OneGram <- UniGramas
One<- OneGram
names(One)<- c("word1","frequency")
BiGramas <- nGramas(BiGram, low=1,high=Inf)
BiGramas <- BiGramas[unlist(sort(BiGramas$frequency,index.return=T,decreasing=TRUE)$ix),]
head(BiGramas,20)
# Bigrams are saved in a text file for later use. To avoid generating bigrams again
write.table(BiGramas,file="Bigramas.csv",append=TRUE, sep=",", row.names=FALSE, col.names=TRUE)
TwoGram <- read.csv("Bigramas.csv", header=TRUE, sep=",")
# Bigram are divided into two words to facilitate search
Two <- read.table(text = as.character(TwoGram$ngram), sep = " ", colClasses = "character")
Two<-cbind(Two,TwoGram$frequency)
names(Two) <- c("Word1","Word2","frequency")
g <- ggplot(TwoGram[1:40,], aes(x=reorder(ngram, frequency), y=frequency, fill=frequency)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
theme(plot.title = element_text(size=9))+
theme(axis.title=element_text(size=8)) +
theme(axis.text=element_text(size=7))+
xlab("2-gram") + ylab("Frequency") +
labs(title = "Top 40 Bigrams")
print(g)
TriGramas <- nGramas(TriGram, low=300,high=Inf)
TriGramas <- TriGramas[unlist(sort(TriGramas$frequency,index.return=T,decreasing=TRUE)$ix),]
# Trigrams are saved in a text file for later use. To avoid generating Trigrams again
write.table(TriGramas,file="Trigramas.csv",append=TRUE, sep=",", row.names=FALSE, col.names=TRUE)
ThreeGram <- read.csv("Trigramas.csv", header=TRUE, sep=",")
# Trigram are divided into three words to facilitate search
Three <- read.table(text = as.character(ThreeGram$ngram), sep = " ", colClasses = "character")
Three<-cbind(Three,ThreeGram$frequency)
names(Three) <- c("Word1","Word2","word3","frequency")
g <- ggplot(ThreeGram[1:40,], aes(x=reorder(ngram, frequency), y=frequency, fill=frequency)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
theme(plot.title = element_text(size=9))+
theme(axis.title=element_text(size=8)) +
theme(axis.text=element_text(size=7))+
xlab("3-gram") + ylab("Frequency") +
labs(title = "Top 40 Trigrams")
print(g)
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.12.3 ggplot2_2.0.0 RWeka_0.4-26
## [4] qdap_2.2.4 RColorBrewer_1.1-2 qdapTools_1.3.1
## [7] qdapRegex_0.6.0 qdapDictionaries_1.0.6 tm_0.6-2
## [10] NLP_0.1-8 pander_0.5.2 stringi_1.0-1
##
## loaded via a namespace (and not attached):
## [1] gtools_3.5.0 wordcloud_2.5 venneuler_1.1-0
## [4] slam_0.1-32 reshape2_1.4.1 rJava_0.9-6
## [7] reports_0.1.4 colorspace_1.2-6 htmltools_0.3
## [10] yaml_2.1.13 chron_2.3-45 XML_3.98-1.2
## [13] DBI_0.3.1 plyr_1.8.3 stringr_1.0.0
## [16] munsell_0.4.2 gtable_0.1.2 evaluate_0.8
## [19] labeling_0.3 RWekajars_3.7.13-1 gender_0.5.1
## [22] parallel_3.2.0 xlsxjars_0.6.1 Rcpp_0.12.2
## [25] scales_0.3.0 formatR_1.2.1 gdata_2.17.0
## [28] plotrix_3.6-1 xlsx_0.5.7 openNLPdata_1.5.3-2
## [31] gridExtra_2.0.0 digest_0.6.9 dplyr_0.4.3
## [34] grid_3.2.0 tools_3.2.0 bitops_1.0-6
## [37] magrittr_1.5 RCurl_1.95-4.7 data.table_1.9.6
## [40] assertthat_0.1 rmarkdown_0.8 openNLP_0.2-6
## [43] R6_2.1.1 igraph_1.0.1