Executive Summary

Each day, people spend an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on these devices can be a serious pain. The idea of this project is to build a smart keyboard that predicts the next word that might be typed by the user.

In this report, first we provide major features about the dataset that will be used to build the prediction model. We analyze a large corpus of text documents to discover the structure in the data and how words are put together. Finally, we provide a summary of the plans for creating the prediction algorithm and the application.

Note: Technical aspects of the analysis and data manipulation will be left as part of the optional appendix for those that would like to reproduce the analysis.

Getting Data

The data provided for this project comes from three sources (blogs, news and twitter feeds) in four different languages (German, English, Finnish and Russian). For this project, the English dataset will be used. The dataset can be downloaded from the following this link. Table 1 shows general information about the English dataset.

Table 1. File Size, Line Count and Row Count of the Datasets
File Size (MB) Number of Lines Number of Words
blogs 200.4 899,288 37,570,839
news 196.3 1,010,242 34,494,539
twitter 159.4 2,360,148 30,451,128

Sampling Data

Since the dataset is fairly large and in order to reduce the time needed for preprocessing and cleaning, a text corpus is created by combining a 5% sample from each of three sources.

Preprocessing

The corpus was cleaned to:

Stopwords were left in since these are used in normal language. The profanity list was downloaded from http://www.bannedwordlist.com Stemming was not applied to reduce words to the root form but it is something that we may decide to apply at a later stage.

Exploratory Data Analysis

The first step in building a text prediction model is understanding the distribution and relationship between words, tokens, and phrases in the text. In this sense, some exploratory data analysis was performed over the corpus. We tokenize the sample into unigrams, bigrams and trigrams using RWeka package. Figures 1, 2 and 3 show the most frequent combinations found for one word, two words and three words.

Fig. 1 Most frequent One-word

Fig. 1 Most frequent One-word

Figure 2. Most frequent Two-words combination

Figure 2. Most frequent Two-words combination

Figure 3. Most frequent Three-words combination

Figure 3. Most frequent Three-words combination

We also analyzed the number of words needed to cover a corpus. To do this, the unigram dictionary sorted by frequency obtained from the sample data was used over the entire twitter dataset. Results are showed in Table 2. An important aspect is that with less than 2000 different words, it is possible to cover more than 80% of the text. This allows us to think that probably prediction algorithms based on small dictionaries might perform well in text prediction algorithms.

Table 2. Number of Words needed to cover % of twitter data
Words percentage
5 10
26 20
62 30
136 40
269 50
555 60
1,021 70
1,886 80

Plans for creating the prediction algorithm and Shiny app

For the prediction algorithm we will use frequency sorted dictionary for unigrams, bigrams and trigrams with backing-off to models with smaller histories under certain condition. The model with the most reliable information about a given history will be used to provide the better results.

The Shiny app will have a text box to enter texts. In another box, the app will show the top 3 predicted choices. User should be able to select the word he/she would like to use next if it is among the predicted words. The app will have a section with instructions.

Appendix

Libraries required

library("stringi")
library("pander")
library("tm")
library("qdap")
library("RWeka")
library("ggplot2")
library("knitr")

setwd("C:/Users/Adsi/Documents/Coursera/Data Science/10_Capstone project/")

Data Read

# Blogs Data in binary mode
connection <- file("./final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(connection, encoding = "UTF-8")

close(connection)

# News data in binary mode
connection <- file("./final/en_US/en_US.news.txt", open = "rb")
news <- readLines(connection, encoding = "UTF-8")
close(connection)

# Twitter data in binary mode
connection <- file("./final/en_US/en_US.twitter.txt", open = "rb")
twitter <- readLines(connection, encoding = "UTF-8")
close(connection)

rm(connection)

Descriptive Statistics

n <- 3
desc   <- data.frame( File  = rep(0,n),
                      Size  = rep(0,n),
                      Lines = rep(0,n), 
                      Words  =rep(0,n))


Files <- c("./final/en_US/en_US.blogs.txt", "./final/en_US/en_US.news.txt",
           "./final/en_US/en_US.twitter.txt")

desc$File   <-  c("blogs","news","twitter")
tempo       <-  lapply(list(blogs,news,twitter), stri_stats_latex)
tempo       <-  data.frame(matrix(unlist(tempo), nrow=3, byrow=T)) 
desc$Words  <-  tempo$X4
tempo       <-  unlist(lapply(Files, file.size))
desc$Size   <-  tempo/(1024*1024)
desc$Lines  <-  unlist(lapply(list(blogs,news,twitter), length))
names(desc)<- c("File","Size (MB)","Number of Lines","Number of Words")


rm(Files,n,tempo)

Data Sampling

set.seed(1234)

mysample.blogs     <- blogs[rbinom(length(blogs)*.05, length(blogs), .5)]
mysample.news      <- news[rbinom(length(news)*.05, length(news), .5)]
mysample.twitter   <- twitter[rbinom(length(twitter)*.05, length(twitter), .5)]
mysample           <- c(mysample.blogs,mysample.news,mysample.twitter)

rm(mysample.blogs,mysample.news,mysample.twitter)

mycorpus <- Corpus(VectorSource(mysample))
rm(blogs, news, twitter)
rm(mysample)
gc()

Profanity List

connection <- file("final/swearWords.txt", open = "r")
profanityWords  <-readLines(connection)
close(connection)

rm(connection)

Data Cleaning

# Cleaning the corpus object

toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
toEmpty <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})

mycorpus<-tm_map(mycorpus, toSpace,"[^[:graph:]]")  
mycorpus<-tm_map(mycorpus, toEmpty, "#\\w+")       # hashtags
mycorpus<-tm_map(mycorpus, toEmpty , "(\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)") # email address
mycorpus<-tm_map(mycorpus, toEmpty, "@\\w+")  
mycorpus<-tm_map(mycorpus, toEmpty, "http[^[:space:]]*")
mycorpus<-tm_map(mycorpus, toSpace, "/|@|\\|") 
mycorpus<-tm_map(mycorpus, removeNumbers)
mycorpus<-tm_map(mycorpus, content_transformer(tolower))
mycorpus<-tm_map(mycorpus, removeWords, profanityWords) 
mycorpus<-tm_map(mycorpus, content_transformer(strip),char.keep="'")
mycorpus<-tm_map(mycorpus, stripWhitespace)

Creation of Term-Document-Matrix for token analysis

Generating 1-gram

UniGramas         <- nGramas(UniGram,low = 1, high=Inf)
UniGramas         <- UniGramas[unlist(sort(UniGramas$frequency,index.return=T,decreasing=TRUE)$ix),]

head(UniGramas,20)

OneGram <- UniGramas
One<- OneGram
names(One)<- c("word1","frequency")

1-gram Plot

Generating 2-gram

BiGramas        <- nGramas(BiGram, low=1,high=Inf)
BiGramas        <- BiGramas[unlist(sort(BiGramas$frequency,index.return=T,decreasing=TRUE)$ix),]
head(BiGramas,20)

# Bigrams are saved in a text file for later use. To avoid generating bigrams again
write.table(BiGramas,file="Bigramas.csv",append=TRUE, sep=",", row.names=FALSE, col.names=TRUE)
TwoGram <- read.csv("Bigramas.csv", header=TRUE, sep=",")

# Bigram are divided into two words to facilitate search 
Two <- read.table(text = as.character(TwoGram$ngram), sep = " ", colClasses = "character")
Two<-cbind(Two,TwoGram$frequency)
names(Two) <- c("Word1","Word2","frequency")

2-gram Plot

g <- ggplot(TwoGram[1:40,], aes(x=reorder(ngram, frequency), y=frequency, fill=frequency)) +
     geom_bar(stat = "identity") +  coord_flip() +
     theme(legend.title=element_blank()) +
     theme(plot.title = element_text(size=9))+
     theme(axis.title=element_text(size=8)) +
     theme(axis.text=element_text(size=7))+
     xlab("2-gram") + ylab("Frequency") +
     labs(title = "Top 40 Bigrams")
print(g)

Generating 3-grams

TriGramas     <- nGramas(TriGram, low=300,high=Inf)
TriGramas     <- TriGramas[unlist(sort(TriGramas$frequency,index.return=T,decreasing=TRUE)$ix),]



# Trigrams are saved in a text file for later use. To avoid generating Trigrams again
write.table(TriGramas,file="Trigramas.csv",append=TRUE, sep=",", row.names=FALSE, col.names=TRUE)
ThreeGram <- read.csv("Trigramas.csv", header=TRUE, sep=",")

# Trigram are divided into three words to facilitate search 
Three <- read.table(text = as.character(ThreeGram$ngram), sep = " ", colClasses = "character")
Three<-cbind(Three,ThreeGram$frequency)
names(Three) <- c("Word1","Word2","word3","frequency")

3-gram Plot

g <- ggplot(ThreeGram[1:40,], aes(x=reorder(ngram, frequency), y=frequency, fill=frequency)) +
     geom_bar(stat = "identity") +  coord_flip() +
     theme(legend.title=element_blank()) +
     theme(plot.title = element_text(size=9))+
     theme(axis.title=element_text(size=8)) +
     theme(axis.text=element_text(size=7))+
     xlab("3-gram") + ylab("Frequency") +
     labs(title = "Top 40 Trigrams")
print(g)

Session Info

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.12.3           ggplot2_2.0.0          RWeka_0.4-26          
##  [4] qdap_2.2.4             RColorBrewer_1.1-2     qdapTools_1.3.1       
##  [7] qdapRegex_0.6.0        qdapDictionaries_1.0.6 tm_0.6-2              
## [10] NLP_0.1-8              pander_0.5.2           stringi_1.0-1         
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0        wordcloud_2.5       venneuler_1.1-0    
##  [4] slam_0.1-32         reshape2_1.4.1      rJava_0.9-6        
##  [7] reports_0.1.4       colorspace_1.2-6    htmltools_0.3      
## [10] yaml_2.1.13         chron_2.3-45        XML_3.98-1.2       
## [13] DBI_0.3.1           plyr_1.8.3          stringr_1.0.0      
## [16] munsell_0.4.2       gtable_0.1.2        evaluate_0.8       
## [19] labeling_0.3        RWekajars_3.7.13-1  gender_0.5.1       
## [22] parallel_3.2.0      xlsxjars_0.6.1      Rcpp_0.12.2        
## [25] scales_0.3.0        formatR_1.2.1       gdata_2.17.0       
## [28] plotrix_3.6-1       xlsx_0.5.7          openNLPdata_1.5.3-2
## [31] gridExtra_2.0.0     digest_0.6.9        dplyr_0.4.3        
## [34] grid_3.2.0          tools_3.2.0         bitops_1.0-6       
## [37] magrittr_1.5        RCurl_1.95-4.7      data.table_1.9.6   
## [40] assertthat_0.1      rmarkdown_0.8       openNLP_0.2-6      
## [43] R6_2.1.1            igraph_1.0.1