Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain.
In this first milestone of the data science specialization capstone project, exploration of the corpus and insight on the development of a predictive algorithm to help user typing will be presented.
The motivation for this project is to:
microbenchmark(
blogs_lg <- read_lines("./Corpora/en_US/en_US.blogs.txt", progress = FALSE) %>%
length()
, times = 1)
## Unit: seconds
## expr
## blogs_lg <- read_lines("./Corpora/en_US/en_US.blogs.txt", progress = FALSE) %>% length()
## min lq mean median uq max neval
## 5.618941 5.618941 5.618941 5.618941 5.618941 5.618941 1
microbenchmark(
news_lg <- read_lines("./Corpora/en_US/en_US.news.txt", progress = FALSE) %>%
length()
, times = 1)
## Unit: seconds
## expr
## news_lg <- read_lines("./Corpora/en_US/en_US.news.txt", progress = FALSE) %>% length()
## min lq mean median uq max neval
## 4.144869 4.144869 4.144869 4.144869 4.144869 4.144869 1
microbenchmark(
twitter_lg <- read_lines("./Corpora/en_US/en_US.twitter.txt", progress = FALSE) %>%
length()
, times = 1)
## Unit: seconds
## expr
## twitter_lg <- read_lines("./Corpora/en_US/en_US.twitter.txt", progress = FALSE) %>% length()
## min lq mean median uq max neval
## 5.725651 5.725651 5.725651 5.725651 5.725651 5.725651 1
The corpora blog contains 899288 lines, 2360148 lines in twitter data, and 1010242 lines in news data.
The original data sets are quite large and take a lot of memory. Each of them to load can take a few seconds (see session info). Using teh readr package, the time is shortened compared to the base readLines function.
In order to continue the exploratory analysis without spending too much time on the processing, we will construct a dataset combining sampling from the three different corpora.
For the ease of creating this sampled corpora, the number of lines to read and add from each corpora is fixed at 0.1% of each files randomly sampled. This is not ideal, but necessary to keep the computation time reasonable while being able to show interesting analysis.
For the purpose of modeling and prediction, this part of the data processing can later loop over to improve accuracy of the algorithm.
set.seed(123)
samp_prop = 0.001
corp <- c(sample(read_lines("./Corpora/en_US/en_US.blogs.txt", progress = FALSE), blogs_lg * samp_prop),
sample(read_lines("./Corpora/en_US/en_US.news.txt", progress = FALSE), news_lg * samp_prop),
sample(read_lines("./Corpora/en_US/en_US.twitter.txt", progress = FALSE), twitter_lg * samp_prop))
The following steps are applied to the corpus:
This is a standard procedure easily implemented and than can be optimized later on.
corp <- iconv(corp, 'UTF-8', 'ASCII')
# Cleaning function for the Corpus
clean_corpus <- function(corpus){
corpus <- VCorpus(VectorSource(corpus))
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
return(corpus)
}
corp <- clean_corpus(corp)
corp[[20]][1]
## [1] "usha martin education solutions limited press release"
corp[[500]][1]
## [1] " art work stamp made "
corp[[4000]][1]
## [1] " outta loop idea whats going considering butts glued floor hours doi "
We now have a sample of the corpora. We can see that some characters have been missed during the cleaning process. This will need to be fine-tuned at a later stage for the modelling algorithm.
However, we can already start the exploratory analysis.
To organize the sample corpora into a data frame which contains the information of each unique word and its frequency, we need to perform 1-gram tokenization:
unigram <- NGramTokenizer(corp, control = Weka_control(min = 1, max = 1))
unigram <- data.frame(table(unigram)) %>%
filter(unigram != "NA")
unigram <- unigram[order(unigram$Freq, decreasing = TRUE),]
names(unigram) <- c("Word","Freq")
saveRDS(unigram, file = "unigram.rds")
barplot( height = unigram$Freq[1:10], names.arg = unigram$Word[1:10], horiz = FALSE, col = heat.colors(10),
main = "Top 10 Unigram Word" , ylab = "Frequency")
unigram <- readRDS("unigram.rds")
unigram$ratio <- sapply(1:length(unigram$Word), function(x) sum(unigram$Freq[1:x]))
unigram$ratio <- unigram$ratio / sum(unigram$Freq)
unigram$number <- 1:length(unigram$Word)
library(scales)
unigram %>%
ggplot(aes(x = number, y = ratio)) +
geom_line(col = "steelblue", lwd = 2) +
geom_hline(yintercept = c(0.5, 0.75, 0.90, 0.99), col = "red", lty = 2) +
scale_y_continuous(labels = percent, breaks = seq(0, 1, 0.1)) +
labs(title = "Number of Unique Word vs. Their Coverage of Total Words",
x = "Number of Unique Word", y = "Coverage of Total Words (%)") +
theme_light()
We can see that with about 12,000 words, we can cover the complete sample. with only 8000 words, about 90% of the corpora is covered.
However, the linear trend observed at about 4000 words is an indication of the inadequate size of the sample corpora. Words added above 4000 are linearly contributing to the increase in coverage.
If repeated on many sample of the original corpora, we would be able to refine the key words necessary to cover the whole corpus for each type of text. This is a valuable insight for the optimization of the prediction algorithm.
To go further, 2-gram tokenization supports the organization of a data frame which contains the information of each unique word pairs and its respective frequency.
bigram <- NGramTokenizer(corp, control = Weka_control(min = 2, max = 2))
bigram <- data.frame(table(bigram)) %>%
filter(bigram != "NA NA")
bigram <- bigram[order(bigram$Freq, decreasing = TRUE),]
names(bigram) <- c("Word","Freq")
saveRDS(bigram, file = "bigram.rds")
barplot(height = bigram$Freq[1:5], names.arg = bigram$Word[1:5], horiz = FALSE, col = heat.colors(5),
main = "Top 10 Bigram Word" , ylab = "Frequency")
bigram <- readRDS("bigram.rds")
bigram$ratio <- sapply(1:length(bigram$Word), function(x) sum(bigram$Freq[1:x]))
bigram$ratio <- bigram$ratio / sum(bigram$Freq)
bigram$number <- 1:length(bigram$Word)
library(scales)
bigram %>%
ggplot(aes(x = number, y = ratio)) +
geom_line(col = "steelblue", lwd = 2) +
geom_hline(yintercept = c(0.5, 0.75, 0.90, 0.99), col = "red", lty = 2) +
scale_y_continuous(labels = percent, breaks = seq(0, 1, 0.1)) +
labs(title = "Number of Pair Word vs. Their Coverage of Total Words",
x = "Number of Pair Word", y = "Coverage of Total Pair Words (%)") +
theme_light()
We can see here that the sample size is insufficient to conclude on the minimum number of pairs necessary for an efficient model. The quasi-linear trend is linked to the actual number of pairs found during the tokenization and creation of the bigram. This is similar to the behavior observed on the unigram dataframe.
This concludes our exploratory analysis. The next steps of this capstone project is the development of the predictive algorithm, followed by the deploy of the algorithm as a Shiny app.
The construction of the predictive algorithm is planned on a n-gram model with frequency lookup similar to the exploratory analysis conducted. Improvement of the accuracy can be done by refining the unigram, bigram and trigram by sampling and analysing multiple times the corpus. Sample size must be taken into consideration to ensure adequate coverage of the words in the corpora.
The model can also be split on the type of source the user is working on. This way prediction can be match whether the user is typing for a blog, a news article or a twitter text.
The different combinaison of type of source and number of words to predict can later be proposed on the Shiny app.
sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_0.4.1 bindrcpp_0.2 RWeka_0.4-34
## [4] rJava_0.9-8 microbenchmark_1.4-2.1 stringr_1.2.0
## [7] dplyr_0.7.0 ggplot2_2.2.1.9000 SnowballC_0.5.1
## [10] tm_0.7-1 NLP_0.1-10 readr_1.1.1
## [13] rmarkdown_1.6 knitr_1.16 RDocumentation_0.8.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.11 bindr_0.1 plyr_1.8.4
## [4] RWekajars_3.9.1-3 tools_3.3.3 digest_0.6.12
## [7] jsonlite_1.5 evaluate_0.10 memoise_1.1.0
## [10] tibble_1.3.3 gtable_0.2.0 rlang_0.1.1
## [13] curl_2.6 yaml_2.1.14 parallel_3.3.3
## [16] proto_1.0.0 withr_1.0.2 httr_1.2.1
## [19] devtools_1.13.2 hms_0.3 rprojroot_1.2
## [22] grid_3.3.3 glue_1.1.0 data.table_1.10.4
## [25] R6_2.2.2 githubinstall_0.2.1 magrittr_1.5
## [28] backports_1.1.0 htmltools_0.3.6 assertthat_0.2.0
## [31] colorspace_1.3-2 labeling_0.3 stringi_1.1.5
## [34] lazyeval_0.2.0 munsell_0.4.3 slam_0.1-40
## [37] rjson_0.2.15