Synopsis

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain.

In this first milestone of the data science specialization capstone project, exploration of the corpus and insight on the development of a predictive algorithm to help user typing will be presented.

Review Criteria

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Exploratory Analysis

Basic Setup

Loading the data and basic information

microbenchmark(
        blogs_lg <- read_lines("./Corpora/en_US/en_US.blogs.txt", progress = FALSE) %>%
        length()
, times = 1)
## Unit: seconds
##                                                                                           expr
##  blogs_lg <- read_lines("./Corpora/en_US/en_US.blogs.txt", progress = FALSE) %>%      length()
##       min       lq     mean   median       uq      max neval
##  5.618941 5.618941 5.618941 5.618941 5.618941 5.618941     1
microbenchmark(
        news_lg <- read_lines("./Corpora/en_US/en_US.news.txt", progress = FALSE) %>%
        length()
, times = 1)
## Unit: seconds
##                                                                                         expr
##  news_lg <- read_lines("./Corpora/en_US/en_US.news.txt", progress = FALSE) %>%      length()
##       min       lq     mean   median       uq      max neval
##  4.144869 4.144869 4.144869 4.144869 4.144869 4.144869     1
microbenchmark(
        twitter_lg <- read_lines("./Corpora/en_US/en_US.twitter.txt", progress = FALSE) %>%
        length()
, times = 1)
## Unit: seconds
##                                                                                               expr
##  twitter_lg <- read_lines("./Corpora/en_US/en_US.twitter.txt",      progress = FALSE) %>% length()
##       min       lq     mean   median       uq      max neval
##  5.725651 5.725651 5.725651 5.725651 5.725651 5.725651     1

The corpora blog contains 899288 lines, 2360148 lines in twitter data, and 1010242 lines in news data.

The original data sets are quite large and take a lot of memory. Each of them to load can take a few seconds (see session info). Using teh readr package, the time is shortened compared to the base readLines function.

In order to continue the exploratory analysis without spending too much time on the processing, we will construct a dataset combining sampling from the three different corpora.

For the ease of creating this sampled corpora, the number of lines to read and add from each corpora is fixed at 0.1% of each files randomly sampled. This is not ideal, but necessary to keep the computation time reasonable while being able to show interesting analysis.

*Notes:*

For the purpose of modeling and prediction, this part of the data processing can later loop over to improve accuracy of the algorithm.

set.seed(123)
samp_prop = 0.001
corp <- c(sample(read_lines("./Corpora/en_US/en_US.blogs.txt", progress = FALSE), blogs_lg * samp_prop),
                 sample(read_lines("./Corpora/en_US/en_US.news.txt", progress = FALSE), news_lg * samp_prop),
                 sample(read_lines("./Corpora/en_US/en_US.twitter.txt", progress = FALSE), twitter_lg * samp_prop))

Data cleaning and processing

The following steps are applied to the corpus:

  • convert to ASCII (careful with emoji)
  • Transform to lower case,
  • Remove punctuation,
  • Remove numbers
  • Remove stop words
  • Remove white space.

This is a standard procedure easily implemented and than can be optimized later on.

corp <-  iconv(corp, 'UTF-8', 'ASCII')

# Cleaning function for the Corpus
clean_corpus <- function(corpus){
        corpus <- VCorpus(VectorSource(corpus))
        corpus <- tm_map(corpus, PlainTextDocument)
        corpus <- tm_map(corpus, tolower)
        corpus <- tm_map(corpus, stripWhitespace)
        corpus <- tm_map(corpus, removePunctuation)
        corpus <- tm_map(corpus, removeNumbers)
        corpus <- tm_map(corpus, removeWords, stopwords("en"))

        return(corpus)
}

corp <- clean_corpus(corp)

corp[[20]][1]
## [1] "usha martin education  solutions limited press release"
corp[[500]][1]
## [1] " art work   stamp made  "
corp[[4000]][1]
## [1] " outta  loop    idea whats going  considering  butts  glued   floor  hours doi "

We now have a sample of the corpora. We can see that some characters have been missed during the cleaning process. This will need to be fine-tuned at a later stage for the modelling algorithm.

However, we can already start the exploratory analysis.

Exploratory Analysis

Unigram

To organize the sample corpora into a data frame which contains the information of each unique word and its frequency, we need to perform 1-gram tokenization:

unigram <- NGramTokenizer(corp, control = Weka_control(min = 1, max = 1))
unigram <- data.frame(table(unigram)) %>%
        filter(unigram != "NA")
unigram <- unigram[order(unigram$Freq, decreasing = TRUE),]
names(unigram) <- c("Word","Freq")
saveRDS(unigram, file = "unigram.rds")

Vizualization of the sample unigram

Top 10 words in the unigram

barplot( height = unigram$Freq[1:10], names.arg = unigram$Word[1:10], horiz = FALSE, col = heat.colors(10), 
         main = "Top 10 Unigram Word" , ylab = "Frequency")

Unigram coverage of the sample corpora

unigram <- readRDS("unigram.rds")
unigram$ratio <- sapply(1:length(unigram$Word), function(x) sum(unigram$Freq[1:x]))
unigram$ratio <- unigram$ratio / sum(unigram$Freq)
unigram$number <- 1:length(unigram$Word)

library(scales)

unigram %>% 
        ggplot(aes(x = number, y = ratio)) +
        geom_line(col = "steelblue", lwd = 2) +
        geom_hline(yintercept = c(0.5, 0.75, 0.90, 0.99), col = "red", lty = 2) +
        scale_y_continuous(labels = percent, breaks = seq(0, 1, 0.1)) +
        labs(title = "Number of Unique Word vs. Their Coverage of Total Words",
             x = "Number of Unique Word", y = "Coverage of Total Words (%)") +
        theme_light()

We can see that with about 12,000 words, we can cover the complete sample. with only 8000 words, about 90% of the corpora is covered.

However, the linear trend observed at about 4000 words is an indication of the inadequate size of the sample corpora. Words added above 4000 are linearly contributing to the increase in coverage.

If repeated on many sample of the original corpora, we would be able to refine the key words necessary to cover the whole corpus for each type of text. This is a valuable insight for the optimization of the prediction algorithm.

Bigram

To go further, 2-gram tokenization supports the organization of a data frame which contains the information of each unique word pairs and its respective frequency.

bigram <- NGramTokenizer(corp, control = Weka_control(min = 2, max = 2))
bigram <- data.frame(table(bigram)) %>%
        filter(bigram != "NA NA")
bigram <- bigram[order(bigram$Freq, decreasing = TRUE),]
names(bigram) <- c("Word","Freq")
saveRDS(bigram, file = "bigram.rds")

Vizualization of the sample bigram

Top 5 2-words in the bigram

barplot(height = bigram$Freq[1:5], names.arg = bigram$Word[1:5], horiz = FALSE, col = heat.colors(5), 
         main = "Top 10 Bigram Word" , ylab = "Frequency")

Bigram coverage of the sample corpora

bigram <- readRDS("bigram.rds")
bigram$ratio <- sapply(1:length(bigram$Word), function(x) sum(bigram$Freq[1:x]))
bigram$ratio <- bigram$ratio / sum(bigram$Freq)
bigram$number <- 1:length(bigram$Word)

library(scales)

bigram %>% 
        ggplot(aes(x = number, y = ratio)) +
        geom_line(col = "steelblue", lwd = 2) +
        geom_hline(yintercept = c(0.5, 0.75, 0.90, 0.99), col = "red", lty = 2) +
        scale_y_continuous(labels = percent, breaks = seq(0, 1, 0.1)) +
        labs(title = "Number of Pair Word vs. Their Coverage of Total Words",
             x = "Number of Pair Word", y = "Coverage of Total Pair Words (%)") +
        theme_light()

We can see here that the sample size is insufficient to conclude on the minimum number of pairs necessary for an efficient model. The quasi-linear trend is linked to the actual number of pairs found during the tokenization and creation of the bigram. This is similar to the behavior observed on the unigram dataframe.

Conclusion and Planning

This concludes our exploratory analysis. The next steps of this capstone project is the development of the predictive algorithm, followed by the deploy of the algorithm as a Shiny app.

The construction of the predictive algorithm is planned on a n-gram model with frequency lookup similar to the exploratory analysis conducted. Improvement of the accuracy can be done by refining the unigram, bigram and trigram by sampling and analysing multiple times the corpus. Sample size must be taken into consideration to ensure adequate coverage of the words in the corpora.

The model can also be split on the type of source the user is working on. This way prediction can be match whether the user is typing for a blog, a news article or a twitter text.

The different combinaison of type of source and number of words to predict can later be proposed on the Shiny app.

Session Info

sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_0.4.1           bindrcpp_0.2           RWeka_0.4-34          
##  [4] rJava_0.9-8            microbenchmark_1.4-2.1 stringr_1.2.0         
##  [7] dplyr_0.7.0            ggplot2_2.2.1.9000     SnowballC_0.5.1       
## [10] tm_0.7-1               NLP_0.1-10             readr_1.1.1           
## [13] rmarkdown_1.6          knitr_1.16             RDocumentation_0.8.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.11        bindr_0.1           plyr_1.8.4         
##  [4] RWekajars_3.9.1-3   tools_3.3.3         digest_0.6.12      
##  [7] jsonlite_1.5        evaluate_0.10       memoise_1.1.0      
## [10] tibble_1.3.3        gtable_0.2.0        rlang_0.1.1        
## [13] curl_2.6            yaml_2.1.14         parallel_3.3.3     
## [16] proto_1.0.0         withr_1.0.2         httr_1.2.1         
## [19] devtools_1.13.2     hms_0.3             rprojroot_1.2      
## [22] grid_3.3.3          glue_1.1.0          data.table_1.10.4  
## [25] R6_2.2.2            githubinstall_0.2.1 magrittr_1.5       
## [28] backports_1.1.0     htmltools_0.3.6     assertthat_0.2.0   
## [31] colorspace_1.3-2    labeling_0.3        stringi_1.1.5      
## [34] lazyeval_0.2.0      munsell_0.4.3       slam_0.1-40        
## [37] rjson_0.2.15