Milestone Report - Text Mining

1. Synopsis

As Capstone Project for the Course John Hopkins’ Data Specialization Courses, I intend to prepare a text mining application.

The application will:

Load a large volume of text from different sources into a Corpora.
Process, clean the data and remove stop words as needed.
Create an algorithm that is able to predict next words based on previous given text like the utility some writing mobile applications provide to improve writing speed.

This project is based on SwiftKey data, one of the first and most important companies providing such techonology.

2. Data Processing

The data consists in three Corpus txt files with different estructures:

en_US.blogs.txt - A collection of blog paragraphs.
en_US.news.txt - A collection of news paragraphs.
en_US.twitter.txt - A collection of tweets.

Let’s display some raw rows of each Corpus :

3. Summary Statistics

Corpus Comparison Table

Let’s create a comparison table with some basic statistics found in each Corpus

This time considering the minimum allowable characters as one and not filtering stop words

4. Word Frequency for each Corpus

Word frequency is an important asset in order to predict future words as we intend to do. .

This section displays a comparison of each Corpus Word Frequency

In the following charts we specify a limit in the minimum number of characters we allow for words (4) and filtering out stop words.

5. Estimating the Number of Lines to Load from the Corpora.

An important issue is the size of the corpora. We might not need to use every single line .

The amount of data required to sample to obtain meaningful predictions is an important factor if we want to use the prediction model under limited time and processing constraints.

What percent of lines do we need to obtain the N % of the instances of words in the corpora ?

Some Examples :

What percent to obtain 50% of word instances of news corpus with a minimum of 3 characters per word and using stop words ?

[1] 9.39

What percent to obtain 90% of word instances of news corpus with a minimum of 1 characters per word and not using stop words ?

[1] 36.65

In order to establish which percent of the lines of the corpora would maximize the time, entropy information pair we will create a function to randomly select a number of maximum lines to use from the corpora and a random desired percent value that would be covered by the sum of the top N word repetitions respect the total of the corpora.

Loading more lines from the corpus brings certainty although with the added time and computing resources.

What is the ratio of “% Lines of Corpus Needed” respect to “% Lines of Corpus Loaded”

Some Observations

Increasing the Number of Loaded Lines from the Corpora significatively increases the processing time
If it is required to cover a greater amount of words, more lines might need to be loaded with the consequent penalty on time.
If we are using the Whole Corpora a 0.05 or (5% of the lines seem optimal in time and efficiency)

6. Calculating N-Grams

Let’s create bigrams and trigrams from the Corpora

What are the top n-gram relationship between the different Corpus ?

We can create higher N-grams as well

7. Future Steps for the Text Mining Algorithm

The following steps could be:

Dictionaries
- Create dictionaries for top frequency words
- Create 2,3,4-ngrams dictionaries with their frequency.
- Use different combinations of parameters like minimum of characters, cleaning stop words, etc.
Prepare Training Set
- Create a training set sampling random sentences from the Corpora and leaving the last word aside as ‘prediction’. Include in the set the correponding n-gram words minus one and their specific frequency as weight.
- Create a testing set.
Train the model
- Using the word and n-gram frequency when corresponding to the training set sentences adjust a model that weights those frequencies in order to better predict the last word or n-gram word.
- Create a prediction model and study it.
Save dictionaries and weights in files
- Pay special attention to dictionary sizes, memory for mobile devices, web, etc.
- Save dictionaries for each n-gram case and configuration as well as the model weights.
Prepare a shiny App
- Prepare a data folder and include the dictionaries and the model weights
- Prepare a ui.R file including a text box where the user can write text. Processed on server.R, return back some buttons or suggestions on what the next words could be.
- Include a button to randomly include sentences when pressed and display the suggested predictions.

8. Appendix

Session infromation for Reproducibility

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
## [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
## [5] LC_TIME=Spanish_Spain.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] googleVis_0.6.4        wordcloud_2.6          RColorBrewer_1.1-2    
##  [4] Hmisc_4.4-0            Formula_1.2-3          survival_3.1-12       
##  [7] lattice_0.20-41        scales_1.1.1           ggplot2_3.3.0         
## [10] stringr_1.4.0          cowplot_1.0.0          stringi_1.4.6         
## [13] tm_0.7-7               NLP_0.2-0              quanteda_2.0.1        
## [16] dplyr_0.8.5            tidyr_1.1.0            tidytext_0.2.4        
## [19] stopwords_2.0          cld2_1.2               qdapDictionaries_1.0.7
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.6.1      splines_3.6.3       RcppParallel_5.0.1 
##  [4] assertthat_0.2.1    latticeExtra_0.6-29 yaml_2.2.1         
##  [7] slam_0.1-47         pillar_1.4.4        backports_1.1.7    
## [10] glue_1.4.1          digest_0.6.25       checkmate_2.0.0    
## [13] colorspace_1.4-1    htmltools_0.4.0     Matrix_1.2-18      
## [16] pkgconfig_2.0.3     purrr_0.3.4         jpeg_0.1-8.1       
## [19] htmlTable_1.13.3    tibble_3.0.1        mgcv_1.8-31        
## [22] farver_2.0.3        generics_0.0.2      usethis_1.6.1      
## [25] ellipsis_0.3.1      withr_2.2.0         nnet_7.3-14        
## [28] magrittr_1.5        crayon_1.3.4        evaluate_0.14      
## [31] tokenizers_0.2.1    janeaustenr_0.1.5   fs_1.4.1           
## [34] nlme_3.1-148        SnowballC_0.7.0     xml2_1.3.2         
## [37] foreign_0.8-75      tools_3.6.3         data.table_1.12.8  
## [40] lifecycle_0.2.0     munsell_0.5.0       cluster_2.1.0      
## [43] compiler_3.6.3      rlang_0.4.6         grid_3.6.3         
## [46] rstudioapi_0.11     htmlwidgets_1.5.1   base64enc_0.1-3    
## [49] labeling_0.3        rmarkdown_2.1       gtable_0.3.0       
## [52] codetools_0.2-16    R6_2.4.1            gridExtra_2.3      
## [55] knitr_1.28          fastmatch_1.1-0     parallel_3.6.3     
## [58] Rcpp_1.0.4.6        vctrs_0.3.0         rpart_4.1-15       
## [61] acepack_1.4.1       png_0.1-7           tidyselect_1.1.0   
## [64] xfun_0.14

Milestone Report - Text Mining

David Pellon

26/5/2020