As Capstone Project for the Course John Hopkins’ Data Specialization Courses, I intend to prepare a text mining application.
The application will:
This project is based on SwiftKey data, one of the first and most important companies providing such techonology.
The data consists in three Corpus txt files with different estructures:
Let’s display some raw rows of each Corpus :
|
||
|
|
Let’s create a comparison table with some basic statistics found in each Corpus
This time considering the minimum allowable characters as one and not filtering stop words
Word frequency is an important asset in order to predict future words as we intend to do. .
This section displays a comparison of each Corpus Word Frequency
In the following charts we specify a limit in the minimum number of characters we allow for words (4) and filtering out stop words.
An important issue is the size of the corpora. We might not need to use every single line .
The amount of data required to sample to obtain meaningful predictions is an important factor if we want to use the prediction model under limited time and processing constraints.
What percent of lines do we need to obtain the N % of the instances of words in the corpora ?
[1] 9.39
[1] 36.65
In order to establish which percent of the lines of the corpora would maximize the time, entropy information pair we will create a function to randomly select a number of maximum lines to use from the corpora and a random desired percent value that would be covered by the sum of the top N word repetitions respect the total of the corpora.
Loading more lines from the corpus brings certainty although with the added time and computing resources.
What is the ratio of “% Lines of Corpus Needed” respect to “% Lines of Corpus Loaded”
Let’s create bigrams and trigrams from the Corpora
|
|
|
What are the top n-gram relationship between the different Corpus ?
|
||||
|
We can create higher N-grams as well
The following steps could be:
Dictionaries
Prepare Training Set
Train the model
Save dictionaries and weights in files
Prepare a shiny App
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
## [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
## [5] LC_TIME=Spanish_Spain.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] googleVis_0.6.4 wordcloud_2.6 RColorBrewer_1.1-2
## [4] Hmisc_4.4-0 Formula_1.2-3 survival_3.1-12
## [7] lattice_0.20-41 scales_1.1.1 ggplot2_3.3.0
## [10] stringr_1.4.0 cowplot_1.0.0 stringi_1.4.6
## [13] tm_0.7-7 NLP_0.2-0 quanteda_2.0.1
## [16] dplyr_0.8.5 tidyr_1.1.0 tidytext_0.2.4
## [19] stopwords_2.0 cld2_1.2 qdapDictionaries_1.0.7
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.6.1 splines_3.6.3 RcppParallel_5.0.1
## [4] assertthat_0.2.1 latticeExtra_0.6-29 yaml_2.2.1
## [7] slam_0.1-47 pillar_1.4.4 backports_1.1.7
## [10] glue_1.4.1 digest_0.6.25 checkmate_2.0.0
## [13] colorspace_1.4-1 htmltools_0.4.0 Matrix_1.2-18
## [16] pkgconfig_2.0.3 purrr_0.3.4 jpeg_0.1-8.1
## [19] htmlTable_1.13.3 tibble_3.0.1 mgcv_1.8-31
## [22] farver_2.0.3 generics_0.0.2 usethis_1.6.1
## [25] ellipsis_0.3.1 withr_2.2.0 nnet_7.3-14
## [28] magrittr_1.5 crayon_1.3.4 evaluate_0.14
## [31] tokenizers_0.2.1 janeaustenr_0.1.5 fs_1.4.1
## [34] nlme_3.1-148 SnowballC_0.7.0 xml2_1.3.2
## [37] foreign_0.8-75 tools_3.6.3 data.table_1.12.8
## [40] lifecycle_0.2.0 munsell_0.5.0 cluster_2.1.0
## [43] compiler_3.6.3 rlang_0.4.6 grid_3.6.3
## [46] rstudioapi_0.11 htmlwidgets_1.5.1 base64enc_0.1-3
## [49] labeling_0.3 rmarkdown_2.1 gtable_0.3.0
## [52] codetools_0.2-16 R6_2.4.1 gridExtra_2.3
## [55] knitr_1.28 fastmatch_1.1-0 parallel_3.6.3
## [58] Rcpp_1.0.4.6 vctrs_0.3.0 rpart_4.1-15
## [61] acepack_1.4.1 png_0.1-7 tidyselect_1.1.0
## [64] xfun_0.14