library(tidytext)
library(tidyverse)
library(stopwords)
library(tm)
theme_set(theme_bw())
The goal of this exercise is to build and evaluate your first predictive model. You will use the n-gram and backoff models you built in previous tasks to build and evaluate your predictive model. The goal is to make the model efficient and accurate.
the_50s %>%
top_n(10, prop) %>%
mutate(word = reorder(word, prop)) %>%
ggplot(aes(word, prop)) +
geom_col(fill = 'lightblue', col= 'blue') +
ggtitle("Global top 10 words") +
xlab(NULL) +
coord_flip()
most_freq <- clean_tidy_data %>%
count(word, sort = T)
head(most_freq, 5)
# A tibble: 5 x 2
word n
<chr> <int>
1 im 24677
2 time 21119
3 dont 17715
4 day 17042
5 love 15957
Interpretation: Globally, the 5 most frequent words are : ‘im’, ‘time’, ‘dont’, ‘day’ and ‘love’.
freq %>%
group_by(source) %>%
top_n(3)
# A tibble: 9 x 4
# Groups: source [3]
word n source prop
<chr> <int> <chr> <dbl>
1 im 15902 Twitter 0.0159
2 love 10525 Twitter 0.0105
3 dont 9118 Twitter 0.00909
4 time 8656 Blogs 0.00743
5 im 6652 Blogs 0.00571
6 people 5704 Blogs 0.00489
7 time 5091 News 0.00400
8 people 4747 News 0.00373
9 city 3773 News 0.00297
Distributions were created for each set of n-grams, based on relative frequency.
#Distribution of bi-grams
bigram_90 %>%
top_n(10, prop) %>%
mutate(bigram = reorder(bigram, prop)) %>%
ggplot(aes(bigram, prop)) +
geom_col(fill = 'lightgreen', col = 'black') +
xlab(NULL) +
coord_flip() +
ggtitle("Bigrams that represent the 90%") +
labs(y = "Proportion")
head(bigram_90)
# A tibble: 6 x 4
bigram n prop coverage
<chr> <int> <dbl> <dbl>
1 of the 43825 0.00516 0.00516
2 in the 41748 0.00492 0.0101
3 to the 24894 0.00293 0.0130
4 for the 20596 0.00243 0.0154
5 on the 19731 0.00232 0.0178
6 to be 16201 0.00191 0.0197
print(paste('Number of bi-grams with 90% of coverage in the dataset: ', nrow(bigram_90)))
[1] "Number of bi-grams with 90% of coverage in the dataset: 1256942"
# A tibble: 6 x 4
trigram n prop coverage
<chr> <int> <dbl> <dbl>
1 one of the 3338 0.000413 0.00334
2 a lot of 3044 0.000376 0.00372
3 thanks for the 2474 0.000306 0.00403
4 to be a 1784 0.000221 0.00425
5 the of the 1686 0.000209 0.00446
6 going to be 1646 0.000204 0.00466
[1] "Number of tri-grams with 90% of coverage in the dataset: 4495716"
I use the tri-grams table as the basis for prediction.
trigrams_pred <- trigram_90 %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")
knitr::kable(head(trigrams_pred))
| word1 | word2 | word3 | n | prop | coverage |
|---|---|---|---|---|---|
| one | of | the | 3338 | 0.0004128 | 0.0033439 |
| a | lot | of | 3044 | 0.0003764 | 0.0037203 |
| thanks | for | the | 2474 | 0.0003060 | 0.0040263 |
| to | be | a | 1784 | 0.0002206 | 0.0042469 |
| the | of | the | 1686 | 0.0002085 | 0.0044554 |
| going | to | be | 1646 | 0.0002036 | 0.0046590 |
Saving the trigram prediction table
# saveRDS(trigrams_pred, "./trigrams_pred.rds")
Notice in the tri-grams table, was separated by word and arranged by relative frequency. When the user inputs two words, the model matches those words and then finds the third word with the greatest relative frequency.
– END
References:
Supervised Machine Learning for Text Analysis in R
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Mexico.1252 LC_CTYPE=Spanish_Mexico.1252
[3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Mexico.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.7-8 NLP_0.2-1 stopwords_2.2 forcats_0.5.1
[5] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_1.4.0
[9] tidyr_1.1.3 tibble_3.1.2 ggplot2_3.3.5 tidyverse_1.3.1
[13] tidytext_0.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 lubridate_1.7.10 lattice_0.20-44 assertthat_0.2.1
[5] digest_0.6.27 utf8_1.2.1 slam_0.1-48 R6_2.5.0
[9] cellranger_1.1.0 backports_1.2.1 reprex_2.0.0 evaluate_0.14
[13] highr_0.9 httr_1.4.2 pillar_1.6.1 rlang_0.4.11
[17] readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4 Matrix_1.3-3
[21] rmarkdown_2.9 labeling_0.4.2 munsell_0.5.0 broom_0.7.8
[25] compiler_4.1.0 janeaustenr_0.1.5 modelr_0.1.8 xfun_0.24
[29] pkgconfig_2.0.3 htmltools_0.5.1.1 tidyselect_1.1.1 viridisLite_0.4.0
[33] fansi_0.5.0 crayon_1.4.1 dbplyr_2.1.1 withr_2.4.2
[37] SnowballC_0.7.0 grid_4.1.0 jsonlite_1.7.2 gtable_0.3.0
[41] lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1
[45] tokenizers_0.2.1 cli_3.0.0 stringi_1.6.2 farver_2.1.0
[49] fs_1.5.0 xml2_1.3.2 bslib_0.2.5.1 ellipsis_0.3.2
[53] generics_0.1.0 vctrs_0.3.8 tools_4.1.0 glue_1.4.2
[57] hms_1.1.0 parallel_4.1.0 yaml_2.2.1 colorspace_2.0-2
[61] rvest_1.0.0 knitr_1.33 haven_2.4.1 sass_0.4.0