library(tidytext)
library(tidyverse)
library(stopwords)
library(tm)

theme_set(theme_bw())

Background

The goal of this exercise is to build and evaluate your first predictive model. You will use the n-gram and backoff models you built in previous tasks to build and evaluate your predictive model. The goal is to make the model efficient and accurate.

Total word frequency

The most frequent words

the_50s %>%
  top_n(10, prop) %>%
  mutate(word = reorder(word, prop)) %>%
  ggplot(aes(word, prop)) +
  geom_col(fill = 'lightblue', col= 'blue') +
  ggtitle("Global top 10 words") +
  xlab(NULL) +
  coord_flip()

most_freq <- clean_tidy_data %>% 
  count(word, sort = T)

head(most_freq, 5)
# A tibble: 5 x 2
  word      n
  <chr> <int>
1 im    24677
2 time  21119
3 dont  17715
4 day   17042
5 love  15957

Interpretation: Globally, the 5 most frequent words are : ‘im’, ‘time’, ‘dont’, ‘day’ and ‘love’.

Word frequency by source

freq %>% 
  group_by(source) %>% 
  top_n(3)
# A tibble: 9 x 4
# Groups:   source [3]
  word       n source     prop
  <chr>  <int> <chr>     <dbl>
1 im     15902 Twitter 0.0159 
2 love   10525 Twitter 0.0105 
3 dont    9118 Twitter 0.00909
4 time    8656 Blogs   0.00743
5 im      6652 Blogs   0.00571
6 people  5704 Blogs   0.00489
7 time    5091 News    0.00400
8 people  4747 News    0.00373
9 city    3773 News    0.00297

Uni-gram Distribution

Distributions were created for each set of n-grams, based on relative frequency.

Bi-grams frequency

#Distribution of bi-grams
bigram_90 %>%
  top_n(10, prop) %>%
  mutate(bigram = reorder(bigram, prop)) %>%
  ggplot(aes(bigram, prop)) +
  geom_col(fill = 'lightgreen', col = 'black') +
  xlab(NULL) +
  coord_flip() + 
  ggtitle("Bigrams that represent the 90%") +
  labs(y = "Proportion")

head(bigram_90)
# A tibble: 6 x 4
  bigram      n    prop coverage
  <chr>   <int>   <dbl>    <dbl>
1 of the  43825 0.00516  0.00516
2 in the  41748 0.00492  0.0101 
3 to the  24894 0.00293  0.0130 
4 for the 20596 0.00243  0.0154 
5 on the  19731 0.00232  0.0178 
6 to be   16201 0.00191  0.0197 
print(paste('Number of bi-grams with 90% of coverage in the dataset: ', nrow(bigram_90))) 
[1] "Number of bi-grams with 90% of coverage in the dataset:  1256942"

Tri-grams frequency

# A tibble: 6 x 4
  trigram            n     prop coverage
  <chr>          <int>    <dbl>    <dbl>
1 one of the      3338 0.000413  0.00334
2 a lot of        3044 0.000376  0.00372
3 thanks for the  2474 0.000306  0.00403
4 to be a         1784 0.000221  0.00425
5 the of the      1686 0.000209  0.00446
6 going to be     1646 0.000204  0.00466
[1] "Number of tri-grams with 90% of coverage in the dataset:  4495716"

tri-gram prediction podel

I use the tri-grams table as the basis for prediction.

trigrams_pred <- trigram_90 %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")
knitr::kable(head(trigrams_pred))
word1 word2 word3 n prop coverage
one of the 3338 0.0004128 0.0033439
a lot of 3044 0.0003764 0.0037203
thanks for the 2474 0.0003060 0.0040263
to be a 1784 0.0002206 0.0042469
the of the 1686 0.0002085 0.0044554
going to be 1646 0.0002036 0.0046590

Saving the trigram prediction table

# saveRDS(trigrams_pred, "./trigrams_pred.rds")

Notice in the tri-grams table, was separated by word and arranged by relative frequency. When the user inputs two words, the model matches those words and then finds the third word with the greatest relative frequency.

– END

References:

Text Mining with R

Supervised Machine Learning for Text Analysis in R

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Mexico.1252  LC_CTYPE=Spanish_Mexico.1252   
[3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C                   
[5] LC_TIME=Spanish_Mexico.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tm_0.7-8        NLP_0.2-1       stopwords_2.2   forcats_0.5.1  
 [5] stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4     readr_1.4.0    
 [9] tidyr_1.1.3     tibble_3.1.2    ggplot2_3.3.5   tidyverse_1.3.1
[13] tidytext_0.3.1 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7        lubridate_1.7.10  lattice_0.20-44   assertthat_0.2.1 
 [5] digest_0.6.27     utf8_1.2.1        slam_0.1-48       R6_2.5.0         
 [9] cellranger_1.1.0  backports_1.2.1   reprex_2.0.0      evaluate_0.14    
[13] highr_0.9         httr_1.4.2        pillar_1.6.1      rlang_0.4.11     
[17] readxl_1.3.1      rstudioapi_0.13   jquerylib_0.1.4   Matrix_1.3-3     
[21] rmarkdown_2.9     labeling_0.4.2    munsell_0.5.0     broom_0.7.8      
[25] compiler_4.1.0    janeaustenr_0.1.5 modelr_0.1.8      xfun_0.24        
[29] pkgconfig_2.0.3   htmltools_0.5.1.1 tidyselect_1.1.1  viridisLite_0.4.0
[33] fansi_0.5.0       crayon_1.4.1      dbplyr_2.1.1      withr_2.4.2      
[37] SnowballC_0.7.0   grid_4.1.0        jsonlite_1.7.2    gtable_0.3.0     
[41] lifecycle_1.0.0   DBI_1.1.1         magrittr_2.0.1    scales_1.1.1     
[45] tokenizers_0.2.1  cli_3.0.0         stringi_1.6.2     farver_2.1.0     
[49] fs_1.5.0          xml2_1.3.2        bslib_0.2.5.1     ellipsis_0.3.2   
[53] generics_0.1.0    vctrs_0.3.8       tools_4.1.0       glue_1.4.2       
[57] hms_1.1.0         parallel_4.1.0    yaml_2.2.1        colorspace_2.0-2 
[61] rvest_1.0.0       knitr_1.33        haven_2.4.1       sass_0.4.0