knitr::opts_chunk$set(echo = TRUE, cache = TRUE, warning = FALSE)
For our text analysis project, we have a set of data which is formed by three large files (sets of words) which are accessible from here.
After unzipping, you have the following
| File | Lines | Max length of line (characters) |
|---|---|---|
| blogs | 899,288 | 40,833 |
| news | 77,259 | 5,760 |
| 2,360,148 | 140 |
For the objective we are looking for, we consider that including the files corresponding to news increases the statistical variance unnecessarily, so we will work with two models one with the three files and the second without considering the news ( news is much longer than what is written on a cell phone in addition to having a lexicon and tone very different from the other two files).
Following the work of 1 we perform the following transformations to clean the data:
Using the data without having received the preprocessing to keep the root of the words we have:
wordcloud2(data = cloud2)
wordcloud2(data = cloud2)
So we decided to continue with the stemed dataset and ngram algorithm
ggplot(data = head(n_1grams,20), aes(x = factor(n_1grams), y = Freq, fill = Freq)) +
geom_bar(stat="identity") + # se construyen las barras
scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
xlab('Top 20') + theme_minimal() + ggtitle('1-grams')+ #demas cosas
guides(fill=FALSE) + #quitar leyenda
coord_flip()
remove("n_1grams")
gc()
ggplot(data = head(n_2grams,20), aes(x = paste(uno, dos ,sep='_'), y = as.numeric(Freq), fill = as.numeric(Freq))) +
geom_bar(stat="identity") + # se construyen las barras
scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
xlab('Top 20') + theme_minimal() + ggtitle('2-grams')+ #demas cosas
guides(fill=FALSE) + #quitar leyenda
coord_flip()
remove("n_2grams")
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 4915621 262.6 8273852 441.9 6861544 366.5
## Vcells 70597216 538.7 211337468 1612.4 160561495 1225.0
Up to this point all the Corpus is used, from here onwards for questions of capacity of the computation, the Corpus was divided into 20 parts, the n-grams were extracted (3 and 4) and only the gramswith frequency greater than one were saved.
n_3grams <- readRDS( file = "C:\\Users\\fou-f\\Desktop\\R-Coursera\\NLP\\project\\n_3grams_app.RDS")
n_3grams <- n_3grams[order(as.numeric(n_3grams$Freq), decreasing = TRUE), ]
ggplot(data = head(n_3grams,20), aes(x = paste(uno,dos,tres, sep = '_') , y = as.numeric(Freq), fill = as.numeric(Freq))) +
geom_bar(stat="identity") + # se construyen las barras
scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
xlab('Top 20') + theme_minimal() + ggtitle('3-grams')+ #demas cosas
guides(fill=FALSE) + #quitar leyenda
coord_flip()
remove("n_3grams")
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 4934942 263.6 8273852 441.9 8273852 441.9
## Vcells 70660290 539.1 211337468 1612.4 160561495 1225.0
ggplot(data = head(n_4grams,20), aes(x = paste(uno, dos, tres, cuatro, sep='_'), y = as.numeric(Freq), fill = as.numeric(Freq))) +
geom_bar(stat="identity") + # se construyen las barras
scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
xlab('Top 20') + theme_minimal() + ggtitle('4-grams')+ #demas cosas
guides(fill=FALSE) + #quitar leyenda
coord_flip()
remove("n_4grams")
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 4939440 263.8 8273852 441.9 8273852 441.9
## Vcells 70671562 539.2 211337468 1612.4 160561495 1225.0
Pengda Qin, Weiran Xu and Jun Guo, ‘A Targeted Retraining Scheme of Unsupervised Word Embeddings for Specific Supervised Tasks,’ in Advances in Knowledge Discovery and Data Mining 2017, Springer.↩