knitr::opts_chunk$set(echo = TRUE, cache = TRUE, warning = FALSE)

Raw data

For our text analysis project, we have a set of data which is formed by three large files (sets of words) which are accessible from here.

After unzipping, you have the following

File Lines Max length of line (characters)
blogs 899,288 40,833
news 77,259 5,760
twitter 2,360,148 140

For the objective we are looking for, we consider that including the files corresponding to news increases the statistical variance unnecessarily, so we will work with two models one with the three files and the second without considering the news ( news is much longer than what is written on a cell phone in addition to having a lexicon and tone very different from the other two files).

Data wrangling

Following the work of 1 we perform the following transformations to clean the data:

How our words looks like?

Current results

Using the data without having received the preprocessing to keep the root of the words we have:

  • We need only 1,015 words to cover 50% of de freqs and 148,076 to cover 90% without stem
wordcloud2(data = cloud2)
  • And we need only 666 words to cover 50% of de freqs and 117,747 to cover 90% with stem
wordcloud2(data = cloud2)

So we decided to continue with the stemed dataset and ngram algorithm

ggplot(data = head(n_1grams,20), aes(x = factor(n_1grams), y = Freq, fill = Freq)) +
  geom_bar(stat="identity") +   # se construyen las barras
  scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
  xlab('Top 20') + theme_minimal() +  ggtitle('1-grams')+  #demas cosas
  guides(fill=FALSE) + #quitar leyenda 
  coord_flip()
remove("n_1grams")
gc()

Top 20 2-gram

ggplot(data = head(n_2grams,20), aes(x = paste(uno, dos ,sep='_'), y = as.numeric(Freq), fill = as.numeric(Freq))) +
  geom_bar(stat="identity") +   # se construyen las barras
  scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
  xlab('Top 20') + theme_minimal() +  ggtitle('2-grams')+  #demas cosas
  guides(fill=FALSE) + #quitar leyenda 
  coord_flip()

remove("n_2grams")
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4915621 262.6    8273852  441.9   6861544  366.5
## Vcells 70597216 538.7  211337468 1612.4 160561495 1225.0

Up to this point all the Corpus is used, from here onwards for questions of capacity of the computation, the Corpus was divided into 20 parts, the n-grams were extracted (3 and 4) and only the gramswith frequency greater than one were saved.

Top 20 3-gram

n_3grams <- readRDS( file = "C:\\Users\\fou-f\\Desktop\\R-Coursera\\NLP\\project\\n_3grams_app.RDS")
n_3grams <- n_3grams[order(as.numeric(n_3grams$Freq), decreasing = TRUE), ]
ggplot(data = head(n_3grams,20), aes(x = paste(uno,dos,tres, sep = '_') , y = as.numeric(Freq), fill = as.numeric(Freq))) +
  geom_bar(stat="identity") +   # se construyen las barras
  scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
  xlab('Top 20') + theme_minimal() +  ggtitle('3-grams')+  #demas cosas
  guides(fill=FALSE) + #quitar leyenda 
  coord_flip()

remove("n_3grams")
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4934942 263.6    8273852  441.9   8273852  441.9
## Vcells 70660290 539.1  211337468 1612.4 160561495 1225.0

Top 20 4-gram

ggplot(data = head(n_4grams,20), aes(x = paste(uno, dos, tres, cuatro, sep='_'), y = as.numeric(Freq), fill = as.numeric(Freq))) +
  geom_bar(stat="identity") +   # se construyen las barras
  scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
  xlab('Top 20') + theme_minimal() +  ggtitle('4-grams')+  #demas cosas
  guides(fill=FALSE) + #quitar leyenda 
  coord_flip()

remove("n_4grams")
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4939440 263.8    8273852  441.9   8273852  441.9
## Vcells 70671562 539.2  211337468 1612.4 160561495 1225.0

  1. Pengda Qin, Weiran Xu and Jun Guo, ‘A Targeted Retraining Scheme of Unsupervised Word Embeddings for Specific Supervised Tasks,’ in Advances in Knowledge Discovery and Data Mining 2017, Springer.