Raw data

For our text analysis project, we have a set of data which is formed by three large files (sets of words) which are accessible from here.

After unzipping, you have the following

File	Lines	Max length of line (characters)
blogs	899,288	40,833
news	77,259	5,760
twitter	2,360,148	140

For the objective we are looking for, we consider that including the files corresponding to news increases the statistical variance unnecessarily, so we will work with two models one with the three files and the second without considering the news ( news is much longer than what is written on a cell phone in addition to having a lexicon and tone very different from the other two files).

Data wrangling

Following the work of ¹ we perform the following transformations to clean the data:

Remove whitespaces, punctuation and numbers.
Transform to lower letters
Remove stop words like a, an, by …
stem words

How our words looks like?

Current results

Using the data without having received the preprocessing to keep the root of the words we have:

We need only 1,015 words to cover 50% of de freqs and 148,076 to cover 90% without stem

wordcloud2(data = cloud2)

And we need only 666 words to cover 50% of de freqs and 117,747 to cover 90% with stem

wordcloud2(data = cloud2)

So we decided to continue with the stemed dataset and ngram algorithm

ggplot(data = head(n_1grams,20), aes(x = factor(n_1grams), y = Freq, fill = Freq)) +
  geom_bar(stat="identity") +   # se construyen las barras
  scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
  xlab('Top 20') + theme_minimal() +  ggtitle('1-grams')+  #demas cosas
  guides(fill=FALSE) + #quitar leyenda 
  coord_flip()
remove("n_1grams")
gc()

Top 20 2-gram

ggplot(data = head(n_2grams,20), aes(x = paste(uno, dos ,sep='_'), y = as.numeric(Freq), fill = as.numeric(Freq))) +
  geom_bar(stat="identity") +   # se construyen las barras
  scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
  xlab('Top 20') + theme_minimal() +  ggtitle('2-grams')+  #demas cosas
  guides(fill=FALSE) + #quitar leyenda 
  coord_flip()

remove("n_2grams")
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4915621 262.6    8273852  441.9   6861544  366.5
## Vcells 70597216 538.7  211337468 1612.4 160561495 1225.0

Up to this point all the Corpus is used, from here onwards for questions of capacity of the computation, the Corpus was divided into 20 parts, the n-grams were extracted (3 and 4) and only the gramswith frequency greater than one were saved.

Top 20 3-gram

n_3grams <- readRDS( file = "C:\\Users\\fou-f\\Desktop\\R-Coursera\\NLP\\project\\n_3grams_app.RDS")
n_3grams <- n_3grams[order(as.numeric(n_3grams$Freq), decreasing = TRUE), ]
ggplot(data = head(n_3grams,20), aes(x = paste(uno,dos,tres, sep = '_') , y = as.numeric(Freq), fill = as.numeric(Freq))) +
  geom_bar(stat="identity") +   # se construyen las barras
  scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
  xlab('Top 20') + theme_minimal() +  ggtitle('3-grams')+  #demas cosas
  guides(fill=FALSE) + #quitar leyenda 
  coord_flip()

remove("n_3grams")
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4934942 263.6    8273852  441.9   8273852  441.9
## Vcells 70660290 539.1  211337468 1612.4 160561495 1225.0

Top 20 4-gram

ggplot(data = head(n_4grams,20), aes(x = paste(uno, dos, tres, cuatro, sep='_'), y = as.numeric(Freq), fill = as.numeric(Freq))) +
  geom_bar(stat="identity") +   # se construyen las barras
  scale_fill_distiller(palette = 'Accent')+ # la magia de poder escoger facÃ???l el color de la escala 'continua' o discreta
  xlab('Top 20') + theme_minimal() +  ggtitle('4-grams')+  #demas cosas
  guides(fill=FALSE) + #quitar leyenda 
  coord_flip()

remove("n_4grams")
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4939440 263.8    8273852  441.9   8273852  441.9
## Vcells 70671562 539.2  211337468 1612.4 160561495 1225.0

NLP Report

José Antonio García Ramirez

December 22, 2017