Exploratoty Analysis of HC Corpora

Task 2 - Exploratory Data Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Tasks to accomplish

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

At first we need to take clear the fact that we need three fundamentals objetcs to start: + The corpora data in our directory + The java package and program installed + The libraries required

So, loading the packages required:

library(tm)
## Loading required package: NLP
library(SnowballC)
suppressMessages(library(gdata))
library(kableExtra)
suppressMessages(library(ggplot2))
library(gridExtra)
library(rJava)
library(XML)
library(wordcloud)
library(RColorBrewer)
library(caret)
library(NLP)
library(openNLP)
library(RWeka)
library(qdap)

Now we need to load the data, and then take a sample of the length of other file = “Bad Words”, that are going to be used in the next task.

badEn<-readLines("badwords.txt")
## Warning in readLines("badwords.txt"): incomplete final line found on
## 'badwords.txt'
blog<-sample(suppressWarnings(readLines("en_US.blogs.txt")),length(badEn))
twitter<-sample(suppressWarnings(readLines("en_US.twitter.txt")),length(badEn))
news<-sample(suppressWarnings(readLines("en_US.news.txt")),length(badEn))


class(blog)
## [1] "character"
class(news)
## [1] "character"
class(twitter)
## [1] "character"

and the samples as data, are:

sample_d<- c(blog,news,twitter)

Now we need to clean the data for those signs of punctuation and numbers, also we need to reorder the data

txt <- sent_detect(sample_d)
txt<- removeNumbers(txt)
txt <- removePunctuation(txt)
txt<- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt, stringsAsFactors = FALSE)

Next step is to create the grams(1, 2 y 3) for this purpose we need to order the data more to be more clear.

words <- WordTokenizer(txt)
grams <- NGramTokenizer(txt)

for(i in 1:length(grams))
{if(length(WordTokenizer(grams[i]))==2) break}
for(j in 1:length(grams))
{if(length(WordTokenizer(grams[j]))==1)break}

# Creating the grams(N)
onegram<- data.frame(table(words))
onegram <- onegram[order(onegram$Freq, decreasing = TRUE),]

bigrams<- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]

trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams<- trigrams[order(trigrams$Freq,decreasing = TRUE),]

Questions to consider

Some words are more frequent than others - what are the distributions of word frequencies?

We can develop two ways but here I focousing in the package Wordcloud to be more clear in the representation and viualization of the data

wordcloud(words, scale = c(3,0.1),max.words = 100, random.order = FALSE,
          rot.per = 0.5, use.r.layout = FALSE, colors = brewer.pal(8,"Accent"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

wordcloud(words, scale=c(5,0.1), max.words=100, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

wordcloud(onegram$words, onegram$Freq, scale=c(5,0.5), max.words=300, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))

On the first graph we can see that the word “said” is the most common word in this sample, other words are “one”, “just” and “will”. The frecuency for onegram is:

onegram[1:10,]
##       words Freq
## 14166   the 5613
## 14375    to 2862
## 753     and 2807
## 1         a 2617
## 9783     of 2413
## 7089     in 1792
## 6966      i 1443
## 14159  that 1179
## 7473     is 1079
## 5594    for 1012

What are the frequencies of 2-grams and 3-grams in the dataset?

For 2 grams the visualization and the frequencies are:

barplot(bigrams[1:20,2],col="blue",
        names.arg = bigrams$Var1[1:20],srt = 45,
        space=0.1, xlim=c(0,20),las=2)

bigrams[1:10,]
##          Var1 Freq
## 41558  of the  581
## 29916  in the  454
## 62961  to the  234
## 42312  on the  209
## 22595 for the  183
## 62108   to be  163
## 29406    in a  158
## 7577   at the  140
## 5620  and the  132
## 30919    is a  113

For 2 grams the visualization and the frequencies are:

barplot(trigrams[1:20,2],col="blue",
        names.arg = trigrams$Var1[1:20],srt = 45,
        space=0.1, xlim=c(0,20),las=2)

trigrams[1:10,]
##              Var1 Freq
## 58576  one of the   42
## 1157     a lot of   34
## 81373     the u s   26
## 32007 going to be   22
## 71531 some of the   21
## 84969     to be a   19
## 10640  as well as   17
## 12426  be able to   17
## 97207 you have to   16
## 60792 part of the   15

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

sumCover <- 0
for(i in 1:length(onegram$Freq)) {
  sumCover <- sumCover + onegram$Freq[i]
  if(sumCover >= 0.5*sum(onegram$Freq)){break}
}
print(i)
## [1] 142
sumCover <- 0
for(i in 1:length(onegram$Freq)) {
  sumCover <- sumCover + onegram$Freq[i]
  if(sumCover >= 0.9*sum(onegram$Freq)){break}
}
print(i)
## [1] 6136

So as we see we need 145 words to cover 50% of all word instances in the language and 6066 words to cover 90% of all word instances in the language.

How do you evaluate how many of the words come from foreign languages?

For me we need to try to compare the text with some well-known dictionary,exist a lot in the web. Also, this is the way to remove “rude” or bad words that we previosuly loaded. Nevertheless, there are too few such words and their impact is too small and we do not need to take it into account, is necessary to continue cleaning the data.

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Yes, for that we need to take in count: + Prediction, based on the location (costumes,traditions, holidays, names, places, etc) + Lerning the writing style of the author and the non coloquials words + Using additional dictionary with n-grams: first, remove from the dictionary low-frequency words, than use the others for better prediction of n-grams