Context

The objective of this document is to summarize the exploratory analysis made on the corpus datase, this is one of the first steps in building the shiny application to predict words of a sentences. By last, the next steps in building the shiny app will be presented answering the questions made in the modeling task. Thus, this report won’t contain much of the code made in the exploratory analysis, it will be in the final notes, on the source code link.

Pre processing step.

The code link is supplied given that the preprocessing step is too large to include in this report. In resumen, the preproccesing contains the following steps:

Information about the files

The following table shows a summary of the files

kable(words_stats)
File Size in Megabytes Number of lines Sample size Number of sample lines
blog 248.49350 899288 0.5576248 2000
news 19.17972 77259 0.4840775 2000
twitter 301.39694 2360148 0.2588806 2000

Word frequencies

First we are going to load the bag of words for an NGram model of N = 1 and the tokenized models of the words, from the news, twitter and blog data.

load("./Milestone report/dfm_matrix.RData")
load("./Milestone report/tokens.RData")

Let’s first look the distribution of the number of words per source

load("./Milestone report/nchar_all.RData")
ggplot(nchar_all, aes(x = number_words, fill = var))+ 
      geom_histogram(alpha = 0.5, bins = 100) +
      labs(x = "Number of characters", y = "Count", title = "Number of character per document") 

As seen, the source with more independent words is the blog data. This is a bit surprising because from the preprocessing code, the same number of documents were taken from the corpus of each source, and the corpus with more expected words was the news source. And as seen the twitter source is the one with less words, this was expected given the configuration of the social network.

Wordclouds are usefull for lookin at the datasets in an independet way.

wordcloud(unlist(blog_tokens), scale = c(3,.05),max.words = 150,
          rot.per = .5,random.order=FALSE, use.r.layout = FALSE,
          colors=brewer.pal(8, "Dark2"))

One of the most seen words are the time related, like week, year, day, this could be related from the nature of the blogs that the words were taken.

wordcloud(unlist(news_tokens), scale = c(3,.05),max.words = 150,
          rot.per = .5,random.order=FALSE, use.r.layout = FALSE,
          colors=brewer.pal(8, "Dark2"))

In the news wordcloud, “said” has high frequency, it gives an idea that most of the reporters quote what othe people said related to the news that they are reporting

wordcloud(unlist(twit_tokens), scale = c(2.5,.05),max.words = 150,
          rot.per = .5,random.order=FALSE, use.r.layout = FALSE,
          colors=brewer.pal(8, "Dark2"))

The following table presents the 15 most frequent words from the datasets and the from the merged dataset:

Blog words Blog freq News words News freq Twitter words Twitter freq Merge words Merge words freq
one 299 said 506 thank 127 said 602
like 288 year 251 like 124 one 543
get 241 one 170 get 115 like 523
just 240 time 145 just 114 get 478
time 234 can 136 go 112 year 464
can 215 new 131 day 106 time 462
make 206 state 124 love 106 just 461
go 205 go 124 good 86 go 441
year 175 two 123 time 83 can 429
love 166 get 122 follow 80 make 377
work 164 say 122 know 78 day 357
day 163 first 118 can 78 work 331
know 160 like 111 great 76 new 316
thing 159 citi 109 rt 74 good 300
think 159 just 107 one 74 love 294

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

The total unique words of the sample corpus are 14016. To cover 50% of all word instances in the language there are needed only 604, that is equivalent to 4.3093607 percent of all the unique words, and to cover 90%, there are needed 5694 equivalent to 40.625 percent of the total words.

Bi-gram and Tri-gram models

For this task, we will work only with the full merged dataset taken from the samples sets of each corpus.

Bi-gram

words freq
year old 50
last year 43
new york 36
right now 33
high school 32
last week 32
year ago 31
make sure 27
even though 26
look like 26
feel like 22
two year 22
first time 21
everi day 20
unit state 20

There is an interesting finding of the most used bi-gram, “year old” and “last year”, there is a lot of use of time reference words.

Tri-gram

words freq
omg omg omg 14
new york citi 5
coupl year ago 4
presid barack obama 4
san diego state 4
mum mum mum 4
protect inform bill 3
two year ago 3
question whether can 3
12th grade foothil 3
grade foothil high 3
foothil high school 3
pleas let know 3
fanni mae freddi 3
mae freddi mac 3

One point of interest is the most used trigram, that is an expression for “oh my god” followed by “new york city”

Further points of interest

How do you evaluate how many of the words come from foreign languages?
A/ By crossing the bag of words with an english dictionary, it’s not the most accuarate technique, but it could give you an idea of how many words could be of foreign language.

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

A/ There are 2 options, and could be used simultaneously.
1. Add new words into the dictionary 2. Add the word “Not_on_dictionary” or anyway you want to call it, so instead of findings NULLS o NaNs when crossing the corpora with a dictionary, it could replace it with “Not_on_dictionary” so it could increase the coverage.

Findings

  1. It’s important to take a tiny subset of the corpora, because it is to heavy to load it on a single PC, or use parallel processing with other machines.
  2. As the n-gram models increases the number of words, it’s more easy to understand the context of the documents, or it give a sligthly idea of it, no like the unigram model, that it’s most used for tasks like sentiment analysis.