The objective of this document is to summarize the exploratory analysis made on the corpus datase, this is one of the first steps in building the shiny application to predict words of a sentences. By last, the next steps in building the shiny app will be presented answering the questions made in the modeling task. Thus, this report won’t contain much of the code made in the exploratory analysis, it will be in the final notes, on the source code link.
The code link is supplied given that the preprocessing step is too large to include in this report. In resumen, the preproccesing contains the following steps:
The following table shows a summary of the files
kable(words_stats)
| File | Size in Megabytes | Number of lines | Sample size | Number of sample lines |
|---|---|---|---|---|
| blog | 248.49350 | 899288 | 0.5576248 | 2000 |
| news | 19.17972 | 77259 | 0.4840775 | 2000 |
| 301.39694 | 2360148 | 0.2588806 | 2000 |
First we are going to load the bag of words for an NGram model of N = 1 and the tokenized models of the words, from the news, twitter and blog data.
load("./Milestone report/dfm_matrix.RData")
load("./Milestone report/tokens.RData")
Let’s first look the distribution of the number of words per source
load("./Milestone report/nchar_all.RData")
ggplot(nchar_all, aes(x = number_words, fill = var))+
geom_histogram(alpha = 0.5, bins = 100) +
labs(x = "Number of characters", y = "Count", title = "Number of character per document")
As seen, the source with more independent words is the blog data. This is a bit surprising because from the preprocessing code, the same number of documents were taken from the corpus of each source, and the corpus with more expected words was the news source. And as seen the twitter source is the one with less words, this was expected given the configuration of the social network.
Wordclouds are usefull for lookin at the datasets in an independet way.
wordcloud(unlist(blog_tokens), scale = c(3,.05),max.words = 150,
rot.per = .5,random.order=FALSE, use.r.layout = FALSE,
colors=brewer.pal(8, "Dark2"))
One of the most seen words are the time related, like week, year, day, this could be related from the nature of the blogs that the words were taken.
wordcloud(unlist(news_tokens), scale = c(3,.05),max.words = 150,
rot.per = .5,random.order=FALSE, use.r.layout = FALSE,
colors=brewer.pal(8, "Dark2"))
In the news wordcloud, “said” has high frequency, it gives an idea that most of the reporters quote what othe people said related to the news that they are reporting
wordcloud(unlist(twit_tokens), scale = c(2.5,.05),max.words = 150,
rot.per = .5,random.order=FALSE, use.r.layout = FALSE,
colors=brewer.pal(8, "Dark2"))
The following table presents the 15 most frequent words from the datasets and the from the merged dataset:
| Blog words | Blog freq | News words | News freq | Twitter words | Twitter freq | Merge words | Merge words freq |
|---|---|---|---|---|---|---|---|
| one | 299 | said | 506 | thank | 127 | said | 602 |
| like | 288 | year | 251 | like | 124 | one | 543 |
| get | 241 | one | 170 | get | 115 | like | 523 |
| just | 240 | time | 145 | just | 114 | get | 478 |
| time | 234 | can | 136 | go | 112 | year | 464 |
| can | 215 | new | 131 | day | 106 | time | 462 |
| make | 206 | state | 124 | love | 106 | just | 461 |
| go | 205 | go | 124 | good | 86 | go | 441 |
| year | 175 | two | 123 | time | 83 | can | 429 |
| love | 166 | get | 122 | follow | 80 | make | 377 |
| work | 164 | say | 122 | know | 78 | day | 357 |
| day | 163 | first | 118 | can | 78 | work | 331 |
| know | 160 | like | 111 | great | 76 | new | 316 |
| thing | 159 | citi | 109 | rt | 74 | good | 300 |
| think | 159 | just | 107 | one | 74 | love | 294 |
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
The total unique words of the sample corpus are 14016. To cover 50% of all word instances in the language there are needed only 604, that is equivalent to 4.3093607 percent of all the unique words, and to cover 90%, there are needed 5694 equivalent to 40.625 percent of the total words.
For this task, we will work only with the full merged dataset taken from the samples sets of each corpus.
| words | freq |
|---|---|
| year old | 50 |
| last year | 43 |
| new york | 36 |
| right now | 33 |
| high school | 32 |
| last week | 32 |
| year ago | 31 |
| make sure | 27 |
| even though | 26 |
| look like | 26 |
| feel like | 22 |
| two year | 22 |
| first time | 21 |
| everi day | 20 |
| unit state | 20 |
There is an interesting finding of the most used bi-gram, “year old” and “last year”, there is a lot of use of time reference words.
| words | freq |
|---|---|
| omg omg omg | 14 |
| new york citi | 5 |
| coupl year ago | 4 |
| presid barack obama | 4 |
| san diego state | 4 |
| mum mum mum | 4 |
| protect inform bill | 3 |
| two year ago | 3 |
| question whether can | 3 |
| 12th grade foothil | 3 |
| grade foothil high | 3 |
| foothil high school | 3 |
| pleas let know | 3 |
| fanni mae freddi | 3 |
| mae freddi mac | 3 |
One point of interest is the most used trigram, that is an expression for “oh my god” followed by “new york city”
How do you evaluate how many of the words come from foreign languages?
A/ By crossing the bag of words with an english dictionary, it’s not the most accuarate technique, but it could give you an idea of how many words could be of foreign language.
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
A/ There are 2 options, and could be used simultaneously.
1. Add new words into the dictionary 2. Add the word “Not_on_dictionary” or anyway you want to call it, so instead of findings NULLS o NaNs when crossing the corpora with a dictionary, it could replace it with “Not_on_dictionary” so it could increase the coverage.