The goal of this report is to show my exploratory analysis and my goals for the eventual app and algorithm.
In this report I will also explain which tools and which approaches I have taken and why I want to stick to choosen ones to complete entire Capstone project
To be able to create corpus from 3 different sources of sentenses
Maximize volume of corpus in the limited resources of the computer
Efficiently calculate probability of the “next” word in n-gram using loaded corpus
My decisions taken during EE:
1. tm vs quanteda: create corpus with tm, but tokens and n-grams with quanteda
2. Taking idea of Markov’s chain (but not using R library) create table of probabilities of the “next” word in n-grams
3. Remove or not stopwords? If we analyze texts to understand what is trending - need to remove; if we build a keyborad - we must include stopwords, since they are the most used in typing
4. Stemming or no stemming: same as 3. For keyboard we want to show full word, not just a stem
5. Corpus of paragraphs or lines? In the input data “lines” are actually paragraphs. If we tokenize them, we will get weird ngrams. I ecided to break the text into real sentences and tokenize them rather then given in text files lines.
Goal: Perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Sizes of input files:
## | File name | Size, Mb | Num of Lines | Num of words |
## |----------------------:|:--------:|:------------:|:------------:|
## | en_US.blogs.txt | 201 | 899288 | 37334117 |
## | en_US.news.txt | 197 | 1010242 | 34365936 |
## | en_US.twitter.txt | 160 | 2360148 | 30373559 |
On my computer I was not able to build a corpus larger than 5000 lines. Therefore I used samples from all 3 inputs with 4500 lines each.
I tried different optimizations and found that I can create and join together smaller corporas.
## [1] "final/en_US/en_US.blogs.txt.smpl.txt"
## [1] 13068
## [1] "final/en_US/en_US.news.txt.smpl.txt"
## [1] 9960
## [1] "final/en_US/en_US.twitter.txt.smpl.txt"
## [1] 8115
## [1] 31143
Exploring n-grams I realized that for analyzing sentiment we can use stopwords and stems, but for final prediction of “next” word as per 3 previously typed words I shall not remove stopwords from the corpus
Also we will see later that ngrams are quite different for the blogs, news or twitts. I think for the final prediction I will need to “sense” in which context the typing happens and use particular corpora.
Top 30 unigrams of blogs:
## one like time can will just get make go year know day thing
## 664 646 600 585 580 579 535 467 444 414 389 379 374
## think love use look work peopl even also now us want way new
## 361 354 352 348 344 343 338 324 319 317 305 300 296
## good see much first
## 291 289 285 284
Top 30 unigrams of news:
## said will year one new say state time can last
## 1191 589 577 438 336 335 326 315 314 306
## also two like get just go game make peopl citi
## 293 287 284 266 264 257 257 254 247 245
## us play first work includ team school want now percent
## 236 233 229 221 221 220 203 199 192 192
Top 30 unigrams of twitts:
## just get im like go thank day love good will rt
## 326 318 315 301 284 268 252 247 224 216 213
## can now one dont see time u know great make new
## 210 187 186 181 172 172 159 155 153 152 150
## follow work lol today back need peopl come
## 148 142 139 136 127 126 124 122
Top 30 unigrams of combined corpora:
## said will one like just get can time year go make day new
## 1433 1385 1288 1231 1169 1119 1109 1087 1071 985 873 816 782
## peopl work now know good say us love also want use think im
## 714 707 698 675 670 665 663 656 649 625 619 605 600
## look back last first
## 596 595 589 584
All contexts have different top words. For “next” word prediction I will need consider that.
Here are top 20 trigrams for blogs vs combined corpora:
## don_t_know don_t_think don_t_want
## 26 14 14
## south_carolina_mortgag carolina_mortgag_refinanc didn_t_know
## 13 13 12
## don_t_like didn_t_want don_t_get
## 8 6 6
## god_s_love realli_don_t realli_didn_t
## 6 5 5
## new_york_citi didn_t_realli art_hotel_florenc
## 5 5 5
## hotel_florenc_itali glitz_design_french design_french_kiss
## 5 5 5
## amazon_servic_llc don_t_let
## 4 4
## don_t_know don_t_want didn_t_know
## 28 18 17
## don_t_think south_carolina_mortgag carolina_mortgag_refinanc
## 17 13 13
## presid_barack_obama new_york_citi major_leagu_basebal
## 12 11 9
## don_t_like two_year_ago happi_mother_day
## 8 8 8
## cant_wait_see im_pretti_sure realli_didn_t
## 7 7 6
## didn_t_want don_t_get nation_weather_servic
## 6 6 6
## everi_singl_day just_make_sure
## 6 6
Goal: Build figures and tables to understand variation in the frequencies of words and word pairs in the data.
In previous paragraph I realized how corpora from different contexts is different.
Here are few graphs to show it visually:
wordsFreq(head(dfu.f, 12), "blog unigrams")
wordsFreq(head(dfua.f, 12), "all unigrams")
drawWC(dfu.f)
wordsFreq(head(dftwb.f, 12), "twitter bigrams")
drawWC(dftwb.f)
wordsFreq(head(dfnewst.f, 12), "news trigrams")
drawWC(dfnewst.f)
Trigrams repeat bigrams, so last 2 wordclouds show the difference between twitter and news contexts.
Also one can realize sentiments of the news of the period from when input text was created.
I explored the data and ready to build prediction models.
I’m still not ready to make the system quick and using entire text - I only can use random smaples of upto 5000 long sentences.