DS Capstone: Week 2 milestone report

Purpose

The goal of this report is to show my exploratory analysis and my goals for the eventual app and algorithm.
In this report I will also explain which tools and which approaches I have taken and why I want to stick to choosen ones to complete entire Capstone project

Ideas and Concerns

To be able to create corpus from 3 different sources of sentenses
Maximize volume of corpus in the limited resources of the computer
Efficiently calculate probability of the “next” word in n-gram using loaded corpus

My decisions taken during EE:
1. tm vs quanteda: create corpus with tm, but tokens and n-grams with quanteda
2. Taking idea of Markov’s chain (but not using R library) create table of probabilities of the “next” word in n-grams
3. Remove or not stopwords? If we analyze texts to understand what is trending - need to remove; if we build a keyborad - we must include stopwords, since they are the most used in typing
4. Stemming or no stemming: same as 3. For keyboard we want to show full word, not just a stem
5. Corpus of paragraphs or lines? In the input data “lines” are actually paragraphs. If we tokenize them, we will get weird ngrams. I ecided to break the text into real sentences and tokenize them rather then given in text files lines.

Exploratory analysis

Goal: Perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.

Input files

Sizes of input files:

## |  File name            | Size, Mb | Num of Lines | Num of words |  
## |----------------------:|:--------:|:------------:|:------------:|  
## |    en_US.blogs.txt    |    201   |    899288    |   37334117   |  
## |    en_US.news.txt     |    197   |    1010242   |   34365936   |  
## |    en_US.twitter.txt  |    160   |    2360148   |   30373559   |

On my computer I was not able to build a corpus larger than 5000 lines. Therefore I used samples from all 3 inputs with 4500 lines each.
I tried different optimizations and found that I can create and join together smaller corporas.

Exploring Corpora

## [1] "final/en_US/en_US.blogs.txt.smpl.txt"
## [1] 13068
## [1] "final/en_US/en_US.news.txt.smpl.txt"
## [1] 9960
## [1] "final/en_US/en_US.twitter.txt.smpl.txt"
## [1] 8115

## [1] 31143

Exploring n-grams I realized that for analyzing sentiment we can use stopwords and stems, but for final prediction of “next” word as per 3 previously typed words I shall not remove stopwords from the corpus
Also we will see later that ngrams are quite different for the blogs, news or twitts. I think for the final prediction I will need to “sense” in which context the typing happens and use particular corpora.

Top 30 unigrams of blogs:

##   one  like  time   can  will  just   get  make    go  year  know   day thing 
##   664   646   600   585   580   579   535   467   444   414   389   379   374 
## think  love   use  look  work peopl  even  also   now    us  want   way   new 
##   361   354   352   348   344   343   338   324   319   317   305   300   296 
##  good   see  much first 
##   291   289   285   284

Top 30 unigrams of news:

##    said    will    year     one     new     say   state    time     can    last 
##    1191     589     577     438     336     335     326     315     314     306 
##    also     two    like     get    just      go    game    make   peopl    citi 
##     293     287     284     266     264     257     257     254     247     245 
##      us    play   first    work  includ    team  school    want     now percent 
##     236     233     229     221     221     220     203     199     192     192

Top 30 unigrams of twitts:

##   just    get     im   like     go  thank    day   love   good   will     rt 
##    326    318    315    301    284    268    252    247    224    216    213 
##    can    now    one   dont    see   time      u   know  great   make    new 
##    210    187    186    181    172    172    159    155    153    152    150 
## follow   work    lol  today   back   need  peopl   come 
##    148    142    139    136    127    126    124    122

Top 30 unigrams of combined corpora:

##  said  will   one  like  just   get   can  time  year    go  make   day   new 
##  1433  1385  1288  1231  1169  1119  1109  1087  1071   985   873   816   782 
## peopl  work   now  know  good   say    us  love  also  want   use think    im 
##   714   707   698   675   670   665   663   656   649   625   619   605   600 
##  look  back  last first 
##   596   595   589   584

All contexts have different top words. For “next” word prediction I will need consider that.

Here are top 20 trigrams for blogs vs combined corpora:

##                don_t_know               don_t_think                don_t_want 
##                        26                        14                        14 
##    south_carolina_mortgag carolina_mortgag_refinanc               didn_t_know 
##                        13                        13                        12 
##                don_t_like               didn_t_want                 don_t_get 
##                         8                         6                         6 
##                god_s_love              realli_don_t             realli_didn_t 
##                         6                         5                         5 
##             new_york_citi             didn_t_realli         art_hotel_florenc 
##                         5                         5                         5 
##       hotel_florenc_itali       glitz_design_french        design_french_kiss 
##                         5                         5                         5 
##         amazon_servic_llc                 don_t_let 
##                         4                         4

##                don_t_know                don_t_want               didn_t_know 
##                        28                        18                        17 
##               don_t_think    south_carolina_mortgag carolina_mortgag_refinanc 
##                        17                        13                        13 
##       presid_barack_obama             new_york_citi       major_leagu_basebal 
##                        12                        11                         9 
##                don_t_like              two_year_ago          happi_mother_day 
##                         8                         8                         8 
##             cant_wait_see            im_pretti_sure             realli_didn_t 
##                         7                         7                         6 
##               didn_t_want                 don_t_get     nation_weather_servic 
##                         6                         6                         6 
##           everi_singl_day            just_make_sure 
##                         6                         6

Understand frequencies of words and word pairs

Goal: Build figures and tables to understand variation in the frequencies of words and word pairs in the data.
In previous paragraph I realized how corpora from different contexts is different.
Here are few graphs to show it visually:

wordsFreq(head(dfu.f, 12), "blog unigrams")

wordsFreq(head(dfua.f, 12), "all unigrams")

drawWC(dfu.f)

wordsFreq(head(dftwb.f, 12), "twitter bigrams")

drawWC(dftwb.f)

wordsFreq(head(dfnewst.f, 12), "news trigrams")

drawWC(dfnewst.f)

Trigrams repeat bigrams, so last 2 wordclouds show the difference between twitter and news contexts.
Also one can realize sentiments of the news of the period from when input text was created.

Conclusion

I explored the data and ready to build prediction models.
I’m still not ready to make the system quick and using entire text - I only can use random smaples of upto 5000 long sentences.