BASIC SUMMARIES

##   label    blog    twit    news
## 1  Word  298857  245992  382979
## 2  Line 2362935 1992553 3754212

We can observe that the size of the data set is News as the biggest followd by Twit and the last one Blog. That’s why to test my code I use always first Blog.

For the Tokenization I used this code. Without removing the stopwords or the steaming, because as they are very used words the predictor has to use it. I could see if all the stopwords are present in my data so that i can predict all of them.

tkn_blog <- tokens(corpus_blog,remove_punct = TRUE,
                   remove_numbers = TRUE,
                   remove_symbols = TRUE,
                   split_hyphens = TRUE,
                   remove_url = TRUE)

tkn_blog <- tokens_tolower(tkn_blog)

The english stopwords are the most frequent words in the 3 documents.

##    blog_word blog_freq twit_word twit_freq news_word news_freq
## 1        the   1860121       the    937309       the   1974362
## 2        and   1094363        to    788634        to    906145
## 3         to   1069412         i    723360       and    889510
## 4          a    900343         a    611315         a    878009
## 5         of    876772       you    548045        of    774501
## 6          i    775029       and    438533        in    679065
## 7         in    598516       for    385336       for    353899
## 8       that    460779        in    380310      that    347079
## 9         is    432699        of    359625        is    284240
## 10        it    403898        is    358771        on    269880
## 11       for    363819        it    295066      with    254813
## 12       you    298693        my    291883      said    250418
## 13      with    286718        on    278008        he    228996
## 14       was    278345      that    234646       was    228970
## 15        on    276506        me    202515        it    220091
## 16        my    270852        be    187834        at    214176
## 17      this    259004        at    186719        as    187560
## 18        as    223945      with    173469         i    158851
## 19      have    218928      your    171213       his    157671
## 20        be    209055      have    168665        be    152860

We can see here some least frequent words (all of them only apereard 1 time). There are some in other languages and some with symbols left, even though we filter them. But they don’t matter since, as i will explain later those are gona be removed.

##                                          blog               twit           news
## 1                                       nyepi             juanny     noncapital
## 2                                  florecidas            stugats           jsut
## 3                                     stanvac            harvast         romace
## 4                                        pauk       #spuarespace         deinbo
## 5                                 monochromic              k229t           petn
## 6                                     georgio           flauging       podoloff
## 7                                 infrequency              osuna        gossman
## 8                                      kojaks       flamingos.we       d.c.once
## 9                                      muskie         eeeeepiiee         tiwari
## 10 <U+03BA>a<U+03B8>a<U+03AF><U+03C1><U+03C9>          sanginary        anjalie
## 11                                   kathairo           hellions     askdoglady
## 12                                  sapientia #shitmybrothersays askdoglady.com
## 13                                   sigurjón             beleib        payerne
## 14                                    draumur   elasticbeanbeard      giftcards
## 15                                 sogblettir                ocm       saunters
## 16                                      bunds            be_____     ratcatcher
## 17                                  megablond      nodaysoff_atc    blizzarding
## 18                              @internetweek         oooohhhhhh       imperuim
## 19                                    thesame            nandina        pejaver
## 20                                    dashkov         bailjumper       piromari

The question “How many words appear only 1 to 5 times?” its relevant because most of them are in foreign languages so they are not relevant for the model

As we can see they are many words that only appear one time principally in the twitter data set

If we look the frequencies the bigger amount of words frequencies are between 1 and 10. So we could subtract them to have a shorter data set and more accurate with less outliers

PLANS

I know i have to make the train and data set. Make the ngrams and search for a model that give the user a response for the next word consider the frequencies.