## label blog twit news
## 1 Word 298857 245992 382979
## 2 Line 2362935 1992553 3754212
We can observe that the size of the data set is News as the biggest followd by Twit and the last one Blog. That’s why to test my code I use always first Blog.
For the Tokenization I used this code. Without removing the stopwords or the steaming, because as they are very used words the predictor has to use it. I could see if all the stopwords are present in my data so that i can predict all of them.
tkn_blog <- tokens(corpus_blog,remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
split_hyphens = TRUE,
remove_url = TRUE)
tkn_blog <- tokens_tolower(tkn_blog)
The english stopwords are the most frequent words in the 3 documents.
## blog_word blog_freq twit_word twit_freq news_word news_freq
## 1 the 1860121 the 937309 the 1974362
## 2 and 1094363 to 788634 to 906145
## 3 to 1069412 i 723360 and 889510
## 4 a 900343 a 611315 a 878009
## 5 of 876772 you 548045 of 774501
## 6 i 775029 and 438533 in 679065
## 7 in 598516 for 385336 for 353899
## 8 that 460779 in 380310 that 347079
## 9 is 432699 of 359625 is 284240
## 10 it 403898 is 358771 on 269880
## 11 for 363819 it 295066 with 254813
## 12 you 298693 my 291883 said 250418
## 13 with 286718 on 278008 he 228996
## 14 was 278345 that 234646 was 228970
## 15 on 276506 me 202515 it 220091
## 16 my 270852 be 187834 at 214176
## 17 this 259004 at 186719 as 187560
## 18 as 223945 with 173469 i 158851
## 19 have 218928 your 171213 his 157671
## 20 be 209055 have 168665 be 152860
We can see here some least frequent words (all of them only apereard 1 time). There are some in other languages and some with symbols left, even though we filter them. But they don’t matter since, as i will explain later those are gona be removed.
## blog twit news
## 1 nyepi juanny noncapital
## 2 florecidas stugats jsut
## 3 stanvac harvast romace
## 4 pauk #spuarespace deinbo
## 5 monochromic k229t petn
## 6 georgio flauging podoloff
## 7 infrequency osuna gossman
## 8 kojaks flamingos.we d.c.once
## 9 muskie eeeeepiiee tiwari
## 10 <U+03BA>a<U+03B8>a<U+03AF><U+03C1><U+03C9> sanginary anjalie
## 11 kathairo hellions askdoglady
## 12 sapientia #shitmybrothersays askdoglady.com
## 13 sigurjón beleib payerne
## 14 draumur elasticbeanbeard giftcards
## 15 sogblettir ocm saunters
## 16 bunds be_____ ratcatcher
## 17 megablond nodaysoff_atc blizzarding
## 18 @internetweek oooohhhhhh imperuim
## 19 thesame nandina pejaver
## 20 dashkov bailjumper piromari
The question “How many words appear only 1 to 5 times?” its relevant because most of them are in foreign languages so they are not relevant for the model
As we can see they are many words that only appear one time principally in the twitter data set
If we look the frequencies the bigger amount of words frequencies are between 1 and 10. So we could subtract them to have a shorter data set and more accurate with less outliers
I know i have to make the train and data set. Make the ngrams and search for a model that give the user a response for the next word consider the frequencies.