This is a report detailing the current work on creating an app that predicts the next word based on what the user has written. This report is based on a sample on 1/10 of the available data from twitter, blogs and news articles written in english.
Below you will see tables of the most common words, sorted by a mathematical algoritm that removes filler words.
## # A tibble: 96,689 x 6
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Twitter haha 2689 0.000894 0.405 0.000363
## 2 Twitter lmao 795 0.000264 1.10 0.000290
## 3 Twitter shit 1751 0.000582 0.405 0.000236
## 4 Twitter thx 518 0.000172 1.10 0.000189
## 5 Twitter dont 1260 0.000419 0.405 0.000170
## 6 Twitter fuck 1176 0.000391 0.405 0.000159
## 7 Twitter fucking 720 0.000239 0.405 0.0000971
## 8 Twitter thats 667 0.000222 0.405 0.0000899
## 9 Twitter hahaha 637 0.000212 0.405 0.0000859
## 10 Twitter niggas 209 0.0000695 1.10 0.0000764
## # … with 96,679 more rows
## # A tibble: 97,227 x 6
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Blogs shit 267 0.0000711 0.405 0.0000288
## 2 Blogs favourite 250 0.0000665 0.405 0.0000270
## 3 Blogs coloured 79 0.0000210 1.10 0.0000231
## 4 Blogs unschooling 75 0.0000200 1.10 0.0000219
## 5 Blogs fucking 188 0.0000500 0.405 0.0000203
## 6 Blogs whilst 188 0.0000500 0.405 0.0000203
## 7 Blogs stampin 65 0.0000173 1.10 0.0000190
## 8 Blogs embossing 60 0.0000160 1.10 0.0000175
## 9 Blogs fuck 151 0.0000402 0.405 0.0000163
## 10 Blogs cricut 55 0.0000146 1.10 0.0000161
## # … with 97,217 more rows
## # A tibble: 97,805 x 6
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 News kasich 118 0.0000340 1.10 0.0000373
## 2 News spokeswoman 297 0.0000855 0.405 0.0000347
## 3 News attorney's 80 0.0000230 1.10 0.0000253
## 4 News rebounds 208 0.0000599 0.405 0.0000243
## 5 News øthe 73 0.0000210 1.10 0.0000231
## 6 News trenton 188 0.0000541 0.405 0.0000219
## 7 News pleaded 180 0.0000518 0.405 0.0000210
## 8 News o'fallon 66 0.0000190 1.10 0.0000209
## 9 News dimora 156 0.0000449 0.405 0.0000182
## 10 News hunterdon 52 0.0000150 1.10 0.0000164
## # … with 97,795 more rows
## # A tibble: 10,239,871 x 3
## language id bigram
## <chr> <chr> <chr>
## 1 en_US Blogs when sam
## 2 en_US Blogs sam and
## 3 en_US Blogs and i
## 4 en_US Blogs i saw
## 5 en_US Blogs saw these
## 6 en_US Blogs these at
## 7 en_US Blogs at christmas
## 8 en_US Blogs christmas we
## 9 en_US Blogs we both
## 10 en_US Blogs both remarked
## # … with 10,239,861 more rows
## # A tibble: 10,239,868 x 3
## language id bigram
## <chr> <chr> <chr>
## 1 en_US Blogs when sam and
## 2 en_US Blogs sam and i
## 3 en_US Blogs and i saw
## 4 en_US Blogs i saw these
## 5 en_US Blogs saw these at
## 6 en_US Blogs these at christmas
## 7 en_US Blogs at christmas we
## 8 en_US Blogs christmas we both
## 9 en_US Blogs we both remarked
## 10 en_US Blogs both remarked hey
## # … with 10,239,858 more rows
Here we see a distribution of the words used. We see that most words occur infrequently, and that some words occur very often. This makes it easier for us, since we want to predict words, and by focusing on predicting the most common words we get a less resource demanding algoritm, while still predicting the words people use most commonly
If we filter away the filler words such as “the”, “and”, “or” and so on, we get these 15 most common words. To note is that slurs have not been removed at this stage, nor has any foreign languages been dealt with so far.
This gives a good basis for the future. To develop an algoritm to predict the next word, I will base it on the following:
Most words are rare, and the most used words are used a lot
By using N-grams we can find the most common combination of words, and a simple algoritm can be based around the fact that it’s most probable that the last word in an N-gram will be based on the words before it. So, if I write a word, the best prediction will be the bigram of the following word. If I write another, I can use the trigram of the same word, and so on.