Intro

This is a report detailing the current work on creating an app that predicts the next word based on what the user has written. This report is based on a sample on 1/10 of the available data from twitter, blogs and news articles written in english.

Token Words

Below you will see tables of the most common words, sorted by a mathematical algoritm that removes filler words.

Twitter

## # A tibble: 96,689 x 6
##    id      word        n        tf   idf    tf_idf
##    <chr>   <chr>   <int>     <dbl> <dbl>     <dbl>
##  1 Twitter haha     2689 0.000894  0.405 0.000363 
##  2 Twitter lmao      795 0.000264  1.10  0.000290 
##  3 Twitter shit     1751 0.000582  0.405 0.000236 
##  4 Twitter thx       518 0.000172  1.10  0.000189 
##  5 Twitter dont     1260 0.000419  0.405 0.000170 
##  6 Twitter fuck     1176 0.000391  0.405 0.000159 
##  7 Twitter fucking   720 0.000239  0.405 0.0000971
##  8 Twitter thats     667 0.000222  0.405 0.0000899
##  9 Twitter hahaha    637 0.000212  0.405 0.0000859
## 10 Twitter niggas    209 0.0000695 1.10  0.0000764
## # … with 96,679 more rows

Blogs

## # A tibble: 97,227 x 6
##    id    word            n        tf   idf    tf_idf
##    <chr> <chr>       <int>     <dbl> <dbl>     <dbl>
##  1 Blogs shit          267 0.0000711 0.405 0.0000288
##  2 Blogs favourite     250 0.0000665 0.405 0.0000270
##  3 Blogs coloured       79 0.0000210 1.10  0.0000231
##  4 Blogs unschooling    75 0.0000200 1.10  0.0000219
##  5 Blogs fucking       188 0.0000500 0.405 0.0000203
##  6 Blogs whilst        188 0.0000500 0.405 0.0000203
##  7 Blogs stampin        65 0.0000173 1.10  0.0000190
##  8 Blogs embossing      60 0.0000160 1.10  0.0000175
##  9 Blogs fuck          151 0.0000402 0.405 0.0000163
## 10 Blogs cricut         55 0.0000146 1.10  0.0000161
## # … with 97,217 more rows

News

## # A tibble: 97,805 x 6
##    id    word            n        tf   idf    tf_idf
##    <chr> <chr>       <int>     <dbl> <dbl>     <dbl>
##  1 News  kasich        118 0.0000340 1.10  0.0000373
##  2 News  spokeswoman   297 0.0000855 0.405 0.0000347
##  3 News  attorney's     80 0.0000230 1.10  0.0000253
##  4 News  rebounds      208 0.0000599 0.405 0.0000243
##  5 News  øthe           73 0.0000210 1.10  0.0000231
##  6 News  trenton       188 0.0000541 0.405 0.0000219
##  7 News  pleaded       180 0.0000518 0.405 0.0000210
##  8 News  o'fallon       66 0.0000190 1.10  0.0000209
##  9 News  dimora        156 0.0000449 0.405 0.0000182
## 10 News  hunterdon      52 0.0000150 1.10  0.0000164
## # … with 97,795 more rows

Token N-grams

Bigram

## # A tibble: 10,239,871 x 3
##    language id    bigram       
##    <chr>    <chr> <chr>        
##  1 en_US    Blogs when sam     
##  2 en_US    Blogs sam and      
##  3 en_US    Blogs and i        
##  4 en_US    Blogs i saw        
##  5 en_US    Blogs saw these    
##  6 en_US    Blogs these at     
##  7 en_US    Blogs at christmas 
##  8 en_US    Blogs christmas we 
##  9 en_US    Blogs we both      
## 10 en_US    Blogs both remarked
## # … with 10,239,861 more rows

Trigram

## # A tibble: 10,239,868 x 3
##    language id    bigram            
##    <chr>    <chr> <chr>             
##  1 en_US    Blogs when sam and      
##  2 en_US    Blogs sam and i         
##  3 en_US    Blogs and i saw         
##  4 en_US    Blogs i saw these       
##  5 en_US    Blogs saw these at      
##  6 en_US    Blogs these at christmas
##  7 en_US    Blogs at christmas we   
##  8 en_US    Blogs christmas we both 
##  9 en_US    Blogs we both remarked  
## 10 en_US    Blogs both remarked hey 
## # … with 10,239,858 more rows

Data Features

Here we see a distribution of the words used. We see that most words occur infrequently, and that some words occur very often. This makes it easier for us, since we want to predict words, and by focusing on predicting the most common words we get a less resource demanding algoritm, while still predicting the words people use most commonly

If we filter away the filler words such as “the”, “and”, “or” and so on, we get these 15 most common words. To note is that slurs have not been removed at this stage, nor has any foreign languages been dealt with so far.

Moving on

This gives a good basis for the future. To develop an algoritm to predict the next word, I will base it on the following:

  1. Most words are rare, and the most used words are used a lot

  2. By using N-grams we can find the most common combination of words, and a simple algoritm can be based around the fact that it’s most probable that the last word in an N-gram will be based on the words before it. So, if I write a word, the best prediction will be the bigram of the following word. If I write another, I can use the trigram of the same word, and so on.