Overview

The goal of this project is to build an NLP algorithm to determine which English word a user is most likely to type after digitation of “some” (a number to determine) previous words.
The starting point is a Corpora dataset consisting of strings from blogs, online newspapers and tweets. Those strings are in English, but data in other languages are also provided and will maybe be used in further steps.
The algorithm is initially built and tested locally, but is expected to work in a Shiny app running on the Shiny server.
For my analysis, I used R.4.0.2 on a Windows X64 machine with 8Gb of RAM and the following libraries:

library(ggplot2)
library(plotly)
library(dplyr)
library(xtable)

NOTE: This report is intended to be concise and understandable by a non-data scientist reader, so the vast majority of code will not be shown. If you’re interested in it, you can find it on Github (Milestone.Rmd file). Knitting that file in RStudio (after removal of every “eval=F” option present in the file) would reproduce the entire workflow, but it could take a very long time and require you to manually clear the workspace from time to time (that’s why some files are saved on hardware and then loaded again when needed). This is partly because of the nature of the analysis itself and the size of data, and partly because I discovered some tricks to improve efficiency along the way.

Data processing and exploration

Data were downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

I used bash command "wc -cmlwL *.txt" to find basic informations for each of the 3 en_US files in my working directory and I summarized them in a table:
bytes Number.of.characters Number.of.lines Number.of.words Longest.line
en_US.blogs 210160014 208623085 899288 37333958 40833
en_US.news 205811889 205243643 1010242 34365905 11384
en_US.twitter 167105338 166843164 2360148 30357171 173


Then, I merged the 3 files and, to save allocated RAM, I divided the resulting object in 50 files that were subsequently loaded one at a time for profanity filtering and tokenisation. Before splitting, strings were mixed in random order, to insure that each chunk contains strings from blogs, news and tweets in approximately the same proportion. I choose to convert all letters to uppercase to reduce the number of different words.
Profanity filter was based on a bad words list found at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en. Profane words were replaced by the tag <BADWORD>.
I also seeked for web links, email adresses, numbers, prices and dates and replaced them with tags, still in order to reduce the number of different tokens in the dataset. Those filters are probably not perfect, but they appear to find the majority of the desired items.
Finally, lines were splitted based on whitespaces and punctuation. Again, a small percentage of words didn’t split correctly, due to some strange patterns of punctuation or to typos, but this doesn’t affect the quality of processed data.

Then, I used the first 35 chunks (70% of the total, as they all have similar sizes) to be part of the training dataset, and left the other 15 for testing purpose. Chunks from 36 to 45 will be the test dataset for the final model. Chunck from 46 to 50 will be the validation dataset that will be used to test the accuracy of each of my future models.

Unigrams

Then, the 35 word lists were merged together obtaining a vector with 70,842,273 elements and tabled to determine the total number of different words (or tokens) and each word’s frequency. The table was then sorted by frequency. As shown, the total number of distinct words and tokens in the training dataset (the so-called “unigrams”) is 806,759.
As a comparison (code not provided), when I examined with the same procedure a sample containing only the 10% of the blogs file, I obtained approx 137000 words, while the total number of different tokens in the entire dataset (training + validation + test) is 1,014,175. So, as expected, increasing the number of strings examined is useful, but as the size of the dataset increases the advantage becomes smaller. We should also consider that a great number of “new” words consists in typos.

unigrams<-read.csv("unigrams.csv")
nrow(unigrams)
## [1] 806759

Now, I calculated the cumulative percent frequency and I found out that only 140 words/tokens account for the fifty percent of the entire training dataset. As shown in the following table, 3 of those are not actual words but tokens summarizing a variety of possible strings (<NUMBER>, <BADWORD>, <PRICE>).

top50<-unigrams[unigrams$Cumulative_frequency_Percent<=50,]
nrow(top50)
## [1] 140


ranking word frequency Cumulative_frequency_Percent
1 THE 3327648 4.70
2 TO 1924098 7.41
3 AND 1683834 9.79
4 A 1663449 12.14
5 OF 1402050 14.12
6 IN 1149924 15.74
7 I 1149695 17.36
8 FOR 768408 18.45
9 IS 750501 19.51
10 THAT 727588 20.53
11 <NUMBER> 709743 21.54
12 YOU 654832 22.46
13 IT 639515 23.36
14 ON 571052 24.17
15 WITH 499779 24.88
16 WAS 437319 25.49
17 MY 421571 26.09
18 AT 398060 26.65
19 BE 383206 27.19
20 THIS 379566 27.73
21 HAVE 369964 28.25
22 ARE 342690 28.73
23 BUT 336983 29.21
24 AS 336454 29.68
25 HE 299288 30.11
26 WE 290701 30.52
27 NOT 286119 30.92
28 FROM 268219 31.30
29 SO 266981 31.67
30 ME 256465 32.04
31 ALL 230938 32.36
32 THEY 224613 32.68
33 WILL 219998 32.99
34 BY 219127 33.30
35 OR 215658 33.60
36 SAID 213581 33.91
37 JUST 212245 34.21
38 HIS 210949 34.50
39 YOUR 210591 34.80
40 AN 208651 35.09
41 ABOUT 206720 35.39
42 OUT 206057 35.68
43 UP 203489 35.96
44 ONE 202336 36.25
45 IF 194153 36.52
46 WHAT 193534 36.80
47 LIKE 188449 37.06
48 WHEN 184814 37.32
49 HAS 181658 37.58
50 WHO 174014 37.83
51 CAN 172052 38.07
52 MORE 170091 38.31
53 DO 168171 38.55
54 HAD 163156 38.78
55 GET 157916 39.00
56 TIME 150268 39.21
57 THERE 147488 39.42
58 HER 146178 39.63
59 WOULD 143511 39.83
60 THEIR 142970 40.03
61 SOME 141003 40.23
62 NO 138515 40.43
63 SHE 136821 40.62
64 NEW 135205 40.81
65 BEEN 131798 41.00
66 OUR 129951 41.18
67 I’M 128268 41.36
68 IT’S 126467 41.54
69 NOW 125512 41.72
70 GOOD 124811 41.89
71 WERE 124672 42.07
72 HOW 122105 42.24
73 DAY 117609 42.41
74 KNOW 114043 42.57
75 THEM 113054 42.73
76 LOVE 112103 42.89
77 PEOPLE 110392 43.04
78 <BADWORD> 105230 43.19
79 <PRICE> 101762 43.33
80 WHICH 100331 43.48
81 BACK 99042 43.61
82 THAN 98191 43.75
83 GO 97502 43.89
84 SEE 96668 44.03
85 FIRST 94368 44.16
86 INTO 93185 44.29
87 AFTER 92906 44.42
88 MAKE 90948 44.55
89 ALSO 90876 44.68
90 DON’T 89626 44.81
91 ITS 89344 44.93
92 ONLY 88738 45.06
93 THINK 88419 45.18
94 GOING 88411 45.31
95 OTHER 88379 45.43
96 LAST 87048 45.56
97 OVER 86932 45.68
98 THEN 86313 45.80
99 GREAT 86130 45.92
100 HIM 84413 46.04
101 MUCH 83989 46.16
102 BECAUSE 83896 46.28
103 US 82785 46.39
104 TOO 80262 46.51
105 TWO 80171 46.62
106 REALLY 79907 46.73
107 YEAR 79137 46.85
108 WAY 78301 46.96
109 COULD 78135 47.07
110 TODAY 77618 47.18
111 GOT 76058 47.28
112 WELL 75710 47.39
113 EVEN 75676 47.50
114 WANT 74811 47.60
115 WORK 73652 47.71
116 DID 73168 47.81
117 STILL 71805 47.91
118 RIGHT 71564 48.01
119 HERE 69057 48.11
120 THANKS 68862 48.21
121 OFF 68191 48.30
122 NEED 68124 48.40
123 WHERE 67947 48.50
124 AM 66933 48.59
125 VERY 65888 48.68
126 YEARS 64938 48.77
127 MOST 64873 48.87
128 ANY 64872 48.96
129 BEFORE 62347 49.05
130 THOSE 62226 49.13
131 MANY 62094 49.22
132 RT 61947 49.31
133 DOWN 61819 49.40
134 LIFE 61814 49.48
135 SAY 60447 49.57
136 SHOULD 60198 49.65
137 TAKE 59945 49.74
138 BEING 59206 49.82
139 THESE 58721 49.90
140 COME 58235 49.99


Now, let’s see how many words are needed to cover larger fractions of the dataset:

top90<-unigrams[unigrams$Cumulative_frequency_Percent<=90,]
nrow(top90)
## [1] 8445
top95<-unigrams[unigrams$Cumulative_frequency_Percent<=95,]
nrow(top95)
## [1] 22188
top99<-unigrams[unigrams$Cumulative_frequency_Percent<=99,]
nrow(top99)
## [1] 203398

Let’s look to the last tokens in the top 95% list:

tail(top95)
##       ranking       word frequency Cumulative_frequency_Percent
## 22183   22183  HYPOCRITE       120                           95
## 22184   22184  INCARNATE       120                           95
## 22185   22185  INTENDING       120                           95
## 22186   22186 INTIMIDATE       120                           95
## 22187   22187      IRWIN       120                           95
## 22188   22188     JARGON       120                           95

They seems to be not very common words, but not even so weird.

Here is a plot of the 95% coverage by the number of words used. You can zoom and hover on it to see each word with its absolute frequency in the dataset.

Digrams

Now, I created a dataframe with all the digrams (combinations of 2 words/tokens) that can be found in the training dataset and their frequencies. Once again, I worked with chunks to avoid exceeding memory limits of my laptop.

Let’s explore digrams distribution like we did with unigrams:

digrams <- readRDS("digrams.RDS")
nrow(digrams)
## [1] 11459841
top50<-digrams[digrams$Cumulative_frequency_Percent<=50,]
nrow(top50)
## [1] 40516

We can see that there are more than 11 millions unique digrams, and 41000 of them account for the 50% of the total. Here are the top 100 2-grams:



V1 V2 Frequency Cumulative_frequency_Percent
1 OF THE 300736 0.44
2 IN THE 284987 0.86
3 TO THE 148713 1.08
4 FOR THE 140432 1.29
5 ON THE 136884 1.49
6 TO BE 113296 1.66
7 AT THE 99540 1.80
8 AND THE 87669 1.93
9 IN A 83212 2.06
10 WITH THE 74029 2.17
11 IS A 70486 2.27
12 IT WAS 67212 2.37
13 FOR A 65806 2.46
14 FROM THE 60926 2.55
15 I HAVE 60117 2.64
16 I WAS 59928 2.73
17 IT IS 57461 2.82
18 WITH A 57232 2.90
19 AND I 57191 2.98
20 WILL BE 56720 3.07
21 GOING TO 55773 3.15
22 OF A 55480 3.23
23 I AM 53665 3.31
24 IS THE 51745 3.39
25 HAVE A 51398 3.46
26 IF YOU 50826 3.54
27 ONE OF 50796 3.61
28 IN <NUMBER> 49429 3.69
29 TO GET 49192 3.76
30 AS A 48333 3.83
31 WANT TO 44509 3.90
32 HAVE TO 43004 3.96
33 BY THE 42737 4.02
34 THAT THE 42561 4.08
35 THIS IS 41040 4.14
36 TO DO 40850 4.20
37 AND A 40693 4.26
38 I THINK 40671 4.32
39 THE FIRST 40051 4.38
40 WAS A 39569 4.44
41 OUT OF 39272 4.50
42 TO A 38735 4.56
43 THAT I 37897 4.61
44 TO SEE 37677 4.67
45 ON A 37461 4.72
46 ALL THE 35778 4.78
47 BUT I 35633 4.83
48 I LOVE 35185 4.88
49 THE SAME 34613 4.93
50 HAVE BEEN 33537 4.98
51 TO MAKE 33429 5.03
52 A LOT 33312 5.08
53 YOU CAN 33276 5.13
54 BE A 32775 5.18
55 HE WAS 31976 5.22
56 THANKS FOR 31384 5.27
57 OF MY 31226 5.32
58 NEED TO 30965 5.36
59 HAS BEEN 30842 5.41
60 A FEW 30541 5.45
61 WOULD BE 30441 5.50
62 YOU ARE 30439 5.54
63 I DON’T 30417 5.59
64 MORE THAN 29993 5.63
65 IN MY 29622 5.67
66 AS THE 29410 5.72
67 ABOUT THE 29296 5.76
68 WHEN I 29229 5.80
69 YOU HAVE 28714 5.85
70 A GREAT 28692 5.89
71 TO GO 28627 5.93
72 I CAN 28457 5.97
73 I HAD 28436 6.01
74 A LITTLE 28376 6.06
75 THE BEST 27857 6.10
76 TO HAVE 27729 6.14
77 HE SAID 27510 6.18
78 A GOOD 27431 6.22
79 THANK YOU 27375 6.26
80 I KNOW 27104 6.30
81 HAD A 26915 6.34
82 INTO THE 26858 6.38
83 THEY ARE 26581 6.42
84 WE ARE 26376 6.46
85 I JUST 25652 6.49
86 THERE IS 25623 6.53
87 <NUMBER> PERCENT 24704 6.57
88 IS NOT 24589 6.60
89 THAT IS 24318 6.64
90 A NEW 23452 6.68
91 THE NEW 23355 6.71
92 THERE ARE 23305 6.74
93 SO I 23254 6.78
94 THE MOST 23240 6.81
95 THE <NUMBER> 23074 6.85
96 OVER THE 23062 6.88
97 THE WORLD 22987 6.91
98 WE HAVE 22963 6.95
99 I WILL 22877 6.98
100 LIKE A 22644 7.02


And here is a plot with the top 1000 and cumulative frequency on the y axis:

Trigrams

The number of 3-grams is so big that my laptop couldn’t manage retrieving and counting the frequency of all of them in one file. Also, as shown in the Possible Models section, prediction is much faster when data are split in one dataframe per letter.
So, after an initial processing in chunks, I divided each chunk in 27 dataframes, one for each initial letter of the first word in the 3-gram, plus one for trigrams beginning with “<”, that denotes my custom tags (while doing so, I discarded a small percentage of weird tokens beginning with other characters). As those data will be the starting point for the creation of the model, I sorted them by decreasing frequency, to improve efficiency in querying the final dataframes.

Let’s look at this huge collection of 3-grams:

temp<-list.files("models/merged",full.names=T)
trigrams<-lapply(temp, readRDS)
sum(sapply(trigrams, nrow))
## [1] 33539777
There are nearly 34 millions different combinations of 3 words in the training dataset. Let’s plot the most frequent 1000 against their frequency and table the first 100 of them.
V1 V2 V3 Frequency ranking
ONE OF THE 24251 1
A LOT OF 20971 2
THANKS FOR THE 16657 3
TO BE A 12635 4
GOING TO BE 12155 5
THE END OF 10425 6
OUT OF THE 10372 7
I WANT TO 10331 8
IT WAS A 9970 9
AS WELL AS 9611 10
SOME OF THE 9501 11
BE ABLE TO 9127 12
MORE THAN <NUMBER> 8681 13
PART OF THE 8614 14
I HAVE A 8234 15
THE REST OF 7892 16
I HAVE TO 7822 17
LOOKING FORWARD TO 7807 18
THE FIRST TIME 7199 19
THANK YOU FOR 7112 20
IS GOING TO 7081 21
A COUPLE OF 7005 22
THIS IS A 6841 23
I NEED TO 6616 24
THERE IS A 6550 25
END OF THE 6472 26
YOU WANT TO 6403 27
YOU HAVE TO 6396 28
I LOVE YOU 6378 29
THE FACT THAT 6352 30
<NUMBER> TO <NUMBER> 6245 31
<NUMBER> PERCENT OF 6061 32
IN THE WORLD 6032 33
ONE OF MY 6009 34
TO GO TO 5882 35
CAN’T WAIT TO 5880 36
IT WOULD BE 5866 37
THIS IS THE 5854 38
I DON’T KNOW 5802 39
AT THE END 5759 40
FOR THE FIRST 5720 41
IS ONE OF 5618 42
IT IS A 5567 43
TO HAVE A 5536 44
THERE IS NO 5499 45
FOR THE FOLLOW 5486 46
IN THE FIRST 5481 47
I’M GOING TO 5470 48
MOST OF THE 5400 49
ACCORDING TO THE 5278 50
YOU HAVE A 5250 51
ALL OF THE 5248 52
IN FRONT OF 5238 53
THE UNITED STATES 5029 54
TO BE THE 5019 55
OF THE YEAR 4967 56
I HAD A 4921 57
IF YOU ARE 4913 58
I HAD TO 4881 59
REST OF THE 4810 60
I THINK I 4808 61
OF THE DAY 4692 62
BACK TO THE 4664 63
I HAVE BEEN 4664 64
I WANTED TO 4645 65
TO MAKE A 4636 66
HAVE A GREAT 4635 67
<NUMBER> AND <NUMBER> 4562 68
IT WILL BE 4505 69
WANT TO BE 4465 70
IN ORDER TO 4440 71
WHEN I WAS 4365 72
TO SEE THE 4350 73
AS MUCH AS 4336 74
I FEEL LIKE 4320 75
IN <NUMBER> AND 4242 76
IF YOU HAVE 4230 77
IN THE PAST 4180 78
TO GET A 4157 79
AT THE SAME 4143 80
ARE GOING TO 4133 81
ONE OF THOSE 4132 82
TO DO WITH 4105 83
I DON’T THINK 4092 84
I WILL BE 4086 85
HAVE TO BE 4061 86
OF THE MOST 4037 87
AT THE TIME 4031 88
WAS GOING TO 4012 89
IF YOU WANT 4006 90
WE NEED TO 3986 91
THE SAME TIME 3979 92
TO SEE YOU 3972 93
A BIT OF 3955 94
THERE WAS A 3936 95
OF THE <NUMBER> 3920 96
I AM NOT 3862 97
LET ME KNOW 3860 98
WOULD LIKE TO 3844 99
IN THE MIDDLE 3822 100


Possible Models

For the modeling task, I didn’t use, for the moment, any special NLP package. Basically, I’m testing my idea that a good approach would be to associate in one ore more dataframes all the observed sequences of words (of a given length) with the most frequent next word. This or those dataframes will then be filtered for the input words and the output will be the content of the last column.
Here are some lines of code that roughly estimate the speed and memory requirements of various approaches. Note that when I wrote this code I hadn’t still finished the dataframes creation, so I just used some available dataframes with the same number of columns and (roughly) the same number of rows expected in the real dataframes. That’s why the prediction outputs are meaningless.

Estimate of using ONE word as predictor

unigrams<-read.csv("unigrams.csv")
unigrams<-select(unigrams, words, Freq)
a<-Sys.time()
output<-filter(unigrams, words==toupper("friends")) %>% select(Freq)
b<-Sys.time()
b-a
## Time difference of 0.040766 secs
print(object.size(unigrams), units="Mb")
## 57 Mb

Estimate of using TWO words as predictors and a single dataframe

digrams <- readRDS("digrams.RDS")
a<-Sys.time()
output<-as.character(filter(digrams, V1==toupper("best") & V2==toupper("friends")))
b<-Sys.time()
b-a
## Time difference of 11.66452 secs
print(object.size(digrams), units="Mb") 
## 445.5 Mb

Estimate of using TWO words as predictors with multiple dataframes

a<-readRDS("digrams.RDS")
i<-grep("^[A-Z]", a$V1)
a<-a[i,]
init<-strsplit(a$V1,"")
init<-unlist(sapply(init, function(x) x[1]))
a$factor<-as.factor(init)
l<-split(a, a$factor)
l<-lapply(l, function(x) x[,1:3])

a<-Sys.time()
v1=toupper("best")
v2=toupper("friends")
output<-as.character(filter(l[[substr(v1,1,1)]], V1==v1 & V2==v2))
b<-Sys.time()
b-a
## Time difference of 2.76389 secs
print(object.size(l), units="Mb")
## 503.5 Mb

The last solution seems to be the best, assuming that prediction with 2 words is more accurate than prediction with 1 word.
For the OOV (out-of-vocabulary) pairs of words, my algorithm will use prediction with 1 word. If not even the previous word alone is in the dictionary, the algorithm will predict the most repeated word in the dataset (“THE”).

First prediction model

For the dataframes to be used by the prediction function, I selected for each pair of words the most frequent trigram beginning with that pair of words. When 2 trigrams occurred the same number of times, I choose the one where the third word is more frequent in the training dataset, with the help of my list of unigrams previously created. Same procedure for the digrams dataframe.

Dictionary creation

At this point, I discovered that those generated dataframes, even if they have very similar dimensions, have a much bigger size in Mb than the ones I used in previous simulations. This has a big impact on speed, and probably would create problems with the Shiny server memory limits. For this reason, I created a dictionary of the unique tokens associating each of them with a number, and then I created a numeric version of the trigrams and digrams dataframes. For some reason (lower number of characters, or maybe the numeric objects are memory saving compared to character object) those new dataframes are much lighter (\(1/12\) in Mb).

Prediction algorithm

Finally, I want to show you the prediction function, which consists of few lines and requires 3 files loaded (a dictionary, the 2-grams dataframe in its numerical version and a list with the 27 3-grams dataframes in their numerical version, for a total size of 181 Mb (Shiny free server RAM limit is 1Gb):

dicvec<-readRDS("df/dicvec.RDS")
digrams<-readRDS("df/digrams_coded.RDS")
m<-readRDS("df/coded.RDS")
print(object.size(list(m, digrams, dicvec)), units="Mb")
## 181.3 Mb
word_predict<-function(a,b) {
  v2<-dicvec[toupper(b)]
  v1<-toupper(a)
if(!(substr(v1,1,1) %in% names(m))) {w<-as.integer(filter(digrams, code.x==v2)$code.y)}
else {
  l<-m[[substr(v1,1,1)]]
  v1=dicvec[v1]
  w<-as.integer(filter(l, V1==v1 & V2==v2)$V3)
}
if(length(w)==0) {
  w<-as.integer(filter(digrams, code.x==v2)$code.y)
}
if(length(w)==0) {w<-743}
names(dicvec[w])
}

Model validation

The final step is to test the prediction algorithm using the validation dataset: even if we are in a field where there is no “correct answer”, testing the probability of guessing the next word typed on a large collection of blogs, newspaper and tweets will be useful to compare this first model with other models I will build during the project, or with someone else’s model.
For this task, I processed the validation dataset in trigrams, following the same steps I used for the training dataset.
Note that all lines in the validation dataset were already previously subjected to profanity filtering, tagging and tokenisation, but those processing steps will have to be included in the final prediction model, so that the user will be able to input number, weblink etc and get the expected output.

validation<-readRDS("validation.RDS")
nrow(validation)
## [1] 9266761

As the number of rows is too big to be tested in a reasonable amount of time, I only used a subsample of 2 millions of trigrams (21.6% of the total), and testing was completed in about 19 hours. The first 2 words in each trigram were passed to the prediction function, and the output was compared with the third word of each trigram.
So, I calculated accuracy (percentage of words correctly guessed) and mean prediction time. In the table you can see a few lines of the testing output (column “V3” contains expected word, column “prediction” contains the output from the algorithm).

round(sum(validation$Correct_prediction)/nrow(validation)*100,2) #Accuracy %
## [1] 16.44
round(difftime(b,a, units="sec")/nrow(validation),4) # Mean time for prediction
## Time difference of 0.0334 secs

V1 V2 V3 prediction Correct_prediction
2439408 THE BCS IN NATIONAL FALSE
3586079 8TH PLACE EPL TO FALSE
706051 ACTUALLY GET MARRIED TO FALSE
5107415 IF SHE WANTS WAS FALSE
5023765 YIKES HOPE NO THEY FALSE
5621073 THOSE SELF-APPOINTED SELF-CENTERED JUDGES FALSE
5525529 NEW BLACK EYED PANTHER FALSE
346263 CHAMBERLAIN’S ALLEGED CRIME THAT FALSE
8222333 ARE HAND MILLED MADE FALSE
3668094 HAVE AN EVEN AWESOME FALSE
6042247 MAYBE CONFUSED FOR PEOPLE FALSE
1771400 LAST LONG PERIODS AND FALSE
6613627 SHOW IS THIS A FALSE
803419 IS THERE AT A FALSE
339601 AS I HAD WAS FALSE
1001783 NOON TIME CHALLENGE CHALLENGES FALSE
2165787 IS PUT AGAINST ON FALSE
2755506 PUBLIC ENGAGEMENT PHASE PANEL FALSE
149725 STATE EDUCATION DEPARTMENT COMMISSIONER FALSE
9020866 SATCHMO AND REDFORD THE FALSE
4383286 OUR FRIENDS PUMPED AND FALSE
9180713 BY DOREEN CRONIN VIRTUE FALSE
4582374 WHATS UP WITH WITH TRUE
3372028 BUCK UP LADIES AND FALSE
4414302 THE NEW YEAR YORK FALSE
1371458 DADA WHICH MAKES WAS FALSE
9174078 HAD THREE HITS HITS TRUE
9022970 THE TOP THE OF FALSE
8846340 OFF OUR WINDOWS NEW FALSE
1446043 IN SIX DIFFERENT GAMES FALSE
4287981 ME FIGURE OUT OUT TRUE
6618228 THE BATTER IS INTO FALSE
7426844 FROM ANYTHING LIKE THAT FALSE
1262357 CONVENTION CENTER THIS AND FALSE
3645871 YOU BETTER FOLLOW BE FALSE
256687 IN YOUR VOTE LIFE FALSE
4727858 IN #GIVEBACK VIA THE FALSE
7342637 SMILING EAR-TO-EAR MATHENY SMILE FALSE
8573918 RAVI AND WEI WEI TRUE
5852593 CAME TO SAMPLE THE FALSE
5639100 SOARING ARTISTIC AND DIRECTOR FALSE
5317045 DOG A LAXATIVE BATH FALSE
1403311 MAYOR STANLEY IACONO KOVACH FALSE
7564059 PARTY FOR YOUR THE FALSE
514358 BOOMERS SAY THEY THEY TRUE
4155317 EVER SEEN SOMETHING IN FALSE
1718265 ABOUT THE SAME SAME TRUE
2785097 ON THE SWEET OTHER FALSE
7300650 HES FREAKIN BEAST ANOYIN FALSE
5862186 RAISE EVEN MORE A FALSE
4307292 THINK SHE IS IS TRUE
3695298 OFF FROM FASHION THE FALSE
282230 OUT TOYS AND FOR FALSE
8992478 THE SEARCH WHO FOR FALSE
7671088 BREATHED THE ESSENCE SAME FALSE
47192 TOO I CAN THINK FALSE
7959570 THE MUSEUM MIGHT OF FALSE
8005220 A SPOT WHERE IN FALSE
3236906 IS SO CUTE MUCH FALSE
3799121 OPTIMISTIC I’M GOING NOT FALSE
9243181 BUT EVEN AFTER IF FALSE
1344830 WAS JUST PLAYING A FALSE
2427489 MY KNOWLEDGE OF OF TRUE
1960246 CITY MUSIC HALL HALL TRUE
2440545 BARRAULT LES PERLES PAUL FALSE
5228277 I USE MY TO FALSE
4913660 AND DON’T HAVE FORGET FALSE
2481072 HE SAID AS HE FALSE
6798036 HAVE TO CHANGE BE FALSE
1463593 LET ME SAY KNOW FALSE
6597700 OF PAGES HE OF FALSE
8309101 ESCAPE OUR DEMONS SELFISH FALSE
5047135 NCM TOMORROW JULY NIGHT FALSE
1782946 THAT I WOULD HAVE FALSE
7375785 THE TOP-RANKED AMERICAN DJOKOVIC FALSE
8808609 BRAND NEW HOMES AND FALSE
2179250 MEAGER LIST PLEASE OF FALSE
8864278 WITH AUSTRALIAN ACCENTS PRONUNCIATION FALSE
7495723 ARE OPPORTUNITIES FOR TO FALSE
7617104 UNJUST AND UNFAIR UNREASONABLE FALSE
1359806 AREN’T YOU ANSWERING GOING FALSE
5999857 THREE DIFFERENT CHARACTERS PEOPLE FALSE
9222662 WAS IN MAGNIFICENT THE FALSE
5955181 SPOT SERVING CHINESE-AMERICAN SPANISH FALSE
703186 SECOND HALF AND OF FALSE
2379400 REALLY GOOD SAID AT FALSE
951439 ACCESS TO THE THE TRUE
9179059 TO A FIRST NEW FALSE
7597514 UP BUT AT I FALSE
3761509 WAS A PAPER-THIN GOOD FALSE
3113355 TIME TO ALERT GET FALSE
2849006 LIKE THIS FEELING ONE FALSE
6315233 CHIEF STRATEGIST FOR FOR TRUE
5007950 LIGHT CRAMPING I I TRUE
8685323 IMPLEMENTED FROM THEIR THE FALSE
1455771 EVENTUALLY DECIDE TO TO TRUE
7591326 EXTEND THE PROJECT LIFE FALSE
1095610 THEY WENT OUT TO FALSE
2009789 UNICORN AND A SCHMENDRICK FALSE
7199250 DREAM WERE THE REALITY FALSE


Possible improvements

I am quite satisfied of my first attempt. With a runtime of 0.033 seconds, I think I’ll be able to build a Shiny app that doesn’t require a button to start computation, but does predictions in continuous, just like a smartphone keyboard does. I think this will still be possible after adding input tokenisation, profanity filtering and hopefully some extra computation to improve accuracy.
In fact, the accuracy is not very high (16.44%). I don’t know what a reasonable benchmark would be, but observing my mobile keyboard in my native language, it correctly guesses the word I want to type once every 3 or maybe 4 times. So, I want to try to enhance accuracy to 25% at least.
Some of my ideas:

I also thought about another approach, which is maybe more similar to the one suggested in the Capstone project (which proposes to manage OOV by smoothing probabilities, something quite different from my solution to OOV).
This approach would rely on a dictionary and on 1,2,3 or more dataframes (depending on the number of words desired for prediction). Those dataframes should contain:

Then, the algorithm would sum up (maybe with some weights to give more importance to the nearest words) the probabilities for each possible predicted word and choose the most probable.
To manage OOV, for each possible word in the 2nd column a line should be included in each dataframe in combination with a “OOV” tag in 1st column and a small (I should think about how small) probability in the 3rd.
I would like to discover if this solution is worse or better than mine in terms of memory required, speed and accuracy, but this would mean to restart my job just after tokenization, so I’m not sure if I will do this (I will decide after reading my peers reports and looking at some Natural Language Prediction resources).