Milestone Report

Overview

The goal of this project is to build an NLP algorithm to determine which English word a user is most likely to type after digitation of “some” (a number to determine) previous words.
The starting point is a Corpora dataset consisting of strings from blogs, online newspapers and tweets. Those strings are in English, but data in other languages are also provided and will maybe be used in further steps.
The algorithm is initially built and tested locally, but is expected to work in a Shiny app running on the Shiny server.
For my analysis, I used R.4.0.2 on a Windows X64 machine with 8Gb of RAM and the following libraries:

library(ggplot2)
library(plotly)
library(dplyr)
library(xtable)

NOTE: This report is intended to be concise and understandable by a non-data scientist reader, so the vast majority of code will not be shown. If you’re interested in it, you can find it on Github (Milestone.Rmd file). Knitting that file in RStudio (after removal of every “eval=F” option present in the file) would reproduce the entire workflow, but it could take a very long time and require you to manually clear the workspace from time to time (that’s why some files are saved on hardware and then loaded again when needed). This is partly because of the nature of the analysis itself and the size of data, and partly because I discovered some tricks to improve efficiency along the way.

Data processing and exploration

Data were downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

I used bash command "wc -cmlwL *.txt" to find basic informations for each of the 3 en_US files in my working directory and I summarized them in a table:

	bytes	Number.of.characters	Number.of.lines	Number.of.words	Longest.line
en_US.blogs	210160014	208623085	899288	37333958	40833
en_US.news	205811889	205243643	1010242	34365905	11384
en_US.twitter	167105338	166843164	2360148	30357171	173

Then, I merged the 3 files and, to save allocated RAM, I divided the resulting object in 50 files that were subsequently loaded one at a time for profanity filtering and tokenisation. Before splitting, strings were mixed in random order, to insure that each chunk contains strings from blogs, news and tweets in approximately the same proportion. I choose to convert all letters to uppercase to reduce the number of different words.
Profanity filter was based on a bad words list found at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en. Profane words were replaced by the tag <BADWORD>.
I also seeked for web links, email adresses, numbers, prices and dates and replaced them with tags, still in order to reduce the number of different tokens in the dataset. Those filters are probably not perfect, but they appear to find the majority of the desired items.
Finally, lines were splitted based on whitespaces and punctuation. Again, a small percentage of words didn’t split correctly, due to some strange patterns of punctuation or to typos, but this doesn’t affect the quality of processed data.

Then, I used the first 35 chunks (70% of the total, as they all have similar sizes) to be part of the training dataset, and left the other 15 for testing purpose. Chunks from 36 to 45 will be the test dataset for the final model. Chunck from 46 to 50 will be the validation dataset that will be used to test the accuracy of each of my future models.

Unigrams

Then, the 35 word lists were merged together obtaining a vector with 70,842,273 elements and tabled to determine the total number of different words (or tokens) and each word’s frequency. The table was then sorted by frequency. As shown, the total number of distinct words and tokens in the training dataset (the so-called “unigrams”) is 806,759.
As a comparison (code not provided), when I examined with the same procedure a sample containing only the 10% of the blogs file, I obtained approx 137000 words, while the total number of different tokens in the entire dataset (training + validation + test) is 1,014,175. So, as expected, increasing the number of strings examined is useful, but as the size of the dataset increases the advantage becomes smaller. We should also consider that a great number of “new” words consists in typos.

unigrams<-read.csv("unigrams.csv")
nrow(unigrams)

## [1] 806759

Now, I calculated the cumulative percent frequency and I found out that only 140 words/tokens account for the fifty percent of the entire training dataset. As shown in the following table, 3 of those are not actual words but tokens summarizing a variety of possible strings (<NUMBER>, <BADWORD>, <PRICE>).

top50<-unigrams[unigrams$Cumulative_frequency_Percent<=50,]
nrow(top50)

## [1] 140

ranking	word	frequency	Cumulative_frequency_Percent
1	THE	3327648	4.70
2	TO	1924098	7.41
3	AND	1683834	9.79
4	A	1663449	12.14
5	OF	1402050	14.12
6	IN	1149924	15.74
7	I	1149695	17.36
8	FOR	768408	18.45
9	IS	750501	19.51
10	THAT	727588	20.53
11	<NUMBER>	709743	21.54
12	YOU	654832	22.46
13	IT	639515	23.36
14	ON	571052	24.17
15	WITH	499779	24.88
16	WAS	437319	25.49
17	MY	421571	26.09
18	AT	398060	26.65
19	BE	383206	27.19
20	THIS	379566	27.73
21	HAVE	369964	28.25
22	ARE	342690	28.73
23	BUT	336983	29.21
24	AS	336454	29.68
25	HE	299288	30.11
26	WE	290701	30.52
27	NOT	286119	30.92
28	FROM	268219	31.30
29	SO	266981	31.67
30	ME	256465	32.04
31	ALL	230938	32.36
32	THEY	224613	32.68
33	WILL	219998	32.99
34	BY	219127	33.30
35	OR	215658	33.60
36	SAID	213581	33.91
37	JUST	212245	34.21
38	HIS	210949	34.50
39	YOUR	210591	34.80
40	AN	208651	35.09
41	ABOUT	206720	35.39
42	OUT	206057	35.68
43	UP	203489	35.96
44	ONE	202336	36.25
45	IF	194153	36.52
46	WHAT	193534	36.80
47	LIKE	188449	37.06
48	WHEN	184814	37.32
49	HAS	181658	37.58
50	WHO	174014	37.83
51	CAN	172052	38.07
52	MORE	170091	38.31
53	DO	168171	38.55
54	HAD	163156	38.78
55	GET	157916	39.00
56	TIME	150268	39.21
57	THERE	147488	39.42
58	HER	146178	39.63
59	WOULD	143511	39.83
60	THEIR	142970	40.03
61	SOME	141003	40.23
62	NO	138515	40.43
63	SHE	136821	40.62
64	NEW	135205	40.81
65	BEEN	131798	41.00
66	OUR	129951	41.18
67	I’M	128268	41.36
68	IT’S	126467	41.54
69	NOW	125512	41.72
70	GOOD	124811	41.89
71	WERE	124672	42.07
72	HOW	122105	42.24
73	DAY	117609	42.41
74	KNOW	114043	42.57
75	THEM	113054	42.73
76	LOVE	112103	42.89
77	PEOPLE	110392	43.04
78	<BADWORD>	105230	43.19
79	<PRICE>	101762	43.33
80	WHICH	100331	43.48
81	BACK	99042	43.61
82	THAN	98191	43.75
83	GO	97502	43.89
84	SEE	96668	44.03
85	FIRST	94368	44.16
86	INTO	93185	44.29
87	AFTER	92906	44.42
88	MAKE	90948	44.55
89	ALSO	90876	44.68
90	DON’T	89626	44.81
91	ITS	89344	44.93
92	ONLY	88738	45.06
93	THINK	88419	45.18
94	GOING	88411	45.31
95	OTHER	88379	45.43
96	LAST	87048	45.56
97	OVER	86932	45.68
98	THEN	86313	45.80
99	GREAT	86130	45.92
100	HIM	84413	46.04
101	MUCH	83989	46.16
102	BECAUSE	83896	46.28
103	US	82785	46.39
104	TOO	80262	46.51
105	TWO	80171	46.62
106	REALLY	79907	46.73
107	YEAR	79137	46.85
108	WAY	78301	46.96
109	COULD	78135	47.07
110	TODAY	77618	47.18
111	GOT	76058	47.28
112	WELL	75710	47.39
113	EVEN	75676	47.50
114	WANT	74811	47.60
115	WORK	73652	47.71
116	DID	73168	47.81
117	STILL	71805	47.91
118	RIGHT	71564	48.01
119	HERE	69057	48.11
120	THANKS	68862	48.21
121	OFF	68191	48.30
122	NEED	68124	48.40
123	WHERE	67947	48.50
124	AM	66933	48.59
125	VERY	65888	48.68
126	YEARS	64938	48.77
127	MOST	64873	48.87
128	ANY	64872	48.96
129	BEFORE	62347	49.05
130	THOSE	62226	49.13
131	MANY	62094	49.22
132	RT	61947	49.31
133	DOWN	61819	49.40
134	LIFE	61814	49.48
135	SAY	60447	49.57
136	SHOULD	60198	49.65
137	TAKE	59945	49.74
138	BEING	59206	49.82
139	THESE	58721	49.90
140	COME	58235	49.99

Now, let’s see how many words are needed to cover larger fractions of the dataset:

top90<-unigrams[unigrams$Cumulative_frequency_Percent<=90,]
nrow(top90)

## [1] 8445

top95<-unigrams[unigrams$Cumulative_frequency_Percent<=95,]
nrow(top95)

## [1] 22188

top99<-unigrams[unigrams$Cumulative_frequency_Percent<=99,]
nrow(top99)

## [1] 203398

Let’s look to the last tokens in the top 95% list:

tail(top95)

##       ranking       word frequency Cumulative_frequency_Percent
## 22183   22183  HYPOCRITE       120                           95
## 22184   22184  INCARNATE       120                           95
## 22185   22185  INTENDING       120                           95
## 22186   22186 INTIMIDATE       120                           95
## 22187   22187      IRWIN       120                           95
## 22188   22188     JARGON       120                           95

They seems to be not very common words, but not even so weird.

Here is a plot of the 95% coverage by the number of words used. You can zoom and hover on it to see each word with its absolute frequency in the dataset.

Digrams

Now, I created a dataframe with all the digrams (combinations of 2 words/tokens) that can be found in the training dataset and their frequencies. Once again, I worked with chunks to avoid exceeding memory limits of my laptop.

Let’s explore digrams distribution like we did with unigrams:

digrams <- readRDS("digrams.RDS")
nrow(digrams)

## [1] 11459841

top50<-digrams[digrams$Cumulative_frequency_Percent<=50,]
nrow(top50)

## [1] 40516

We can see that there are more than 11 millions unique digrams, and 41000 of them account for the 50% of the total. Here are the top 100 2-grams:

	V1	V2	Frequency	Cumulative_frequency_Percent
1	OF	THE	300736	0.44
2	IN	THE	284987	0.86
3	TO	THE	148713	1.08
4	FOR	THE	140432	1.29
5	ON	THE	136884	1.49
6	TO	BE	113296	1.66
7	AT	THE	99540	1.80
8	AND	THE	87669	1.93
9	IN	A	83212	2.06
10	WITH	THE	74029	2.17
11	IS	A	70486	2.27
12	IT	WAS	67212	2.37
13	FOR	A	65806	2.46
14	FROM	THE	60926	2.55
15	I	HAVE	60117	2.64
16	I	WAS	59928	2.73
17	IT	IS	57461	2.82
18	WITH	A	57232	2.90
19	AND	I	57191	2.98
20	WILL	BE	56720	3.07
21	GOING	TO	55773	3.15
22	OF	A	55480	3.23
23	I	AM	53665	3.31
24	IS	THE	51745	3.39
25	HAVE	A	51398	3.46
26	IF	YOU	50826	3.54
27	ONE	OF	50796	3.61
28	IN	<NUMBER>	49429	3.69
29	TO	GET	49192	3.76
30	AS	A	48333	3.83
31	WANT	TO	44509	3.90
32	HAVE	TO	43004	3.96
33	BY	THE	42737	4.02
34	THAT	THE	42561	4.08
35	THIS	IS	41040	4.14
36	TO	DO	40850	4.20
37	AND	A	40693	4.26
38	I	THINK	40671	4.32
39	THE	FIRST	40051	4.38
40	WAS	A	39569	4.44
41	OUT	OF	39272	4.50
42	TO	A	38735	4.56
43	THAT	I	37897	4.61
44	TO	SEE	37677	4.67
45	ON	A	37461	4.72
46	ALL	THE	35778	4.78
47	BUT	I	35633	4.83
48	I	LOVE	35185	4.88
49	THE	SAME	34613	4.93
50	HAVE	BEEN	33537	4.98
51	TO	MAKE	33429	5.03
52	A	LOT	33312	5.08
53	YOU	CAN	33276	5.13
54	BE	A	32775	5.18
55	HE	WAS	31976	5.22
56	THANKS	FOR	31384	5.27
57	OF	MY	31226	5.32
58	NEED	TO	30965	5.36
59	HAS	BEEN	30842	5.41
60	A	FEW	30541	5.45
61	WOULD	BE	30441	5.50
62	YOU	ARE	30439	5.54
63	I	DON’T	30417	5.59
64	MORE	THAN	29993	5.63
65	IN	MY	29622	5.67
66	AS	THE	29410	5.72
67	ABOUT	THE	29296	5.76
68	WHEN	I	29229	5.80
69	YOU	HAVE	28714	5.85
70	A	GREAT	28692	5.89
71	TO	GO	28627	5.93
72	I	CAN	28457	5.97
73	I	HAD	28436	6.01
74	A	LITTLE	28376	6.06
75	THE	BEST	27857	6.10
76	TO	HAVE	27729	6.14
77	HE	SAID	27510	6.18
78	A	GOOD	27431	6.22
79	THANK	YOU	27375	6.26
80	I	KNOW	27104	6.30
81	HAD	A	26915	6.34
82	INTO	THE	26858	6.38
83	THEY	ARE	26581	6.42
84	WE	ARE	26376	6.46
85	I	JUST	25652	6.49
86	THERE	IS	25623	6.53
87	<NUMBER>	PERCENT	24704	6.57
88	IS	NOT	24589	6.60
89	THAT	IS	24318	6.64
90	A	NEW	23452	6.68
91	THE	NEW	23355	6.71
92	THERE	ARE	23305	6.74
93	SO	I	23254	6.78
94	THE	MOST	23240	6.81
95	THE	<NUMBER>	23074	6.85
96	OVER	THE	23062	6.88
97	THE	WORLD	22987	6.91
98	WE	HAVE	22963	6.95
99	I	WILL	22877	6.98
100	LIKE	A	22644	7.02

And here is a plot with the top 1000 and cumulative frequency on the y axis:

Trigrams

The number of 3-grams is so big that my laptop couldn’t manage retrieving and counting the frequency of all of them in one file. Also, as shown in the Possible Models section, prediction is much faster when data are split in one dataframe per letter.
So, after an initial processing in chunks, I divided each chunk in 27 dataframes, one for each initial letter of the first word in the 3-gram, plus one for trigrams beginning with “<”, that denotes my custom tags (while doing so, I discarded a small percentage of weird tokens beginning with other characters). As those data will be the starting point for the creation of the model, I sorted them by decreasing frequency, to improve efficiency in querying the final dataframes.

Let’s look at this huge collection of 3-grams:

temp<-list.files("models/merged",full.names=T)
trigrams<-lapply(temp, readRDS)
sum(sapply(trigrams, nrow))

## [1] 33539777

There are nearly 34 millions different combinations of 3 words in the training dataset. Let’s plot the most frequent 1000 against their frequency and table the first 100 of them.

V1	V2	V3	Frequency	ranking
ONE	OF	THE	24251	1
A	LOT	OF	20971	2
THANKS	FOR	THE	16657	3
TO	BE	A	12635	4
GOING	TO	BE	12155	5
THE	END	OF	10425	6
OUT	OF	THE	10372	7
I	WANT	TO	10331	8
IT	WAS	A	9970	9
AS	WELL	AS	9611	10
SOME	OF	THE	9501	11
BE	ABLE	TO	9127	12
MORE	THAN	<NUMBER>	8681	13
PART	OF	THE	8614	14
I	HAVE	A	8234	15
THE	REST	OF	7892	16
I	HAVE	TO	7822	17
LOOKING	FORWARD	TO	7807	18
THE	FIRST	TIME	7199	19
THANK	YOU	FOR	7112	20
IS	GOING	TO	7081	21
A	COUPLE	OF	7005	22
THIS	IS	A	6841	23
I	NEED	TO	6616	24
THERE	IS	A	6550	25
END	OF	THE	6472	26
YOU	WANT	TO	6403	27
YOU	HAVE	TO	6396	28
I	LOVE	YOU	6378	29
THE	FACT	THAT	6352	30
<NUMBER>	TO	<NUMBER>	6245	31
<NUMBER>	PERCENT	OF	6061	32
IN	THE	WORLD	6032	33
ONE	OF	MY	6009	34
TO	GO	TO	5882	35
CAN’T	WAIT	TO	5880	36
IT	WOULD	BE	5866	37
THIS	IS	THE	5854	38
I	DON’T	KNOW	5802	39
AT	THE	END	5759	40
FOR	THE	FIRST	5720	41
IS	ONE	OF	5618	42
IT	IS	A	5567	43
TO	HAVE	A	5536	44
THERE	IS	NO	5499	45
FOR	THE	FOLLOW	5486	46
IN	THE	FIRST	5481	47
I’M	GOING	TO	5470	48
MOST	OF	THE	5400	49
ACCORDING	TO	THE	5278	50
YOU	HAVE	A	5250	51
ALL	OF	THE	5248	52
IN	FRONT	OF	5238	53
THE	UNITED	STATES	5029	54
TO	BE	THE	5019	55
OF	THE	YEAR	4967	56
I	HAD	A	4921	57
IF	YOU	ARE	4913	58
I	HAD	TO	4881	59
REST	OF	THE	4810	60
I	THINK	I	4808	61
OF	THE	DAY	4692	62
BACK	TO	THE	4664	63
I	HAVE	BEEN	4664	64
I	WANTED	TO	4645	65
TO	MAKE	A	4636	66
HAVE	A	GREAT	4635	67
<NUMBER>	AND	<NUMBER>	4562	68
IT	WILL	BE	4505	69
WANT	TO	BE	4465	70
IN	ORDER	TO	4440	71
WHEN	I	WAS	4365	72
TO	SEE	THE	4350	73
AS	MUCH	AS	4336	74
I	FEEL	LIKE	4320	75
IN	<NUMBER>	AND	4242	76
IF	YOU	HAVE	4230	77
IN	THE	PAST	4180	78
TO	GET	A	4157	79
AT	THE	SAME	4143	80
ARE	GOING	TO	4133	81
ONE	OF	THOSE	4132	82
TO	DO	WITH	4105	83
I	DON’T	THINK	4092	84
I	WILL	BE	4086	85
HAVE	TO	BE	4061	86
OF	THE	MOST	4037	87
AT	THE	TIME	4031	88
WAS	GOING	TO	4012	89
IF	YOU	WANT	4006	90
WE	NEED	TO	3986	91
THE	SAME	TIME	3979	92
TO	SEE	YOU	3972	93
A	BIT	OF	3955	94
THERE	WAS	A	3936	95
OF	THE	<NUMBER>	3920	96
I	AM	NOT	3862	97
LET	ME	KNOW	3860	98
WOULD	LIKE	TO	3844	99
IN	THE	MIDDLE	3822	100

Possible Models

For the modeling task, I didn’t use, for the moment, any special NLP package. Basically, I’m testing my idea that a good approach would be to associate in one ore more dataframes all the observed sequences of words (of a given length) with the most frequent next word. This or those dataframes will then be filtered for the input words and the output will be the content of the last column.
Here are some lines of code that roughly estimate the speed and memory requirements of various approaches. Note that when I wrote this code I hadn’t still finished the dataframes creation, so I just used some available dataframes with the same number of columns and (roughly) the same number of rows expected in the real dataframes. That’s why the prediction outputs are meaningless.

Estimate of using ONE word as predictor

unigrams<-read.csv("unigrams.csv")
unigrams<-select(unigrams, words, Freq)
a<-Sys.time()
output<-filter(unigrams, words==toupper("friends")) %>% select(Freq)
b<-Sys.time()
b-a

## Time difference of 0.040766 secs

print(object.size(unigrams), units="Mb")

## 57 Mb

Estimate of using TWO words as predictors and a single dataframe

digrams <- readRDS("digrams.RDS")
a<-Sys.time()
output<-as.character(filter(digrams, V1==toupper("best") & V2==toupper("friends")))
b<-Sys.time()
b-a

## Time difference of 11.66452 secs

print(object.size(digrams), units="Mb")

## 445.5 Mb

Estimate of using TWO words as predictors with multiple dataframes

a<-readRDS("digrams.RDS")
i<-grep("^[A-Z]", a$V1)
a<-a[i,]
init<-strsplit(a$V1,"")
init<-unlist(sapply(init, function(x) x[1]))
a$factor<-as.factor(init)
l<-split(a, a$factor)
l<-lapply(l, function(x) x[,1:3])

a<-Sys.time()
v1=toupper("best")
v2=toupper("friends")
output<-as.character(filter(l[[substr(v1,1,1)]], V1==v1 & V2==v2))
b<-Sys.time()
b-a

## Time difference of 2.76389 secs

print(object.size(l), units="Mb")

## 503.5 Mb

The last solution seems to be the best, assuming that prediction with 2 words is more accurate than prediction with 1 word.
For the OOV (out-of-vocabulary) pairs of words, my algorithm will use prediction with 1 word. If not even the previous word alone is in the dictionary, the algorithm will predict the most repeated word in the dataset (“THE”).

First prediction model

For the dataframes to be used by the prediction function, I selected for each pair of words the most frequent trigram beginning with that pair of words. When 2 trigrams occurred the same number of times, I choose the one where the third word is more frequent in the training dataset, with the help of my list of unigrams previously created. Same procedure for the digrams dataframe.

Dictionary creation

At this point, I discovered that those generated dataframes, even if they have very similar dimensions, have a much bigger size in Mb than the ones I used in previous simulations. This has a big impact on speed, and probably would create problems with the Shiny server memory limits. For this reason, I created a dictionary of the unique tokens associating each of them with a number, and then I created a numeric version of the trigrams and digrams dataframes. For some reason (lower number of characters, or maybe the numeric objects are memory saving compared to character object) those new dataframes are much lighter (\(1/12\) in Mb).

Prediction algorithm

Finally, I want to show you the prediction function, which consists of few lines and requires 3 files loaded (a dictionary, the 2-grams dataframe in its numerical version and a list with the 27 3-grams dataframes in their numerical version, for a total size of 181 Mb (Shiny free server RAM limit is 1Gb):

dicvec<-readRDS("df/dicvec.RDS")
digrams<-readRDS("df/digrams_coded.RDS")
m<-readRDS("df/coded.RDS")
print(object.size(list(m, digrams, dicvec)), units="Mb")

## 181.3 Mb

word_predict<-function(a,b) {
  v2<-dicvec[toupper(b)]
  v1<-toupper(a)
if(!(substr(v1,1,1) %in% names(m))) {w<-as.integer(filter(digrams, code.x==v2)$code.y)}
else {
  l<-m[[substr(v1,1,1)]]
  v1=dicvec[v1]
  w<-as.integer(filter(l, V1==v1 & V2==v2)$V3)
}
if(length(w)==0) {
  w<-as.integer(filter(digrams, code.x==v2)$code.y)
}
if(length(w)==0) {w<-743}
names(dicvec[w])
}

Model validation

The final step is to test the prediction algorithm using the validation dataset: even if we are in a field where there is no “correct answer”, testing the probability of guessing the next word typed on a large collection of blogs, newspaper and tweets will be useful to compare this first model with other models I will build during the project, or with someone else’s model.
For this task, I processed the validation dataset in trigrams, following the same steps I used for the training dataset.
Note that all lines in the validation dataset were already previously subjected to profanity filtering, tagging and tokenisation, but those processing steps will have to be included in the final prediction model, so that the user will be able to input number, weblink etc and get the expected output.

validation<-readRDS("validation.RDS")
nrow(validation)

## [1] 9266761

As the number of rows is too big to be tested in a reasonable amount of time, I only used a subsample of 2 millions of trigrams (21.6% of the total), and testing was completed in about 19 hours. The first 2 words in each trigram were passed to the prediction function, and the output was compared with the third word of each trigram.
So, I calculated accuracy (percentage of words correctly guessed) and mean prediction time. In the table you can see a few lines of the testing output (column “V3” contains expected word, column “prediction” contains the output from the algorithm).

round(sum(validation$Correct_prediction)/nrow(validation)*100,2) #Accuracy %

## [1] 16.44

round(difftime(b,a, units="sec")/nrow(validation),4) # Mean time for prediction

## Time difference of 0.0334 secs

	V1	V2	V3	prediction	Correct_prediction
2439408	THE	BCS	IN	NATIONAL	FALSE
3586079	8TH	PLACE	EPL	TO	FALSE
706051	ACTUALLY	GET	MARRIED	TO	FALSE
5107415	IF	SHE	WANTS	WAS	FALSE
5023765	YIKES	HOPE	NO	THEY	FALSE
5621073	THOSE	SELF-APPOINTED	SELF-CENTERED	JUDGES	FALSE
5525529	NEW	BLACK	EYED	PANTHER	FALSE
346263	CHAMBERLAIN’S	ALLEGED	CRIME	THAT	FALSE
8222333	ARE	HAND	MILLED	MADE	FALSE
3668094	HAVE	AN	EVEN	AWESOME	FALSE
6042247	MAYBE	CONFUSED	FOR	PEOPLE	FALSE
1771400	LAST	LONG	PERIODS	AND	FALSE
6613627	SHOW	IS	THIS	A	FALSE
803419	IS	THERE	AT	A	FALSE
339601	AS	I	HAD	WAS	FALSE
1001783	NOON	TIME	CHALLENGE	CHALLENGES	FALSE
2165787	IS	PUT	AGAINST	ON	FALSE
2755506	PUBLIC	ENGAGEMENT	PHASE	PANEL	FALSE
149725	STATE	EDUCATION	DEPARTMENT	COMMISSIONER	FALSE
9020866	SATCHMO	AND	REDFORD	THE	FALSE
4383286	OUR	FRIENDS	PUMPED	AND	FALSE
9180713	BY	DOREEN	CRONIN	VIRTUE	FALSE
4582374	WHATS	UP	WITH	WITH	TRUE
3372028	BUCK	UP	LADIES	AND	FALSE
4414302	THE	NEW	YEAR	YORK	FALSE
1371458	DADA	WHICH	MAKES	WAS	FALSE
9174078	HAD	THREE	HITS	HITS	TRUE
9022970	THE	TOP	THE	OF	FALSE
8846340	OFF	OUR	WINDOWS	NEW	FALSE
1446043	IN	SIX	DIFFERENT	GAMES	FALSE
4287981	ME	FIGURE	OUT	OUT	TRUE
6618228	THE	BATTER	IS	INTO	FALSE
7426844	FROM	ANYTHING	LIKE	THAT	FALSE
1262357	CONVENTION	CENTER	THIS	AND	FALSE
3645871	YOU	BETTER	FOLLOW	BE	FALSE
256687	IN	YOUR	VOTE	LIFE	FALSE
4727858	IN	#GIVEBACK	VIA	THE	FALSE
7342637	SMILING	EAR-TO-EAR	MATHENY	SMILE	FALSE
8573918	RAVI	AND	WEI	WEI	TRUE
5852593	CAME	TO	SAMPLE	THE	FALSE
5639100	SOARING	ARTISTIC	AND	DIRECTOR	FALSE
5317045	DOG	A	LAXATIVE	BATH	FALSE
1403311	MAYOR	STANLEY	IACONO	KOVACH	FALSE
7564059	PARTY	FOR	YOUR	THE	FALSE
514358	BOOMERS	SAY	THEY	THEY	TRUE
4155317	EVER	SEEN	SOMETHING	IN	FALSE
1718265	ABOUT	THE	SAME	SAME	TRUE
2785097	ON	THE	SWEET	OTHER	FALSE
7300650	HES	FREAKIN	BEAST	ANOYIN	FALSE
5862186	RAISE	EVEN	MORE	A	FALSE
4307292	THINK	SHE	IS	IS	TRUE
3695298	OFF	FROM	FASHION	THE	FALSE
282230	OUT	TOYS	AND	FOR	FALSE
8992478	THE	SEARCH	WHO	FOR	FALSE
7671088	BREATHED	THE	ESSENCE	SAME	FALSE
47192	TOO	I	CAN	THINK	FALSE
7959570	THE	MUSEUM	MIGHT	OF	FALSE
8005220	A	SPOT	WHERE	IN	FALSE
3236906	IS	SO	CUTE	MUCH	FALSE
3799121	OPTIMISTIC	I’M	GOING	NOT	FALSE
9243181	BUT	EVEN	AFTER	IF	FALSE
1344830	WAS	JUST	PLAYING	A	FALSE
2427489	MY	KNOWLEDGE	OF	OF	TRUE
1960246	CITY	MUSIC	HALL	HALL	TRUE
2440545	BARRAULT	LES	PERLES	PAUL	FALSE
5228277	I	USE	MY	TO	FALSE
4913660	AND	DON’T	HAVE	FORGET	FALSE
2481072	HE	SAID	AS	HE	FALSE
6798036	HAVE	TO	CHANGE	BE	FALSE
1463593	LET	ME	SAY	KNOW	FALSE
6597700	OF	PAGES	HE	OF	FALSE
8309101	ESCAPE	OUR	DEMONS	SELFISH	FALSE
5047135	NCM	TOMORROW	JULY	NIGHT	FALSE
1782946	THAT	I	WOULD	HAVE	FALSE
7375785	THE	TOP-RANKED	AMERICAN	DJOKOVIC	FALSE
8808609	BRAND	NEW	HOMES	AND	FALSE
2179250	MEAGER	LIST	PLEASE	OF	FALSE
8864278	WITH	AUSTRALIAN	ACCENTS	PRONUNCIATION	FALSE
7495723	ARE	OPPORTUNITIES	FOR	TO	FALSE
7617104	UNJUST	AND	UNFAIR	UNREASONABLE	FALSE
1359806	AREN’T	YOU	ANSWERING	GOING	FALSE
5999857	THREE	DIFFERENT	CHARACTERS	PEOPLE	FALSE
9222662	WAS	IN	MAGNIFICENT	THE	FALSE
5955181	SPOT	SERVING	CHINESE-AMERICAN	SPANISH	FALSE
703186	SECOND	HALF	AND	OF	FALSE
2379400	REALLY	GOOD	SAID	AT	FALSE
951439	ACCESS	TO	THE	THE	TRUE
9179059	TO	A	FIRST	NEW	FALSE
7597514	UP	BUT	AT	I	FALSE
3761509	WAS	A	PAPER-THIN	GOOD	FALSE
3113355	TIME	TO	ALERT	GET	FALSE
2849006	LIKE	THIS	FEELING	ONE	FALSE
6315233	CHIEF	STRATEGIST	FOR	FOR	TRUE
5007950	LIGHT	CRAMPING	I	I	TRUE
8685323	IMPLEMENTED	FROM	THEIR	THE	FALSE
1455771	EVENTUALLY	DECIDE	TO	TO	TRUE
7591326	EXTEND	THE	PROJECT	LIFE	FALSE
1095610	THEY	WENT	OUT	TO	FALSE
2009789	UNICORN	AND	A	SCHMENDRICK	FALSE
7199250	DREAM	WERE	THE	REALITY	FALSE

Possible improvements

I am quite satisfied of my first attempt. With a runtime of 0.033 seconds, I think I’ll be able to build a Shiny app that doesn’t require a button to start computation, but does predictions in continuous, just like a smartphone keyboard does. I think this will still be possible after adding input tokenisation, profanity filtering and hopefully some extra computation to improve accuracy.
In fact, the accuracy is not very high (16.44%). I don’t know what a reasonable benchmark would be, but observing my mobile keyboard in my native language, it correctly guesses the word I want to type once every 3 or maybe 4 times. So, I want to try to enhance accuracy to 25% at least.
Some of my ideas:

First of all, as the introduction of a character/number dictionary allowed me to save memory and time, I will definitely try using a prediction with the 3 previous words.
Also, I could consider using some NLP methods to remove suffixes from words, thus reducing the number of rows in each dataframe.
As previously mentioned, I could create some small dictionaries with the most common words in foreign languages and force the algorithm to look into those dictionaries when the word is not (or is rare, e.g. in the last 5% of frequency) in the English dictionary.

I also thought about another approach, which is maybe more similar to the one suggested in the Capstone project (which proposes to manage OOV by smoothing probabilities, something quite different from my solution to OOV).
This approach would rely on a dictionary and on 1,2,3 or more dataframes (depending on the number of words desired for prediction). Those dataframes should contain:

In the first column one of the input words
In the second column the top (e.g. 5) most frequent words at a given distance from that word (immediately following, or separeted by one word, or separated by 2 words…)
In the third column the probability of that combination of words (the % of time that, in the Corpora data set, the first given word is associated with the second at that given distance).

Then, the algorithm would sum up (maybe with some weights to give more importance to the nearest words) the probabilities for each possible predicted word and choose the most probable.
To manage OOV, for each possible word in the 2nd column a line should be included in each dataframe in combination with a “OOV” tag in 1st column and a small (I should think about how small) probability in the 3rd.
I would like to discover if this solution is worse or better than mine in terms of memory required, speed and accuracy, but this would mean to restart my job just after tokenization, so I’m not sure if I will do this (I will decide after reading my peers reports and looking at some Natural Language Prediction resources).