The goal of this project is to build an NLP algorithm to determine which English word a user is most likely to type after digitation of “some” (a number to determine) previous words.
The starting point is a Corpora dataset consisting of strings from blogs, online newspapers and tweets. Those strings are in English, but data in other languages are also provided and will maybe be used in further steps.
The algorithm is initially built and tested locally, but is expected to work in a Shiny app running on the Shiny server.
For my analysis, I used R.4.0.2 on a Windows X64 machine with 8Gb of RAM and the following libraries:
library(ggplot2)
library(plotly)
library(dplyr)
library(xtable)
NOTE: This report is intended to be concise and understandable by a non-data scientist reader, so the vast majority of code will not be shown. If you’re interested in it, you can find it on Github (Milestone.Rmd file). Knitting that file in RStudio (after removal of every “eval=F” option present in the file) would reproduce the entire workflow, but it could take a very long time and require you to manually clear the workspace from time to time (that’s why some files are saved on hardware and then loaded again when needed). This is partly because of the nature of the analysis itself and the size of data, and partly because I discovered some tricks to improve efficiency along the way.
Data were downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
I used bash command "wc -cmlwL *.txt" to find basic informations for each of the 3 en_US files in my working directory and I summarized them in a table:| bytes | Number.of.characters | Number.of.lines | Number.of.words | Longest.line | |
|---|---|---|---|---|---|
| en_US.blogs | 210160014 | 208623085 | 899288 | 37333958 | 40833 |
| en_US.news | 205811889 | 205243643 | 1010242 | 34365905 | 11384 |
| en_US.twitter | 167105338 | 166843164 | 2360148 | 30357171 | 173 |
Then, I merged the 3 files and, to save allocated RAM, I divided the resulting object in 50 files that were subsequently loaded one at a time for profanity filtering and tokenisation. Before splitting, strings were mixed in random order, to insure that each chunk contains strings from blogs, news and tweets in approximately the same proportion. I choose to convert all letters to uppercase to reduce the number of different words.
Profanity filter was based on a bad words list found at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en. Profane words were replaced by the tag <BADWORD>.
I also seeked for web links, email adresses, numbers, prices and dates and replaced them with tags, still in order to reduce the number of different tokens in the dataset. Those filters are probably not perfect, but they appear to find the majority of the desired items.
Finally, lines were splitted based on whitespaces and punctuation. Again, a small percentage of words didn’t split correctly, due to some strange patterns of punctuation or to typos, but this doesn’t affect the quality of processed data.
Then, I used the first 35 chunks (70% of the total, as they all have similar sizes) to be part of the training dataset, and left the other 15 for testing purpose. Chunks from 36 to 45 will be the test dataset for the final model. Chunck from 46 to 50 will be the validation dataset that will be used to test the accuracy of each of my future models.
Then, the 35 word lists were merged together obtaining a vector with 70,842,273 elements and tabled to determine the total number of different words (or tokens) and each word’s frequency. The table was then sorted by frequency. As shown, the total number of distinct words and tokens in the training dataset (the so-called “unigrams”) is 806,759.
As a comparison (code not provided), when I examined with the same procedure a sample containing only the 10% of the blogs file, I obtained approx 137000 words, while the total number of different tokens in the entire dataset (training + validation + test) is 1,014,175. So, as expected, increasing the number of strings examined is useful, but as the size of the dataset increases the advantage becomes smaller. We should also consider that a great number of “new” words consists in typos.
unigrams<-read.csv("unigrams.csv")
nrow(unigrams)
## [1] 806759
Now, I calculated the cumulative percent frequency and I found out that only 140 words/tokens account for the fifty percent of the entire training dataset. As shown in the following table, 3 of those are not actual words but tokens summarizing a variety of possible strings (<NUMBER>, <BADWORD>, <PRICE>).
top50<-unigrams[unigrams$Cumulative_frequency_Percent<=50,]
nrow(top50)
## [1] 140
| ranking | word | frequency | Cumulative_frequency_Percent |
|---|---|---|---|
| 1 | THE | 3327648 | 4.70 |
| 2 | TO | 1924098 | 7.41 |
| 3 | AND | 1683834 | 9.79 |
| 4 | A | 1663449 | 12.14 |
| 5 | OF | 1402050 | 14.12 |
| 6 | IN | 1149924 | 15.74 |
| 7 | I | 1149695 | 17.36 |
| 8 | FOR | 768408 | 18.45 |
| 9 | IS | 750501 | 19.51 |
| 10 | THAT | 727588 | 20.53 |
| 11 | <NUMBER> | 709743 | 21.54 |
| 12 | YOU | 654832 | 22.46 |
| 13 | IT | 639515 | 23.36 |
| 14 | ON | 571052 | 24.17 |
| 15 | WITH | 499779 | 24.88 |
| 16 | WAS | 437319 | 25.49 |
| 17 | MY | 421571 | 26.09 |
| 18 | AT | 398060 | 26.65 |
| 19 | BE | 383206 | 27.19 |
| 20 | THIS | 379566 | 27.73 |
| 21 | HAVE | 369964 | 28.25 |
| 22 | ARE | 342690 | 28.73 |
| 23 | BUT | 336983 | 29.21 |
| 24 | AS | 336454 | 29.68 |
| 25 | HE | 299288 | 30.11 |
| 26 | WE | 290701 | 30.52 |
| 27 | NOT | 286119 | 30.92 |
| 28 | FROM | 268219 | 31.30 |
| 29 | SO | 266981 | 31.67 |
| 30 | ME | 256465 | 32.04 |
| 31 | ALL | 230938 | 32.36 |
| 32 | THEY | 224613 | 32.68 |
| 33 | WILL | 219998 | 32.99 |
| 34 | BY | 219127 | 33.30 |
| 35 | OR | 215658 | 33.60 |
| 36 | SAID | 213581 | 33.91 |
| 37 | JUST | 212245 | 34.21 |
| 38 | HIS | 210949 | 34.50 |
| 39 | YOUR | 210591 | 34.80 |
| 40 | AN | 208651 | 35.09 |
| 41 | ABOUT | 206720 | 35.39 |
| 42 | OUT | 206057 | 35.68 |
| 43 | UP | 203489 | 35.96 |
| 44 | ONE | 202336 | 36.25 |
| 45 | IF | 194153 | 36.52 |
| 46 | WHAT | 193534 | 36.80 |
| 47 | LIKE | 188449 | 37.06 |
| 48 | WHEN | 184814 | 37.32 |
| 49 | HAS | 181658 | 37.58 |
| 50 | WHO | 174014 | 37.83 |
| 51 | CAN | 172052 | 38.07 |
| 52 | MORE | 170091 | 38.31 |
| 53 | DO | 168171 | 38.55 |
| 54 | HAD | 163156 | 38.78 |
| 55 | GET | 157916 | 39.00 |
| 56 | TIME | 150268 | 39.21 |
| 57 | THERE | 147488 | 39.42 |
| 58 | HER | 146178 | 39.63 |
| 59 | WOULD | 143511 | 39.83 |
| 60 | THEIR | 142970 | 40.03 |
| 61 | SOME | 141003 | 40.23 |
| 62 | NO | 138515 | 40.43 |
| 63 | SHE | 136821 | 40.62 |
| 64 | NEW | 135205 | 40.81 |
| 65 | BEEN | 131798 | 41.00 |
| 66 | OUR | 129951 | 41.18 |
| 67 | I’M | 128268 | 41.36 |
| 68 | IT’S | 126467 | 41.54 |
| 69 | NOW | 125512 | 41.72 |
| 70 | GOOD | 124811 | 41.89 |
| 71 | WERE | 124672 | 42.07 |
| 72 | HOW | 122105 | 42.24 |
| 73 | DAY | 117609 | 42.41 |
| 74 | KNOW | 114043 | 42.57 |
| 75 | THEM | 113054 | 42.73 |
| 76 | LOVE | 112103 | 42.89 |
| 77 | PEOPLE | 110392 | 43.04 |
| 78 | <BADWORD> | 105230 | 43.19 |
| 79 | <PRICE> | 101762 | 43.33 |
| 80 | WHICH | 100331 | 43.48 |
| 81 | BACK | 99042 | 43.61 |
| 82 | THAN | 98191 | 43.75 |
| 83 | GO | 97502 | 43.89 |
| 84 | SEE | 96668 | 44.03 |
| 85 | FIRST | 94368 | 44.16 |
| 86 | INTO | 93185 | 44.29 |
| 87 | AFTER | 92906 | 44.42 |
| 88 | MAKE | 90948 | 44.55 |
| 89 | ALSO | 90876 | 44.68 |
| 90 | DON’T | 89626 | 44.81 |
| 91 | ITS | 89344 | 44.93 |
| 92 | ONLY | 88738 | 45.06 |
| 93 | THINK | 88419 | 45.18 |
| 94 | GOING | 88411 | 45.31 |
| 95 | OTHER | 88379 | 45.43 |
| 96 | LAST | 87048 | 45.56 |
| 97 | OVER | 86932 | 45.68 |
| 98 | THEN | 86313 | 45.80 |
| 99 | GREAT | 86130 | 45.92 |
| 100 | HIM | 84413 | 46.04 |
| 101 | MUCH | 83989 | 46.16 |
| 102 | BECAUSE | 83896 | 46.28 |
| 103 | US | 82785 | 46.39 |
| 104 | TOO | 80262 | 46.51 |
| 105 | TWO | 80171 | 46.62 |
| 106 | REALLY | 79907 | 46.73 |
| 107 | YEAR | 79137 | 46.85 |
| 108 | WAY | 78301 | 46.96 |
| 109 | COULD | 78135 | 47.07 |
| 110 | TODAY | 77618 | 47.18 |
| 111 | GOT | 76058 | 47.28 |
| 112 | WELL | 75710 | 47.39 |
| 113 | EVEN | 75676 | 47.50 |
| 114 | WANT | 74811 | 47.60 |
| 115 | WORK | 73652 | 47.71 |
| 116 | DID | 73168 | 47.81 |
| 117 | STILL | 71805 | 47.91 |
| 118 | RIGHT | 71564 | 48.01 |
| 119 | HERE | 69057 | 48.11 |
| 120 | THANKS | 68862 | 48.21 |
| 121 | OFF | 68191 | 48.30 |
| 122 | NEED | 68124 | 48.40 |
| 123 | WHERE | 67947 | 48.50 |
| 124 | AM | 66933 | 48.59 |
| 125 | VERY | 65888 | 48.68 |
| 126 | YEARS | 64938 | 48.77 |
| 127 | MOST | 64873 | 48.87 |
| 128 | ANY | 64872 | 48.96 |
| 129 | BEFORE | 62347 | 49.05 |
| 130 | THOSE | 62226 | 49.13 |
| 131 | MANY | 62094 | 49.22 |
| 132 | RT | 61947 | 49.31 |
| 133 | DOWN | 61819 | 49.40 |
| 134 | LIFE | 61814 | 49.48 |
| 135 | SAY | 60447 | 49.57 |
| 136 | SHOULD | 60198 | 49.65 |
| 137 | TAKE | 59945 | 49.74 |
| 138 | BEING | 59206 | 49.82 |
| 139 | THESE | 58721 | 49.90 |
| 140 | COME | 58235 | 49.99 |
Now, let’s see how many words are needed to cover larger fractions of the dataset:
top90<-unigrams[unigrams$Cumulative_frequency_Percent<=90,]
nrow(top90)
## [1] 8445
top95<-unigrams[unigrams$Cumulative_frequency_Percent<=95,]
nrow(top95)
## [1] 22188
top99<-unigrams[unigrams$Cumulative_frequency_Percent<=99,]
nrow(top99)
## [1] 203398
Let’s look to the last tokens in the top 95% list:
tail(top95)
## ranking word frequency Cumulative_frequency_Percent
## 22183 22183 HYPOCRITE 120 95
## 22184 22184 INCARNATE 120 95
## 22185 22185 INTENDING 120 95
## 22186 22186 INTIMIDATE 120 95
## 22187 22187 IRWIN 120 95
## 22188 22188 JARGON 120 95
They seems to be not very common words, but not even so weird.
Here is a plot of the 95% coverage by the number of words used. You can zoom and hover on it to see each word with its absolute frequency in the dataset.
Now, I created a dataframe with all the digrams (combinations of 2 words/tokens) that can be found in the training dataset and their frequencies. Once again, I worked with chunks to avoid exceeding memory limits of my laptop.
Let’s explore digrams distribution like we did with unigrams:
digrams <- readRDS("digrams.RDS")
nrow(digrams)
## [1] 11459841
top50<-digrams[digrams$Cumulative_frequency_Percent<=50,]
nrow(top50)
## [1] 40516
We can see that there are more than 11 millions unique digrams, and 41000 of them account for the 50% of the total. Here are the top 100 2-grams:
| V1 | V2 | Frequency | Cumulative_frequency_Percent | |
|---|---|---|---|---|
| 1 | OF | THE | 300736 | 0.44 |
| 2 | IN | THE | 284987 | 0.86 |
| 3 | TO | THE | 148713 | 1.08 |
| 4 | FOR | THE | 140432 | 1.29 |
| 5 | ON | THE | 136884 | 1.49 |
| 6 | TO | BE | 113296 | 1.66 |
| 7 | AT | THE | 99540 | 1.80 |
| 8 | AND | THE | 87669 | 1.93 |
| 9 | IN | A | 83212 | 2.06 |
| 10 | WITH | THE | 74029 | 2.17 |
| 11 | IS | A | 70486 | 2.27 |
| 12 | IT | WAS | 67212 | 2.37 |
| 13 | FOR | A | 65806 | 2.46 |
| 14 | FROM | THE | 60926 | 2.55 |
| 15 | I | HAVE | 60117 | 2.64 |
| 16 | I | WAS | 59928 | 2.73 |
| 17 | IT | IS | 57461 | 2.82 |
| 18 | WITH | A | 57232 | 2.90 |
| 19 | AND | I | 57191 | 2.98 |
| 20 | WILL | BE | 56720 | 3.07 |
| 21 | GOING | TO | 55773 | 3.15 |
| 22 | OF | A | 55480 | 3.23 |
| 23 | I | AM | 53665 | 3.31 |
| 24 | IS | THE | 51745 | 3.39 |
| 25 | HAVE | A | 51398 | 3.46 |
| 26 | IF | YOU | 50826 | 3.54 |
| 27 | ONE | OF | 50796 | 3.61 |
| 28 | IN | <NUMBER> | 49429 | 3.69 |
| 29 | TO | GET | 49192 | 3.76 |
| 30 | AS | A | 48333 | 3.83 |
| 31 | WANT | TO | 44509 | 3.90 |
| 32 | HAVE | TO | 43004 | 3.96 |
| 33 | BY | THE | 42737 | 4.02 |
| 34 | THAT | THE | 42561 | 4.08 |
| 35 | THIS | IS | 41040 | 4.14 |
| 36 | TO | DO | 40850 | 4.20 |
| 37 | AND | A | 40693 | 4.26 |
| 38 | I | THINK | 40671 | 4.32 |
| 39 | THE | FIRST | 40051 | 4.38 |
| 40 | WAS | A | 39569 | 4.44 |
| 41 | OUT | OF | 39272 | 4.50 |
| 42 | TO | A | 38735 | 4.56 |
| 43 | THAT | I | 37897 | 4.61 |
| 44 | TO | SEE | 37677 | 4.67 |
| 45 | ON | A | 37461 | 4.72 |
| 46 | ALL | THE | 35778 | 4.78 |
| 47 | BUT | I | 35633 | 4.83 |
| 48 | I | LOVE | 35185 | 4.88 |
| 49 | THE | SAME | 34613 | 4.93 |
| 50 | HAVE | BEEN | 33537 | 4.98 |
| 51 | TO | MAKE | 33429 | 5.03 |
| 52 | A | LOT | 33312 | 5.08 |
| 53 | YOU | CAN | 33276 | 5.13 |
| 54 | BE | A | 32775 | 5.18 |
| 55 | HE | WAS | 31976 | 5.22 |
| 56 | THANKS | FOR | 31384 | 5.27 |
| 57 | OF | MY | 31226 | 5.32 |
| 58 | NEED | TO | 30965 | 5.36 |
| 59 | HAS | BEEN | 30842 | 5.41 |
| 60 | A | FEW | 30541 | 5.45 |
| 61 | WOULD | BE | 30441 | 5.50 |
| 62 | YOU | ARE | 30439 | 5.54 |
| 63 | I | DON’T | 30417 | 5.59 |
| 64 | MORE | THAN | 29993 | 5.63 |
| 65 | IN | MY | 29622 | 5.67 |
| 66 | AS | THE | 29410 | 5.72 |
| 67 | ABOUT | THE | 29296 | 5.76 |
| 68 | WHEN | I | 29229 | 5.80 |
| 69 | YOU | HAVE | 28714 | 5.85 |
| 70 | A | GREAT | 28692 | 5.89 |
| 71 | TO | GO | 28627 | 5.93 |
| 72 | I | CAN | 28457 | 5.97 |
| 73 | I | HAD | 28436 | 6.01 |
| 74 | A | LITTLE | 28376 | 6.06 |
| 75 | THE | BEST | 27857 | 6.10 |
| 76 | TO | HAVE | 27729 | 6.14 |
| 77 | HE | SAID | 27510 | 6.18 |
| 78 | A | GOOD | 27431 | 6.22 |
| 79 | THANK | YOU | 27375 | 6.26 |
| 80 | I | KNOW | 27104 | 6.30 |
| 81 | HAD | A | 26915 | 6.34 |
| 82 | INTO | THE | 26858 | 6.38 |
| 83 | THEY | ARE | 26581 | 6.42 |
| 84 | WE | ARE | 26376 | 6.46 |
| 85 | I | JUST | 25652 | 6.49 |
| 86 | THERE | IS | 25623 | 6.53 |
| 87 | <NUMBER> | PERCENT | 24704 | 6.57 |
| 88 | IS | NOT | 24589 | 6.60 |
| 89 | THAT | IS | 24318 | 6.64 |
| 90 | A | NEW | 23452 | 6.68 |
| 91 | THE | NEW | 23355 | 6.71 |
| 92 | THERE | ARE | 23305 | 6.74 |
| 93 | SO | I | 23254 | 6.78 |
| 94 | THE | MOST | 23240 | 6.81 |
| 95 | THE | <NUMBER> | 23074 | 6.85 |
| 96 | OVER | THE | 23062 | 6.88 |
| 97 | THE | WORLD | 22987 | 6.91 |
| 98 | WE | HAVE | 22963 | 6.95 |
| 99 | I | WILL | 22877 | 6.98 |
| 100 | LIKE | A | 22644 | 7.02 |
And here is a plot with the top 1000 and cumulative frequency on the y axis:
The number of 3-grams is so big that my laptop couldn’t manage retrieving and counting the frequency of all of them in one file. Also, as shown in the Possible Models section, prediction is much faster when data are split in one dataframe per letter.
So, after an initial processing in chunks, I divided each chunk in 27 dataframes, one for each initial letter of the first word in the 3-gram, plus one for trigrams beginning with “<”, that denotes my custom tags (while doing so, I discarded a small percentage of weird tokens beginning with other characters). As those data will be the starting point for the creation of the model, I sorted them by decreasing frequency, to improve efficiency in querying the final dataframes.
Let’s look at this huge collection of 3-grams:
temp<-list.files("models/merged",full.names=T)
trigrams<-lapply(temp, readRDS)
sum(sapply(trigrams, nrow))
## [1] 33539777
There are nearly 34 millions different combinations of 3 words in the training dataset. Let’s plot the most frequent 1000 against their frequency and table the first 100 of them.
| V1 | V2 | V3 | Frequency | ranking |
|---|---|---|---|---|
| ONE | OF | THE | 24251 | 1 |
| A | LOT | OF | 20971 | 2 |
| THANKS | FOR | THE | 16657 | 3 |
| TO | BE | A | 12635 | 4 |
| GOING | TO | BE | 12155 | 5 |
| THE | END | OF | 10425 | 6 |
| OUT | OF | THE | 10372 | 7 |
| I | WANT | TO | 10331 | 8 |
| IT | WAS | A | 9970 | 9 |
| AS | WELL | AS | 9611 | 10 |
| SOME | OF | THE | 9501 | 11 |
| BE | ABLE | TO | 9127 | 12 |
| MORE | THAN | <NUMBER> | 8681 | 13 |
| PART | OF | THE | 8614 | 14 |
| I | HAVE | A | 8234 | 15 |
| THE | REST | OF | 7892 | 16 |
| I | HAVE | TO | 7822 | 17 |
| LOOKING | FORWARD | TO | 7807 | 18 |
| THE | FIRST | TIME | 7199 | 19 |
| THANK | YOU | FOR | 7112 | 20 |
| IS | GOING | TO | 7081 | 21 |
| A | COUPLE | OF | 7005 | 22 |
| THIS | IS | A | 6841 | 23 |
| I | NEED | TO | 6616 | 24 |
| THERE | IS | A | 6550 | 25 |
| END | OF | THE | 6472 | 26 |
| YOU | WANT | TO | 6403 | 27 |
| YOU | HAVE | TO | 6396 | 28 |
| I | LOVE | YOU | 6378 | 29 |
| THE | FACT | THAT | 6352 | 30 |
| <NUMBER> | TO | <NUMBER> | 6245 | 31 |
| <NUMBER> | PERCENT | OF | 6061 | 32 |
| IN | THE | WORLD | 6032 | 33 |
| ONE | OF | MY | 6009 | 34 |
| TO | GO | TO | 5882 | 35 |
| CAN’T | WAIT | TO | 5880 | 36 |
| IT | WOULD | BE | 5866 | 37 |
| THIS | IS | THE | 5854 | 38 |
| I | DON’T | KNOW | 5802 | 39 |
| AT | THE | END | 5759 | 40 |
| FOR | THE | FIRST | 5720 | 41 |
| IS | ONE | OF | 5618 | 42 |
| IT | IS | A | 5567 | 43 |
| TO | HAVE | A | 5536 | 44 |
| THERE | IS | NO | 5499 | 45 |
| FOR | THE | FOLLOW | 5486 | 46 |
| IN | THE | FIRST | 5481 | 47 |
| I’M | GOING | TO | 5470 | 48 |
| MOST | OF | THE | 5400 | 49 |
| ACCORDING | TO | THE | 5278 | 50 |
| YOU | HAVE | A | 5250 | 51 |
| ALL | OF | THE | 5248 | 52 |
| IN | FRONT | OF | 5238 | 53 |
| THE | UNITED | STATES | 5029 | 54 |
| TO | BE | THE | 5019 | 55 |
| OF | THE | YEAR | 4967 | 56 |
| I | HAD | A | 4921 | 57 |
| IF | YOU | ARE | 4913 | 58 |
| I | HAD | TO | 4881 | 59 |
| REST | OF | THE | 4810 | 60 |
| I | THINK | I | 4808 | 61 |
| OF | THE | DAY | 4692 | 62 |
| BACK | TO | THE | 4664 | 63 |
| I | HAVE | BEEN | 4664 | 64 |
| I | WANTED | TO | 4645 | 65 |
| TO | MAKE | A | 4636 | 66 |
| HAVE | A | GREAT | 4635 | 67 |
| <NUMBER> | AND | <NUMBER> | 4562 | 68 |
| IT | WILL | BE | 4505 | 69 |
| WANT | TO | BE | 4465 | 70 |
| IN | ORDER | TO | 4440 | 71 |
| WHEN | I | WAS | 4365 | 72 |
| TO | SEE | THE | 4350 | 73 |
| AS | MUCH | AS | 4336 | 74 |
| I | FEEL | LIKE | 4320 | 75 |
| IN | <NUMBER> | AND | 4242 | 76 |
| IF | YOU | HAVE | 4230 | 77 |
| IN | THE | PAST | 4180 | 78 |
| TO | GET | A | 4157 | 79 |
| AT | THE | SAME | 4143 | 80 |
| ARE | GOING | TO | 4133 | 81 |
| ONE | OF | THOSE | 4132 | 82 |
| TO | DO | WITH | 4105 | 83 |
| I | DON’T | THINK | 4092 | 84 |
| I | WILL | BE | 4086 | 85 |
| HAVE | TO | BE | 4061 | 86 |
| OF | THE | MOST | 4037 | 87 |
| AT | THE | TIME | 4031 | 88 |
| WAS | GOING | TO | 4012 | 89 |
| IF | YOU | WANT | 4006 | 90 |
| WE | NEED | TO | 3986 | 91 |
| THE | SAME | TIME | 3979 | 92 |
| TO | SEE | YOU | 3972 | 93 |
| A | BIT | OF | 3955 | 94 |
| THERE | WAS | A | 3936 | 95 |
| OF | THE | <NUMBER> | 3920 | 96 |
| I | AM | NOT | 3862 | 97 |
| LET | ME | KNOW | 3860 | 98 |
| WOULD | LIKE | TO | 3844 | 99 |
| IN | THE | MIDDLE | 3822 | 100 |
For the modeling task, I didn’t use, for the moment, any special NLP package. Basically, I’m testing my idea that a good approach would be to associate in one ore more dataframes all the observed sequences of words (of a given length) with the most frequent next word. This or those dataframes will then be filtered for the input words and the output will be the content of the last column.
Here are some lines of code that roughly estimate the speed and memory requirements of various approaches. Note that when I wrote this code I hadn’t still finished the dataframes creation, so I just used some available dataframes with the same number of columns and (roughly) the same number of rows expected in the real dataframes. That’s why the prediction outputs are meaningless.
unigrams<-read.csv("unigrams.csv")
unigrams<-select(unigrams, words, Freq)
a<-Sys.time()
output<-filter(unigrams, words==toupper("friends")) %>% select(Freq)
b<-Sys.time()
b-a
## Time difference of 0.040766 secs
print(object.size(unigrams), units="Mb")
## 57 Mb
digrams <- readRDS("digrams.RDS")
a<-Sys.time()
output<-as.character(filter(digrams, V1==toupper("best") & V2==toupper("friends")))
b<-Sys.time()
b-a
## Time difference of 11.66452 secs
print(object.size(digrams), units="Mb")
## 445.5 Mb
a<-readRDS("digrams.RDS")
i<-grep("^[A-Z]", a$V1)
a<-a[i,]
init<-strsplit(a$V1,"")
init<-unlist(sapply(init, function(x) x[1]))
a$factor<-as.factor(init)
l<-split(a, a$factor)
l<-lapply(l, function(x) x[,1:3])
a<-Sys.time()
v1=toupper("best")
v2=toupper("friends")
output<-as.character(filter(l[[substr(v1,1,1)]], V1==v1 & V2==v2))
b<-Sys.time()
b-a
## Time difference of 2.76389 secs
print(object.size(l), units="Mb")
## 503.5 Mb
The last solution seems to be the best, assuming that prediction with 2 words is more accurate than prediction with 1 word.
For the OOV (out-of-vocabulary) pairs of words, my algorithm will use prediction with 1 word. If not even the previous word alone is in the dictionary, the algorithm will predict the most repeated word in the dataset (“THE”).
For the dataframes to be used by the prediction function, I selected for each pair of words the most frequent trigram beginning with that pair of words. When 2 trigrams occurred the same number of times, I choose the one where the third word is more frequent in the training dataset, with the help of my list of unigrams previously created. Same procedure for the digrams dataframe.
At this point, I discovered that those generated dataframes, even if they have very similar dimensions, have a much bigger size in Mb than the ones I used in previous simulations. This has a big impact on speed, and probably would create problems with the Shiny server memory limits. For this reason, I created a dictionary of the unique tokens associating each of them with a number, and then I created a numeric version of the trigrams and digrams dataframes. For some reason (lower number of characters, or maybe the numeric objects are memory saving compared to character object) those new dataframes are much lighter (\(1/12\) in Mb).
Finally, I want to show you the prediction function, which consists of few lines and requires 3 files loaded (a dictionary, the 2-grams dataframe in its numerical version and a list with the 27 3-grams dataframes in their numerical version, for a total size of 181 Mb (Shiny free server RAM limit is 1Gb):
dicvec<-readRDS("df/dicvec.RDS")
digrams<-readRDS("df/digrams_coded.RDS")
m<-readRDS("df/coded.RDS")
print(object.size(list(m, digrams, dicvec)), units="Mb")
## 181.3 Mb
word_predict<-function(a,b) {
v2<-dicvec[toupper(b)]
v1<-toupper(a)
if(!(substr(v1,1,1) %in% names(m))) {w<-as.integer(filter(digrams, code.x==v2)$code.y)}
else {
l<-m[[substr(v1,1,1)]]
v1=dicvec[v1]
w<-as.integer(filter(l, V1==v1 & V2==v2)$V3)
}
if(length(w)==0) {
w<-as.integer(filter(digrams, code.x==v2)$code.y)
}
if(length(w)==0) {w<-743}
names(dicvec[w])
}
The final step is to test the prediction algorithm using the validation dataset: even if we are in a field where there is no “correct answer”, testing the probability of guessing the next word typed on a large collection of blogs, newspaper and tweets will be useful to compare this first model with other models I will build during the project, or with someone else’s model.
For this task, I processed the validation dataset in trigrams, following the same steps I used for the training dataset.
Note that all lines in the validation dataset were already previously subjected to profanity filtering, tagging and tokenisation, but those processing steps will have to be included in the final prediction model, so that the user will be able to input number, weblink etc and get the expected output.
validation<-readRDS("validation.RDS")
nrow(validation)
## [1] 9266761
As the number of rows is too big to be tested in a reasonable amount of time, I only used a subsample of 2 millions of trigrams (21.6% of the total), and testing was completed in about 19 hours. The first 2 words in each trigram were passed to the prediction function, and the output was compared with the third word of each trigram.
So, I calculated accuracy (percentage of words correctly guessed) and mean prediction time. In the table you can see a few lines of the testing output (column “V3” contains expected word, column “prediction” contains the output from the algorithm).
round(sum(validation$Correct_prediction)/nrow(validation)*100,2) #Accuracy %
## [1] 16.44
round(difftime(b,a, units="sec")/nrow(validation),4) # Mean time for prediction
## Time difference of 0.0334 secs
| V1 | V2 | V3 | prediction | Correct_prediction | |
|---|---|---|---|---|---|
| 2439408 | THE | BCS | IN | NATIONAL | FALSE |
| 3586079 | 8TH | PLACE | EPL | TO | FALSE |
| 706051 | ACTUALLY | GET | MARRIED | TO | FALSE |
| 5107415 | IF | SHE | WANTS | WAS | FALSE |
| 5023765 | YIKES | HOPE | NO | THEY | FALSE |
| 5621073 | THOSE | SELF-APPOINTED | SELF-CENTERED | JUDGES | FALSE |
| 5525529 | NEW | BLACK | EYED | PANTHER | FALSE |
| 346263 | CHAMBERLAIN’S | ALLEGED | CRIME | THAT | FALSE |
| 8222333 | ARE | HAND | MILLED | MADE | FALSE |
| 3668094 | HAVE | AN | EVEN | AWESOME | FALSE |
| 6042247 | MAYBE | CONFUSED | FOR | PEOPLE | FALSE |
| 1771400 | LAST | LONG | PERIODS | AND | FALSE |
| 6613627 | SHOW | IS | THIS | A | FALSE |
| 803419 | IS | THERE | AT | A | FALSE |
| 339601 | AS | I | HAD | WAS | FALSE |
| 1001783 | NOON | TIME | CHALLENGE | CHALLENGES | FALSE |
| 2165787 | IS | PUT | AGAINST | ON | FALSE |
| 2755506 | PUBLIC | ENGAGEMENT | PHASE | PANEL | FALSE |
| 149725 | STATE | EDUCATION | DEPARTMENT | COMMISSIONER | FALSE |
| 9020866 | SATCHMO | AND | REDFORD | THE | FALSE |
| 4383286 | OUR | FRIENDS | PUMPED | AND | FALSE |
| 9180713 | BY | DOREEN | CRONIN | VIRTUE | FALSE |
| 4582374 | WHATS | UP | WITH | WITH | TRUE |
| 3372028 | BUCK | UP | LADIES | AND | FALSE |
| 4414302 | THE | NEW | YEAR | YORK | FALSE |
| 1371458 | DADA | WHICH | MAKES | WAS | FALSE |
| 9174078 | HAD | THREE | HITS | HITS | TRUE |
| 9022970 | THE | TOP | THE | OF | FALSE |
| 8846340 | OFF | OUR | WINDOWS | NEW | FALSE |
| 1446043 | IN | SIX | DIFFERENT | GAMES | FALSE |
| 4287981 | ME | FIGURE | OUT | OUT | TRUE |
| 6618228 | THE | BATTER | IS | INTO | FALSE |
| 7426844 | FROM | ANYTHING | LIKE | THAT | FALSE |
| 1262357 | CONVENTION | CENTER | THIS | AND | FALSE |
| 3645871 | YOU | BETTER | FOLLOW | BE | FALSE |
| 256687 | IN | YOUR | VOTE | LIFE | FALSE |
| 4727858 | IN | #GIVEBACK | VIA | THE | FALSE |
| 7342637 | SMILING | EAR-TO-EAR | MATHENY | SMILE | FALSE |
| 8573918 | RAVI | AND | WEI | WEI | TRUE |
| 5852593 | CAME | TO | SAMPLE | THE | FALSE |
| 5639100 | SOARING | ARTISTIC | AND | DIRECTOR | FALSE |
| 5317045 | DOG | A | LAXATIVE | BATH | FALSE |
| 1403311 | MAYOR | STANLEY | IACONO | KOVACH | FALSE |
| 7564059 | PARTY | FOR | YOUR | THE | FALSE |
| 514358 | BOOMERS | SAY | THEY | THEY | TRUE |
| 4155317 | EVER | SEEN | SOMETHING | IN | FALSE |
| 1718265 | ABOUT | THE | SAME | SAME | TRUE |
| 2785097 | ON | THE | SWEET | OTHER | FALSE |
| 7300650 | HES | FREAKIN | BEAST | ANOYIN | FALSE |
| 5862186 | RAISE | EVEN | MORE | A | FALSE |
| 4307292 | THINK | SHE | IS | IS | TRUE |
| 3695298 | OFF | FROM | FASHION | THE | FALSE |
| 282230 | OUT | TOYS | AND | FOR | FALSE |
| 8992478 | THE | SEARCH | WHO | FOR | FALSE |
| 7671088 | BREATHED | THE | ESSENCE | SAME | FALSE |
| 47192 | TOO | I | CAN | THINK | FALSE |
| 7959570 | THE | MUSEUM | MIGHT | OF | FALSE |
| 8005220 | A | SPOT | WHERE | IN | FALSE |
| 3236906 | IS | SO | CUTE | MUCH | FALSE |
| 3799121 | OPTIMISTIC | I’M | GOING | NOT | FALSE |
| 9243181 | BUT | EVEN | AFTER | IF | FALSE |
| 1344830 | WAS | JUST | PLAYING | A | FALSE |
| 2427489 | MY | KNOWLEDGE | OF | OF | TRUE |
| 1960246 | CITY | MUSIC | HALL | HALL | TRUE |
| 2440545 | BARRAULT | LES | PERLES | PAUL | FALSE |
| 5228277 | I | USE | MY | TO | FALSE |
| 4913660 | AND | DON’T | HAVE | FORGET | FALSE |
| 2481072 | HE | SAID | AS | HE | FALSE |
| 6798036 | HAVE | TO | CHANGE | BE | FALSE |
| 1463593 | LET | ME | SAY | KNOW | FALSE |
| 6597700 | OF | PAGES | HE | OF | FALSE |
| 8309101 | ESCAPE | OUR | DEMONS | SELFISH | FALSE |
| 5047135 | NCM | TOMORROW | JULY | NIGHT | FALSE |
| 1782946 | THAT | I | WOULD | HAVE | FALSE |
| 7375785 | THE | TOP-RANKED | AMERICAN | DJOKOVIC | FALSE |
| 8808609 | BRAND | NEW | HOMES | AND | FALSE |
| 2179250 | MEAGER | LIST | PLEASE | OF | FALSE |
| 8864278 | WITH | AUSTRALIAN | ACCENTS | PRONUNCIATION | FALSE |
| 7495723 | ARE | OPPORTUNITIES | FOR | TO | FALSE |
| 7617104 | UNJUST | AND | UNFAIR | UNREASONABLE | FALSE |
| 1359806 | AREN’T | YOU | ANSWERING | GOING | FALSE |
| 5999857 | THREE | DIFFERENT | CHARACTERS | PEOPLE | FALSE |
| 9222662 | WAS | IN | MAGNIFICENT | THE | FALSE |
| 5955181 | SPOT | SERVING | CHINESE-AMERICAN | SPANISH | FALSE |
| 703186 | SECOND | HALF | AND | OF | FALSE |
| 2379400 | REALLY | GOOD | SAID | AT | FALSE |
| 951439 | ACCESS | TO | THE | THE | TRUE |
| 9179059 | TO | A | FIRST | NEW | FALSE |
| 7597514 | UP | BUT | AT | I | FALSE |
| 3761509 | WAS | A | PAPER-THIN | GOOD | FALSE |
| 3113355 | TIME | TO | ALERT | GET | FALSE |
| 2849006 | LIKE | THIS | FEELING | ONE | FALSE |
| 6315233 | CHIEF | STRATEGIST | FOR | FOR | TRUE |
| 5007950 | LIGHT | CRAMPING | I | I | TRUE |
| 8685323 | IMPLEMENTED | FROM | THEIR | THE | FALSE |
| 1455771 | EVENTUALLY | DECIDE | TO | TO | TRUE |
| 7591326 | EXTEND | THE | PROJECT | LIFE | FALSE |
| 1095610 | THEY | WENT | OUT | TO | FALSE |
| 2009789 | UNICORN | AND | A | SCHMENDRICK | FALSE |
| 7199250 | DREAM | WERE | THE | REALITY | FALSE |
I am quite satisfied of my first attempt. With a runtime of 0.033 seconds, I think I’ll be able to build a Shiny app that doesn’t require a button to start computation, but does predictions in continuous, just like a smartphone keyboard does. I think this will still be possible after adding input tokenisation, profanity filtering and hopefully some extra computation to improve accuracy.
In fact, the accuracy is not very high (16.44%). I don’t know what a reasonable benchmark would be, but observing my mobile keyboard in my native language, it correctly guesses the word I want to type once every 3 or maybe 4 times. So, I want to try to enhance accuracy to 25% at least.
Some of my ideas:
I also thought about another approach, which is maybe more similar to the one suggested in the Capstone project (which proposes to manage OOV by smoothing probabilities, something quite different from my solution to OOV).
This approach would rely on a dictionary and on 1,2,3 or more dataframes (depending on the number of words desired for prediction). Those dataframes should contain:
Then, the algorithm would sum up (maybe with some weights to give more importance to the nearest words) the probabilities for each possible predicted word and choose the most probable.
To manage OOV, for each possible word in the 2nd column a line should be included in each dataframe in combination with a “OOV” tag in 1st column and a small (I should think about how small) probability in the 3rd.
I would like to discover if this solution is worse or better than mine in terms of memory required, speed and accuracy, but this would mean to restart my job just after tokenization, so I’m not sure if I will do this (I will decide after reading my peers reports and looking at some Natural Language Prediction resources).