Data Science Capstone Project
Jerome CHOLEWA
Sep 12th 2017
The app can be found on my shiny server : https://jeromecholewa.shinyapps.io/Capstone_text_prediction/. It works in a quite simple way:
I used the prescribed corpus corpora.heliohost.org of which we were asked to consider the English blogs, news and twitter feeds. I cleaned the data using the stringi
package, so fast I could actually use the full corpus.
caret::createDataPartition
)"[^[:alnum:]]"
and words with tripe letters "[[:alpha:]]*([[:alpha:]])\\1{2,}[[:alpha:]]*"
(like weooo or aaann) which I considered unlikely intended words.[.]
and replaced it by EOS
. More on that on the next page…
I realized that quanteda
was much faster for my needs of n-grams extraction than the tm
package, which has other functions and advantages that I did not need or use.
If the user types >= 3 words, only the last 3 words are taken into account. If the user types < 2 words, it is enough to start a prediction. I will describe my algorithm for a minimum of 3 words typed. The principle stays the same if fewer words are typed. All n-grams also have a count and hence a probability (percent) of occurrence.
possibles
.possibles
is empty, give the 6 most likely unigrams, if not, we rank each of those possible words according to a score, not following KBO method, which I deemed (wrongly?) slow and hence impractical for 4-grams or 3-grams. The score formula is:score(possibles[i]) = w4 x log(p4(possibles[i])) + w3 x log(p3(possibles[i])) + w2 x log(p2(possibles[i]))
where w4 is a weight for 4-grams, etc. p4(possibles[i]) is the percent (probability) of the 4-gram that has been found and having the possibles[i]
as the last word, etc.
In many cases possibles[i] will not be in one of the (usually rare) 4-grams found above. I arbitrarily assigned in that case:
p4(possibles[i]
) = 10% * min(p4) (and the same rule for 3- and 2-grams)
I chose w4 = 0.5, w3 = 0.3 and w2 = 0.2, to give more weight to matched 4-grams.