Data Science Capstone Project

Text Prediction App Presentation

Jerome CHOLEWA
Sep 12th 2017

 

     alt text

Brief description of the app

The app can be found on my shiny server : https://jeromecholewa.shinyapps.io/Capstone_text_prediction/. It works in a quite simple way:

  • At the beginning, give some time for the app to load (look at the progress bar)
  • Then start inputting text
  • The algorithm takes into account only the last 3 words.
    • so context information is not taken into account
    • grammar is not taken into account
    • it uses a database of 4-grams, trigrams, bigrams and unigrams
    • the database construction principle is detailed in a subsequent slide
    • If fewer than 3 words are entered, it will not search through the 4-grams
    • If fewer than 2 words are entered, it will not search through the trigrams, etc…
  • Wait a few seconds (some proposals might appear in gray), then final proposals appear in black
  • the first proposal got the highest likelihood as per my algorithm
    • the calculation of the highest likelihood is explained in a subsequent slide.

Cleaning the data

I used the prescribed corpus corpora.heliohost.org of which we were asked to consider the English blogs, news and twitter feeds. I cleaned the data using the stringi package, so fast I could actually use the full corpus.

  • almost at the end of my work I realized I did not set aside a portion of the corpus for testing. No big deal: any text from other sources can be used for testing, including the quizzes.
  • Only when I extracted the 4-grams did I split the corpus in 4 parts and set one of them aside (I used caret::createDataPartition)
  • Before removing punctuation, I removed tweet hashtags (#xxx), web site addresses (http or https:…) and symbols or smileys
  • put everything in lower case
  • removed bad language (after all, you don't want the algorithm to suggest you a bad word when you write to your boss and you do not catch the unintended typing correction… I used 2 sources of bad words: bannedwordlist.com and Google which I combined into one list.
  • I removed numbers (but kept “1st”, “2nd” and “3rd”, non-alphanumeric characters "[^[:alnum:]]" and words with tripe letters "[[:alpha:]]*([[:alpha:]])\\1{2,}[[:alpha:]]*" (like weooo or aaann) which I considered unlikely intended words.
  • I realized that I should have kept the period [.] and replaced it by EOS. More on that on the next page…
  • I reconstructed, in the cleaned corpus, contractions like I'll, we've, … I used this list

Extraction of n-grams

I realized that quanteda was much faster for my needs of n-grams extraction than the tm package, which has other functions and advantages that I did not need or use.

  • I realized too late in my project that I should have replaced the period [.] by a word like EOS. Then, when extracting n-grams, I could have skipped EOS. So in my model, for example “The sky is blue. The sun shines” leads to such n-grams as “is blue the sun” or “blue the sun shines”, a nonsense. I decided to ignore that mistake and did not re-clean the data nor did I re-extract the n-grams correspondingly.
  • I extracted unigrams. 3.2% of them covered 95% of all occurrences in the corpus. Therefore I deleted singletons, doubletons and tripletons from those unigrams but also from the corpus itself - a radical choice that may be debatable. That operation took many hours, slicing the corpus into 7 chunks was necessary to prevent R from crashing.
  • Then I extracted bigrams from which I kept only those occuring at least 4 times, that is around 2 million out of 13.7 million, covering more than 85% of all bigram occurences of my cleaned corpus.
  • For trigrams, similarly I kept those occuring at least 4 times, that is 2.8 million out of the 44.8 million, knowing that there were 41.5 million singleton trigrams. Keeping more trigrams would have slowed down my mode.
  • For 4-grams, I sliced the corpus in 4 and extracted the 4-grams in 3 of those 4 parts (leaving one part as a potential test set). Each subgroup gave around 21 M 4-grams. I combined the first 2, removed the singletons from this combination, grouped the 3rd 4-gram group, removed the 20 M singletons from that new combined group, leaving 3.3 M 4-grams, a manageable number.

Algorithm

If the user types >= 3 words, only the last 3 words are taken into account. If the user types < 2 words, it is enough to start a prediction. I will describe my algorithm for a minimum of 3 words typed. The principle stays the same if fewer words are typed. All n-grams also have a count and hence a probability (percent) of occurrence.

  • find 4-grams starting with those 3 words typed, find trigrams starting with the last 2 words typed, find bigrams starting with the last word typed.
  • for each of those n-grams found above, store the last word in a vector called possibles.
  • If possibles is empty, give the 6 most likely unigrams, if not, we rank each of those possible words according to a score, not following KBO method, which I deemed (wrongly?) slow and hence impractical for 4-grams or 3-grams. The score formula is:

             score(possibles[i]) = w4 x log(p4(possibles[i])) + w3 x log(p3(possibles[i])) + w2 x log(p2(possibles[i]))

where w4 is a weight for 4-grams, etc.     p4(possibles[i]) is the percent (probability) of the 4-gram that has been found and having the possibles[i] as the last word, etc.

In many cases possibles[i] will not be in one of the (usually rare) 4-grams found above. I arbitrarily assigned in that case:

     p4(possibles[i]) = 10% * min(p4)      (and the same rule for 3- and 2-grams)

I chose w4 = 0.5, w3 = 0.3 and w2 = 0.2, to give more weight to matched 4-grams.