Text prediction App

Filipe Lima
11-23-2020

We wish to determin which words is the most probable to follow a sentence.
At our disposal, a large dataset with lines that were extracted from newspapers, Twitter and blogs.
Using Natural Language Processing, build a n-gram model to predict the next word in a sentence.

The Text Prediction App is hosted in this site.
Created using the Quanteda package and using the Kneser-Ney Smoothing algorithm:
First, separating a 10% sample from the database, then splitting in a 80-20 proportion two datasets, the training and testing datasets.
The dataset was cleaned: Punctuation, Numbers, Symbols, Hyphenation and URL indicators were removed. Then, it was stemmed.
Three tables were created, using unigrams, bigrams and trigrams, calculating their quantities and their probability using Kneser-Ney Algorithm.

Our app have two kinds of predictions based on the sentence you write: A rank with the n most probable words, and the next n words that complete your sentence.
You should provide the sentence, the type of prediction and the number of words you want.

We tried to measure the accuracy of our app choosing 1000 random lines from our testing datasets and comparing the results with the last word. We checked with our result was in the 1, 3 and 5 word prediction.
In blogs.txt sample, we got 3,5%, 7,9% and 8,9% right answers with one, three and five words.
In twitter.txt sample, we got 6,3%, 11,2% and 15% right answers with one, three and five words.
In news.txt sample, we got 3,4%, 6% and 7,8% right answers with one, three and five words.
Although our accuracy looks low, our app is pretty fast.