Synopsis

SentenceCompleteR is a simple word prediction app that predicts the next word in a sentence based on the last five words typed in.

Its top prediction is presented in bold at the top of the app, as completing the sentence typed in, and its top five predictions are displayed in a table below.

Scores for each prediction are also listed in this table, so that one can see how strong its five predictions are, relative to one another. Scores range from 0 to 100; the higher the score, the stronger the prediction.

Instructions for use

To use, simply type a sentence into the textbox on the left sidebar of the app.

SentenceCompleteR generates its predictions as you type. However, as predictions take a second or two to compute, you will generally be able to finish typing your sentence before SentenceCompleteR displays its first prediction.

Under the hood (pt. 1)

Our prediction algorithm is based on a simple N-gram model.

In essence, an N-gram model works by analyzing a corpus of text into strings of consecutive words. From our original dataset of 4,000,000 blogs, news stories, and tweets, we computed every 1- to 6-word string that appeared in this dataset and then calculated the frequency of each unique string (i.e., how many times it occurred in the dataset).

SentenceCompleteR’s predictions are based on a corpus of blog posts, news stories, and tweets. The original dataset contained over 4,000,000 lines of text and took up over 800 Mb of space. With data processing, SentenceCompleteR’s predictions are computed using a database that takes up less than 100 Mb of space. (Less space = faster processing.)

Under the hood (pt. 2)

To make its predictions, our algorithm uses these frequency counts to determine: Which words are the most common completions of the last five words typed in?

For example, “times” is a very common completion of the phrase “It was the best of”. In our dataset, “times” is found to complete this sentence 87.5% of the time. This number is calculated by taking the frequency count of the phrase “It was the best of times” in the entire dataset, divided by the frequency count of the phrase “It was the best of”. This calculation tells us what percentage of the time the phrase “It was the best of” is completed by “times”.

Under the hood (pt. 3)

Sometimes the best completion is not based on the last five words typed in, but on the last four, three, two, or even one.

Our algorithm automatically takes such considerations into account. In addition to determining the most common completion of the last five words typed in, it also determines the most common completions of the last four, three, two, and one words typed in. It then weighs those completions accordingly (multiplying each completion by a factor of 0.4 for each level) and compares the scores of all percentages. The highest scoring completion is the algorithm’s top prediction.