Sentence PredictR

2023-07-15

Background

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on the interaction between computers and human language. It encompasses the methods and techniques used to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP plays a crucial role in bridging the gap between human communication and computer systems, enabling machines to process and analyze vast amounts of textual data.

One of the key applications of NLP is text prediction, where algorithms are designed to generate probable next words or phrases based on the context of the input text. Text prediction relies on various NLP techniques such as language modeling, statistical analysis, and machine learning algorithms. By utilizing large amounts of text data and training models on it, NLP systems can accurately predict the most likely next words or phrases, leading to improved writing assistance and productivity tools.

The Sentence PredictR app

All you need to do is type in the phrase in the box and the app does the rest!
Built on 4M+ lines of text
Fast with a low memory footprint

Corpus details

The corpus is built upon data sourced from news articles, blog posts, and twitter posts. The data is quite large: ~4M lines.

	Number of rows	Max line length	Object size (MB)
Blogs	899288	40833	255
News	1010242	2363	257
Twitter	2360148	140	319

Algorithm details

A simple backoff algorithm was used to build the prediction model. Basically, this algorithm operates as follows

The n-grams are separated into a query phrase and prediction. For example, “at_the_end_of_the” would be separated into a query phrase of “at_the_end_of” and a prediction of “the”
The scores for each prediction are computed by dividing the frequency of the full n-gram by the frequency of the query phrase
Predictions are run on input text of at most four words combined into a query
If no n-grams match the query, the algorithm backs off to a shorter query phrase by eliminating the first word in the query. For example, “at_the_end_of” would become “the_end_of”
Also, if a desired number of results is needed and not reached, the backoff happens in this case too
A backoff weight of 0.4 is used per the “Stupid Backoff” algorithm used by Google
The results are sorted by score and the top n results are returned

Performance

The app has a response time of under 30ms and a memory footprint of around 100MB on the shiny server. The n-gram table was pruned to achieve this speed and memory combination.

The model has been benchmarked on a set of blog posts and tweets comprising 28k+ predictions and has the following accuracy

Top choice accuracy of 13.81%
Top three accuracy of 22.68%