2023-07-15

Background

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on the interaction between computers and human language. It encompasses the methods and techniques used to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP plays a crucial role in bridging the gap between human communication and computer systems, enabling machines to process and analyze vast amounts of textual data.

One of the key applications of NLP is text prediction, where algorithms are designed to generate probable next words or phrases based on the context of the input text. Text prediction relies on various NLP techniques such as language modeling, statistical analysis, and machine learning algorithms. By utilizing large amounts of text data and training models on it, NLP systems can accurately predict the most likely next words or phrases, leading to improved writing assistance and productivity tools.

The Sentence PredictR app

  • All you need to do is type in the phrase in the box and the app does the rest!
  • Built on 4M+ lines of text
  • Fast with a low memory footprint

Corpus details

The corpus is built upon data sourced from news articles, blog posts, and twitter posts. The data is quite large: ~4M lines.

Number of rows Max line length Object size (MB)
Blogs 899288 40833 255
News 1010242 2363 257
Twitter 2360148 140 319

Algorithm details

A simple backoff algorithm was used to build the prediction model. Basically, this algorithm operates as follows

  • The n-grams are separated into a query phrase and prediction. For example, “at_the_end_of_the” would be separated into a query phrase of “at_the_end_of” and a prediction of “the”
  • The scores for each prediction are computed by dividing the frequency of the full n-gram by the frequency of the query phrase
  • Predictions are run on input text of at most four words combined into a query
  • If no n-grams match the query, the algorithm backs off to a shorter query phrase by eliminating the first word in the query. For example, “at_the_end_of” would become “the_end_of”
  • Also, if a desired number of results is needed and not reached, the backoff happens in this case too
  • A backoff weight of 0.4 is used per the “Stupid Backoff” algorithm used by Google
  • The results are sorted by score and the top n results are returned

Performance

The app has a response time of under 30ms and a memory footprint of around 100MB on the shiny server. The n-gram table was pruned to achieve this speed and memory combination.

The model has been benchmarked on a set of blog posts and tweets comprising 28k+ predictions and has the following accuracy

  • Top choice accuracy of 13.81%
  • Top three accuracy of 22.68%