Product Description

The goal of this project is to create a predictive text model which predicts the next word in a sequence of 2-4 words, based off of a corpus of text pulled from various public social media and news sources

My predictive text model was trained on three large sets of social media data: tweets, blogs and news
The data was fed into my model in three phases based on the source, and then parsed into four subsets of n-grams(2,3,4 and 5)
From there I added optionality to filter out common stop words such as “the” and “a” which are pervasive in human speech but not semantically dense, as the words on their own do not add much to the n-gram
I then filtered out the long tail of very rare n-grams which were only found once in the entire corpus of the data provided. This substantially reduced computational intensity while not sacrificing much in terms of utility to the user
I then calculated the probabilities of each n-gram within the data set and applied smoothing to make the model more generalized and capable of maintaining its usefulness when presented with new word combinations which it was not trained on

Source N-gram Analysis

Most bodies of text should generally conform to Zipf’s law. Zipf’s law is an empirical rule that states that in a given dataset of natural language, the frequency of any word is inversely proportional to its rank in the frequency table

This means that the most common word occurs twice as often as the second most common word, three times as often as the third most common word, and so on
The effect of this is that we should expect to see fewer unique combinations of words for smaller N-grams. Conversely, we would expect to see a larger total count of ngrams as n approaches zero. The model appears to support this

Source Stop Word Analysis

Retaining common stop words - which are often removed - made the training set larger and the model more robust to a wider of potential text inputs, making the model more generalizable

Model Effectiveness

The model returns clear and coherent responses, although the accuracy might mean that there is heightened sensitivity to modestly different spelling formats and hidden text figures.

Final Presentation

Product Description

Source N-gram Analysis

Source Stop Word Analysis

Model Effectiveness