Product Description
The goal of this project is to create a predictive text model which
predicts the next word in a sequence of 2-4 words, based off of a corpus
of text pulled from various public social media and news sources
-
My predictive text model was trained on three large sets of social media
data: tweets, blogs and news
-
The data was fed into my model in three phases based on the source, and
then parsed into four subsets of n-grams(2,3,4 and 5)
-
From there I added optionality to filter out common stop words such as
“the” and “a” which are pervasive in human speech but not semantically
dense, as the words on their own do not add much to the n-gram
-
I then filtered out the long tail of very rare n-grams which were only
found once in the entire corpus of the data provided. This substantially
reduced computational intensity while not sacrificing much in terms of
utility to the user
-
I then calculated the probabilities of each n-gram within the data set
and applied smoothing to make the model more generalized and capable of
maintaining its usefulness when presented with new word combinations
which it was not trained on
Source N-gram Analysis

Most bodies of text should generally conform to Zipf’s law. Zipf’s law
is an empirical rule that states that in a given dataset of natural
language, the frequency of any word is inversely proportional to its
rank in the frequency table
-
This means that the most common word occurs twice as often as the second
most common word, three times as often as the third most common word,
and so on
-
The effect of this is that we should expect to see fewer unique
combinations of words for smaller N-grams. Conversely, we would expect
to see a larger total count of ngrams as n approaches zero. The model
appears to support this