September 2, 2019

Introduction

A predictive text application was developed using a corpora of English text from blog, news, and twitter sources.

Using a 5-gram dictionary paired with a 'Stupid Backoff' model, the application predicts the next word of sentences from user input with a top prediction rate of 11.51% and top-3 rate of 21.31%.

The final application can be found here.

Below are simple desriptive statistics of the source files used.

        Size (MB) Num.of.Lines Min Characters (by line)
blogs      200.42       899288                        1
news       196.28        77259                        2
twitter    159.36      2360148                        2
        Mean Characters (by line) Max Characters (by line)
blogs                   231.69601                    40835
news                    203.00243                     5760
twitter                  68.80281                      213

Building n-gram Dictionary

At a few key sample sizes, the data was analyzed for total term instances, unique terms, and percentage coverage of total instances from the first 1000 words.

             Unique.Terms Total.Instances First1000.Coverage %
Sample_0.01         17342         1977979                97.15
Sample_0.025        29953         5037452                93.91
Sample_0.05         44112        10107684                92.82
Sample_0.07         53180        14204144                92.48
Sample_0.10         64647        20280038                92.21

Sample subset of 2.5% of the total corpora was cleaned and processed to build the final predictive model 'dictionary' of 1- to 5-gram tokens.

Pruning of the final dictionary was performed (removed ngrams with frequency < 4) for substantial savings of disk space and increased prediction speed with minimal impact on prediction accuracy.

Predictive Model

From the final dictionary, a search algorithm was created to perform the following:

  1. Search for the input string based on the number of words 'n' in the corresponding n-gram file, but
  2. if not found within the largest possible n-gram file, search 'n-1' words in the 'n-1' gram file ('Stupid Backoff').
  3. Repeat 'backoff' until match occurs.
  4. Once a match is found, subset the corresponding n-gram file based on the matched 'n' words from user input, rank top 'next' words based on term frequency, and return the top 1-3 predictions.

Predictive Text App

Results and Future Outlook

From the final model, a top prediction accuracy score of 11.5% was achieved (lowest at 8.61% for uni-gram inputs and highest at 15.8% for bi-gram inputs). In addition, a top-3 prediction score of 21.3% was achieved, indicating a total 'miss' rate of 78.7%.

The low accuracy rate from the simple model is expected as the 'Stupid backoff' search algorithm does not take other important factors into account (such as part-of-speech of words and sentence context past the quint-gram length).

Moving forward, plans to further develop the model involve:

  1. the inclusion of higher-level ngrams to boost accuracy in higher-order phrase predictions, and
  2. stemming/lemmatization techniques to boost vocabulary by 'reducing' similar ngrams to their roots and allowing a larger sample subset used in the final dictionary size.