Milestone Report

KMH

2024-08-22


Synopsis

This report is an exploratory analysis to inform the development of our R Shiny Text Prediction app. The app will utilize a basic n-gram model to predict the next word based on the previous 1, 2, or 3 words input by a user.

Note: The table of contents on the left can be used to quickly access the various sections of this report.


Data

The text data used for this project comes from three internet sources: blogs, news, and Twitter. The number of lines of text and file size for each file are given below.

Table 1: Text Data Sources
Source Number of Lines File Size (MB)
Blogs 899288 210.16
News 77259 205.81
Twitter 2360148 167.11
Total 3336695 583.08

As seen in Table 1, there are over three million total lines of text.


Training Data

70% of the total combined lines of text will be randomly selected for training our text prediction algorithm and the remaining 30% will be reserved for testing.


Cleaning Data

The R package tm was used to clean the data. The following transformations were applied:

  • Conversion to lowercase
  • Removal of swear words
  • Removal of symbols
  • Removal of numbers
  • Removal of punctuation
  • Removal of ASCII characters
  • Removal of excess white space

The R package quanteda was used to tokenize the data.


N-grams

To build our prediction model we first need to construct n-grams, specifically unigrams, bigrams, and trigrams. Figures 1-3 provide the 25 most frequent unigrams, bigrams, and trigrams in the training set, respectively.


N-gram Distributions

To visualize the distribution of n-gram frequencies we plot the CDF and a (log) frequency histogram for each n-gram distribution (Figures 4-6). We see the distributions are heavily skewed, indicating that a relatively small number of n-grams have extremely large frequencies.

To better quantify the skewness of the n-gram frequency distributions we calculate the number of distinct tokens required to cover 50% and 90% of all token instances. Table 2 below provides these values as well as the number of distinct tokens and total number of token instances for each n-gram.

Table 2: N-grams
N-gram Number of Distinct Tokens Total Number of Token Instances Distinct Tokens to Cover 50% Distinct Tokens to Cover 90% Percent Distinct Tokens to Cover 50% Percent Distinct Tokens to Cover 90%
Unigram 518816 48322005 120 6540 0.02 1.26
Bigram 8820215 48322004 36032 2416968 0.41 27.40
Trigram 26792095 48322003 1930690 3846965 7.21 14.36

Next Steps

Several considerations will be taken into account when building our n-gram prediction model on the training set, including:

  • Frequent single or double character unigrams (e.g., “d”,“w”,“st”, etc.) may affect model accuracy so removal of such tokens may be warranted. Would be interesting to know what others have done to address this issue.

  • In some cases users will want to enter a combination of words that doesn’t appear in our training text corpus. Our model should be able to handle cases where a specific n-gram is not observed.

  • Accuracy will be estimated using the test set

  • Timing in the Shiny app may be an issue so we may have to consider ways to improve performance such as:

    • Removing tokens with very few instances
    • Reducing the size of the training data (e.g., 50% random sample instead of 70%)

Any feedback is much appreciated, thank you!