KMH
2024-08-22
This report is an exploratory analysis to inform the development of our R Shiny Text Prediction app. The app will utilize a basic n-gram model to predict the next word based on the previous 1, 2, or 3 words input by a user.
Note: The table of contents on the left can be used to quickly access the various sections of this report.
The text data used for this project comes from three internet sources: blogs, news, and Twitter. The number of lines of text and file size for each file are given below.
| Source | Number of Lines | File Size (MB) |
|---|---|---|
| Blogs | 899288 | 210.16 |
| News | 77259 | 205.81 |
| 2360148 | 167.11 | |
| Total | 3336695 | 583.08 |
As seen in Table 1, there are over three million total lines of text.
70% of the total combined lines of text will be randomly selected for training our text prediction algorithm and the remaining 30% will be reserved for testing.
The R package tm was used to clean the data. The
following transformations were applied:
The R package quanteda was used to tokenize the
data.
To build our prediction model we first need to construct n-grams, specifically unigrams, bigrams, and trigrams. Figures 1-3 provide the 25 most frequent unigrams, bigrams, and trigrams in the training set, respectively.
To visualize the distribution of n-gram frequencies we plot the CDF and a (log) frequency histogram for each n-gram distribution (Figures 4-6). We see the distributions are heavily skewed, indicating that a relatively small number of n-grams have extremely large frequencies.
To better quantify the skewness of the n-gram frequency distributions we calculate the number of distinct tokens required to cover 50% and 90% of all token instances. Table 2 below provides these values as well as the number of distinct tokens and total number of token instances for each n-gram.
| N-gram | Number of Distinct Tokens | Total Number of Token Instances | Distinct Tokens to Cover 50% | Distinct Tokens to Cover 90% | Percent Distinct Tokens to Cover 50% | Percent Distinct Tokens to Cover 90% |
|---|---|---|---|---|---|---|
| Unigram | 518816 | 48322005 | 120 | 6540 | 0.02 | 1.26 |
| Bigram | 8820215 | 48322004 | 36032 | 2416968 | 0.41 | 27.40 |
| Trigram | 26792095 | 48322003 | 1930690 | 3846965 | 7.21 | 14.36 |
Several considerations will be taken into account when building our n-gram prediction model on the training set, including:
Frequent single or double character unigrams (e.g., “d”,“w”,“st”, etc.) may affect model accuracy so removal of such tokens may be warranted. Would be interesting to know what others have done to address this issue.
In some cases users will want to enter a combination of words that doesn’t appear in our training text corpus. Our model should be able to handle cases where a specific n-gram is not observed.
Accuracy will be estimated using the test set
Timing in the Shiny app may be an issue so we may have to consider ways to improve performance such as:
Any feedback is much appreciated, thank you!