Milestone Report

KMH

2024-08-22

Synopsis

This report is an exploratory analysis to inform the development of our R Shiny Text Prediction app. The app will utilize a basic n-gram model to predict the next word based on the previous 1, 2, or 3 words input by a user.

Note: The table of contents on the left can be used to quickly access the various sections of this report.

Data

The text data used for this project comes from three internet sources: blogs, news, and Twitter. The number of lines of text and file size for each file are given below.

**Table 1: Text Data Sources**
Source	Number of Lines	File Size (MB)
Blogs	899288	210.16
News	77259	205.81
Twitter	2360148	167.11
Total	3336695	583.08

As seen in Table 1, there are over three million total lines of text.

Training Data

70% of the total combined lines of text will be randomly selected for training our text prediction algorithm and the remaining 30% will be reserved for testing.

Cleaning Data

The R package tm was used to clean the data. The following transformations were applied:

Conversion to lowercase
Removal of swear words
Removal of symbols
Removal of numbers
Removal of punctuation
Removal of ASCII characters
Removal of excess white space

The R package quanteda was used to tokenize the data.

N-grams

To build our prediction model we first need to construct n-grams, specifically unigrams, bigrams, and trigrams. Figures 1-3 provide the 25 most frequent unigrams, bigrams, and trigrams in the training set, respectively.

N-gram Distributions

To visualize the distribution of n-gram frequencies we plot the CDF and a (log) frequency histogram for each n-gram distribution (Figures 4-6). We see the distributions are heavily skewed, indicating that a relatively small number of n-grams have extremely large frequencies.

To better quantify the skewness of the n-gram frequency distributions we calculate the number of distinct tokens required to cover 50% and 90% of all token instances. Table 2 below provides these values as well as the number of distinct tokens and total number of token instances for each n-gram.

**Table 2: N-grams**
N-gram	Number of Distinct Tokens	Total Number of Token Instances	Distinct Tokens to Cover 50%	Distinct Tokens to Cover 90%	Percent Distinct Tokens to Cover 50%	Percent Distinct Tokens to Cover 90%
Unigram	518816	48322005	120	6540	0.02	1.26
Bigram	8820215	48322004	36032	2416968	0.41	27.40
Trigram	26792095	48322003	1930690	3846965	7.21	14.36

Next Steps

Several considerations will be taken into account when building our n-gram prediction model on the training set, including:

Frequent single or double character unigrams (e.g., “d”,“w”,“st”, etc.) may affect model accuracy so removal of such tokens may be warranted. Would be interesting to know what others have done to address this issue.
In some cases users will want to enter a combination of words that doesn’t appear in our training text corpus. Our model should be able to handle cases where a specific n-gram is not observed.
Accuracy will be estimated using the test set
Timing in the Shiny app may be an issue so we may have to consider ways to improve performance such as:
- Removing tokens with very few instances
- Reducing the size of the training data (e.g., 50% random sample instead of 70%)

Any feedback is much appreciated, thank you!