source('./0 - FUNCTIONS.R')
library(qdap)
library(data.table)
library(ngram)
library(stringi)
The following is an exploratory analysis prepared for the Data Science Specialization Capstone project. The project’s objective was to develop a predictive text application using an English corpora from blog, twitter, and news sources (found here).
File sizes, term frequency, unique term vocabulary, and word coverage statistics were computed on the corpora to gain an understanding of the suitable approach to develop a predictive model.
First, we can analyze the size of each file, along with the length (i.e., number of lines) and a summary of the number of characters (per entry/sentence):
## Size (MB) Num.of.Lines Min Characters (by line)
## blogs 200.42 899288 1
## news 196.28 77259 2
## twitter 159.36 2360148 2
## Mean Characters (by line) Max Characters (by line)
## blogs 231.69601 40835
## news 203.00243 5760
## twitter 68.80281 213
From above, we see that the files are extremely large. Therefore, to analyze word coverage, we will sample the file at different subset %’s and review the number of unique terms per each sample size to see if the number of unique terms in the vocabulary tapers off at some level that we can infer for the entire file.
## Unique.Terms Total.Instances
## Sample_0.01 17342 1977979
## Sample_0.025 29953 5037452
## Sample_0.05 44112 10107684
## Sample_0.07 53180 14204144
## Sample_0.10 64647 20280038
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:qdapRegex':
##
## %+%
From the graph above, it seems that the limit of unique terms within the total data set doesn’t appear in the foreseeable (and computationally-feasible) set of sample sizes. It also isn’t feasible to continue to test larger and larger sample sizes as the largest sample we’ve tested above (10%) would yield a 55 MB subset file (i.e., 10% of 200 + 196 + 159 MB from the blogs, twitter, and news files). However, from further analysis of the unique terms and frequencies at the 10% subset level, it’s noted that most of these ‘unique’ occurences are actually words either with special characters or not typical english words at all. These types of special occurrences can only be expected to increase in numbers as we increase the sample size of the total dataset, therefore, inflating the estimated number of ‘unique’ terms as we increase the sample size.
In addition, we can review the coverage of the first few words (most of them being stop words) to get an idea of the frequency skewness within the unique terms:
## Sample Coverage
## 1 1.0 0.9715351
## 2 2.5 0.9390708
## 3 5.0 0.9282348
## 4 7.0 0.9247853
## 5 10.0 0.9221163
From the analysis above, we see at each of the sample sizes, the top 10,000 words cover >90% of all instances within each subset (even at the 10% sample size level with 65,000 unique terms and 20 million total instances). Since nearly all instances are held within the top 10,000 words, we can have reasonable assurance that our model will capture most n-grams that are fed into the predictor.
For simplicity (as well as due to computational and memory constraints), we will use the 2.5% subset size for training the final model.
Accuracy was tested by running the final prediction function (found in the script ‘3 - Predictive Text App - Model.R’) on the testing set found in the ‘./data/ngrams/testing’ folder. The test ngram, target word, prediction, and accuracy results of each row was recorded and written to separate tables in ‘./data/Results’.
The script developed for the accuracy testing performed is found in the ‘4 - Predictive Text App - Testing.R’ file within the accompanying GitHub repository.
For simplicity, accuracy was given a score of 1 if the predicted word was identical to the target word, and a score between 2-3 if the target word was found within the top 2-3 predicted words. In the event that the target word is not found in these top 3 words, ‘NA’ is returned. Therefore, we can calculate ratios on the following:
The accuracy testing results are as follows:
## top_pct top3_pct missed_pct
## [1,] 11.51 21.31 78.69
We also review the accuracy of the prediction model by the size of the ngram input:
## top_pct top3_pct missed_pct
## 1 8.61 19.21 80.79
## 2 15.79 26.32 73.68
## 3 10.31 18.56 81.44
## 4 13.33 21.67 78.33
From the exploratory analysis performed, a small subset at 2.5% of the total corpora was used to feed the final dictionary of the predictive model.
A Stupid Backoff model was simulated using a function to search for partially completed sentences within the largest possible n-gram file and ‘backing off’ to n-1 grams when necessary.
From the final model, a top prediction accuracy score of 11.5% was achieved (lowest at 8.61% for uni-gram inputs and highest at 15.8% for bi-gram inputs). In addition, a top-3 prediction score of 21.3% was achieved, indicating a total ‘miss’ rate of 78.7%.
Moving forward, we plan to explore in further developing the model to include higher-level ngrams to boost accuracy in higher-order phrase predictions and stemming/lemmatization techniques to aid in the total size of the final model’s dictionary.