Frank D. Evans
Data Science Specialization - Johns Hopkins University
Intro: The overall goal of this project is to create a text prediction system similar to the SwiftKey keyboard application for the Android Mobile platform. This analysis is a Milestone Report on Data Handling, Exploratory Data Analysis, Model Design, and planned Application Architecture.
Exploratory Data Analysis
Load Data
news <- scan(file = './data/en_US.news.txt', what = 'character', sep = '\n')
blogs <- scan(file = './data/en_US.blogs.txt', what = 'character', sep = '\n')
twitter <- scan(file = './data/en_US.twitter.txt', what = 'character', sep = '\n', skipNul = TRUE)
Clean Corpus Contents and Calculate Word Counts
library(stringr)
clean_corpus <- function(corpus) {
output <- tolower(corpus)
output <- str_replace_all(output, pattern = "'", replacement = "")
output <- str_replace_all(output, pattern = "[^[:print:]]", replacement = "")
output <- str_replace_all(output, pattern = "[^[:alpha:] | [:space:]]", replacement = "")
output <- str_trim(output, side = 'both')
output <- str_replace_all(output, pattern = " {2,}", replacement = " ")
output <- str_split(output, pattern = "[[:space:]]")
output
}
news_clean <- clean_corpus(news)
blogs_clean <- clean_corpus(blogs)
twitter_clean <- clean_corpus(twitter)
doc_counts <- c(length(news), length(blogs), length(twitter))
word_counts <- c(length(unlist(news_clean)),
length(unlist(blogs_clean)),
length(unlist(twitter_clean)))
eda_df <- data.frame(corpus_name = c('news','blogs','twitter'),
doc_counts = doc_counts,
word_counts = word_counts)
eda_df$avg_doc_length <- word_counts / doc_counts
eda_df
## corpus_name doc_counts word_counts avg_doc_length
## 1 news 1010242 33535452 33.20
## 2 blogs 899288 36886284 41.02
## 3 twitter 2360148 29413068 12.46
Before statistics are calculated, documents are cleaned to cast all characters to lowercase, remove all punctuation (but leave cohesive words together), strip out any excess whitespace, and split the documents into words. The twitter corpus contains the most documents by a large margin, while the news and blogs are roughly equivalent. However, there is a considerable difference in the average word count among the three corpus sets, with the blogs averaging the largest number of words per document.
Overall Word Frequency
word_freq_all <- read.table(file = './data/int_word_freq_all.csv',
header = TRUE, sep = '|', stringsAsFactors = FALSE)
head(word_freq_all)
## v count probability
## 1 the 4749579 0.04758
## 2 to 2752221 0.02757
## 3 and 2402175 0.02406
## 4 a 2378401 0.02382
## 5 of 2005149 0.02009
## 6 in 1642766 0.01646
Load the word frequency data frame from earlier pre-processing. For each word in the combined corpus, the integer count of occurence and the blended probability is calculated. The data frame is sorted by word frequency, so examination of the first few records will demonstrate the structure of the word count data.
length(word_freq_all$probability)
## [1] 996922
sum(word_freq_all$probability[1:150])
## [1] 0.5059
150 / length(word_freq_all$probability)
## [1] 0.0001505
Although there are nearly 100,000 unique words used across the corpus documents, the most common 150 account for slightly more than half of all word freqeuncy. This represents less than 0.02% of all unique words, yet account for more than half of word usage. Not surrprisinly, these words heavily consist of ‘stop words’, or words that are used as connectors between thoughts. While these words often to do not provide much information in the sentances they are contained in, they provide context for how the information of the sentance fits together. And, since they are so common, being able to accurately predict these words will be a major innovation in making accurate text predictions overall.
plot(word_freq_all$count, type = 'l', main = "Word Frequency Probability - All Words")
When a plot of all word frequencies is used, the long tail is so prevalent that the falloff of the most common words appears to be a veritcal line.
plot(word_freq_all$count[1:500], type = 'l', main = "Word Frequency Probability - Top 500 Words")
Reducing the plot frequencies from the full 100,000 to just 500 helps get a sense of the steep falloff of the frequencies within scale and provides a bit more detail.
Word Tuple Frequency
word_tuple_n2 <- read.table(file = './data/int_ref_all_n2.csv',
header = TRUE, sep = '|', stringsAsFactors = FALSE)
head(word_tuple_n2)
## k1 k2 v count probability
## 1 one of the 34558 0.0007077
## 2 a lot of 30010 0.0006146
## 3 thanks for the 23761 0.0004866
## 4 to be a 18200 0.0003727
## 5 going to be 17425 0.0003569
## 6 i want to 14961 0.0003064
Word frequency counts were computed during pre-processing for 2, 3, and 4 gram tuples; defined as n words as keys and a value word as the prediction tail of the tuple. Hence, for the 3 tuple version loaded here (2 tuple keys, 1 prediction value), examination of the data frame shows the structure. As a part of the pre-processing, only tuples that showed up more than once across the corpus sets are included.
length(word_tuple_n2$probability)
## [1] 6014412
sum(word_tuple_n2$probability[1:150])
## [1] 0.02283
sum(word_tuple_n2$probability[1:10000])
## [1] 0.1765
It is not surprising that there are many more unique tuples even when only considering recurrent cases, due to combinatorics. Thus, even when considering the probability of the top 10,000 cases, only about 18% of probability is covered. For the same top 150, prediction probability is reduced to only 2%.
plot(word_tuple_n2$count[1:150], type = 'l',
main = "Tuple n=3 Frequency Probability - Top 150 Words")
The comparative dropoff is much less on the same 150 word scale. When it is possible to make predictions based on recurring phrases, the expectation is increased accuracy. However, this will require a tradeoff of performance and memory as tuple probabilities require holding several orders of magnitude more data in memory and search through it to make the most probable predictions.
Planned Application Architecture
The application will use a waterfall format. When given a phrase, the application will make 5 predictions as to the next word. The predictions will be made in order from the following component analytics in order of frequency: * Predictions for all matching n=4 tuples (3 key, 1 value) * Predictions for all matching n=3 tuples (2 key, 1 value) * Predictions for all matching n=2 tuples (1 key, 1 value) * Predictions for most common used words by flat frequency.
All duplicate predictions will be removed, to ensure that 5 unique predictions are made for all entries. Additionally, the application will skip any prediction sub-models in the waterfall that are not possible (e.g. 3 key predictions when only 2 words have been typed so far).
Data Size: They constraints of the application design under this architechure is the data size and integration into the shiny application. To make the absolute most accurate predictions in all scenarios, inclusion of all tuple frequencies as large as computable would be available. However, this will be too large for the Shiny environment to handle, and will likely cause lag in predictions from the models making the end user wait time too long. Additionally, the Shiny source code will not be able to natively include the data at that scale, and will need to integrate with data stored in a cloud location (Dropbox, or the like) if necessary to the application performance. Thus, the application will be developed to rely primarily on outside data integration from the cloud, but will remain operational if such data is no longer available.