Exploratory Analysis for Predictive Text Engine

Introduction

The model uses a n-gram frequency based approach for predictive text analysis. From the English U.S data set given in the course I will build a frequency table of every n-gram combination. This will consist of 2-ngrams, 3-ngrams and so on. Based on the coverage percentage we will have to optimize the minimum length of the n-grams needed.

Once we generate these n-gram tables the prediction problem narrows down to efficiently storing and searching through this table for the most likely match of the next word the user may type.

The challenge to this approach is the efficieny in terms of speed of execution and memory usage for searching and storing the n-gram database. The challenge here also includes updating this table dynamically at runtime to consider n-gram phrases that were not part of the traning set and were typed in by the user. By updating the n-gram table dynamically we can ensure a much higger success percentage.

Exploratory Analysis

Load Training Set

setwd('~/Downloads/final/en_US/')
con <- file("en_US.twitter.txt", "r")
size <- 10000

lines_tweets <- readLines(con, size) ## Read twitter text
length_tweet_file <- length(lines_tweets)
lines_tweets <- readLines(con, size) ## Read lesser twitter text
close(con) ## Close the file connection

con <- file("en_US.blogs.txt", "r") 
lines_blogs <- readLines(con, size) ## Read blogs text
length_blog_file <- length(lines_blogs)
lines_blogs <- readLines(con, size) ## Read lesser blogs text
close(con)

con <- file("en_US.news.txt", "r")
lines_news <- readLines(con, size) ## Read news text
length_news_file <- length(lines_news)
lines_news <- readLines(con, size) ## Read lesser news text
close(con)

Currently the entire training set is loaded in memory. The total number of lines in the entire English data set are 30000. Given the huge amount of data in these files, it would be impractical to work on the entire data set for exploratory analyis. I’ll only consider the first 10^{4} lines in each data set for exploration. The assumption here is that the first 10^{4} lines contain similar probability distribution of the words compared with the rest of the data set.

n-gram Package

n-gram package can be downloaded and installed from CRAN. This package has a built in method to generate n-grams on a given string. It also gives helper methods like concatenate to combine strings. More details on the n-gram package can be found in the reference link.

library(ngram)
training_set <- concatenate(lines_tweets, lines_blogs, lines_news) ## Concatenate all data in a single training_set string
## Generate n level n-gram. Return value of ngram() is a n-gram class variable
ngram2Model <- ngram(training_set, n=2)
ngram3Model <- ngram(training_set, n=3)

## Convert n-gram class variable to a frequency table data frame
ngram2df <- get.phrasetable(ngram2Model)
ngram3df <- get.phrasetable(ngram3Model)

total_word_num <- sapply(strsplit(training_set, " "), length)

Word frequency plots

The total number of words in the training set are 884044. This includes all punctuation marks and special characters.

histogram_size <- 10

## Most frequently occuring words
ngram1Model <- ngram(training_set, n=1)
ngram1df <- get.phrasetable(ngram1Model)
ngram1df <- ngram1df[with(ngram1df, order(-freq, ngrams)), ]
df <- ngram1df[1:histogram_size,]
barplot(df$freq, names.arg=df$ngrams, main="1-gram frequency distribution", xlab="Words")

## Frequency sort the n-gram database
ngram2df <- ngram2df[with(ngram2df, order(-freq, ngrams)), ]
ngram3df <- ngram3df[with(ngram3df, order(-freq, ngrams)), ]

## Top 3 most frequent 2-ngrams
df <- ngram2df[1:histogram_size,]
barplot(df$freq, names.arg=df$ngrams, main="2-gram frequency distribution", xlab="Words")

## Top 3 most frequent 3-ngrams,
df <- ngram3df[1:histogram_size,]
barplot(df$freq, names.arg=df$ngrams, main="3-gram frequency distribution", xlab="Words")

Data Coverage

It is interesting to note the number of unique words needed to cover say 50% of all word instances. To get this number we need to keep adding the frequency column in the sorted 1-ngram data starting from row 1. The row number at which the sum frequency exceeds 50% of the word instances is the number of unique words needed.

coverage <- c(0.5, 0.9)
coverage_criteria_index <- c( 0, 0)
expected_sum = coverage*total_word_num

for (coverage_index in 1:length(coverage) ) {
  calculated_sum <- 0
  dummy_index <- 1
  repeat{
    calculated_sum <- calculated_sum + ngram2df[dummy_index,]$freq*2
    if(calculated_sum>expected_sum[coverage_index]){
      break
    }
    if(dummy_index >= length(ngram1df$freq)){
      break
    }
    dummy_index <- dummy_index+1
  }
  coverage_criteria_index[coverage_index] <- dummy_index
}

We need 5494 unique words to cover 50% of all word instances. We need 5.00910^{4} unique words to cover 90% of all word instances.

A similar concept can be applied to a 2-gram data frame to understand the number of rows we need to save in memory to effectively cover majority of the user scenarios. In my predictive tool, I’ll aim to cover 90% of the words used by customer. This threshold of 90% should guarantee a good customer experience in terms of accurately predicting most of the customer queries and at the same time saving on memory.

coverage <- c(0.7, 0.8, 0.9, 0.95)
coverage_criteria_index <- c(0, 0, 0, 0)
expected_sum = coverage*total_word_num


for (coverage_index in 1:length(coverage) ) {
  calculated_sum <- 0
  dummy_index <- 1
  repeat{
    calculated_sum <- calculated_sum + ngram2df[dummy_index,]$freq*2
    if(calculated_sum>expected_sum[coverage_index]){
      break
    }
    if(dummy_index >= length(ngram1df$freq)){
      break
    }
    dummy_index <- dummy_index+1
  }
  coverage_criteria_index[coverage_index] <- dummy_index
}

plot(coverage, coverage_criteria_index, main="Tradeoff between coverage and memory space", xlab="Actual Words Covered vs Total Words ratio", ylab="No. of unique n-grams")

Approach

My approach to is to save in memory 4 n-gram tables for 2-ngrams, 3-ngrams and 4-ngrams. I’ll only save enough rows in memory needed to cover 90% of the overall word usage.

When the user types in a certain set of characters, the algorithm will run a search query on the applicable n-gram table to get the most frequently used n-gram. For example, when the user types in ‘the end’, the algorithm should suggest ‘of’ as the next predicitve word the user might intend to type. If the user types ‘for the first’ then in this case the algorithm will suggest ‘time’ as the suggestion. This is based on the frequency distribution of the n-gram table. The challenge in this approach will be to ensure the algorithm returns a suggestion within a second for the suggestion to be relavant to the user.

Conclusion

This document gives details on some of the exploratory analysis I performed on the US English data set given. From the exploratory analysis I was able to understand the most commonly used words in English and the most frequently occuring n-grams. This helped my determine the threshold of the number of entries of the n-gram table I need to save in memory. Going forward I’ll work on more accuracy and performance related optimizations. Also I’ll add a way to update the n-gram table with entries with new user entered text. This will help personalize the algorithm recommendations for every user.

References

n-gram CRAN package - https://cran.r-project.org/web/packages/ngram/ngram.pdf