Executive Summary:

Unlike traditional full sized keyboarding methods used on desktop and laptops, portable devices are used more and more and the input methods are a bottle neck. Word prediction assists the user by speeding up data entry and play an important role in hand held device input. Just as portable technology has advanced, we too are able to process large amounts of data equally well on our personal computers. This allows us to process well over a million data points and use them in predictive analytics. This word prediction capstone reflects this ability.

This initial submission reports on the challenges and findings discovered during data acquisition, cleaning and exploratory analysis of the Swift Key data set. Only the english files were analyzed. For this initial investigation, the data sources were not merged and independent studies were done for the BLOG, TWITTER and NEWS data sources. The methods used to obtain, clean and create the corpus, along with the process to create the various ‘n gram’ models used in both this report as well as the final model are also included. Finally, visuals for the data set are presented which show the differences in the corpus for the various sources. This is important as the distinction is necessary when considering the medium and method used to trigger the word prediction algorithm.

Finally, next steps and challenges will be discussed. As the tokens used in the prediction are generated, a method to deploy them into an application is essential in order to add value to this study.

Stage ONE: Data Acquisition, Batch Creation and Session Setup:

In order to control the study and ensure processing and memory constraints were avoided, a batching the data was performed. The Batches were broken down in the following manner:

  1. Load in the required libraries for NLP, Memory Management, Regular Expressions and Visual status method for lapply to assist with determine status of analysis.
  2. Load one of the Source Data Files (Blog, Twitter or News)
  3. Define batches for each of the file types.
    • A random number generator with the ‘unif’ function was used to split the files randomly.
    • Blogs Data - 50% of 899,288 vector are used.
    • Twitter and News Data - 100% of the vectors was used.

Example of Code used for Phase during Blog Analysis:

rm(list=ls(all.names = T))  ## clear the environment
gc()  ## Clean up Environment
gc(reset=TRUE)
### Libraries -----------------------------------------------------------------------------------
library(tm)
library(snowfall)
library(data.table)
library(RWeka)
library(magrittr)
library(stringi)
library(pbapply)
# Set Environment -----------------------------------------------------------------------------
sfInit(parallel =TRUE, cpus = 4)
setwd("~/RWorkingDirectory/SwiftKeyCapstone/final/en_US")
set.seed(54433)
#Load Data For Analysis -------------------------------------------------------------------------
con <- file("en_US.blogs.txt", "r")
blogsData <- readLines(con, encoding="UTF-8",skipNul =TRUE)
close(con)
# Functions for various tokenizing methods -----------------------------------------------------
ngramTokenize <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
# Master Batch Variables -----------------------------------------------------------------------
BatchNumbers <- 50000
BatchMin <- c(1,100000,200000,300000,400000,500000,600000,700000,800000 )
BatchMax <- c(99999, 199999,299999,399999,499999,599999,699999,799999,899288)
BatchStudy <- "BLOGS"   
nGramStudy <- "4gram"

Stage TWO: Corpus Creation and Aggregation:

Once the data is loaded and the batches are created, the analysis is done in a series of batch and saved studies which ultimately need to be aggregated. This smaller batch approach ensures the analysis will not falter or crash as each ‘ngram’ batch took approximately 1 hour to process.

  1. Setting Results to NULL and then Garbage Collecting the results from the previous study in each Batch is instrumental in ensuring the corpus creation and and tokenization can run without interruption or failure due to the large number of vectors to be run.
  2. Looped through each batch and ensured RAM didn’t exceed 2 GB during the analysis for robust processing.
  3. Break each of the remaining lines into Sentences.
    • Keep the word order within a sentence word boundary
    • Increases the size as source data has generally more than ONE sentence and is composed of paragraphs.
    • Helps to neutralize the effects of Short Sentences and ensure longer ones are represented properly.
    • nGrams are maintained and more accurate as they do not take following words that run into the next sentence
    • Did not break TWITTER data into sentences as the nature of a tweet generally does not follow proper grammar due to character limits and nature of the communication.
  4. Cleaned the data vector by using Regular Expressions to remove non alphanumeric characters and basic TM package cleaning functions to remove punctuation, set to lower case and remove numbers. Stop Words were NOT removed as they provide values during the prediction process. Sentiment Analysis for example could benefit from removing them, but not in this case.
  5. Tokenize into the specific ‘ngram’ as done by the study. The ngrams 1 to 4 were done for all data sources.
  6. Save the Result to a study for future aggregation.
  7. Each Batch for each data source is combined, recounted and the probabilities calculated for the aggregation. This could be done in the future using a map reduce function on even larger data sets as multiple nodes could perform these calculations independently and return the results to the same type of aggregating method for model creation.

In order to provide some insight into these methods, the following code snippets were utilized to create ngrams for each of the data sets.

Code Example for Blog Analysis - Cleaning Data, Sentence Subsets and Corpus Creation:

    # Select Half of Batch using a randome number generation and bounds as defined in Batch Definition  
      randomNums <- runif(BatchNumbers, BatchMin[currBatch],  BatchMax[currBatch])
      v <- blogsData[randomNums]
      v <- unlist(stri_split_boundaries (v, type = "sentence", skip_sentence_sep = TRUE))
      SampleSize <- length(v)
    #Remove NON Alpha Numeric Characters
      v <- gsub("[^[:alnum:] ]", "", v)
    # Break the text into a Corpus and transform to lower case and remove punctuation or numbers
      corpus <- Corpus(VectorSource(v))
      corpus <- tm_map(corpus, content_transformer(tolower))
      corpus <- tm_map(corpus, removePunctuation)
      corpus <- tm_map(corpus, removeNumbers)
    #Start the Tokenize Process - Use pblapply to show status and cycle through each tokenize process
      tokens <- unlist(pblapply(corpus$content, ngramTokenize))
      tokens <- pblapply(unlist(tokens), function(x) {strsplit(x, " ")})
    # Build List Of Tokens in cleaned Single String
      tokens <- rbindlist(pblapply(tokens, function(x) as.list(unlist(x)))) 
    # Flatten results from multi column data frame to a single string with spaces between words   
       tokens_FLAT <- as.data.frame(paste(tokens$V1,tokens$V2,tokens$V3,tokens$V4,tokens$V5, sep = " "))
    # Build Freq and Probability Tables
      tokenNumbers <- nrow(tokens_FLAT)
      df_tokenResult <- as.data.frame(table(tokens_FLAT))
      df_tokenResult$prob <- df_tokenResult$Freq / SampleSize
      df_tokenResult$sample <- SampleSize
      df_tokenResult$sampleTokens <- tokenNumbers
    # Conserve Memory and Garbage Collecting
      df_tokenResult = NULL
      gc(reset = TRUE)
      gc()
}

Data Cleaning and Tokenization Results:

BLOGS DATA SET

Summary of Analysis Vector Sizes:

## Warning: package 'xtable' was built under R version 3.2.3
Data_Set nGram Total_Vectors Study_Tokens
1 BLOGS 1 1,072,910 17,570,858
2 BLOGS 2 2,145,820 34,079,806
3 BLOGS 3 1,072,910 15,470,033
4 BLOGS 4 2,145,820 29,925,007

Unigram to 4Gram Results of RWEKA tokenizing process:

As you can see from the above plots, the long tail of the histogram is quite significant. The Log Scale on both the X and Y axis provides a clearer view of what is occuring in terms of token frequency. Clearly, the majority of the data is singles.

TWITTER DATA SET

Summary of Analysis Vector Sizes:
Data_Set nGram Total_Vectors Study_Tokens
1 TWITTER 1 1,650,000 20,532,002
2 TWITTER 2 3,300,000 39,414,004
3 TWITTER 3 4,950,000 56,654,771
4 TWITTER 4 5,950,000 66,126,860

Unigram to 4Gram Results of RWEKA tokenizing process:

NEWS DATA SET

Summary of Analysis Vector Sizes:_
Data_Set nGram Total_Vectors Study_Tokens
1 NEWS 1 115,442 1,976,317
2 NEWS 2 268,596 4,446,785
3 NEWS 3 421,750 6,768,137
4 NEWS 4 574,904 8,943,096

Unigram to 4Gram Results of RWEKA tokenizing process:

Stage THREE: Next Steps

Now that the tokens for each model have been established, the final model to use these for the prediction model and ultimately the shiny app is the focus. The hurdles to address are as follows:

  1. Size of Aggregated Results - Currently, the data is still too sparse. Over 90% of the tokens have probabilities which are still too far low and therefore should be removed in order to streamline the model.
  2. Model Selection - The nature of the data implies a fairly standard language model. Using the Markhov Assumuption to simplify the model, conditional probabilities will be utilized (Probability of X provided Y, where y is the nGram). Implementing the Stupid Backoff is the easiest method, as termination occurs if either a matching nGram is found, or nothing is found. The data is now in a format that if the words used in the prediction are not discovered by the higher value nGram (IE 4Gram), the next level (n-1) can be searched. If the word ultimately cannot be found after the 2Gram, then it returns nothing or we can suggest a word based on a unigram ‘most likely’ candidate.
  3. Overfitting - if the model selection in not careful, over fitting is at risk. Due to the severe Tail as shown in the BLOG analysis above, the model has some tokens which will provide very little probability of ever being found. By eliminating some of these extremely rare events, we can reduce the risk of overfitting, but the model must accomodate the scenario where Stupid Backoff yields no prediction. Data smoothing using Laplace is an option, but could increase the overall size of the model. It may be best to remove these low probabiltiy scenarios. The NEWS dataset for example had many specific tokens which appear to fit a slew of news stories. The TWITTER example appears to contain standard messaging such as “thanks for the follow”. These should be removed as they are auto generating NOT inputs from the user.
  4. Source Data for Model - Overall this is a fundamental step. Balancing predictive accuracy and speed is going to be a major issue. The Shiny App file size constraint must be managed or another method of model storage is necessary. Potentially storing the model as a mySQL database hosted else where is a potential solution but it might be best to optimize the model first and store within Shiny if possible.
  5. Final Solution is driven by the application - The differences in Tweets, News and Blogs is quite substantial. Therefore the value of the prediction is a function of the method or application the data is entered with. The final Shiny App will load the model depending on the ‘mode’ of the input. This should help in keeping the models smaller as well as increase the accuracy of the prediction within the application type.
  6. Profanity Filtering - Initial inquiries were done for profane words and the results were fairly significant. Therefore, before the model is fully developed, the tokens need to be scrubbed and profane entries are to be removed so they will not appear during the prediction. Also, regular expressions utilized during input in the shiny app can be utilized to stop prediction if the user enters profane data.