Unlike traditional full sized keyboarding methods used on desktop and laptops, portable devices are used more and more and the input methods are a bottle neck. Word prediction assists the user by speeding up data entry and play an important role in hand held device input. Just as portable technology has advanced, we too are able to process large amounts of data equally well on our personal computers. This allows us to process well over a million data points and use them in predictive analytics. This word prediction capstone reflects this ability.
This initial submission reports on the challenges and findings discovered during data acquisition, cleaning and exploratory analysis of the Swift Key data set. Only the english files were analyzed. For this initial investigation, the data sources were not merged and independent studies were done for the BLOG, TWITTER and NEWS data sources. The methods used to obtain, clean and create the corpus, along with the process to create the various ‘n gram’ models used in both this report as well as the final model are also included. Finally, visuals for the data set are presented which show the differences in the corpus for the various sources. This is important as the distinction is necessary when considering the medium and method used to trigger the word prediction algorithm.
Finally, next steps and challenges will be discussed. As the tokens used in the prediction are generated, a method to deploy them into an application is essential in order to add value to this study.
In order to control the study and ensure processing and memory constraints were avoided, a batching the data was performed. The Batches were broken down in the following manner:
Example of Code used for Phase during Blog Analysis:
rm(list=ls(all.names = T)) ## clear the environment
gc() ## Clean up Environment
gc(reset=TRUE)
### Libraries -----------------------------------------------------------------------------------
library(tm)
library(snowfall)
library(data.table)
library(RWeka)
library(magrittr)
library(stringi)
library(pbapply)
# Set Environment -----------------------------------------------------------------------------
sfInit(parallel =TRUE, cpus = 4)
setwd("~/RWorkingDirectory/SwiftKeyCapstone/final/en_US")
set.seed(54433)
#Load Data For Analysis -------------------------------------------------------------------------
con <- file("en_US.blogs.txt", "r")
blogsData <- readLines(con, encoding="UTF-8",skipNul =TRUE)
close(con)
# Functions for various tokenizing methods -----------------------------------------------------
ngramTokenize <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
# Master Batch Variables -----------------------------------------------------------------------
BatchNumbers <- 50000
BatchMin <- c(1,100000,200000,300000,400000,500000,600000,700000,800000 )
BatchMax <- c(99999, 199999,299999,399999,499999,599999,699999,799999,899288)
BatchStudy <- "BLOGS"
nGramStudy <- "4gram"
Once the data is loaded and the batches are created, the analysis is done in a series of batch and saved studies which ultimately need to be aggregated. This smaller batch approach ensures the analysis will not falter or crash as each ‘ngram’ batch took approximately 1 hour to process.
In order to provide some insight into these methods, the following code snippets were utilized to create ngrams for each of the data sets.
Code Example for Blog Analysis - Cleaning Data, Sentence Subsets and Corpus Creation:
# Select Half of Batch using a randome number generation and bounds as defined in Batch Definition
randomNums <- runif(BatchNumbers, BatchMin[currBatch], BatchMax[currBatch])
v <- blogsData[randomNums]
v <- unlist(stri_split_boundaries (v, type = "sentence", skip_sentence_sep = TRUE))
SampleSize <- length(v)
#Remove NON Alpha Numeric Characters
v <- gsub("[^[:alnum:] ]", "", v)
# Break the text into a Corpus and transform to lower case and remove punctuation or numbers
corpus <- Corpus(VectorSource(v))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
#Start the Tokenize Process - Use pblapply to show status and cycle through each tokenize process
tokens <- unlist(pblapply(corpus$content, ngramTokenize))
tokens <- pblapply(unlist(tokens), function(x) {strsplit(x, " ")})
# Build List Of Tokens in cleaned Single String
tokens <- rbindlist(pblapply(tokens, function(x) as.list(unlist(x))))
# Flatten results from multi column data frame to a single string with spaces between words
tokens_FLAT <- as.data.frame(paste(tokens$V1,tokens$V2,tokens$V3,tokens$V4,tokens$V5, sep = " "))
# Build Freq and Probability Tables
tokenNumbers <- nrow(tokens_FLAT)
df_tokenResult <- as.data.frame(table(tokens_FLAT))
df_tokenResult$prob <- df_tokenResult$Freq / SampleSize
df_tokenResult$sample <- SampleSize
df_tokenResult$sampleTokens <- tokenNumbers
# Conserve Memory and Garbage Collecting
df_tokenResult = NULL
gc(reset = TRUE)
gc()
}
Summary of Analysis Vector Sizes:
## Warning: package 'xtable' was built under R version 3.2.3
| Data_Set | nGram | Total_Vectors | Study_Tokens | |
|---|---|---|---|---|
| 1 | BLOGS | 1 | 1,072,910 | 17,570,858 |
| 2 | BLOGS | 2 | 2,145,820 | 34,079,806 |
| 3 | BLOGS | 3 | 1,072,910 | 15,470,033 |
| 4 | BLOGS | 4 | 2,145,820 | 29,925,007 |
Unigram to 4Gram Results of RWEKA tokenizing process:
As you can see from the above plots, the long tail of the histogram is quite significant. The Log Scale on both the X and Y axis provides a clearer view of what is occuring in terms of token frequency. Clearly, the majority of the data is singles.
| Data_Set | nGram | Total_Vectors | Study_Tokens | |
|---|---|---|---|---|
| 1 | 1 | 1,650,000 | 20,532,002 | |
| 2 | 2 | 3,300,000 | 39,414,004 | |
| 3 | 3 | 4,950,000 | 56,654,771 | |
| 4 | 4 | 5,950,000 | 66,126,860 |
Unigram to 4Gram Results of RWEKA tokenizing process:
| Data_Set | nGram | Total_Vectors | Study_Tokens | |
|---|---|---|---|---|
| 1 | NEWS | 1 | 115,442 | 1,976,317 |
| 2 | NEWS | 2 | 268,596 | 4,446,785 |
| 3 | NEWS | 3 | 421,750 | 6,768,137 |
| 4 | NEWS | 4 | 574,904 | 8,943,096 |
Unigram to 4Gram Results of RWEKA tokenizing process:
Now that the tokens for each model have been established, the final model to use these for the prediction model and ultimately the shiny app is the focus. The hurdles to address are as follows: