Overview

The Typeahead Predictor is a feature which suggest the next possible word as a user types into a text areas in applications, such as a search bar or any space while using a keyboard (physical or virtual on mobile devices) Providing highly correlated next-word suggestions makes a user’s interaction while typing in their devices much faster. Smart keyboards are based on predictive text models, which are trained by a corpus of unstructured free speach documents.

As we enter the initial exploratory milestone phase, we analyzing a specific corpus of text documents review discover the structure in the data and how words are put together. This phase will including the necessary preprocessing of our dataset with cleaning using a variety of text mining tools provided in our R libraries/packages structure.

Numerous preprocessing techniques are drawn from this material: “Text Mining Infrastructure in R” by Ingo Feinerer, Kurt Hornik and David Meyer. (https://www.jstatsoft.org/article/view/v025i05).

Configuration

In order to proceed with our evaluation of the dataset files, we have some necessary housekeeping here:

set.seed("4789")
setwd("D:/Courses/Data Science JHU/Capstone")

usePackage<-function(p){
  # load a package if installed, else load after installation.
  # Args: p: package name in quotes
  if (!is.element(p, installed.packages()[,1])){
    #print(paste('Package:',p,'Not found, Installing Now...'))
    suppressMessages(install.packages(p, dep = TRUE))
  }
  #print(paste('Loading Package :',p))
  require(p, character.only = TRUE)  
}

Loading Libraries Here: tm, RWeka, ggplot2, dplyr, stringi, png, grid…

## Loading required package: tm

## Loading required package: NLP

## Loading required package: RWeka

## Loading required package: SnowballC

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: stringi

## Loading required package: png

## Loading required package: grid

Exploratory Analysis

The dataset archive used has the source anaylzing and making typeahead predictions, was obtained at this resource: (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

It has been downloaded, uncompressed. The following path indicate how we can access the various US English based files for this initial assessment.

dataBaseFilePath   <- "D:/Courses/Data Science JHU/Capstone/Coursera-SwiftKey/"
enUSBaseFilePath    <- paste0( dataBaseFilePath, "final/en_US/" )
enUSBlogsFilename   <- "en_US.blogs.txt"
enUSBlogsFilepath   <- paste0( enUSBaseFilePath, enUSBlogsFilename ) 
enUSNewsFilename    <- "en_US.news.txt"
enUSNewsFilepath    <- paste0( enUSBaseFilePath, enUSNewsFilename )
enUSTwitterFilename <- "en_US.twitter.txt"
enUSTwitterFilepath <- paste0( enUSBaseFilePath, enUSTwitterFilename )

dBlogs    <- readLines( enUSBlogsFilepath, encoding = "UTF-8", skipNul = TRUE )
dNews     <- readLines( enUSNewsFilepath, encoding = "UTF-8", skipNul = TRUE )

## Warning in readLines(enUSNewsFilepath, encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on 'D:/Courses/Data Science JHU/Capstone/
## Coursera-SwiftKey/final/en_US/en_US.news.txt'

dTwitter  <- readLines( enUSTwitterFilepath, encoding = "UTF-8", skipNul = TRUE )

We use the following logic to build a table for reviewing some basic features of the documents:

Text File	Size (MB)	No. of Lines	No. of Words	Longest Line Length
en_US.blogs.txt	200.4	899,288	37,570,839	40833
en_US.news.txt	196.3	77,259	2,651,432	5760
en_US.twitter.txt	159.4	2,360,148	30,451,170	140

Upon a physical scan of each file, we notice that there are many issues we must deal with to have a clean subset to act upon for our prediction. We notice profanity and complex character set that are non-english words. These will need to be removed. Here a demonstration with removal of complex characters and non-english using the iconv function.

Creation of Corpus

To provide a uniform heterogeneous way of accessing the root TEXT of a document to perform text mining work, regardless of file location, internal annotations and other features unique to the document, we load up the text document(s) into a “CORPUS” from various sources.

# FUNCTION: LOAD SINGLE DATASET FILE - OPPOSED TO DIRSOURCE
getSingleFileCorpus <- function( filename ) {
  VCorpus( VectorSource( paste(readLines(file( filename )) ,collapse="\n") ), list(reader = readPlain) )
}

Examples of loading single or full corpus of data.

corpusBlogs <- getSingleFileCorpus( enUSBlogsFilepath )
inspect(corpusBlogs)
rm("corpusBlogs")

corpusFull  <- Corpus( DirSource(enUSBaseFilePath), readerControl = list(reader=readPlain, language="en_US") )
inspect(corpusFull)
rm("corpusFull")

LOAD DATA SUBSET - 3% SAMPLING OF EACH DATASET FILE FOR ENGLISH_US.

subsetPercentage <- 0.03
getDataSubset <- function( df, sp = subsetPercentage ) { sample(df, length(df) * sp )  }
dCombinedDataSubset <- c( getDataSubset( dBlogs ), getDataSubset( dNews ), getDataSubset( dTwitter ) )
filenameCombinedDataSubset <- paste0( dataBaseFilePath, "en_US.combined_subset.txt" )
if (file.exists(filenameCombinedDataSubset)) file.remove(filenameCombinedDataSubset)

## [1] TRUE

fileCombinedDataSubset <- file(filenameCombinedDataSubset)
writeLines(dCombinedDataSubset, fileCombinedDataSubset)
close(fileCombinedDataSubset)
corpusCombinedDataSubset <- getSingleFileCorpus( filenameCombinedDataSubset )

Without some optimizations (parallelization), processing the full set of text documents takes quiet a bit horsepower and time. It has been observed with taking a subset of each traning dataset file allows us to gain some insight in the distribution of most-used words, and being the anaylsis of term vection/frequencies. In addition look into n-gram distribution.

The reduced corpusCombinedDataSubset will be used below in futher examples. It contains a 3 percent of each dataset file combined into on. Continued effort when preparing the predictor will be used to anaylze the as much (or all 100%) of the data provided where possible.

Text Mining Preprocessing

The R package tm for “text mining” has useful functions for processing text. It can lowercases document text making words equal from that perspective. It removes punctuation, but it can also remove stopwords, specifically eliminate words such as “in”, “the”, “a”, “of”, “and”, etc. It has stemming feature, which is performs like lemmatization, however, it drops off what might be considered an affix to help equate the root words.

LOAD BANNED SWEAR WORD LIST

Note: The bad swear words list was obtain from this resource: (http://www.bannedwordlist.com/). It has been augment to 730+ profane words.

bannedWordList <- readLines( paste0( dataBaseFilePath, "swearWords.txt" ) )

## Warning in readLines(paste0(dataBaseFilePath, "swearWords.txt")):
## incomplete final line found on 'D:/Courses/Data Science JHU/Capstone/
## Coursera-SwiftKey/swearWords.txt'

FUNCTION: GET CORPUS PREPROCESSED

This getCorpusPreprocessed function encapsulates the tasks performed to preprocess our various corpus.

getCorpusPreprocessed <- function( dCorpus ) {
  
  # TRANSFORM TO LOWERCASE
  dCorpus <- tm_map( dCorpus, tolower )
  
  # STRIP UNNECESSARY ITEMS
  dCorpus <- tm_map( dCorpus, stripWhitespace )
  dCorpus <- tm_map( dCorpus, removeNumbers )
  dCorpus <- tm_map( dCorpus, removePunctuation )
  
  # REMOVE BANNED, HIGHLY COMPLEX, NON-ENGLISH AND STOPWORDS
  dCorpus <- tm_map( dCorpus, function(x) iconv(x, "UTF-8", "ASCII", sub="") )
  dCorpus <- tm_map( dCorpus, removeWords, bannedWordList )  # PROFRANITY FILTER
  dCorpus <- tm_map( dCorpus, removeWords, stopwords("english") ) 
  
  # STEMMING ( WORD ROOT )
  dCorpus <- tm_map( dCorpus, stemDocument )
  
  # ENSURE PLAIN TEXT DOCUMENT
  dCorpus <- tm_map( dCorpus, PlainTextDocument )  
  
  customRemoveStopWords <- function(x) removeWords( x, stopwords("english") )
  customRemoveBannedWords <- function(x) removeWords( x, bannedWordList )
  customRemoveUnknownContent <- function(x) iconv(x, "UTF-8", "ASCII", sub="")

  dCorpus # RETURN FILTERED CORPUS!
}

Perform Preprocessing operation

corpusComboPP <- getCorpusPreprocessed( corpusCombinedDataSubset )
inspect( corpusComboPP )

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 6796854

Tokenization of Corpus

The next part of the analysis is to attempt tokenization in order to help with the production of n-grams. N-grams are these contiguous sequence of n items (1 = unigram, 2 = bigram and 3 = trigram, etc) pull from our Corpus.

Term Matrix Processing

The following produces a matrix system that describes the frequency of terms that occurs in our Combo Corpus.

corpusComboTDM  <- TermDocumentMatrix( corpusComboPP )
corpusComboTDMx <- as.matrix( corpusComboTDM )

frequencyCombo <- rowSums(corpusComboTDMx)
frequencyCombo <- sort(frequencyCombo, decreasing = TRUE)[1:100]
#names(frequencyCombo)

Plot of Frequent Single Terms from Combo Corpus:

barplot(head(frequencyCombo,25),main="Combo Corpus: Highest Word Frequency (Top 25)", ylab = "Frequency", col = "orange", las = 2)

N-Gram Processing

Issue working with this on my macbook wasn’t solvable. This error occured on a Macbook running Java 8.

java.lang.UnsupportedClassVersionError: weka/core/tokenizers/NGramTokenizer : Unsupported major.minor version 51.0 *

#bigram   <- NGramTokenizer(corpusComboPP, Weka_control(min = 2, max = 2))
#trigram  <- NGramTokenizer(corpusComboPP, Weka_control(min = 3, max = 3))

Tested out a alternative workaound found in stackover flow and discussion groups. Using the corpusComboPP I created, I tested out this code:

freq_df <- function(tdm){
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_df <- data.frame(word=names(freq), freq=freq)
  return(freq_df)
}

TrigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
    
trigram <- removeSparseTerms(TermDocumentMatrix(corpusComboPP, control = list(tokenize = TrigramTokenizer)), 0.9999)
trigram_freq <- freq_df(trigram)

We can take a peak at the top 25 tri-grams produced using this functionality:

knitr::kable(trigram_freq[1:25,], format = "markdown")

	word	freq
happi mother day	happi mother day	123
cant wait see	cant wait see	110
let us know	let us know	84
happi new year	happi new year	60
im pretti sure	im pretti sure	54
look forward see	look forward see	47
feel like im	feel like im	39
new york citi	new york citi	36
dream come true	dream come true	34
cinco de mayo	cinco de mayo	32
dont even know	dont even know	29
cant wait get	cant wait get	27
dope dope dope	dope dope dope	26
follow follow back	follow follow back	26
blah blah blah	blah blah blah	24
good morn everyon	good morn everyon	23
im look forward	im look forward	23
make feel like	make feel like	23
cant wait hear	cant wait hear	22
cant wait till	cant wait till	22
dont get wrong	dont get wrong	22
didnt even know	didnt even know	21
dont know im	dont know im	21
right now im	right now im	21
st patrick day	st patrick day	21

Final Remarks & Next Steps

When using a full set of documents ( the complete corpus ), we find it is a lengthy and intense pre-processing stage. Attempts to reduce the corpus to a subset sample from 50% down to 3% occured to help generate a sample that couple be created in a reason time frame. There are possibly features of R multithreading and parallization that will need to be investigated further to speed up the processing.
We do see from even a 3% subset sampling of the data, that there are tremendous set of terms that are produced, that can can derive frequency of most used words.

Next?

The next phase of the project will be to continue optimizing how to get a clean set of terms. More work is necessary to solve the Bigram and Trigram creation process. The is an important investigation on how the subroutines to get next word with the R feature, however, mainly start to work on the prediction algorithm.

Typeahead Predictor Tool - Capstone

Yuvasri Raghavan

May 2020