Predictive Text App - Milestone Report

John Hopkins Data Science Certificate Capstone Project

Report Objectives

This report summarizes progress on the predictive text algorithm development. The focus for the first two weeks of activity has been establishing the development environment, reviewing the available data, and becoming familiar with the computational routines available for working with the textual datasets.

Data Retrieval and Cleaning Operations

For the developement of the predictive text algorithm the JHU web scraping team has provided several data collections in four languages and from three sources with varying levels of formality. For this project, development will be limited in scope to using the English datasets.

The raw data may be retrieved from the Coursera site: Capstone Dataset

Loading Dataset

For brevity, the code chunk for downloading and extracting the data is not shown. From the extracted data sets, the relevant datasets are readily retrieved using functionality from the tm library.

# Set up dataset retrieval options
dd <- "./Coursera-SwiftKey/final/en_US"
dd.texts <- DirSource(directory=dd, encoding='UTF-8', mode="text" )

# Read data files into a volatile (in memory) corpus
corp <- VCorpus(dd.texts)

Initial Dataset information

The raw data is summarized with the following metrics.

##               Files  FileSize   MemSize   Lines TotalWords AvgWords
## 1   en_US.blogs.txt 210160014 260567992  899288   37546246 41.75108
## 2    en_US.news.txt 205811889  20115064   77259    2674536 34.61779
## 3 en_US.twitter.txt 167105338 316041032 2360148   30093369 12.75063

Text Example

The following example is a small extract from one of the datasets prior to cleaning activities.

# set up function for viewing data.  Acknowledgement to Graham@togaware for function
cor.view <- function(d,n){ d %>% extract2(n) %>% as.character() } # %>% writeLines() }
cor.view(corp,1)[5]

## [1] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"

Bad Words Dataset

As the intended audience of the application is the general public, the predictive algorithm shall not suggest socially inappropriate terms. As the end users may span the full range of cultural sensitivities, it is best to err on the safe side and exclude any potentially offensive terms. There are no restricitions on an individual user’s expressiveness; if a user wishes to use an obscene or offensive word, they are able to type it in as usual.

From a quick internet search, Luis von Ahn has a fairly comprehensive list of potentially offensive words at: List of bad words This will be used as the starting point and may be readily expanded as further experience is gained.

The tm library command for filtering the profanity has been tested and is included in the code chunk.

if(!file.exists("bad-words.txt")){
      download.file("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt", "bad-words.txt")
}
BadWords <- readLines(con="bad-words.txt", warn=FALSE, encoding='UTF-8' )
# strip empty lines from BadWords
BadWords <- BadWords[BadWords!=""]

# remove bad words
corp <- tm_map(corp, removeWords, BadWords)

Data Splitting

The dataset has been seperated as follows:

Training data - 70%
Test set 1 - 15%
Test set 2 - 15%

In addition, small sample sets from the training data have been taken for use during code development.

Possible manipulations

The exploratory analysis included experimenting with the possible manipulations of the data using available routines within the tm and weka libraries. Available manipulations include:

Remove whitespace
Convert to lower case
Remove non-english characters
Remove stopwords or bad-words
Remove numbers and punctuation
Find and replace particular strings or characters

Descriptive Statistics on Datasets

A variety of visuals are presented to illustrate aspects of the data set. The data is first tabulated as unigrams into a term document matrix, from which the visuals are derived.

Histogram of Common Words

# convert TDM to dataframe
corp.df <- as.data.frame(as.matrix(news.tdm1))
# sum rows
corp.df <- data.frame(word=rownames(corp.df), sum=rowSums(corp.df))
# sort by frequency
corp.df <- corp.df[order(-corp.df$sum),]
# fix factors so it plots in decreasing order
corp.df$word <- factor(corp.df$word, 
                     levels=with(corp.df, word[order(sum, word, decreasing = TRUE)]))
# prepare histogram of most common words
g1 <- ggplot(corp.df[1:15,], aes(x=word, y=sum)) +
      geom_bar(stat="identity")
g1

WordCloud of Common Words

## Loading required package: RColorBrewer

Cummulative Frequency Plot - of the corpus to give 50% and 90% of words in text.

# add index and cummulative sum to tdm
corp.df$x <- seq(along.with=corp.df$sum)
corp.df$cs <- cumsum(corp.df$sum)
temp <- sum(corp.df$sum)
corp.df$cf <- corp.df$cs/temp
cf.v50 <- length(corp.df$cf[corp.df$cf<=0.5])/length(corp.df$cf)
cf.v90 <- length(corp.df$cf[corp.df$cf<=0.9])/length(corp.df$cf)
rm(temp)

g2 <- ggplot(corp.df, aes(x=(x/length(corp.df$sum)), y=cf)) +
      geom_line() +
      geom_vline(xintercept=cf.v50, color="blue", linetype="dashed") +
      geom_vline(xintercept=cf.v90, color="red", linetype="dashed") +
      labs(title="Word Usage Frequency", x="Terms in Corpus, %", y="Cummulative Terms in Text, %")
g2

Histogram of Word Length

corp.df$nlet <- nchar(as.character(corp.df$word))
g4 <- ggplot(corp.df, aes(x=nlet)) +
      geom_histogram(binwidth = 1) +
      geom_vline(xintercept=mean(corp.df$nlet), color="red", linetype="dashed") +
      labs(x="Number of Letters in Word", y="Number of Words")
g4

Findings

From the exploratory review the following comments are noted for consideration in the predictive algorithm.

In the Word Usage Frequency plot we find that 50% of the words used come less than 5% of the corpus, and 90% of the words used come from about 40% of the corpus. This indicates that removal of the least common words will potentially have a minimal effect on prediction accuracy.
The text processing times can be very long; removal of the infrequently used words improves processing speed.

Next Steps

Develop the testing and evaluation methods.
Develop routines for preparation of the text prediction tables.
Develop the data predicition algorithm.
Test and iterate on algorithms to improve prediction capabilities.
Develop shiny app.
Final Report
Bask in the glory of having finished the program, then find a job.