This report summarizes progress on the predictive text algorithm development. The focus for the first two weeks of activity has been establishing the development environment, reviewing the available data, and becoming familiar with the computational routines available for working with the textual datasets.
For the developement of the predictive text algorithm the JHU web scraping team has provided several data collections in four languages and from three sources with varying levels of formality. For this project, development will be limited in scope to using the English datasets.
The raw data may be retrieved from the Coursera site: Capstone Dataset
For brevity, the code chunk for downloading and extracting the data is not shown. From the extracted data sets, the relevant datasets are readily retrieved using functionality from the tm library.
# Set up dataset retrieval options
dd <- "./Coursera-SwiftKey/final/en_US"
dd.texts <- DirSource(directory=dd, encoding='UTF-8', mode="text" )
# Read data files into a volatile (in memory) corpus
corp <- VCorpus(dd.texts)
The raw data is summarized with the following metrics.
## Files FileSize MemSize Lines TotalWords AvgWords
## 1 en_US.blogs.txt 210160014 260567992 899288 37546246 41.75108
## 2 en_US.news.txt 205811889 20115064 77259 2674536 34.61779
## 3 en_US.twitter.txt 167105338 316041032 2360148 30093369 12.75063
The following example is a small extract from one of the datasets prior to cleaning activities.
# set up function for viewing data. Acknowledgement to Graham@togaware for function
cor.view <- function(d,n){ d %>% extract2(n) %>% as.character() } # %>% writeLines() }
cor.view(corp,1)[5]
## [1] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
As the intended audience of the application is the general public, the predictive algorithm shall not suggest socially inappropriate terms. As the end users may span the full range of cultural sensitivities, it is best to err on the safe side and exclude any potentially offensive terms. There are no restricitions on an individual user’s expressiveness; if a user wishes to use an obscene or offensive word, they are able to type it in as usual.
From a quick internet search, Luis von Ahn has a fairly comprehensive list of potentially offensive words at: List of bad words This will be used as the starting point and may be readily expanded as further experience is gained.
The tm library command for filtering the profanity has been tested and is included in the code chunk.
if(!file.exists("bad-words.txt")){
download.file("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt", "bad-words.txt")
}
BadWords <- readLines(con="bad-words.txt", warn=FALSE, encoding='UTF-8' )
# strip empty lines from BadWords
BadWords <- BadWords[BadWords!=""]
# remove bad words
corp <- tm_map(corp, removeWords, BadWords)
The dataset has been seperated as follows:
In addition, small sample sets from the training data have been taken for use during code development.
The exploratory analysis included experimenting with the possible manipulations of the data using available routines within the tm and weka libraries. Available manipulations include:
A variety of visuals are presented to illustrate aspects of the data set. The data is first tabulated as unigrams into a term document matrix, from which the visuals are derived.
# convert TDM to dataframe
corp.df <- as.data.frame(as.matrix(news.tdm1))
# sum rows
corp.df <- data.frame(word=rownames(corp.df), sum=rowSums(corp.df))
# sort by frequency
corp.df <- corp.df[order(-corp.df$sum),]
# fix factors so it plots in decreasing order
corp.df$word <- factor(corp.df$word,
levels=with(corp.df, word[order(sum, word, decreasing = TRUE)]))
# prepare histogram of most common words
g1 <- ggplot(corp.df[1:15,], aes(x=word, y=sum)) +
geom_bar(stat="identity")
g1
## Loading required package: RColorBrewer
# add index and cummulative sum to tdm
corp.df$x <- seq(along.with=corp.df$sum)
corp.df$cs <- cumsum(corp.df$sum)
temp <- sum(corp.df$sum)
corp.df$cf <- corp.df$cs/temp
cf.v50 <- length(corp.df$cf[corp.df$cf<=0.5])/length(corp.df$cf)
cf.v90 <- length(corp.df$cf[corp.df$cf<=0.9])/length(corp.df$cf)
rm(temp)
g2 <- ggplot(corp.df, aes(x=(x/length(corp.df$sum)), y=cf)) +
geom_line() +
geom_vline(xintercept=cf.v50, color="blue", linetype="dashed") +
geom_vline(xintercept=cf.v90, color="red", linetype="dashed") +
labs(title="Word Usage Frequency", x="Terms in Corpus, %", y="Cummulative Terms in Text, %")
g2
corp.df$nlet <- nchar(as.character(corp.df$word))
g4 <- ggplot(corp.df, aes(x=nlet)) +
geom_histogram(binwidth = 1) +
geom_vline(xintercept=mean(corp.df$nlet), color="red", linetype="dashed") +
labs(x="Number of Letters in Word", y="Number of Words")
g4
From the exploratory review the following comments are noted for consideration in the predictive algorithm.
In the Word Usage Frequency plot we find that 50% of the words used come less than 5% of the corpus, and 90% of the words used come from about 40% of the corpus. This indicates that removal of the least common words will potentially have a minimal effect on prediction accuracy.
The text processing times can be very long; removal of the infrequently used words improves processing speed.