This document reports on progress towards a text prediction algorithm and Shiny application undertaken in the Capstone project marking the end of the 9-course Data Science Specialization offered through Coursera and the John Hopkins Department of Biostatistics. The project is offered in cooperation with Swiftkey, a company building smart prediction technology for easier mobile typing.
This particular presentation is available as an R Studio Presenter presentation.
The data is provided for us, so the first step is to download it, then get an idea of the size of the files’ sizes. We know a priori that the data set consists of 3 large text corpora taken from blogs, news and Twitter. We report on the number of words in a subsample of the data after tokenization (see below).
suppressPackageStartupMessages(require("downloader")); suppressPackageStartupMessages(require("R.utils"))
# Download, unzip data and setwd()
url <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download(url, dest = "data.zip", mode = "wb")
unzip("data.zip", exdir = "./")
# Set working directory
setwd(paste(getwd(),"/final/en_US",sep=""))
list.files()
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
# Get an idea of corpora sizes
as.numeric(countLines("en_US.blogs.txt"))
## [1] 899288
as.numeric(countLines("en_US.news.txt"))
## [1] 1010242
as.numeric(countLines("en_US.twitter.txt"))
## [1] 2360148
# Read in data
blogs <- readLines("en_US.blogs.txt"); news <- readLines("en_US.news.txt"); tweets <- readLines("en_US.twitter.txt", skipNul = TRUE)
# Clean up
rm(url)
Seeing the size of the data set, we elect to take a random sample of a size sufficient to guarantee representativeness. The following code chunk performs this sampling, reporting only the size of the sample drawn from the three corpora.
## [1] 9345
## [1] 9345
## [1] 9345
With a more manageable sampled subset of the data, we briefly look at some of the more interesting / challenging features of the data for text prediction. In order to get an idea of such features, we look at the last 5 lines in the sample drawn from the blogs sample.
tail(blogs_samp, n = 5)
## [1] "I'm begining to think that where my life seems to be lived, is in the spaces between plans. Like I make plans..they go all wonky..then I start to plan again..mostly never really reaching the point I had planned so hard for. It's a funny little cycle I find myself in."
## [2] "Translation:"
## [3] "Born in the unfathomable depths of Space, out of the homogeneous Element called the World-Soul, every nucleus of Cosmic matter, suddenly launched into being, begins life under the most hostile circumstances. Through a series of countless ages, it has to conquer for itself a place in the infinitudes. It circles round and round between denser and already fixed bodies, moving by jerks, and pulling towards some given point or centre that attracts it, trying to avoid, like a ship drawn into a channel dotted with reefs and sunken rocks, other bodies that draw and repel it in turn; many perish, their mass disintegrating through stronger masses, and, when born within a system, chiefly within the insatiable stomachs of various Suns. (See Comm. to Stanza IV). Those which move slower and are propelled into an elliptic course are doomed to annihilation sooner or later. Others moving in parabolic curves generally escape destruction, owing to their velocity."
## [4] "And then there was the burger. THE Burger. Here, the taste of the locally sourced, humanely raised, organic beef was clearly superior. I'll never go back."
## [5] "RUSTENBURG: Group of black mineworkers assault Anita Venter (24) and turns over her car."
For the purposes of text prediction, these lines reveal several types of characters that should be eliminated as they shouldn’t be predicted, including punctuation, quotation marks and digits. Furthermore, this and the three other samples are likely to contain non-English characters, emoticons, curse words and other things we wouldn’t like to predict. As such, we clean the samples. The following code chunk performs this data cleaning with the blogs corpus sample only, displaying the first 5 lines of the cleaned blog sample.
## [1] "but unlike pontiac which only ed away media dollars chevy is flirting with frittering away its whole culture on people who don t buy cars don t want cars and can t afford cars "
## [2] " the only commandment with a promise is to honor our fathers and mothers and the promise is so that it may be well with you and that you may live long on the earth see ephesians nowhere does it say we should only honor great fathers and mothers or even good ones but simply the ones we have been given good or bad almost all will be both good and bad just as some of the greatest heroes in the bible also made some of the greatest mistakes rick joyner"
## [3] "thinking about my kids upcoming birthdays and not thinking i want to really do very much for them wondering if that makes me the worst mother in the world "
## [4] " yes richard fine is he your case "
## [5] "i spent last night dipping in and out of some books to remind myself of how i chose to discipline because yesterday it didn t go far beyond screaming no and came across this in the womanly art of breastfeeding william g white md and experienced family practice doctor and father offers the following observation "
The next step is to tokenize and build n-grams, which we will then use to predict the posterior probability of a word given the previous context. We have chosen to use R’s stylo package to tokenize and produce n-grams. The following code does so for the blogs sample, reporting the total number of words in this sample.The code chunk then displays the 20 most common 1-, 2-, 3- and 4-grams in that sample.
## Warning: package 'stylo' was built under R version 3.1.2
## using current directory...
## loading blogs_sample.txt ...
## [1] 27546
This report brings up a few potential thorny issues that have yet to be addressed. Chief among these is that of contractions. As the barplots reveal, contractions are currently in the data as two words (e.g “don’t” is “don” + “t”). Experimentation with the text prediction models should show if this is the correct choice or not.
Now that we have a sorted list of n-grams, the next steps are to write a function that accepts a string of words and attempts to predict the next word. We anticipate doing so by searching our sorted lists for the largest matching string, then taking the word that most often occurs next. For example, if the user types “I am going”, the function finds this trigram in the data and returns the word from the 4-gram table that most often completes that sequence in the observed data. If “I am going to” occurs, for example, 76 times and “I am going for” occurs 52 times, the function would return “to” as the predicted word.
If the string does not occur in the data, we plan to revert to a smaller n-gram. Such a “backoff” mechanism would complete a progressively smaller sequence so that if the user enters “Colorless clouds float” and there are no occurrences in the data, the function looks for “clouds float” and eventually even “float”, returning the most probable next word in each case.
Once the function is written, we will begin training a model using increasingly higher sample sizes, until computing capacity is reached. At this point, we will explore avenues for making the model more efficient via more efficient coding, better data structures, etc. We then pla to build these structures into a Shiny application, remaining as faithful as possible to the structure and experience described herein.