Data Science Capstone: Milestone Report

Introduction

This report records the progress made for Johns Hopkins University/Coursera’s Data Science capstone project. The project goal is to create a mobile app (hosted on a web page) that, given a string of words, will predict the next word. This app cannot be too intensive, as it must be able to run on a web page or mobile device. Yet it cannot be so innacurate as to be useless - balancing accuracy and speed is the crux of this project. The processing and analysis mentioned in this report utilizes the following R packages: tm, quanteda, parallel, doParallel, dplyr, ngram.

The Data

The datasets available to build a text-prediction model were provided by SwiftKey, the text-predictive keyboard available for iOS and Android devices. There are 3 large datasets in English: news articles, blog posts, and Twitter tweets. In language processing lingo, a group of documents for analysis is called a corpus; it is also the name of a specific data type for R text-mining packages. In this project, each file - news, blogs, or tweets - forms a corpus. Exploratory analysis and code-testing were predominately focused on the news dataset, but the referenced code can work with any of the three corpii. Following is a basic summary - word counts and lines counts - for each file, with no kind of pre-processing.

con1 <- file("./final/en_US/en_US.news.txt")
con2 <- file("./final/en_US/en_US.blogs.txt")
con3 <- file("./final/en_US/en_US.twitter.txt")

print(paste("News Lines:",length(readLines(con1, skipNul = TRUE))))
print(paste("News Word Count:", wordcount(readLines(con1, skipNul = TRUE))))
      
print(paste("Blogs Lines:",length(readLines(con2, skipNul = TRUE))))
print(paste("Blogs Word Count:", wordcount(readLines(con2, skipNul = TRUE))))
      
print(paste("Tweets Lines:",length(readLines(con3, skipNul = TRUE))))
print(paste("Tweets Word Count:", wordcount(readLines(con3, skipNul = TRUE))))

close(con1, con2, con3)

## [1] "News Lines: 1010242"
## [1] "News Word Count: 34372530"
## [1] "Blogs Lines: 899288"
## [1] "Blogs Word Count: 37334131"
## [1] "Tweets Lines: 2360148"
## [1] "Tweets Word Count: 30373583"

Each dataset is massive, so loading and organizing the files is the first challenge. I used readLines() and corpus to create corpus data. All following cleaning and analyzing takes advantage of parallel processing and functions such as parLapply() to drastically cut down on runtime.

Cleaning and Organizing

The news articles seem to be scraped from web pages, and often include phone numbers or emails in site footers. These are examples of a few things for which I wrote a function (myFilter) to clean up. The entire code for this filter is available on my github page (repository DSCapstone/myFilter.R) located here. Running myFilter(corpus) accomplishes the following:

Removes non-ASCII characters with inconv() function
Removes urls, email addresses, and numbers with gsub() function
Converts contractions into two separate words with gsub() function
Splits the corpus into five sections for parallel processing
Restructures the corpus chunks from news article entries to sentence entries with corpus_reshape() function and parallel processing
Returns a list of the five corpus chunks for further processing

This is a good starting point for the word-order analysis.

Exploratory Analysis

The quanteda function textstat_collocations() identifies multi-word expressions (in this case, pairs) and scores them accordingly. Collocation is similar to n-grams; I chose the collocation function specifically because of the simplicity of the output. I chose the numeric ‘count’ variable for an easy-to-compare measurement of n-gram (word pair) frequency. The output of the collocation function on each corpus, with the count variable added when appropriate and the results cleaned of extraneous information, is two columns: the first variable is the two-word phrase, the second variable is the number of times it appears. Below are the first few lines of this datatable.

collocation	count
it is	34025
more than	12445
in the	78667
of the	84790
will be	10536
if you	9986

For an example usecase of the collocation data, I will search for the word ‘rather’ and chart the most likely following words.

matches <- combs[which(grepl("rather ", combs$collocation, ignore.case=TRUE))]
matches <- arrange(matches, desc(count))

x <- barplot(height=head(matches$count), ylim=c(0,1400), main="Word Pairs Example: 'Rather'")
labs <- paste(head(matches$collocation))
text(cex=0.9, x=x-.25, y=-200, labs, xpd=TRUE, srt=45)

‘Rather than’ is clearly the most likely combination, so the algorithm in the final app should return ‘than’ to the user.

Plans for the App

The collocation data could be the backbone of prediction app that uses regular expressions (REGEX) and simple search function. Consider that the user inputs a word string; that word is the search term (using REGEX to specify the first word in a two-word pair). The regex function would produce all rows in which string appears as the first word. The app would in turn output to the user the second word in the row with the highest count variable. This type of design is user-oriented and makes sense from an outward perspective, but is it the most efficient? A dictionary-style Markov-chain may be faster for the end product, as the user-input string would point directly to one specific data line. However, the processing to create a Markov-chain dictionary may be beyond the capabilties of my computers The next two weeks of the project will be experimenting with and fine-tuning an algorithm to search through and organize the data, so I will experiment with both n-gram collocations and Markov chains.