Introduction

This project attempts to create a text prediction application using data derived from Twitter, blog posts, and news articles. When the user enters a text string into the provided input area, the entered text will be analyzed, the next word will be predicted and then presented to the user.

Data Loading and Exploration

In order to predict what the user will type next, I need something to compare it to. First, I load a few Natual Language Processing packages and the source data.

library( tm )
## Loading required package: NLP
library( RWeka )

setwd("~/coursera/R Capstone/final/")
options( mc.cores = 1 )

twitter <- readLines("./en_US/en_US.twitter.txt", skipNul=TRUE )
blogs <- readLines("./en_US/en_US.blogs.txt", skipNul=TRUE )
news <- readLines("./en_US/en_US.news.txt", skipNul=TRUE )

length( twitter ) + length( blogs ) + length( news )
## [1] 4269678

The first thing that I noticed about these data is that they are huge. Over 4.2 million rows! So, to start, I’m only loading the first 10,000 lines from each file. Also, several of the RWeka and tm functions have issues with multiple cores on my workstation, so I’ve set the global option mc.cores to 1.

twitter <- readLines("./en_US/en_US.twitter.txt", skipNul=TRUE, n = 10000 )
blogs <- readLines("./en_US/en_US.blogs.txt", skipNul=TRUE, n = 10000 )
news <- readLines("./en_US/en_US.news.txt", skipNul=TRUE, n = 10000 )

Once the files are loaded, I combine them into one flat list and create a Corpus using the tm package and begin the cleaning process. I decided to keep ‘bad words’ and ‘stop words’ in the Corpus as I felt they may be useful for prediction. ‘Bad words’ will not be returned to the user, but may be used when processing user input.

raw <- paste( twitter, blogs, news, collapse = '\n' )

d <- Corpus( VectorSource( raw ) )

d <- tm_map( d, tolower, mc.cores = 1 )
d <- tm_map( d, removePunctuation, mc.cores = 1 )
d <- tm_map( d, removeNumbers, mc.cores = 1 )
d <- tm_map( d, stemDocument, mc.cores = 1 )
d <- tm_map( d, stripWhitespace, mc.cores = 1 )

d <- Corpus( VectorSource( d ) )

Next, I create a data.frame of bigrams and trigrams with their frequency count using the RWeka package.

ng <- NGramTokenizer( d, Weka_control( min=2, max=3) )

ng <- as.data.frame( table( ng ) )

summary( ng )
##                  ng               Freq         
##  = 0              :      1   Min.   :   1.000  
##  0 datetimestamp  :      1   1st Qu.:   1.000  
##  0 datetimestamp =:      1   Median :   1.000  
##  = 0 description  :      1   Mean   :   1.488  
##  0 description    :      1   3rd Qu.:   1.000  
##  0 description =  :      1   Max.   :4105.000  
##  (Other)          :1160276

Let’s take a closer look at the transformed data. This histogram shows the frequency of bi and trigrams which have an individual frequency greater than 500 found in the source data.

What’s next?

When the user enters text into the unput area the application the ngram data frame for similar ngrams using a regex expression. The results will be sorted by the Freq column and the last words in the top 10 results will be returned to the user. The ngrams will be created from the complete dataset processed in chunks of 10000.

xkcd Credit: xkcd, http://xkcd.com/208/