March 15, 2016

Overview

This is the milestone report for the Data Science Capstone Project.

The goal of this report was to build a simple model for the relationship between words, as a first step in creating a predictive text mining application.

The following sections describe my methods for analysing the datasets

Libraries

Load the necessary libraries

library(tm)
library(knitr)
library(ggplot2)

Securing the Data and Preliminary Analyses

The dataset can be downloaded from here

Included are three different data files containing text sampled from blogs, news articles, and twitter feeds. The the English versions was used.

## Read in the files
blog    <- readLines("data/final/en_US/en_US.blogs.txt")
news    <- readLines("data/final/en_US/en_US.news.txt")
twitter <- readLines("data/final/en_US/en_US.twitter.txt")

Combined the datasets contain several million lines of text, with over 20 million characters.

kable(data.frame(
  "Data File"       = c("Blogs", "News", "Twitter"), 
  "Line Count"      = c(length(blog), length(news), length(twitter)),
  "Character Count" = c( sum(nchar(blog)), sum(nchar(news)), sum(nchar(twitter)))
  ))
Data.File Line.Count Character.Count
Blogs 899288 208361438
News 77259 15683765
Twitter 2360148 162384825

Clean and sample the data sets

In order to be able to predict the next word with the highest degree of accuracy, in a reasonably efficient manner, the data set needed to be cleaned up. Numbers, punctuation, special characters, and stop words were removed. In addition, words where converted back to their stems.

The next sections uses the tm package, you can find an introduction here

Samples of the data sets are used to reduce the memory footprint.

set.seed(90210)

combined_set <- c(blog, news, twitter)
sample_set   <- sample(combined_set, size = 7500, replace = TRUE)

# encode
sample_set <- iconv(sample_set, "latin1", "ASCII", "") 

# create the corpus
swiftkey = Corpus(VectorSource(sample_set))

# clean up objects
rm(blog, news, twitter, sample_set, combined_set)

Tidy up the data sets by removing elements and converting to a common case.

swiftkey <- tm_map(swiftkey, content_transformer(tolower))

# remove the most commonly used words in the english language
swiftkey <- tm_map(swiftkey, removeWords, stopwords("english"))

# reduce inflected (or sometimes derived) words to their word stem
swiftkey <- tm_map(swiftkey, stemDocument)

# clean up
swiftkey <- tm_map(swiftkey, stripWhitespace)
swiftkey <- tm_map(swiftkey, removePunctuation)
swiftkey <- tm_map(swiftkey, removeNumbers)

Basic n-gram models

The final section deals with modeling n-grams and graphing the frequency of unigram, bigram, and trigram.

For reference, see the following links if you are interesting in learning more about text mining with R

unigram <- function(x) unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
bigram  <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
trigram <- function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

tdm1 <- TermDocumentMatrix(swiftkey, control = list(tokenize = unigram))
tdm2 <- TermDocumentMatrix(swiftkey, control = list(tokenize = bigram))
tdm3 <- TermDocumentMatrix(swiftkey, control = list(tokenize = trigram))
rm(swiftkey)

freq1 <- sort(rowSums(as.matrix(tdm1)), decreasing = TRUE)[1:15]
rm(tdm1)

freq2 <- sort(rowSums(as.matrix(tdm2)), decreasing = TRUE)[1:15]
rm(tdm2)

freq3 <- sort(rowSums(as.matrix(tdm3)), decreasing = TRUE)[1:15]
rm(tdm3)

Unigram

Unigrams ordered by frequency

x1 <- data.frame(shingles = names(freq1), fq = freq1)

ggplot(x1, aes(x=reorder(shingles, fq), y=fq)) + 
  geom_bar(stat="identity") + xlab("Term(s)") + ylab("Frequency") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Bigram

Bigrams ordered by frequency

x2 <- data.frame(shingles = names(freq2), fq = freq2)

ggplot(x2, aes(x=reorder(shingles, fq), y=fq)) + 
  geom_bar(stat="identity") + xlab("Term(s)") + ylab("Frequency") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Trigram

Trigrams ordered by frequency

x3 <- data.frame(shingles = names(freq3), fq = freq3)

ggplot(x3, aes(x=reorder(shingles, fq), y=fq)) + 
  geom_bar(stat="identity") + xlab("Term(s)") + ylab("Frequency") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Conclusion and next steps

This was a good first step towards the final goal of creating a prediction algorithm. I can see some large hurdles, such as quality of the data and performance. Not to mention choosing the overall methodology of the prediction. As far as next steps go, I am going to continue to model/clean the data and look for ways to make it more useful.