Introduction

This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the project is to create a predictive text product using a large corpus of text documents as training data. Analysis of text data and natural language processing will be used to build the predictive model.

Obtaining the Data

The zip file containing the text files was downloaded and unzipped.

# download and unzip file
if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}

The data sets consist of texts from three different sources: 1) news, 2) blogs and 3) Twitter feeds. The text data are provided in four languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian. In this project, only the English - United States data sets will be used.

# read data into R
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("final/en_US/en_US.news.txt", encoding =
## "UTF-8", skipNul = TRUE): incomplete final line found on 'final/en_US/
## en_US.news.txt'
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

# remove invalid characters
twitter <- sapply(twitter, function(x) iconv (enc2utf8(x), sub ="byte"))
blogs <- sapply(blogs, function(x) iconv (enc2utf8(x), sub ="byte"))
news <- sapply(news, function(x) iconv (enc2utf8(x), sub ="byte"))

A summary of the data is obtained.

library(stringi)

# get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

# get words
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

# create summary
summary <- data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))

summary
##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  38153767       42.42664
## 2    news     196.2775     77259   2694417       34.87512
## 3 twitter     159.3641   2360148  30195719       12.79399

Data Cleaning

Since the data sets are quite large, a sample (1%) of the data was selected for demonstration.

library(tm)
## Loading required package: NLP
# Sample the data
set.seed(679)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

A corpus for each data type was created. The data was then cleaned - punctuations, white spaces, numbers, stopwords (e.g. profanities) were removed, texts were convered to lowercase, and stemming was used to removed similar words.

# create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)

# remove white spaces
corpus <- tm_map(corpus, stripWhitespace)

# remove numbers
corpus <- tm_map(corpus, removeNumbers)

# remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("en"))

# convert to plain text
corpus <- tm_map(corpus, PlainTextDocument)

# convert to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))

# initiate stemming
corpus <- tm_map(corpus, stemDocument)

Exploratory Analysis

Word frequencies for each document were created and ordered from largest to smallest. The histograms show the 30 most frequent words from each file.

library(RWeka)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
options(mc.cores=1)

getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("grey50"))
}

# Get frequencies of most common n-grams in data sample
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))

The following are histograms of the 30 most common:

  1. unigrams
makePlot(freq1, "30 Most Common Unigrams")

  1. bigrams
makePlot(freq2, "30 Most Common Bigrams")

  1. trigrams
makePlot(freq3, "30 Most Common Trigrams")

Plans for Prediction Algorithm And Shiny App

Following this exploratory analysis, a predictive algorithm will be developed and then deployed as a Shiny app.

A prediction model will be built based on larger sample size (perhaps 2%). The predictive algorithm will be using an n-gram model with a frequency lookup similar to the exploratory analysis.

The Shiny app will have a user interface consisting of a text input box. When the user enters a phrase, the app will use the algorithm to suggest the most likely next word.