This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the project is to create a predictive text product using a large corpus of text documents as training data. Analysis of text data and natural language processing will be used to build the predictive model.
The zip file containing the text files was downloaded and unzipped.
# download and unzip file
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
The data sets consist of texts from three different sources: 1) news, 2) blogs and 3) Twitter feeds. The text data are provided in four languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian. In this project, only the English - United States data sets will be used.
# read data into R
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("final/en_US/en_US.news.txt", encoding =
## "UTF-8", skipNul = TRUE): incomplete final line found on 'final/en_US/
## en_US.news.txt'
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# remove invalid characters
twitter <- sapply(twitter, function(x) iconv (enc2utf8(x), sub ="byte"))
blogs <- sapply(blogs, function(x) iconv (enc2utf8(x), sub ="byte"))
news <- sapply(news, function(x) iconv (enc2utf8(x), sub ="byte"))
A summary of the data is obtained.
library(stringi)
# get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
# get words
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
# create summary
summary <- data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
summary
## source file.size.MB num.lines num.words mean.num.words
## 1 blogs 200.4242 899288 38153767 42.42664
## 2 news 196.2775 77259 2694417 34.87512
## 3 twitter 159.3641 2360148 30195719 12.79399
Since the data sets are quite large, a sample (1%) of the data was selected for demonstration.
library(tm)
## Loading required package: NLP
# Sample the data
set.seed(679)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
A corpus for each data type was created. The data was then cleaned - punctuations, white spaces, numbers, stopwords (e.g. profanities) were removed, texts were convered to lowercase, and stemming was used to removed similar words.
# create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# remove white spaces
corpus <- tm_map(corpus, stripWhitespace)
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
# remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("en"))
# convert to plain text
corpus <- tm_map(corpus, PlainTextDocument)
# convert to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# initiate stemming
corpus <- tm_map(corpus, stemDocument)
Word frequencies for each document were created and ordered from largest to smallest. The histograms show the 30 most frequent words from each file.
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("grey50"))
}
# Get frequencies of most common n-grams in data sample
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
The following are histograms of the 30 most common:
makePlot(freq1, "30 Most Common Unigrams")
makePlot(freq2, "30 Most Common Bigrams")
makePlot(freq3, "30 Most Common Trigrams")
Following this exploratory analysis, a predictive algorithm will be developed and then deployed as a Shiny app.
A prediction model will be built based on larger sample size (perhaps 2%). The predictive algorithm will be using an n-gram model with a frequency lookup similar to the exploratory analysis.
The Shiny app will have a user interface consisting of a text input box. When the user enters a phrase, the app will use the algorithm to suggest the most likely next word.