Abstract

The Capstone project for the Coursera Data Science Specialization involves exploration on a big data named as HC Corpora Dataset. The project is done in collaboration with Swiftkey and the goal of this project is to design a Shiny App with text prediction capabilities. This report will outline the exploratory analysis of the dataset and the current plans for implementing the text prediction algorithm.

Data

The [HC Corpora][1] dataset is comprised of the output of crawls of news sites, blogs and twitter. A readme file with more specific details on how the data was generated can be found [here][3]. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This project will focus on the English language datasets. The names of the data files are as follows:

en_US.blogs.txt
en_US.twitter.txt
en_US.news.txt

The datasets will be referred to as “Blogs”, “Twitter” and “News” for the remainder of this report.

setwd("/Users/subasishdas1/Copy/OTHERS/Coursera/data sc cap/prog")
blogs<-readLines("en_US/en_US.blogs.txt")
news<-readLines("en_US/en_US.news.txt")

##loading the libraries
library(knitr)
library(RWeka)
library(qdap)

## Loading required package: ggplot2
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     clean
## 
## The following object is masked from 'package:base':
## 
##     Filter

library(SnowballC)
library(tm)

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## 
## Attaching package: 'tm'
## 
## The following objects are masked from 'package:qdap':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix

library(stringr)

Each of the datasets (Blogs, Twitter and News) are large enough that processing time is a factor. In order to address this concern, a representative sampling of each of the datasets was made for the remainder of this analysis.

Exploratory Analysis

twitter<-readLines("en_US/en_US.twitter.txt", warn = FALSE)

dir.create("en_US/Sample/", showWarnings = FALSE)
set.seed(2000)

blogs<-blogs[rbinom(length(blogs)*.005, length(blogs), .5)]
write.csv(blogs, file = "en_US/Sample/blogs.csv", row.names = FALSE)

news<-news[rbinom(length(news)*.005, length(news), .5)]
write.csv(news, file = "en_US/Sample/news.csv", row.names = FALSE)

twitter<-twitter[rbinom(length(twitter)*.005, length(twitter), .5)]
write.csv(twitter, file = "en_US/Sample/twitter.csv", row.names = FALSE)

dat<-Corpus(DirSource("en_US/Sample"), readerControl = list(reader=readPlain, language="en_US"))

#Removed extra whitespace
dat <- tm_map(dat, stripWhitespace)

#Transformed all characters to lowercase
dat <- tm_map(dat, content_transformer(tolower))

#Remove Numbers
dat <- tm_map(dat, removeNumbers)

#Remove Punctuation
dat<- tm_map(dat, removePunctuation)

# leave alphanumerics, punctuation and spaces only
tw <- gsub("[^[:print:]]", "  ", twitter)
bl <- gsub("[^[:print:]]", "  ", blogs)
ns <- gsub("[^[:print:]]", "  ", news)

# Sample subset to build and explore model
set.seed(1)
twitterSmpl <- tw[as.logical(rbinom(length(tw), 1, prob=.1))]
newsSmpl    <- ns[as.logical(rbinom(length(ns), 1, prob=.2))]
blogSmpl    <- bl[as.logical(rbinom(length(bl), 1, prob=.1))]

setwd(“/Users/subasishdas1/Copy/Rpubs/Shiny/example”)

# Collapse into one document (to avoid numerous meta data)
library(qdap)
twitterOne  <- paste2(twitterSmpl, " ")
newsOne     <- paste2(newsSmpl, " ")
blogOne     <- paste2(blogSmpl, " ")


library(tm)
options(mc.cores=1) # to make tm package running more robust
rawCorpus <- VCorpus(VectorSource(list(twitterOne, newsOne, blogOne)),
                     readerControl = list(language="english"))
# Clean Corpus
library(SnowballC)
clean <- function(y) {
  rawCorpus <- tm_map(y, content_transformer(tolower))
  rawCorpus <- tm_map(rawCorpus, stemDocument)
  rawCorpus <- tm_map(rawCorpus, removeNumbers)
  tm_map(rawCorpus, stripWhitespace)
}
cleanCorpus <- clean(rawCorpus)
rm(rawCorpus)

tdm <- TermDocumentMatrix(cleanCorpus)
inspect(tdm[1:10,1:3])

## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 10/20
## Sparsity           : 67%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## 
##              Docs
## Terms         1 2 3
##   -_-         1 0 0
##   ---         1 0 0
##   ---.        0 1 0
##   --->        1 0 0
##   --,         0 2 0
##   --.         0 2 0
##   --consider, 0 2 0
##   --lhp       0 1 0
##   --streamlin 1 0 0
##   -.-         1 0 0

findFreqTerms(tdm, lowfreq=2000)

## [1] "the"

sapply(list(twitter, blogs, news), length)

## [1] 11800  4496  5051

sapply(list(twitterSmpl, blogSmpl, newsSmpl), length)

## [1] 1232  446 1023

uni <- rowSums(as.matrix(tdm))
barplot(tail(sort(uni), 10), las = 2, main = "Top 20 Unigrams", cex.main = 1, cex.axis = 0.75, horiz=TRUE)

library(RWeka)
Bigram<- NGramTokenizer(dat, Weka_control(min = 2, max = 2))
freq.Bigram <- data.frame(table(Bigram))
sort.Bigram <- freq.Bigram[order(freq.Bigram$Freq,decreasing = TRUE),]
Bigram_top10<-head(sort.Bigram,5)
barplot(Bigram_top10$Freq, names.arg = Bigram_top10$Bigram, border=NA, 
        las=2, main="Top 10 Most Frequent BiGrams", cex.main=1)

The following factors are dominant in the developed models: 1. Foreign language text exists in the dataset and it will need to be dealt with effectively. 2. The twitter data is (as is to be expected) made up of purely short phrases, I expect this data will yield different results than the blog and news data which are much more context rich.

Future Work

To develop the Shiny tool, more rigorus work on the datset is required to perform. Here are few things I will consider: 1) Identify a more efficient method for tokenization, 2) Find a way to deal with foreign text in the data, 3) Build several very small training/test sets so that I can efficiently run models and test them on new data.

Milestone Report for Data Science Capstone Project

03/20/2015

Abstract

Data

Exploratory Analysis

Future Work