The Capstone project for the Coursera Data Science Specialization involves exploration on a big data named as HC Corpora
Dataset. The project is done in collaboration with Swiftkey
and the goal of this project is to design a Shiny App with text prediction capabilities. This report will outline the exploratory analysis of the dataset and the current plans for implementing the text prediction algorithm.
The [HC Corpora][1] dataset is comprised of the output of crawls of news sites, blogs and twitter. A readme file with more specific details on how the data was generated can be found [here][3]. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This project will focus on the English language datasets. The names of the data files are as follows:
The datasets will be referred to as “Blogs”, “Twitter” and “News” for the remainder of this report.
setwd("/Users/subasishdas1/Copy/OTHERS/Coursera/data sc cap/prog")
blogs<-readLines("en_US/en_US.blogs.txt")
news<-readLines("en_US/en_US.news.txt")
##loading the libraries
library(knitr)
library(RWeka)
library(qdap)
## Loading required package: ggplot2
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
##
## The following object is masked _by_ '.GlobalEnv':
##
## clean
##
## The following object is masked from 'package:base':
##
## Filter
library(SnowballC)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
##
##
## Attaching package: 'tm'
##
## The following objects are masked from 'package:qdap':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
library(stringr)
Each of the datasets (Blogs, Twitter and News) are large enough that processing time is a factor. In order to address this concern, a representative sampling of each of the datasets was made for the remainder of this analysis.
twitter<-readLines("en_US/en_US.twitter.txt", warn = FALSE)
dir.create("en_US/Sample/", showWarnings = FALSE)
set.seed(2000)
blogs<-blogs[rbinom(length(blogs)*.005, length(blogs), .5)]
write.csv(blogs, file = "en_US/Sample/blogs.csv", row.names = FALSE)
news<-news[rbinom(length(news)*.005, length(news), .5)]
write.csv(news, file = "en_US/Sample/news.csv", row.names = FALSE)
twitter<-twitter[rbinom(length(twitter)*.005, length(twitter), .5)]
write.csv(twitter, file = "en_US/Sample/twitter.csv", row.names = FALSE)
dat<-Corpus(DirSource("en_US/Sample"), readerControl = list(reader=readPlain, language="en_US"))
#Removed extra whitespace
dat <- tm_map(dat, stripWhitespace)
#Transformed all characters to lowercase
dat <- tm_map(dat, content_transformer(tolower))
#Remove Numbers
dat <- tm_map(dat, removeNumbers)
#Remove Punctuation
dat<- tm_map(dat, removePunctuation)
# leave alphanumerics, punctuation and spaces only
tw <- gsub("[^[:print:]]", " ", twitter)
bl <- gsub("[^[:print:]]", " ", blogs)
ns <- gsub("[^[:print:]]", " ", news)
# Sample subset to build and explore model
set.seed(1)
twitterSmpl <- tw[as.logical(rbinom(length(tw), 1, prob=.1))]
newsSmpl <- ns[as.logical(rbinom(length(ns), 1, prob=.2))]
blogSmpl <- bl[as.logical(rbinom(length(bl), 1, prob=.1))]
setwd(“/Users/subasishdas1/Copy/Rpubs/Shiny/example”)
# Collapse into one document (to avoid numerous meta data)
library(qdap)
twitterOne <- paste2(twitterSmpl, " ")
newsOne <- paste2(newsSmpl, " ")
blogOne <- paste2(blogSmpl, " ")
library(tm)
options(mc.cores=1) # to make tm package running more robust
rawCorpus <- VCorpus(VectorSource(list(twitterOne, newsOne, blogOne)),
readerControl = list(language="english"))
# Clean Corpus
library(SnowballC)
clean <- function(y) {
rawCorpus <- tm_map(y, content_transformer(tolower))
rawCorpus <- tm_map(rawCorpus, stemDocument)
rawCorpus <- tm_map(rawCorpus, removeNumbers)
tm_map(rawCorpus, stripWhitespace)
}
cleanCorpus <- clean(rawCorpus)
rm(rawCorpus)
tdm <- TermDocumentMatrix(cleanCorpus)
inspect(tdm[1:10,1:3])
## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 10/20
## Sparsity : 67%
## Maximal term length: 11
## Weighting : term frequency (tf)
##
## Docs
## Terms 1 2 3
## -_- 1 0 0
## --- 1 0 0
## ---. 0 1 0
## ---> 1 0 0
## --, 0 2 0
## --. 0 2 0
## --consider, 0 2 0
## --lhp 0 1 0
## --streamlin 1 0 0
## -.- 1 0 0
findFreqTerms(tdm, lowfreq=2000)
## [1] "the"
sapply(list(twitter, blogs, news), length)
## [1] 11800 4496 5051
sapply(list(twitterSmpl, blogSmpl, newsSmpl), length)
## [1] 1232 446 1023
uni <- rowSums(as.matrix(tdm))
barplot(tail(sort(uni), 10), las = 2, main = "Top 20 Unigrams", cex.main = 1, cex.axis = 0.75, horiz=TRUE)
library(RWeka)
Bigram<- NGramTokenizer(dat, Weka_control(min = 2, max = 2))
freq.Bigram <- data.frame(table(Bigram))
sort.Bigram <- freq.Bigram[order(freq.Bigram$Freq,decreasing = TRUE),]
Bigram_top10<-head(sort.Bigram,5)
barplot(Bigram_top10$Freq, names.arg = Bigram_top10$Bigram, border=NA,
las=2, main="Top 10 Most Frequent BiGrams", cex.main=1)
The following factors are dominant in the developed models: 1. Foreign language text exists in the dataset and it will need to be dealt with effectively. 2. The twitter data is (as is to be expected) made up of purely short phrases, I expect this data will yield different results than the blog and news data which are much more context rich.
To develop the Shiny tool, more rigorus work on the datset is required to perform. Here are few things I will consider: 1) Identify a more efficient method for tokenization, 2) Find a way to deal with foreign text in the data, 3) Build several very small training/test sets so that I can efficiently run models and test them on new data.