The Capstone Milestone Report is an initial analysis of the dataset to have an idea on how the data can/will be used to create a predictive app. The main goal of this report is to do identify some key features of the data that will help with the creation of the predictive algorithm.
There is data from 4 different languages (German, Russian, English, and Finish). Each language contains text samples from 3 different sources: blogs, news, and twitter. This exercise will use the English data set.
filename<-'~/R_programming/CapStone/final/en_US/'
library(RWeka)
library(ggplot2)
library(stringi)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
blogs.size <- file.info(paste0(filename,"en_US.twitter.txt"))$size / 1024 ^ 2
news.size <- file.info(paste0(filename,"en_US.news.txt"))$size / 1024 ^ 2
twitter.size <- file.info(paste0(filename,"en_US.twitter.txt"))$size / 1024 ^ 2
# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
## source file.size.MB num.lines num.words mean.num.words
## 1 blogs 159.3641 899288 37546246 41.75108
## 2 news 196.2775 1010242 34762395 34.40997
## 3 twitter 159.3641 2360148 30093410 12.75065
The datasets are considerably big and will require a lot of time to process and analyze. For this initial analysis the data will be randomly sampled using 1,000 lines of the Blogs, News and Twitter data.
set.seed(007)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
Now that we have a sample it is important to clean the data. The cleansing of the sample will be to remove special characters, make lowercase, remove punctuation, remove numbers, remove whitespace and common english stopwords (and, the, or etc.). Also we will be removing profanity word and to do it we will use Google’s bad words database.
## Read Sample
#setwd("C:/Users/Home/datasciencecoursera/Capstone/sample")
#sampleData <- readLines("sampleData.txt", encoding="UTF-8")
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
## Remove profanity
file.bad <- "http://www.bannedwordlist.com/lists/swearWords.txt"
if (!file.exists("googlebadwords.txt")) {
download.file(file.bad, "googlebadwords.txt", mode="wb")
}
badword <- read.table(file = "googlebadwords.txt",header = F,as.is = T)
badword_new<- gsub("[*()]","",badword[,1])
corpus <- tm_map(corpus, removeWords, badword_new)
#create DocumentTermMatrix
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("grey50"))
}
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
makePlot(freq1, "Top 30 Common Unigrams")
makePlot(freq2, "Top 30 Common Bigrams")
makePlot(freq1, "Top 30 Common Trigrams")
The future work will involve expanding the n-gram analysis and developing the predictive model based on common n-grams. The Shiny app to allow the user to interface with the model will be developed last.