The Capstone Milestone Report is an initial analysis of the dataset to have an idea on how the data can/will be used to create a predictive app. The main goal of this report is to do identify some key features of the data that will help with the creation of the predictive algorithm.

There is data from 4 different languages (German, Russian, English, and Finish). Each language contains text samples from 3 different sources: blogs, news, and twitter. This exercise will use the English data set.

Load libraries

filename<-'~/R_programming/CapStone/final/en_US/'
library(RWeka)
library(ggplot2)
library(stringi)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

Read data from local machine

basic findings

blogs.size <- file.info(paste0(filename,"en_US.twitter.txt"))$size / 1024 ^ 2
news.size <- file.info(paste0(filename,"en_US.news.txt"))$size / 1024 ^ 2
twitter.size <- file.info(paste0(filename,"en_US.twitter.txt"))$size / 1024 ^ 2

# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     159.3641    899288  37546246       41.75108
## 2    news     196.2775   1010242  34762395       34.40997
## 3 twitter     159.3641   2360148  30093410       12.75065

Sample Data

The datasets are considerably big and will require a lot of time to process and analyze. For this initial analysis the data will be randomly sampled using 1,000 lines of the Blogs, News and Twitter data.

set.seed(007)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

Read Sample Data & Clean Data

Now that we have a sample it is important to clean the data. The cleansing of the sample will be to remove special characters, make lowercase, remove punctuation, remove numbers, remove whitespace and common english stopwords (and, the, or etc.). Also we will be removing profanity word and to do it we will use Google’s bad words database.

## Read Sample
#setwd("C:/Users/Home/datasciencecoursera/Capstone/sample")
#sampleData <- readLines("sampleData.txt", encoding="UTF-8")
corpus <- VCorpus(VectorSource(data.sample))

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

## Remove profanity
file.bad <- "http://www.bannedwordlist.com/lists/swearWords.txt"

if (!file.exists("googlebadwords.txt")) {
    download.file(file.bad, "googlebadwords.txt", mode="wb")
}

badword <- read.table(file = "googlebadwords.txt",header = F,as.is = T)
badword_new<- gsub("[*()]","",badword[,1])
corpus <- tm_map(corpus, removeWords, badword_new)
#create DocumentTermMatrix
options(mc.cores=1)


getFreq <- function(tdm) {
      freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
      return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
      ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
            labs(x = label, y = "Frequency") +
            theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
            geom_bar(stat = "identity", fill = I("grey50"))
}



freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
makePlot(freq1, "Top 30 Common Unigrams")

makePlot(freq2, "Top 30 Common Bigrams")

makePlot(freq1, "Top 30 Common Trigrams")

Future Work

The future work will involve expanding the n-gram analysis and developing the predictive model based on common n-grams. The Shiny app to allow the user to interface with the model will be developed last.