DataScience Capstone: Exploratory Data Analysis

Scope of the Document

This document is submitted for the partial completion of Capstone Project in DataScience Specialization from Johns Hopkins University offered through Coursera.

The Aim of this document is to:

  1. Demonstration to download the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that I gathered so far.
  4. Get feedback on my plans for creating a prediction algorithm and Shiny app.

Data

Data Coprises of text files which has blogs, news and twitter feed respectively. the data is from a corpus called HC Corpora, (www.corpora.heliohost.org). Readme file is available here. The Corpus is available in 4 languages in which we are going to use English.

Tasks Accomplished

  1. Loading Data and basic summary Statistics
  2. Cleaning Data
  3. Profanity Filtering
  4. Tokenization
  5. Exploratory Analysis
  6. Understanding the Frequencies of words and word pairs
  7. Idea for the Prediction Algorithm

Loading Data and Basic Summary Statistics

Lets set the working directory, where the files exist and check the list of files.

setwd("D:\\Data Science Track\\Notes\\capstone\\en_US")
list.files()
## [1] "en_US.blogs.txt"                "en_US.news.txt"                
## [3] "en_US.twitter.txt"              "exploratory_data_analysis.html"
## [5] "exploratory_data_analysis.Rmd"  "profanity.txt"

The files en_US.blogs,en_US.news, en_US.twitter.txt are the input files which has blogs text, news feeds and twitter feeds respectively in English. And profanity.txt is used for the profanity filtering and it contains a list profane words and it was downloaded here

Lets load the required libraries.

suppressMessages(library(R.utils))
library(tm)
suppressMessages(library(qdap))
library(RWeka)
library(stringi)
library(stringr)
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))

Lets create the file connections to read the data into R.

news <- file("en_US.news.txt","rb")
blogs <- file("en_US.blogs.txt")
twitter <- file("en_US.twitter.txt")
profanity <- file("profanity.txt")

Lets look at the number of lines in each of the input files. It uses countLines function from the R.utils package.

countLines("en_US.news.txt", con=news) 
## [1] 1010242
## attr(,"lastLineHasNewline")
## [1] TRUE
countLines("en_US.blogs.txt", con=blogs)
## [1] 899288
## attr(,"lastLineHasNewline")
## [1] TRUE
countLines("en_US.twitter.txt", con=twitter)
## [1] 2360148
## attr(,"lastLineHasNewline")
## [1] TRUE

A summary statistics is given below:

summary_table <- data.frame(filename = c("blogs","news","twitter"),
                            num_lines = c(length(blogs1),length(news1),length(twitter1)),
                            num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
                            mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))
summary_table
##   filename num_lines num_words mean_num_words
## 1    blogs    899288  37541795       41.74613
## 2     news   1010242  34762303       34.40988
## 3  twitter   2360148  30092866       12.75041

Since lot of data in the files (more than 4million lines all together), we’ll read only small samples from each file for the analysis

news.sm <- readLines(news,10000)
blogs.sm <- readLines(blogs,10000)
twit.sm <- readLines(twitter,10000)
close(news)
close(blogs)
close(twitter)

Lets combine all the data to create a single input data sample

data <- paste(news.sm, blogs.sm, twit.sm)

Cleaning Data

Lets do some basic cleaning of data by doing the following, 1. Breaking down in to sentences 2. Removing Punctuations, Numbers and white spaces 3. Converting all the text to lower case.

qdap package is used for breaking down in to sentences and tm package is used for other cleaning

data <- sent_detect(data, language="en", model=NULL) #breaking down into sentences

corpus <- VCorpus(VectorSource(data)) #Building main corpus
corpus <- tm_map(corpus, removeNumbers) #Removing Numbers
corpus <- tm_map(corpus, stripWhitespace) #stripping white spaces
corpus <- tm_map(corpus, removePunctuation) #Reomving Puncutaions
corpus <- tm_map(corpus, tolower) #Converting the text to lower case.

Profanity Filtering

Lets filter the Profanity words from our main corpus.

profane <- readLines(profanity) #reading Profanity words
profane <- VectorSource(profane) #creating Vecor source
corpus <- tm_map(corpus, removeWords, profane) #removing profanity words from main corpus

Tokenization

Lets tokenize our corpus into uni-gram, bi-gram and tri-grams

df <-data.frame(text=unlist(corpus), stringsAsFactors=F) ##Converting the copus into dataframe.

unigram <- NGramTokenizer(df, Weka_control(min = 1, max = 1)) #Uni-gram Tokenization

bigram <- NGramTokenizer(df, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!")) #Bi-gram tokenization

trigram <- NGramTokenizer(df, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!")) #Tri-gram Tokenization

Exploratory Analysis

Lets find out the frequencies of the available uni-grams, b-grams and tri-grams respectively

uni.df <- data.frame(table(unigram))
bi.df <- data.frame(table(bigram))
tri.df <- data.frame(table(trigram))

Top 20 Unigrams and its frequencies

uni.df <- uni.df[order(-uni.df$Freq),]
uni_top20 <- uni.df[1:20,]
ggplot(uni_top20, aes(x=unigram, y=Freq)) +
    geom_bar(stat="Identity", fill="green")+
    xlab("Unigrams") + ylab("Frequency")+
    ggtitle("Top 20 Unigrams") +
    geom_text(aes(label=Freq, vjust=-0.1))+
    theme(axis.text.x=element_text(angle=90, hjust=1))

Top 20 bi-grams and its frequencies

bi.df <- bi.df[order(-bi.df$Freq),]
bi_top20 <- bi.df[1:20,]
ggplot(bi_top20, aes(x=bigram, y=Freq)) +
    geom_bar(stat="Identity", fill="red")+
    xlab("Bi-grams") + ylab("Frequency")+
    ggtitle("Top 20 Bi-grams") +
    geom_text(aes(label=Freq, vjust=-0.1))+
    theme(axis.text.x=element_text(angle=90, hjust=1))

Top 20 tri-grams and its frequencies

tri.df <- tri.df[order(-tri.df$Freq),]
tri_top20 <- tri.df[1:20,]
ggplot(tri_top20, aes(x=trigram, y=Freq)) +
    geom_bar(stat="Identity", fill="brown")+
    xlab("Tri-grams") + ylab("Frequency")+
    ggtitle("Top 20 Tri-grams") +
    geom_text(aes(label=Freq, vjust=-0.1))+
    theme(axis.text.x=element_text(angle=90, hjust=1))

Prediction Idea

I plan to use Hidden Markov Models witht he combination of ngrams. Please give your feedbacks for this.

Conclusion

This document is aimed to complete simple explorations and basic relationship between words. It is evident that most frequent unigrams are English Language Stopwords.

In general, stopwords can be removed in most the text mining problems. But here we can not remove those stopwords because when we look at bi-grams or trigrams, most of it is complete and make sense only with the stopwords.So we can not remove stop words in this case.