This document is submitted for the partial completion of Capstone Project in DataScience Specialization from Johns Hopkins University offered through Coursera.
The Aim of this document is to:
Data Coprises of text files which has blogs, news and twitter feed respectively. the data is from a corpus called HC Corpora, (www.corpora.heliohost.org). Readme file is available here. The Corpus is available in 4 languages in which we are going to use English.
Lets set the working directory, where the files exist and check the list of files.
setwd("D:\\Data Science Track\\Notes\\capstone\\en_US")
list.files()
## [1] "en_US.blogs.txt" "en_US.news.txt"
## [3] "en_US.twitter.txt" "exploratory_data_analysis.html"
## [5] "exploratory_data_analysis.Rmd" "profanity.txt"
The files en_US.blogs,en_US.news, en_US.twitter.txt are the input files which has blogs text, news feeds and twitter feeds respectively in English. And profanity.txt is used for the profanity filtering and it contains a list profane words and it was downloaded here
Lets load the required libraries.
suppressMessages(library(R.utils))
library(tm)
suppressMessages(library(qdap))
library(RWeka)
library(stringi)
library(stringr)
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
Lets create the file connections to read the data into R.
news <- file("en_US.news.txt","rb")
blogs <- file("en_US.blogs.txt")
twitter <- file("en_US.twitter.txt")
profanity <- file("profanity.txt")
Lets look at the number of lines in each of the input files. It uses countLines function from the R.utils package.
countLines("en_US.news.txt", con=news)
## [1] 1010242
## attr(,"lastLineHasNewline")
## [1] TRUE
countLines("en_US.blogs.txt", con=blogs)
## [1] 899288
## attr(,"lastLineHasNewline")
## [1] TRUE
countLines("en_US.twitter.txt", con=twitter)
## [1] 2360148
## attr(,"lastLineHasNewline")
## [1] TRUE
A summary statistics is given below:
summary_table <- data.frame(filename = c("blogs","news","twitter"),
num_lines = c(length(blogs1),length(news1),length(twitter1)),
num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))
summary_table
## filename num_lines num_words mean_num_words
## 1 blogs 899288 37541795 41.74613
## 2 news 1010242 34762303 34.40988
## 3 twitter 2360148 30092866 12.75041
Since lot of data in the files (more than 4million lines all together), we’ll read only small samples from each file for the analysis
news.sm <- readLines(news,10000)
blogs.sm <- readLines(blogs,10000)
twit.sm <- readLines(twitter,10000)
close(news)
close(blogs)
close(twitter)
Lets combine all the data to create a single input data sample
data <- paste(news.sm, blogs.sm, twit.sm)
Lets do some basic cleaning of data by doing the following, 1. Breaking down in to sentences 2. Removing Punctuations, Numbers and white spaces 3. Converting all the text to lower case.
qdap package is used for breaking down in to sentences and tm package is used for other cleaning
data <- sent_detect(data, language="en", model=NULL) #breaking down into sentences
corpus <- VCorpus(VectorSource(data)) #Building main corpus
corpus <- tm_map(corpus, removeNumbers) #Removing Numbers
corpus <- tm_map(corpus, stripWhitespace) #stripping white spaces
corpus <- tm_map(corpus, removePunctuation) #Reomving Puncutaions
corpus <- tm_map(corpus, tolower) #Converting the text to lower case.
Lets filter the Profanity words from our main corpus.
profane <- readLines(profanity) #reading Profanity words
profane <- VectorSource(profane) #creating Vecor source
corpus <- tm_map(corpus, removeWords, profane) #removing profanity words from main corpus
Lets tokenize our corpus into uni-gram, bi-gram and tri-grams
df <-data.frame(text=unlist(corpus), stringsAsFactors=F) ##Converting the copus into dataframe.
unigram <- NGramTokenizer(df, Weka_control(min = 1, max = 1)) #Uni-gram Tokenization
bigram <- NGramTokenizer(df, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!")) #Bi-gram tokenization
trigram <- NGramTokenizer(df, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!")) #Tri-gram Tokenization
Lets find out the frequencies of the available uni-grams, b-grams and tri-grams respectively
uni.df <- data.frame(table(unigram))
bi.df <- data.frame(table(bigram))
tri.df <- data.frame(table(trigram))
Top 20 Unigrams and its frequencies
uni.df <- uni.df[order(-uni.df$Freq),]
uni_top20 <- uni.df[1:20,]
ggplot(uni_top20, aes(x=unigram, y=Freq)) +
geom_bar(stat="Identity", fill="green")+
xlab("Unigrams") + ylab("Frequency")+
ggtitle("Top 20 Unigrams") +
geom_text(aes(label=Freq, vjust=-0.1))+
theme(axis.text.x=element_text(angle=90, hjust=1))
Top 20 bi-grams and its frequencies
bi.df <- bi.df[order(-bi.df$Freq),]
bi_top20 <- bi.df[1:20,]
ggplot(bi_top20, aes(x=bigram, y=Freq)) +
geom_bar(stat="Identity", fill="red")+
xlab("Bi-grams") + ylab("Frequency")+
ggtitle("Top 20 Bi-grams") +
geom_text(aes(label=Freq, vjust=-0.1))+
theme(axis.text.x=element_text(angle=90, hjust=1))
Top 20 tri-grams and its frequencies
tri.df <- tri.df[order(-tri.df$Freq),]
tri_top20 <- tri.df[1:20,]
ggplot(tri_top20, aes(x=trigram, y=Freq)) +
geom_bar(stat="Identity", fill="brown")+
xlab("Tri-grams") + ylab("Frequency")+
ggtitle("Top 20 Tri-grams") +
geom_text(aes(label=Freq, vjust=-0.1))+
theme(axis.text.x=element_text(angle=90, hjust=1))
I plan to use Hidden Markov Models witht he combination of ngrams. Please give your feedbacks for this.
This document is aimed to complete simple explorations and basic relationship between words. It is evident that most frequent unigrams are English Language Stopwords.
In general, stopwords can be removed in most the text mining problems. But here we can not remove those stopwords because when we look at bi-grams or trigrams, most of it is complete and make sense only with the stopwords.So we can not remove stop words in this case.