In this assignment we are trying to explore the datasets and look at the major features of the data with the use of some Plots and Wordclouds.Towards the end we also look at ways to tackle unexplored aspects and path towards our goal of creating a Shiny App.
We have three files with regard to information of language used on Blogs, News and Twitter. We shall read the data and check the basic details of these files.
setwd("C:/Users/user/Desktop/Bhanu/Data Science/CoursEra/Capstone Project/final/en_US")
library(tm)
## Loading required package: NLP
library(wordcloud)
## Loading required package: RColorBrewer
library(NLP)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RColorBrewer)
library(RWeka)
library(SnowballC)
Tweet <- readLines("en_US.twitter.txt",encoding = "UTF-8", skipNul = TRUE)
News <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE,warn = FALSE)
Blogs <- readLines("en_US.blogs.txt",encoding = "UTF-8", skipNul = TRUE)
Checking the no. of rows & characters in each file
# Checking for Tweet data
length(Tweet)
## [1] 2360148
sum(nchar(Tweet))
## [1] 162096241
# Checking for News data
length(News)
## [1] 77259
sum(nchar(News))
## [1] 15639409
# Checking for Blogs data
length(Blogs)
## [1] 899288
sum(nchar(Blogs))
## [1] 206824505
We see huge number of lines and characters in each file. In order to continue with our exploratory analysis we shall try to use first 1000 lines of each dataset.
Tweet_sample <- readLines("en_US.twitter.txt",1000,encoding = "UTF-8", skipNul = TRUE)
News_sample <- readLines("en_US.news.txt",1000, encoding = "UTF-8", skipNul = TRUE,warn = FALSE)
Blogs_sample <- readLines("en_US.blogs.txt",1000,encoding = "UTF-8", skipNul = TRUE)
comp_sample <- c(Tweet_sample,News_sample,Blogs_sample)
In order to explore further we shall create a corpus and look at the frequency distribution of the words - Unigrams, bigrams and trigrams.
So we shall first process the data using the following steps: 1) Convert into a corpus of characters 2) Remove Non-ASCII characters - as we are sticking to American syntax 3) Convert all words to lower case 4) Remove punctuations, numbers 5) Remove English stopwords - which are frequently used english words and add no value to our analysis 6) Remove unecessary whitespaces
We then create a document term matrix which gives us the Words used as columns and the values represent the frequency of their usage. Each line has been treated as a document and is represented by the row
comp_sample <- iconv(comp_sample,"UTF-8", "ASCII", "byte")
sample_corpus <- VCorpus(VectorSource(comp_sample))
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus <- tm_map(sample_corpus, removeNumbers)
sample_corpus <- tm_map(sample_corpus,removeWords,stopwords("english"))
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_dtm <- DocumentTermMatrix(sample_corpus)
inspect(sample_dtm)
## <<DocumentTermMatrix (documents: 3000, terms: 13943)>>
## Non-/sparse entries: 44025/41784975
## Sparsity : 100%
## Maximal term length: 95
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can get just know like new one said time will
## 2022 0 0 1 1 0 2 0 0 1 1
## 2050 0 0 0 0 0 1 1 0 0 1
## 2065 1 1 1 0 1 0 4 0 0 0
## 2180 2 0 0 1 0 0 2 0 2 0
## 2313 1 0 2 0 0 1 1 0 2 1
## 2376 2 0 0 0 0 0 1 0 1 4
## 2483 0 0 0 1 0 0 1 0 1 0
## 2558 0 0 1 0 2 0 0 0 1 0
## 2851 0 0 0 0 1 3 0 0 2 5
## 2951 0 0 0 0 0 0 0 0 0 0
Here we shall look at the trends in the Sample corpus of the words used.
If we look at the word cloud we see that words like ‘said’, ‘will’, ‘one’ etc are the most frequent ones.
We also see at top 50 words and their respective frequencies.
Just plotted the words with frequency of more than 100.
wordcloud(sample_corpus, max.words=100, random.order=FALSE, colors=brewer.pal(8,"Dark2"))
freq <- colSums(as.matrix(sample_dtm))
length(freq)
## [1] 13943
freq <- sort(freq,decreasing = TRUE)
head(freq,50)
## said will one just like can time new get
## 304 259 254 249 248 191 191 186 171
## know day now good first people much year make
## 144 141 137 131 128 124 122 120 112
## also two dont love last really see right think
## 110 106 104 102 99 99 99 97 97
## well going got back even home made want way
## 95 93 93 91 89 87 85 84 84
## little many say work great need years today another
## 83 83 80 80 78 78 75 74 69
## still life never lot next
## 69 65 65 64 64
df <- data.frame(word=names(freq), freq=freq)
p <- ggplot(subset(df,freq>100), aes(word, freq)) + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=45, hjust=1))
p
## tokenizer functions
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# Unigrams
Unigrams <- TermDocumentMatrix(sample_corpus,control = list(tokenize = unigram))
freq_uni <- rowSums(as.matrix(Unigrams))
freq_uni <- sort(freq_uni,decreasing = TRUE)
freq_uni_df <- data.frame(word=names(freq_uni), freq=freq_uni)
ggplot(freq_uni_df[1:50,], aes(word, freq)) + labs(x = "Unigrams", y = "Frequency") + theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) + geom_bar(stat = "identity")
# Bigrams
Bigrams <- TermDocumentMatrix(sample_corpus,control = list(tokenize = bigram))
freq_Bi <- rowSums(as.matrix(Bigrams))
freq_Bi <- sort(freq_Bi,decreasing = TRUE)
freq_Bi_df <- data.frame(word=names(freq_Bi), freq=freq_Bi)
ggplot(freq_Bi_df[1:30,], aes(word, freq)) + labs(x = "Bigrams", y = "Frequency") + theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) + geom_bar(stat = "identity")
Trigrams <- TermDocumentMatrix(sample_corpus,control = list(tokenize = trigram))
freq_Tri <- rowSums(as.matrix(Trigrams))
freq_Tri <- sort(freq_Tri,decreasing = TRUE)
freq_Tri_df <- data.frame(word=names(freq_Tri), freq=freq_Tri)
ggplot(freq_Tri_df[1:30,], aes(word, freq)) + labs(x = "Trigrams", y = "Frequency") + theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) + geom_bar(stat = "identity")
Let us look at the number of Uni-,Bi-, and Tri-grams required to cover 50% and 90% of the total word count
For 50% coverage, we see that
total_uni <- sum(as.matrix(Unigrams))
RightNumber <- function(freqwords,coverage){
Total <- 0
for(i in 1:nrow(freqwords)){
Total <- Total + freqwords$freq[i]
if((Total/sum(freqwords$freq)) > coverage){break}
}
return(i)
}
Unicount_50 <- RightNumber(freq_uni_df,0.50)
Bicount_50 <- RightNumber(freq_Bi_df,0.50)
Tricount_50 <- RightNumber(freq_Tri_df,0.50)
For 90% coverage
Unicount_90 <- RightNumber(freq_uni_df,0.90)
Bicount_90 <- RightNumber(freq_Bi_df,0.90)
Tricount_90 <- RightNumber(freq_Tri_df,0.90)
Till now we have looked at patterns of words based on their frequencies but have not much dealt into occurences of foreign words. We might have to use some dictionaries to eliminate the foreign words.
Also we need to look at multi n-gram model which makes use of all Uni-,Bi- and Tri-grams to predict the next word. Need to work on the Markov Chain model for that.
Also for prediction of words after an unknown word need to explore the Back-off models. I guess a bit of work on above will help me to produce a Shiny App that can predict the next word.