This is an exploratory analysis on the given dataset to understand the distribution of words and relationship between the words in the corpora. The goal of this analysis is to get breif summary and important features of the dataset which would be useful in model building process.
We have downloaded the dataset and established connection to the dataset.
#Establishing connection to the English version of the twitter, Blogs and news dataset
con_twitter <- file("D:/Games/R Workspace/Capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "r")
con_news <- file("D:/Games/R Workspace/Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt", "r")
con_blogs <- file("D:/Games/R Workspace/Capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "r")
# Reading lines from each of the dataset
lines_twitter <- readLines(con_twitter, warn=FALSE, encoding="UTF-8")
lines_news <- readLines(con_news, encoding="UTF-8")
lines_blogs <- readLines(con_blogs, warn=FALSE, encoding="UTF-8")
Summary on length of each data set lines can be observed below:
summary(nchar(lines_twitter))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
summary(nchar(lines_blogs))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40833
summary(nchar(lines_news))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 111.0 186.0 202.4 270.0 5760.0
For this purpose we took 5% of data from each file because if we take data more thatn that then it cannot be processed as it will require more memory size which is not available in a general commputing system. Each of the file is then combined and proceeded with the data cleaning process.
set.seed(1725)
twitter <- sample(lines_twitter, length(lines_twitter)*0.05)
blogs <- sample(lines_blogs, length(lines_blogs)*0.05)
news <- sample(lines_news, length(lines_news)*0.05)
corpus <- c(twitter, blogs, news)
corpus <- iconv(corpus, "UTF-8","ASCII", sub = "")
length(corpus)
## [1] 166833
Here we cleanded the data by taking the following measures: 1. Removed any extra white spacae 2. Converted all the words into lower case as to reduce ambiguity 3. Removed punctuation as it has no meaning 4. Removed Numbers 5. Converted the file into plain text document 6. Removed stop words as it has no meaning and present in bulk in the dataset
corpus <- VCorpus(VectorSource(corpus))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
Now we will create unigram, bigram and trigram of the dataset, which will represent the frequency of each wset of words. To reduce the computation we have taken most frequent terms for representing purpose. In case of unigram we took the words which have more than 60 occurence. In case of Bigram we took the words which have more than 40 occurrence and finally for trigrams we took the words which have more than 10 occurence.
memory.limit(size = 56000)
## [1] 56000
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unigram_tdm <- TermDocumentMatrix(corpus, control = list(tokenize = unigram))
unigram_freqTerm <- findFreqTerms(unigram_tdm,lowfreq = 60)
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram_tdm <- TermDocumentMatrix(corpus, control = list(tokenize = bigram))
bigram_freqTerm <- findFreqTerms(bigram_tdm,lowfreq=40)
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram_tdm <- TermDocumentMatrix(corpus, control = list(tokenize = trigram))
trigram_freqTerm <- findFreqTerms(trigram_tdm,lowfreq=10)
Here we are showing the distribution of unigram, bigram and trigram to undestand the frequency of each words. we have also made the word cloud to clearly depict the words which are more used in the dataset.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1816765 97.1 3463808 185 2664588 142.4
## Vcells 3137921 24.0 8388608 64 5066025 38.7
unigram_freq <- rowSums(as.matrix(unigram_tdm[unigram_freqTerm,]))
unigram_ord <- order(unigram_freq, decreasing = TRUE)
unigram_freq <- data.frame(word=names(unigram_freq[unigram_ord]), frequency=unigram_freq[unigram_ord])
ggplot(unigram_freq[1:25,], aes(factor(word, levels = unique(word)), frequency)) +
geom_bar(stat = 'identity')+
theme(axis.text.x=element_text(angle=90))+
xlab('Unigram')+
ylab('Frequency')
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1919227 102.5 3463808 185 2939880 157.1
## Vcells 3357480 25.7 8388608 64 5066025 38.7
wordcloud(unigram_freq$word, unigram_freq$frequency, max.words=40, colors=brewer.pal(8, "Set1"))
From the above distribution and word cloud we can see that “Just”, “Like”, “will”, “one” and “can” are the top 5 most occurred word in the dataset. The Frequency is above and approx 20000 in the dataset.
bigram_freq <- rowSums(as.matrix(bigram_tdm[bigram_freqTerm,]))
bigram_ord <- order(bigram_freq, decreasing = TRUE)
bigram_freq <- data.frame(word=names(bigram_freq[bigram_ord]), frequency=bigram_freq[bigram_ord])
ggplot(bigram_freq[1:20,], aes(factor(word, levels = unique(word)), frequency)) +
geom_bar(stat = 'identity')+
theme(axis.text.x=element_text(angle=90))+
xlab('Bigram')+
ylab('Frequency')
wordcloud(bigram_freq$word, bigram_freq$frequency, max.words=30, colors=brewer.pal(8, "Set1"))
## Warning in wordcloud(bigram_freq$word, bigram_freq$frequency, max.words = 30, :
## right now could not be fit on page. It will not be plotted.
## Warning in wordcloud(bigram_freq$word, bigram_freq$frequency, max.words = 30, :
## looking forward could not be fit on page. It will not be plotted.
From the above distribution and word cloud we can see that “right now”, “can’t wait”, “don’t know”, “last night” and “feel like” are the top 5 most occurred word in the dataset. The information will help us to predict the next word as required in the future modelling process.
trigram_freq <- rowSums(as.matrix(trigram_tdm[trigram_freqTerm,]))
trigram_ord <- order(trigram_freq, decreasing = TRUE)
trigram_freq <- data.frame(word=names(trigram_freq[trigram_ord]), frequency=trigram_freq[trigram_ord])
ggplot(trigram_freq[1:15,], aes(factor(word, levels = unique(word)), frequency)) +
geom_bar(stat = 'identity')+
theme(axis.text.x=element_text(angle=90))+
xlab('Trigram')+
ylab('Frequency')
wordcloud(trigram_freq$word, trigram_freq$frequency, max.words=15, colors=brewer.pal(8, "Set1"))
## Warning in wordcloud(trigram_freq$word, trigram_freq$frequency, max.words =
## 15, : happy mothers day could not be fit on page. It will not be plotted.
Here are the top 15 most occuring trigrams in the dataset. This gives us the information that which combination of words are being used mostly and can be helpful in predicting the next word.
In this analysis we will try to find out that how many words from sorted frequency dictionary can cover 50% of the dataset. For this we will calculate proportion of each word in the dataset and find out the answer to the above query.
unigram_freq <- unigram_freq %>% mutate(prop = frequency/sum(frequency))
unigram_freq <- as.data.frame(unigram_freq)
head(unigram_freq)
## word frequency prop
## 1 just 12773 0.009090029
## 2 like 11082 0.007886613
## 3 will 10763 0.007659593
## 4 one 10446 0.007433997
## 5 can 9521 0.006775712
## 6 get 9236 0.006572889
n <- nrow(unigram_freq)
count <- 1
sum <- 0
while(count <= n){
if(round(sum(unigram_freq$prop[1:count]),2) == 0.50 ){
break
}
count <- count + 1
}
## Total Number of words
count
## [1] 360
## Total Percentage of the word
total.percentage = (count/n)*100
total.percentage
## [1] 8.421053
From the above calculation we found that there are 360 (~8%) words which covers 50% of the proportion in the datset. Therefore this words we can keep in the sorted doictionary, which we can be referred for the prediction model.
From the above analysis we found out the distribution of each word in the datset without any stop words. After cleaning the dataset, we reduced the dataset size and removed words which had no meanings like stop words and punctuation from the datset. We have also found the distribution of combination of words like bigrams and trigrams and from that we concluded that as the the combination of number of words increased, the total frequency has decreased. We have not removed any profanity words from the dataset as we did not encounter any in this sample of words which leads to the conclusion that the number of profanity words in the dataset is comparetively less. With the proportion analysis we got further insight to the data as we found that 50% of the words in the dataset is covered by only 8% percent of the words. Identifying these words would be very usefull as these words are used more often. These analysis would be helpful in the model building process.