Introduction

In this assignment we are trying to explore the datasets and look at the major features of the data with the use of some Plots and Wordclouds.Towards the end we also look at ways to tackle unexplored aspects and path towards our goal of creating a Shiny App.

Reading the Data & exploring basic details

We have three files with regard to information of language used on Blogs, News and Twitter. We shall read the data and check the basic details of these files.

setwd("C:/Users/user/Desktop/Bhanu/Data Science/CoursEra/Capstone Project/final/en_US")

library(tm)
## Loading required package: NLP
library(wordcloud)
## Loading required package: RColorBrewer
library(NLP)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(RColorBrewer)
library(RWeka)
library(SnowballC)

Tweet <- readLines("en_US.twitter.txt",encoding = "UTF-8", skipNul = TRUE)
News <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE,warn = FALSE)
Blogs <- readLines("en_US.blogs.txt",encoding = "UTF-8", skipNul = TRUE)

Checking the no. of rows & characters in each file

# Checking for Tweet data

length(Tweet)
## [1] 2360148
sum(nchar(Tweet))
## [1] 162096241
# Checking for News data
length(News)
## [1] 77259
sum(nchar(News))
## [1] 15639409
# Checking for Blogs data
length(Blogs)
## [1] 899288
sum(nchar(Blogs))
## [1] 206824505

We see huge number of lines and characters in each file. In order to continue with our exploratory analysis we shall try to use first 1000 lines of each dataset.

Tweet_sample <- readLines("en_US.twitter.txt",1000,encoding = "UTF-8", skipNul = TRUE)
News_sample <- readLines("en_US.news.txt",1000, encoding = "UTF-8", skipNul = TRUE,warn = FALSE)
Blogs_sample <- readLines("en_US.blogs.txt",1000,encoding = "UTF-8", skipNul = TRUE)

comp_sample <- c(Tweet_sample,News_sample,Blogs_sample)

Creating a clean Corpus

In order to explore further we shall create a corpus and look at the frequency distribution of the words - Unigrams, bigrams and trigrams.

So we shall first process the data using the following steps: 1) Convert into a corpus of characters 2) Remove Non-ASCII characters - as we are sticking to American syntax 3) Convert all words to lower case 4) Remove punctuations, numbers 5) Remove English stopwords - which are frequently used english words and add no value to our analysis 6) Remove unecessary whitespaces

We then create a document term matrix which gives us the Words used as columns and the values represent the frequency of their usage. Each line has been treated as a document and is represented by the row

comp_sample <- iconv(comp_sample,"UTF-8", "ASCII", "byte")
sample_corpus <- VCorpus(VectorSource(comp_sample))
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus <- tm_map(sample_corpus, removeNumbers)
sample_corpus <- tm_map(sample_corpus,removeWords,stopwords("english"))
sample_corpus <- tm_map(sample_corpus, stripWhitespace)

sample_dtm <- DocumentTermMatrix(sample_corpus)

inspect(sample_dtm)
## <<DocumentTermMatrix (documents: 3000, terms: 13943)>>
## Non-/sparse entries: 44025/41784975
## Sparsity           : 100%
## Maximal term length: 95
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   can get just know like new one said time will
##   2022   0   0    1    1    0   2   0    0    1    1
##   2050   0   0    0    0    0   1   1    0    0    1
##   2065   1   1    1    0    1   0   4    0    0    0
##   2180   2   0    0    1    0   0   2    0    2    0
##   2313   1   0    2    0    0   1   1    0    2    1
##   2376   2   0    0    0    0   0   1    0    1    4
##   2483   0   0    0    1    0   0   1    0    1    0
##   2558   0   0    1    0    2   0   0    0    1    0
##   2851   0   0    0    0    1   3   0    0    2    5
##   2951   0   0    0    0    0   0   0    0    0    0

Exploratory Data Analyis

Here we shall look at the trends in the Sample corpus of the words used.

  1. We shall look at top 50 single words ( Uni-grams) occuring
  2. Create a wordcloud
  3. Check top 50 Bi-grams occuring
  4. Check top 50 Tri-grams occuring

Top 50 Words in the Corpus

If we look at the word cloud we see that words like ‘said’, ‘will’, ‘one’ etc are the most frequent ones.

We also see at top 50 words and their respective frequencies.

Just plotted the words with frequency of more than 100.

wordcloud(sample_corpus, max.words=100, random.order=FALSE, colors=brewer.pal(8,"Dark2"))

freq <- colSums(as.matrix(sample_dtm))
length(freq)
## [1] 13943
freq <- sort(freq,decreasing = TRUE)

head(freq,50)
##    said    will     one    just    like     can    time     new     get 
##     304     259     254     249     248     191     191     186     171 
##    know     day     now    good   first  people    much    year    make 
##     144     141     137     131     128     124     122     120     112 
##    also     two    dont    love    last  really     see   right   think 
##     110     106     104     102      99      99      99      97      97 
##    well   going     got    back    even    home    made    want     way 
##      95      93      93      91      89      87      85      84      84 
##  little    many     say    work   great    need   years   today another 
##      83      83      80      80      78      78      75      74      69 
##   still    life   never     lot    next 
##      69      65      65      64      64
df <- data.frame(word=names(freq), freq=freq)  

p <- ggplot(subset(df,freq>100), aes(word, freq)) + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=45, hjust=1))   
p

Checking for n-grams

## tokenizer functions
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram  <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# Unigrams

Unigrams <- TermDocumentMatrix(sample_corpus,control = list(tokenize = unigram))

freq_uni <- rowSums(as.matrix(Unigrams))
freq_uni <- sort(freq_uni,decreasing = TRUE)
freq_uni_df <- data.frame(word=names(freq_uni), freq=freq_uni)  
ggplot(freq_uni_df[1:50,], aes(word, freq)) + labs(x = "Unigrams", y = "Frequency") + theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) + geom_bar(stat = "identity")

# Bigrams

Bigrams <- TermDocumentMatrix(sample_corpus,control = list(tokenize = bigram))

freq_Bi <- rowSums(as.matrix(Bigrams))
freq_Bi <- sort(freq_Bi,decreasing = TRUE)
freq_Bi_df <- data.frame(word=names(freq_Bi), freq=freq_Bi)  
ggplot(freq_Bi_df[1:30,], aes(word, freq)) + labs(x = "Bigrams", y = "Frequency") + theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) + geom_bar(stat = "identity")

Trigrams <- TermDocumentMatrix(sample_corpus,control = list(tokenize = trigram))
freq_Tri <- rowSums(as.matrix(Trigrams))
freq_Tri <- sort(freq_Tri,decreasing = TRUE)
freq_Tri_df <- data.frame(word=names(freq_Tri), freq=freq_Tri)  
ggplot(freq_Tri_df[1:30,], aes(word, freq)) + labs(x = "Trigrams", y = "Frequency") + theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) + geom_bar(stat = "identity")

Coverage of the Corpus by the highly frequent words

Let us look at the number of Uni-,Bi-, and Tri-grams required to cover 50% and 90% of the total word count

For 50% coverage, we see that

  1. For uni-grams we need 7 % of words for covering 50% of the text in the sample
  2. For bi-grams we need 47% of Bi-grams for covering 50% of the text in the sample
  3. For tri-grams we need 49% of Tri-grams for covering 50% of the text in the sample
total_uni <- sum(as.matrix(Unigrams))
RightNumber <- function(freqwords,coverage){
        Total <- 0
        for(i in 1:nrow(freqwords)){
                Total <- Total + freqwords$freq[i]
                if((Total/sum(freqwords$freq)) > coverage){break}
        }
        return(i)
}

Unicount_50 <- RightNumber(freq_uni_df,0.50)
Bicount_50 <- RightNumber(freq_Bi_df,0.50)
Tricount_50 <- RightNumber(freq_Tri_df,0.50)

For 90% coverage

  1. For uni-grams we need 66 % of words for covering 90% of the text in the sample
  2. For bi-grams we need 90% of Bi-grams for covering 90% of the text in the sample
  3. For tri-grams we need 88% of Tri-grams for covering 90% of the text in the sample
Unicount_90 <- RightNumber(freq_uni_df,0.90)
Bicount_90 <- RightNumber(freq_Bi_df,0.90)
Tricount_90 <- RightNumber(freq_Tri_df,0.90)

Way Ahead

Till now we have looked at patterns of words based on their frequencies but have not much dealt into occurences of foreign words. We might have to use some dictionaries to eliminate the foreign words.

Also we need to look at multi n-gram model which makes use of all Uni-,Bi- and Tri-grams to predict the next word. Need to work on the Markov Chain model for that.

Also for prediction of words after an unknown word need to explore the Back-off models. I guess a bit of work on above will help me to produce a Shiny App that can predict the next word.