Exploratory Data Analysis of Text data

Downloading Data

We start by downloading the zip file containing the project data and unzipping the files into a directory “data” in the working directory.

download.file(url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip",quiet = TRUE )
unzip("Coursera-SwiftKey.zip",exdir = "data")

Exploratory Data Analysis

Read in the english version of blogs, news and twitter data using readLines function

blogs <- readLines("data/final/en_US/en_US.blogs.txt", encoding="UTF-8",skipNul = TRUE,warn = FALSE)
news <- readLines("data/final/en_US/en_US.news.txt", encoding="UTF-8",skipNul = TRUE, warn = FALSE)
twitter <-readLines("data/final/en_US/en_US.twitter.txt", encoding="UTF-8",skipNul = TRUE, warn = FALSE)

Summary of the three data sets blog.text, new.txt and twitter.txt

library(stringi)
corpus <- list(blogs,news,twitter)
linecharstat<-  sapply(corpus,stri_stats_general)[c('Lines','Chars'),]
       wordcount <- sapply(corpus,stri_stats_latex)[c('Words'),]
      wordsummary <- sapply(corpus,function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
      descstat <- as.data.frame(rbind(linecharstat,wordcount,wordsummary))
      
      colnames(descstat)<- c("blogs.txt","news.txt","twitter.txt")
      rownames(descstat) <- c("Number of Lines","Number of Characters","Number of Words","Min words per line","Mean words per line","Max.words per line")
      rm(corpus)
      format(descstat,scientific = FALSE,digits=0,big.mark=",")

##                        blogs.txt   news.txt twitter.txt
## Number of Lines          899,288     77,259   2,360,148
## Number of Characters 206,824,382 15,639,408 162,096,241
## Number of Words       37,570,839  2,651,432  30,451,170
## Min words per line             0          1           1
## Mean words per line           42         35          13
## Max.words per line         6,726      1,123          47

We will now explore the words which are more frequently appearing in the three datasets. The analysis is done with and without stopwords .Stop words are those words which are quite frequent like“the”,“a”,“and” etc . An explanation of stopwords is given at https://en.wikipedia.org/wiki/Stop_words.

But prior to that the data has to be cleaned and preprocessed to facilitate exploratory analysis. Numbers,punctuations,symbols,seperators,url’s , twitter signs etc are also removed from the data corpus. Quanteda package in R is used for creating document frequency matrix. In the document frequency matrix(dfm) each line in the corpus is considered as a document and each unique word in the line is considered as a feature. A Dfm is a matrix with as many rows as the number of lines and as many columns as there are unique words in the corpus.The dfm provides the line wise frequency of each of these words. While creating the dfm words or word combinations which occurs only in two lines or lower are excluded to reduce the size of the dfm.

library(quanteda)
library(data.table)
createdfm <- function(x,n) {
 doc <- dfm(x, tolower = TRUE, remove_numbers= TRUE, remove_punct = TRUE,remove_symbols = TRUE,remove_twitter = TRUE,remove_separators = TRUE,remove_hyphense=TRUE, remove_url = TRUE,ngrams=n,concatenator=" ")
 doc <- dfm_compress(doc)
 doc = dfm_trim(doc,min_docfreq = 2)
 return(doc)
        }
createstopdfm <-  function(x,n){
docstop = dfm(x, remove= stopwords("english"), tolower = TRUE, remove_numbers= TRUE, remove_punct = TRUE,remove_symbols = TRUE,remove_twitter = TRUE,remove_separators = TRUE,remove_hyphense=TRUE, remove_url = TRUE,ngrams=n,concatenator=" ") 
docstop <- dfm_compress(docstop)
 docstop = dfm_trim(docstop,min_docfreq = 2)
        }
createdatatable <- function(x){
       table <-  data.table(Words = featnames(x), TotalCount = colSums(x),Linespresent =  docfreq(x),stringsAsFactors = FALSE)
   table <- table[order(TotalCount,decreasing = TRUE)]
   table$percentoftotal = (table$TotalCount/sum(table$TotalCount))*100
   table$cumpercent <- cumsum(table$percentoftotal)
   return(table)
}

Understanding the distribution and relationship between the words in the corpora.

Exploratory Analysis of Blogs Data

blogsunigram = createdfm(blogs,1)
blogsstopunigram = createstopdfm(blogs,1)
blogsunigramtab = createdatatable(blogsunigram)
blogsstopunigramtab =createdatatable(blogsstopunigram)

The top 10 frequent words appearing in blogs text data with and without stopwords are shown in the tables below. Table on the right are without stopwords.

A world cloud display of top 50 words with and without stop words in blogs text is shown below.Figure on the right are without stopwords.

Let us now examine the word pair combination in the blog

Below are the top 10 frequent word pairs appearing in blogs text data.

##       Words TotalCount Linespresent
##  1:  of the     187136       139416
##  2:  in the     154187       124230
##  3:  to the      86031        74644
##  4:  on the      75266        66050
##  5:   to be      68104        59442
##  6: and the      58737        52726
##  7: for the      58128        52473
##  8:   i was      49556        38672
##  9:   and i      49486        43929
## 10:  i have      48074        40333

A world cloud display of top 25 word pairs in blog is shown below.

Exploratory Analysis of News data

newsunigram = createdfm(news,1)
newsstopunigram = createstopdfm(news,1)
newsunigramtab = createdatatable(newsunigram)
newsstopunigramtab =createdatatable(newsstopunigram)

The top 10 frequent words appearing in news text data with and without stopwords are shown in the tables below. Table on the right are without stopwords.

A world cloud display of top 50 words with and without stop words in news text is shown below.Figure on the right are without stopwords.

Let us now examine the word pair combination in the News data

newsbigram = createdfm(news,2)
newsbigramtab =createdatatable(newsbigram)
rm(news)

Below are the top 10 frequent word pairs appearing in news data.

##        Words TotalCount Linespresent
##  1:   of the      14096        12097
##  2:   in the      13708        11808
##  3:   to the       6442         5981
##  4:   on the       5537         5133
##  5:  for the       5397         5079
##  6:   at the       4516         4231
##  7:  and the       4050         3857
##  8:     in a       4041         3859
##  9:    to be       3552         3314
## 10: with the       3323         3163

A world cloud display of top 25 word pairs in news data is shown below.

Exploratory Analysis of Twitter Data

twitterunigram = createdfm(twitter,1)
twitterstopunigram = createstopdfm(twitter,1)
twitterunigramtab = createdatatable(twitterunigram)
twitterstopunigramtab =createdatatable(twitterstopunigram)

The top 10 frequent words appearing in twitter text data with and without stopwords are shown in the tables below. Table on the right are without stopwords.

A world cloud display of top 50 words with and without stop words in news text is shown below.Figure on the right are without stopwords.

Let us now examine the word pair combination in the twitter data

Below are the top 10 frequent word pairs appearing in twitter text data.

##          Words TotalCount Linespresent
##  1:     in the      78335        76159
##  2:    for the      73955        72854
##  3:     of the      56873        55032
##  4:     on the      48469        47488
##  5:      to be      46986        45596
##  6:     to the      43400        42683
##  7: thanks for      42983        42778
##  8:     at the      37229        36601
##  9:     i love      35904        34947
## 10:   going to      34270        33604

A world cloud display of top 25 word pairs in twitter data is shown below.

Exploring coverage of unique words in the data sets.

We will examine how many unique words cover 50% or more word instances in the data sets.

We had 163660 unique words in the blog data including stop words. Intrestingly just 105 words cover 50% of the word usages in the data. Just 1953 words cover 80% and 6351 words cover 90% of the word usages in the blog data. Out of the 45422 unique words in news data just 178,2615 & 7071 cover 50%, 80% and 90% of total word usage. Like wise 120,1499,4942 cover 50%,80% and 90% word usage in twitter data which has 151046 unique words.

The above shows that by using much lesser word or word pairs we can ensure a fair representation of the text corpus. this forms the basis of the model for predicting the next word.

Modelling

Based on the exploratory analysis it’s evident that we don’t need to use the complete corpus to build our ngrams. The following steps are proposed to develop the model.

Use about 30% of the entire corpus which contains words or word pairs that cover about 95% of the word usage in the entire corpus.
Build upto 5 grams using quanteda library.
Use Knesser Nay Smoothing algorithms based on conditional probailities
Use Stupid back off models to estimate the probability of unobserved ngrams.