Milestone Report for Coursera Data Science Specialization: Exploratory Analysis of Text data from Twitter, news and blogs

Synopsis

We will explore the training data set provided on the Coursera website with a view building a model that predicts the next word that a user will type in a sentence based on the previous one to three words input. The data consists of a large amount of text collected in three files from blogs, twitter and the news.

The exploratory analysis will make use of the following R libraries.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Note that the use of other libraries such as tm and RWeka were explored, but tokenizing and forming n-grams appeared to run much faster when done manually using basic R functions.

Data Input and Processing

The following will download and extract the training data.

# download and extract data from the Coursera website link
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip","data.zip")
unzip("data.zip")

We will then read in the three english language data files from Twitter, blogs and the news:

twit <- readLines("./final/en_US/en_US.twitter.txt")

## Warning in readLines("./final/en_US/en_US.twitter.txt"): line 167155
## appears to contain an embedded nul

## Warning in readLines("./final/en_US/en_US.twitter.txt"): line 268547
## appears to contain an embedded nul

## Warning in readLines("./final/en_US/en_US.twitter.txt"): line 1274086
## appears to contain an embedded nul

## Warning in readLines("./final/en_US/en_US.twitter.txt"): line 1759032
## appears to contain an embedded nul

blog <- readLines("./final/en_US/en_US.blogs.txt")
news <- readLines("./final/en_US/en_US.news.txt")

## Warning in readLines("./final/en_US/en_US.news.txt"): incomplete final line
## found on './final/en_US/en_US.news.txt'

The following aims to tokenize, by words, and split into sentences the text data from Twitter with a few assumptions:

“.”, “?”, “!” followed by a space indicate the end of sentences (of course there will be a few exceptions).
Punctuation is not critical to keep track of.
Whether a character is upper/lower case is not critical to keep track of.

twitwds <- strsplit(twit,"(\\.|\\?|\\!) ") # split text into sentences
twitwds <- unlist(twitwds) # produces a list of sentences
twitwds <- tolower(twitwds) # change all characters to  lower case
twitwds <- gsub("[[:punct:]]","",twitwds) # removes punctuation
twitwds <- gsub("  "," ",twitwds) # remove double spaces
twitwds <- strsplit(twitwds," ") # split sentences into words
twitwds <- twitwds[sapply(twitwds,length)>0] # remove any sentences with no words in

The same is then repeated for the blogs and news text data:

blogwds <- strsplit(blog,"(\\.|\\?|\\!) ") # split text into sentences
blogwds <- unlist(blogwds) # produces a list of sentences
blogwds <- tolower(blogwds) # change all characters to  lower case
blogwds <- gsub("[[:punct:]]","",blogwds) # removes punctuation
blogwds <- gsub("  "," ",blogwds) # remove double spaces
blogwds <- strsplit(blogwds," ") # split sentences into words
blogwds <- blogwds[sapply(blogwds,length)>0] # remove any sentences with no words in

newswds <- strsplit(news,"(\\.|\\?|\\!) ") # split text into sentences
newswds <- unlist(newswds) # produces a list of sentences
newswds <- tolower(newswds) # change all characters to  lower case
newswds <- gsub("[[:punct:]]","",newswds) # removes punctuation
newswds <- gsub("  "," ",newswds) # remove double spaces
newswds <- strsplit(newswds," ") # split sentences into words
newswds <- newswds[sapply(newswds,length)>0] # remove any sentences with no words in

Note that text data is now a list of sentences where each sentence is itself a list of words. The purpose of splitting up into sentences is so that the n-grams that we form later do not cross between sentences and so the predictive power of the eventual application will be limited to predicting the next word within the same sentence.

Exploratory Analysis

We define an n-gram of words as a sequence of n consecutive words found within a sentence. Let’s start by writing a function that can produce a list of the n-grams from a given sentence (list of words):

ngram <- function(lt, n){  # n will be the n in n-gram; lt is the list to work on
  out <- list()
  if(n<=length(lt)){    #there won't be any n-grams if n is bigger than the number of words in the sentence!
    for (j in 1:(length(lt)-n+1)){  #collect all the n-grams in a list
      out <- c(out,paste(lt[j:(j+n-1)],collapse=" "))
    }
  }
  return(out)
}

We now apply this function to the lists of sentences produced from the three data sources to produce a list of 1-grams for each data source. Due to memory constraints, we’ll do this separately for each data source.

Twitter

twit1g <- lapply(twitwds,function(x) {ngram(x,1)})
twit1g <- unlist(twit1g)

There are 29819134 1-grams in the twitter data. We can examine the distribution of the frquencies of words:

twittab <- as.data.frame(table(twit1g))
twittab <- twittab[order(-twittab$Freq),]  #place 1-grams in order of decreasing frequency
twittab <- mutate(twittab,cumFreq = cumsum(Freq))
twittab <- mutate(twittab,cumRelFreq = cumFreq/sum(Freq)) #compute cumulative relative frequencies

The 10 most frequent 1-grams:

twittab[1:10,]

##    twit1g   Freq cumFreq cumRelFreq
## 1     the 933427  933427 0.03130295
## 2      to 786378 1719805 0.05767455
## 3       i 712377 2432182 0.08156447
## 4       a 606542 3038724 0.10190517
## 5     you 542154 3580878 0.12008659
## 6     and 433527 4014405 0.13462514
## 7     for 384311 4398716 0.14751320
## 8      in 376725 4775441 0.16014687
## 9      of 358876 5134317 0.17218196
## 10     is 357433 5491750 0.18416866

We can truncate the the list of 1-grams such that only enough 1-grams are kept to cover 90% of occurences in the data:

twittab <- twittab[twittab$cumRelFreq<=0.9,]

Only 6224 unique 1-grams are needed to cover 90% of all 1-gram occurences in the data. Truncating like this could help us to keep the size and number of parameters of the model under control without sacrificing too much on prediction accuracy.

We now plot a histrogram to see the distribution of frequencies of 1-grams. The x-axis has been rescaled to be the log of the log of of the log of the number of occurences of each 1-gram in the data in order to produce a meaningful shape:

hist(log(log(log(twittab$Freq))))

There are a few extremely common 1-grams and frequencies drop off very sharply as you move to the less common 1-grams.

We’ll now do some similar analysis for the other two data sets:

Blogs

blog1g <- lapply(blogwds,function(x) {ngram(x,1)})
blog1g <- unlist(blog1g)

blogtab <- as.data.frame(table(blog1g))
blogtab <- blogtab[order(-blogtab$Freq),]  #place 1-grams in order of decreasing frequency
blogtab <- mutate(blogtab,cumFreq = cumsum(Freq))
blogtab <- mutate(blogtab,cumRelFreq = cumFreq/sum(Freq)) #compute cumulative relative frequencies

The 10 most frequent 1-grams:

blogtab[1:10,]

##    blog1g    Freq cumFreq cumRelFreq
## 1     the 1848597 1848597 0.04965738
## 2     and 1084582 2933179 0.07879163
## 3      to 1064762 3997941 0.10739348
## 4       a  894545 4892486 0.13142292
## 5      of  874700 5767186 0.15491929
## 6       i  762769 6529955 0.17540894
## 7      in  592243 7122198 0.19131789
## 8    that  458155 7580353 0.20362494
## 9      is  430735 8011088 0.21519542
## 10     it  397611 8408699 0.22587613

We can truncate the the list of 1-grams such that only enough 1-grams are kept to cover 90% of occurences in the data:

blogtab <- blogtab[blogtab$cumRelFreq<=0.9,]

Only 7805 unique 1-grams are needed to cover 90% of all 1-gram occurences in the data.

hist(log(log(log(blogtab$Freq))))

There are a few extremely common 1-grams and frequencies drop off very sharply as you move to the less common 1-grams. It looks quite similar to what we saw for the Twitter data.

We might expect the news data to be a little different because language used may be more formal. Let’s take a look:

News

news1g <- lapply(newswds,function(x) {ngram(x,1)})
news1g <- unlist(news1g)

newstab <- as.data.frame(table(news1g))
newstab <- newstab[order(-newstab$Freq),]  #place 1-grams in order of decreasing frequency
newstab <- mutate(newstab,cumFreq = cumsum(Freq))
newstab <- mutate(newstab,cumRelFreq = cumFreq/sum(Freq)) #compute cumulative relative frequencies

The 10 most frequent 1-grams:

newstab[1:10,]

##    news1g   Freq cumFreq cumRelFreq
## 1     the 151153  151153 0.05735175
## 2      to  69300  220453 0.08364614
## 3     and  68159  288612 0.10950760
## 4       a  67026  355638 0.13493918
## 5      of  59029  414667 0.15733646
## 6      in  51409  466076 0.17684250
## 7     for  26944  493020 0.18706582
## 8    that  26281  519301 0.19703757
## 9      is  21942  541243 0.20536299
## 10     on  20538  561781 0.21315570

We can truncate the the list of 1-grams such that only enough 1-grams are kept to cover 90% of occurences in the data:

newstab <- newstab[newstab$cumRelFreq<=0.9,]

Only 9631 unique 1-grams are needed to cover 90% of all 1-gram occurences in the data.

hist(log(log(log(newstab$Freq))))

The pattern here is again fairly similar to that of the blogs and Twitter data, but it does appear that there is a slightly more diverse vocabulary observed in the news data.

Plans for Creating Prediction Algorithm and Application

Model size

In order to keep the size of the model under control, it will be useful to replace words that are not commonly observed in the data (perhaps words that were not in the truncated lists of 1-grams above, although the cut-off of 90% can be tuned later to get the best trade-off between model size and prediciton accuracy) with a token such as “<rare>”. After doing this, lists of 1-grams, 2-grams and 3-grams can be produced with the aim of producing a backoff n-gram model.

Backoff n-gram model

Given an n-gram, the model will estimate the probability of each possible subsequent word by counting how many times in the training data that the word was observed after the n-gram and dividing by the total number of occurences of the n-gram. If the n-gram was not observed at all in the training data then we “backoff” to the (n-1)-gram proceeding the word being predicted and proceed in a similar manner. We “backoff” until we find a history that has non-zero counts.

In order to ensure the probability mass adds up to 1, a discount must be applied to the higher order n-grams in order that some mass is saved for when we are required to “backoff” to lower order n-grams.

We’ll start off by having the highest order of n-gram be the 3-gram but will consider exploring higher order n-grams if possible.

Evaluating and tuning the model

We will need to tune the parameters in the model (such as the discounting factor to be applied at each “backoff”) using a cross-validation set of data. To do this we will need to decide on a metric for evaluating the accuracy of our model - one such measure is known as perplexity which is the inverse probability that the model would produce for a test set occurring, normalised for the number of words in the test data.

The Application

The proposed application will be a shiny app into which a user types a word or words and the application responds automaticlly by suggesting one (or possibly a few choices) word which the model predicts is the most likely to come next. The model will have to be compact enough so that this can be computed in a minimal amount of time.