Section 1: Preparing for Exploratoy data analysis

This section is to perform a thorough exploratory analysis of the data, understand the distribution of words and relationship between the words in the corpora.

1-1 Get and the data

We can download the data from:

Capstone Dataset

or we can download the in this way

## Once you get the download link, you can download the file by the following code.
download.file("url",destfile = "raw.zip")
unzip("raw.zip")

Anyway, we download the raw zip file, then unzip this file. We get several files, but we focus on dealing with the English files “en_US_twitter.txt”, “en_US_blogs.txt” and “en_US.news.txt”.

1-2 Clean the data

Three functions we will get to deal with the data:

  • getWords: a list of sentences as input, returns a list of sentences with only words.

  • subList: takes a list and an index vector, returns a the list indexed by the vector

  • subsetData: takes a list and a number integer n, returns a random sublist with sample size divided by n.

The first thing I want to do is to select those charaters from “a-z” and “A-Z”. The following is a function to exact words in running text.

## This function  recieves a list of charaters as input and returns a list of 
## charaters only with a-z, A-Z, and space charater to get words.
getWords <- function(corpus){
            n <- length(corpus)  ## corpus is an list of sentences
            x <- corpus
            for (i in 1:n) {
                x[[i]] <- gsub("[^a-zA-z ]", "", corpus[[i]])
            }
            x
}

We read the dataset and clean the data set. Dealing with the whole dataset is too much time consuming, so we just deal with a smaller sample of the each data.

con <- file("en_US.twitter.txt","r")
twitter <- readLines(con)
close(con)
## subset a list orilist with index vector x
subList <- function(x, orilist){
           y <- list()
           n <- length(x)
           for (i in 1:n){
               y[[i]] <- orilist[[x[i]]]
           }
           y
}

subsetData <- function(data, m = 100){
              n <- length(data)
              x <- sample(1:n, n/m)
              subList(x, data)
}

set.seed(1)
twitterSamples <- getWords(subsetData(twitter, 1000))

Then the dataset twitterSamples is a size 2360 sample with elementary cleaning from the training twitter data.

For the sake of time saving, we sometimes deal with the 2360 size sampled dataset twitterSamples instead of the whole twitter dataset twitter.

1-3 Lines, tokens, and size of each English dataset

Functions we are going to use:

  • countSpace, takes a sentence, returns the tokens in the sentence
  • tokenCount, count the tokens in a list of sentence.

We give the table of size and lines of each English txt file.

con <- file("en_US.blogs.txt","r")
blogs <- readLines(con)
close(con)
con <- file("en_US.news.txt","r")
news <- readLines(con)
close(con)

countSpace <- function(x){
              nchar(gsub("[^ ]", "", x))+1
}


tokenCount <- function(corpus){
              n = length(corpus)
              m = 0
              for (i in 1:n) {
                  m = m + countSpace(corpus[[i]])
              }
              m
}
set.seed(1)
## tokens of the twitters
tokenCount(subsetData(twitter))
## [1] 302634
## tokens of the news
tokenCount(subsetData(news))
## [1] 27614
## tokens of the blogs
tokenCount(subsetData(blogs))
## [1] 368602

Num of tokens of each file is approximate equal 100 times the above number. The following table shows number of line and size of each dataset.

Statistic items en_US.blogs.txt en_US.blogs.txt en_US.news.txt
Num of Lines 2360148 899288 77259
Size of file(Mb) 301.4 248.5 19.2

From the table, we can see that the twitter file has more than ten times of lines over the lines of blogs and news. The news file has both the least lines and the least size.

1-4 Tokenization of corpus

A function we are going to use:

  • tokenize, takes a list of sentences, return a list of tokens

The first step to calculate the frequencies is to split the sentences in corpus into tokens. The following is our function to do the work.

tokenize <- function(corpus){
            sapply(corpus, function(x){strsplit(x, " ")})
}

Codes to apply the tokenize function.

## split the sampled twitter data and news data
wordstwitter <- tokenize(twitterSamples)
wordsnews <- tokenize(getWords(news))

The tokenized news data wordsnews have size of 154.9 Mb, which is much bigger that the origin data of 19.2 Mb. It means that we can’t taokenized the whole twitter dataset to do statistical summaries in case that it takes too much time to compute.

1-5 Basic summaries functions for tokenized corpus

Functions defined and to be used:

  • freqTable, takes a lisk of tokens, returns the frequency table of tokens.
  • cumprobTable, shows the culmulative probility sum of the highest n tokens of a list.
twittertokens <- tokenize(twitterSamples)
freqTable <- function(inputList){
            n <- length(inputList)
            x <- rep(0,n)
            xname <- names(x)
            k <- 0
            for (i in 1:n){
                v <- inputList[[i]]
                for (j in 1:length(v)){
                    if (v[j] %in% xname) {
                        x[v[j]] = x[v[j]]+1
                    }
                    else {
                        k = k + 1
                        xname[k] = v[j]
                        x[k] = 1
                    }
                    names(x) <- xname
                }
            }
            sort(x, decreasing = T)
}
freqTable(twittertokens)[1:30]
##  the   to    a    I  you  and   is  for   of   in   it   my   on that   be 
##  846  761  559  552  474  411  355  353  347  340  262  256  238  228  204 
##  are with   at   me your have this   Im like  out just   so  all    i  not 
##  181  170  168  150  147  139  134  133  122  118  109  108  107  101  100
cumprobTable <- function(n=20,inputlist){
                v <- freqTable(inputlist)
                cumsum(v[1:n])/sum(v)
}

plot(cumprobTable(n= 500, twittertokens), type = "l", xlab = "Number of tokens", ylab = "Probability")
abline(h = 0.5, col = "red")

Section 2: Summaries for the data

The very difficult of dealing with these corpus is time-consuming, and the solution to save time is to sample the whole dataset. In this section, we are focusing on calculating the frequences of different situations to build the n-grams model.

2-1 Frequency functions of tokens

  • FreqToken, takes a specified token and a copus, returns the probability of this token in this corpus.
freqToken <- function(token, corpus, m = 1000){
             li <- tolower(subsetData(corpus, m))
             tokenList <- tokenize(getWords(li))
             countToken <- function(token, x){
                        count <- 0
                        for (i in 1:length(x)){
                            if (identical(token, x[i])){
                                count = count + 1
                            }
                        }
                        count
                 }
             sum(sapply(tokenList, function(x){countToken(token,x)}))/tokenCount(li)
}
  • highFreqToken, takes a copus, a divided number m, number of tokens to show, returns the probability of tokens with highest probability.
highFreqTokenTable <- function(corpus, m = 1000, n = 20){
      li <- tolower(subsetData(corpus, m))
      tokenList <- tokenize(getWords(li))
      freqTable(tokenList)[1:n]/sum(freqTable(tokenList))
}

2-2 Summaries for each dataset

We have done some basic summaries in Section 1, here we are doing more things.

highFreqTokenTable(twitter, m = 5000, n = 20)
##         the          to           i         you           a         and 
## 0.029381265 0.026097477 0.025406153 0.020394055 0.018492914 0.015727618 
##         for          in          on          of          is          it 
## 0.012443830 0.012098168 0.011752506 0.011234013 0.011061182 0.010888351 
##          my        that          me          be        your         was 
## 0.010715520 0.008468718 0.007604563 0.007258901 0.006913239 0.006394746 
##        just          so 
## 0.006394746 0.006221915
highFreqTokenTable(blogs, m = 2000, n = 20)
##         the          to         and           a          of           i 
## 0.048134062 0.029771809 0.029236986 0.026681721 0.023116235 0.020204421 
##          in          is          it        that         for         you 
## 0.014380794 0.012300927 0.011647255 0.011468980 0.009805087 0.009567388 
##         was        with        this          as          on        have 
## 0.007130972 0.007130972 0.006536725 0.006358450 0.006180176 0.006061326 
##          my         not 
## 0.005942477 0.005823627
highFreqTokenTable(news, m = 1000, n = 20)
##         the         and           a          to          of          in 
## 0.056818182 0.031524927 0.030791789 0.028592375 0.022727273 0.021627566 
##         for        that          on          is        with          be 
## 0.012463343 0.011730205 0.010263930 0.008797654 0.008797654 0.007697947 
##        said          it         was        from          he        have 
## 0.007697947 0.006964809 0.006964809 0.006231672 0.005131965 0.004765396 
##         but         not 
## 0.004765396 0.004032258

From the results, we can see that the word “I” is not very frequent in the dataset news, but very frequent in twitter and blogs. The prob “I” appeared in news is 0.0033149, which is much lower that 0.0200847 and 0.0232843. This reflect a fact that news are asked to objective.

We can show more differece between the three datasets with these two functions, which reflecting features of twitter, blog and news, but let’s turn our focuses on prediction models.

Section 3: Prediction models

3-1 Frequency table for tokens after one specified token

Given a corpus such as twitter and a token like “the” , what is the frequency table of those tokens after the specified token “the” for corpus twitter? we are going to build a function to get the frequency table.

tokenProducedFreqTable <- function(token, corpus){
        li = tolower(subsetData(corpus))
        if (sum(grepl(token,li))) {
            x = grep(token, li)
            subLi = subList(x,li)
            subLi = tokenize(getWords(subLi))
            newLi = subLi
            for (i in 1:length(x)) {
                newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+1]
            }
            freqTable(newLi)[1:6]
        }
        else {
            x = grep(token, corpus)
            subLi = tolower(subList(x,corpus))
            subLi = tokenize(getWords(subLi))
            newLi = subLi
            for (i in 1:length(x)) {
                newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+1]
            }
            freqTable(newLi)[1:5]
        }
         freqTable(newLi)[1:6]
}

Let’s see what we can do with this function

We would like to predict the word after “you” with the twitter dataset, then we just print

tokenProducedFreqTable("you", twitter)
## have  are  can know  for dont 
##  219  216  171  135  112  101

to get the prediction of the words of “are”, “have”, “can”, “know”, “for” in decreasing probability.

but if we predict the word after “you” with a different training set such as news, then

tokenProducedFreqTable("you", news)
##   can  have  will would  need  that 
##     5     4     3     3     3     2

Your prediction would be differet.

3-2 2-Grams model

The prediction model above only use the information of previous words, this model uses information both the previous word and the second words before the word.

tokenProducedFreqTable2 <- function(token, corpus){
        li = tolower(subsetData(corpus))
        if (sum(grepl(token,li))) {
            x = grep(token, li)
            subLi = subList(x,li)
            subLi = tokenize(getWords(subLi))
            newLi = subLi
            for (i in 1:length(x)) {
                newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+2]
            }
            freqTable(newLi)[1:6]
        }
        else {
            x = grep(token, corpus)
            subLi = tolower(subList(x,corpus))
            subLi = tokenize(getWords(subLi))
            newLi = subLi
            for (i in 1:length(x)) {
                newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+2]
            }
            freqTable(newLi)[1:5]
        }
         freqTable(newLi)[1:6]
}

Section 4: Final model

Our main difficult is that the corpus size is too big, so we need a lot of time and memories. I could only deal with a random sample in certain situations. we could save the words with a dataframe with the fisrt column is the token and the second column is the frequency of the words.

Our algorithm to predict is:

  1. Random sample, check if we could deal with the sample or the whole corpus
  2. give the frequecy table of the 2-grams model
  3. make a prediciton by the model
twoGramsPredict <- function(token, corpus){
        data <- tolower(subsetData(corpus))
        if (sum(grepl(token,data)) > 20) {
                x = grep(token, data)
                subLi = subList(x,data)
                subLi = tokenize(getWords(subLi))
                newLi = subLi
                for (i in 1:length(x)) {
                        newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+1]
                }
                freqTable(newLi)[1]
        }
        else {
                x = grep(token, corpus)
                subLi = tolower(subList(x,corpus))
                subLi = tokenize(getWords(subLi))
                newLi = subLi
                for (i in 1:length(x)) {
                        newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+1]
                }
                freqTable(newLi)[1]
        }
}

twoGramsPredict("the", news)
## same 
##   21
threeGramsPredict <- function(token2,token1, corpus){
        data <- tolower(subsetData(corpus))
        if (sum(grepl(token1,data)) > 20) {
                x = grep(token1, data)
                subLi = subList(x,data)
                subLi = tokenize(getWords(subLi))
                newLi1 = subLi
                for (i in 1:length(x)) {
                        newLi1[[i]] <- subLi[[i]][grep(token1, subLi[[i]])+1]
                }
                a = freqTable(newLi1)/ sum(freqTable(newLi1))
        }
        else {
                x = grep(token1, corpus)
                subLi = tolower(subList(x,corpus))
                subLi = tokenize(getWords(subLi))
                newLi1 = subLi
                for (i in 1:length(x)) {
                        newLi1[[i]] <- subLi[[i]][grep(token1, subLi[[i]])+1]
                }
                a = freqTable(newLi1)/ sum(freqTable(newLi1))
        }
        if (sum(grepl(paste(token2, token1),data)) > 20) {
                x = grep(paste(token2, token1), data)
                subLi = subList(x,data)
                subLi = tokenize(getWords(subLi))
                newLi2 = subLi
                for (i in 1:length(x)) {
                        newLi2[[i]] <- subLi[[i]][grep(token1, subLi[[i]])+1]
                }
                b = freqTable(newLi2)/ sum(freqTable(newLi2))
        }
        else {
                x = grep(paste(token2, token1), corpus)
                subLi = tolower(subList(x,corpus))
                subLi = tokenize(getWords(subLi))
                newLi2 = subLi
                for (i in 1:length(x)) {
                        newLi2[[i]] <- subLi[[i]][grep(token1, subLi[[i]])+1]
                }
                b = freqTable(newLi2)/ sum(freqTable(newLi2))
        }
        if (a>b) {a[1]}
        else {b[1]}
}

threeGramsPredict("in", "the", news)
##      first 
## 0.02291667