Summary

A first pass analysis of the Quora Duplicate Questions dataset problem. A basic model using straight forward application of word vectors has an ability to disciminate duplicate sentences, with a measured postiive predictive value of about 51.77%.

This program uses these packages tidyverse, stringr, ggplot2, ww44ss/yo, stats, xtable, knitr, dplyr, purrr, broom, tidyr, tidytext
## Get The Data

Read the data as a .csv file. To make processing time sane, only the first 500 lines are processed

## define directory
data.directory <- "/Users/winstonsaunders/Documents/kaggle_quora/data/"
## read data into a data_frame
train.data <- read_csv(str_c(data.directory, "train.csv"), col_types = "dddccc", n_max = n.lines.read)
## fix the id
train.data <- train.data %>% mutate(id = id+1)

#test.data <- read_csv(str_c(data.directory, "test.csv"), col_types = "dcc")
## reduce size for prototyping

Data Cleaning

The purpose here is to put the data into a tidy format. The tools of the tidytext package accomplish this quickly.

First we get the list of stopwords from one of the supported dictionaries.

## GET STOP WORDS

list_of_stop_words <- stop_words %>%
        filter(lexicon == "onix") %>%   #snowball 
        select(word) %>% 
        yo

# # A tibble: 174 × 1
#         word

Then quickly process the text into tidy format (yes, it is that easy!)

## TRANSFORM TRAINING DATA TO TIDY FORMAT

## Takes lightly processed train.data and reduced to single question, removes tokens, 
## and gets rid of stop words

## QUESTION1
    words_question1 <- 
            train.data %>%
            select(id, qid1, question1) %>%
            unnest_tokens(word, question1) %>%
            filter(!word %in% list_of_stop_words$word) %>% 
            yo

    xtable(words_question1[1:10, ])
% latex table generated in R 3.3.3 by xtable 1.8-2 package % Thu Apr 6 16:43:46 2017
## the data should look like this
# # A tibble: 1,700 × 3
#       id  qid1     word

Let’s compare a couple of random questions with the data treated.

## print the reduced words
id.select = sample(1:n.lines.read, 1)
print(words_question1 %>% filter(id == id.select))
## # A tibble: 3 × 3
##      id  qid1   word
##   <dbl> <dbl>  <chr>
## 1   451   898 travel
## 2   451   898 faster
## 3   451   898  light
print(words_question2%>% filter(id == id.select))
## # A tibble: 3 × 3
##      id  qid2   word
##   <dbl> <dbl>  <chr>
## 1   451   899 travel
## 2   451   899 faster
## 3   451   899  light
## print the reduced words
id.select = sample(1:n.lines.read, 1)
print(words_question1 %>% filter(id == id.select))
## # A tibble: 3 × 3
##      id  qid1     word
##   <dbl> <dbl>    <chr>
## 1   445   886   answer
## 2   445   886 question
## 3   445   886    quora
print(words_question2%>% filter(id == id.select))
## # A tibble: 3 × 3
##      id  qid2    word
##   <dbl> <dbl>   <chr>
## 1   445   887  you've
## 2   445   887 learned
## 3   445   887   quora

Vector Similiarity

We can compute vector similarity by summing word vectors and taking the normalized dot product.

Getting the word vectors

The word vectors are predefined. We’ll use the 300 dimension GloVe word vectors from Stanford. I store a copy locally to speed things up. A bit of prcessing is required to get the vectors into a simple to use data_frame format.

## Grab word vectors
number.of.words <- 9999
word.vector.file <- "glove.6B.300d.txt"
directory.wordvecs <- "/Users/winstonsaunders/Documents/presidential_debates_clinton_saunders/"

Processing the first 9999 of the GloVe word vectors glove.6B.300d.txt took 40 seconds to: 1. load the word vetors and 2. put them into usable form.

Using the Word Vectors

To use the word vectors we define a couple of utility functions

vectorize.word <- function(word.x = "test", w.list = word.list){
        ## compute normalized word vector
        ## inputs: 
        ##      word.x = sample word
        ##      w.list = word.list data frame
        ##  outputs:
        ##      vec = a word vector
    
    vec <- 
        word.x %>% 
        paste0("^", ., "$") %>%
        grep(w.list) %>%
        slice(word.vector.df[ , -1], .) %>%
        as_data_frame %>%
        yo
    
    return(vec)          
}

After this it is a simple process to convert each question into a vector…

And then do a vector product comparison.

question.compare <- function(x){
    
    dot.prod <- 0
    if(x %in% 1:max(words_question2$id)){

    question1 <- words_question1 %>% filter(id == x) %>% select(word) 
    question2 <- words_question2 %>% filter(id == x) %>% select(word) 
    
    v1 <- question.2.vec(question1)
    v2 <- question.2.vec(question2)
    
    dot.prod <- sum(v1*v2)/sqrt((sum(v1*v1)*sum(v2*v2) + 0.001))
    }
    
    return(dot.prod)
}

Results

Visualization

The first way to look at the results is just to plot thee vector product of the two questions by question-pair id and color code the data based on the “grounded truth” know from the training data.

Distribution

We can get a little better idea by taking a historgram of the data, again grouping by the grounded truth. This confirms what may be hard to recognize in the data above, namely that the model does indeed discriminate between types of questions (though the distributions do have considerable overlap).

ROC Curve

Based on the observation above, we can forge ahead and compute a ROC curve.

## [1] 254
## [1] 172

Here we can see some reasonable discrimination in true versus false positive.

The postive predictive value of this model is roughly 52%.

Conclusion

For a first pass, the word vector analysis shows some promise of having predictive value, especially it it can be combined with other predictors. The next steps are to explore refinement of and additions to the model.