Data 606 Final Assignment

Kai Lukowiak

2017-11-25

Final Assignment:

1. Introduction:

What is your research question? Why do you care? Why should others care?

Can baisic machine learning techniques outperform simple logistic regression?

Can we better predict the probability of an insurance user filing a claim? This is an interesting question because it will allow better price discrimination to insurance users. This means that risky and less risky individuals can be charged different prices.

DATA 606 is a statistics course. While statistics means many things, inference is a very important objective for many statistical analyses is statistical inference. This difference from the objective of the company. The company just wants to understand that a model predicts claims. I want to understand if the interpretability of a logistic regression can make compensate for possible poorer performance.

Loading Packages:

library(dplyr)
library(tidyr)
library(data.table)
library(ggplot2)
library(ggthemes)
library(tibble)
library(knitr)
library(corrr)
library(corrplot)
library(caret)
library(xgboost)
library(MLmetrics)
library(ROCR)
library(lattice)
test <- as.tibble(fread("/Users/kailukowiak/Data606_Proposal/test.csv", na.strings = c("-1","-1.0")))
## 
Read 17.9% of 892816 rows
Read 58.2% of 892816 rows
Read 98.6% of 892816 rows
Read 892816 rows and 58 (of 58) columns from 0.160 GB file in 00:00:05
train <- as.tibble(fread("/Users/kailukowiak/Data606_Proposal/train.csv", na.strings = c("-1","-1.0")))
## 
Read 94.1% of 595212 rows
Read 595212 rows and 59 (of 59) columns from 0.108 GB file in 00:00:03
glimpse(train)
## Observations: 595,212
## Variables: 59
## $ id             <int> 7, 9, 13, 16, 17, 19, 20, 22, 26, 28, 34, 35, 3...
## $ target         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_01      <int> 2, 1, 5, 0, 0, 5, 2, 5, 5, 1, 5, 2, 2, 1, 5, 5,...
## $ ps_ind_02_cat  <int> 2, 1, 4, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,...
## $ ps_ind_03      <int> 5, 7, 9, 2, 0, 4, 3, 4, 3, 2, 2, 3, 1, 3, 11, 3...
## $ ps_ind_04_cat  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,...
## $ ps_ind_05_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_06_bin  <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_07_bin  <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,...
## $ ps_ind_08_bin  <int> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...
## $ ps_ind_09_bin  <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ ps_ind_10_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15      <int> 11, 3, 12, 8, 9, 6, 8, 13, 6, 4, 3, 9, 10, 12, ...
## $ ps_ind_16_bin  <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ ps_ind_17_bin  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ ps_reg_01      <dbl> 0.7, 0.8, 0.0, 0.9, 0.7, 0.9, 0.6, 0.7, 0.9, 0....
## $ ps_reg_02      <dbl> 0.2, 0.4, 0.0, 0.2, 0.6, 1.8, 0.1, 0.4, 0.7, 1....
## $ ps_reg_03      <dbl> 0.7180703, 0.7660777, NA, 0.5809475, 0.8407586,...
## $ ps_car_01_cat  <int> 10, 11, 7, 7, 11, 10, 6, 11, 10, 11, 11, 11, 6,...
## $ ps_car_02_cat  <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
## $ ps_car_03_cat  <int> NA, NA, NA, 0, NA, NA, NA, 0, NA, 0, NA, NA, NA...
## $ ps_car_04_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 8, 0, 0, 0, 0, 9,...
## $ ps_car_05_cat  <int> 1, NA, NA, 1, NA, 0, 1, 0, 1, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat  <int> 4, 11, 14, 11, 14, 14, 11, 11, 14, 14, 13, 11, ...
## $ ps_car_07_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_08_cat  <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,...
## $ ps_car_09_cat  <int> 0, 2, 2, 3, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 2, 0,...
## $ ps_car_10_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat  <int> 12, 19, 60, 104, 82, 104, 99, 30, 68, 104, 20, ...
## $ ps_car_11      <int> 2, 3, 1, 1, 3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 1, 2,...
## $ ps_car_12      <dbl> 0.4000000, 0.3162278, 0.3162278, 0.3741657, 0.3...
## $ ps_car_13      <dbl> 0.8836789, 0.6188165, 0.6415857, 0.5429488, 0.5...
## $ ps_car_14      <dbl> 0.3708099, 0.3887158, 0.3472751, 0.2949576, 0.3...
## $ ps_car_15      <dbl> 3.605551, 2.449490, 3.316625, 2.000000, 2.00000...
## $ ps_calc_01     <dbl> 0.6, 0.3, 0.5, 0.6, 0.4, 0.7, 0.2, 0.1, 0.9, 0....
## $ ps_calc_02     <dbl> 0.5, 0.1, 0.7, 0.9, 0.6, 0.8, 0.6, 0.5, 0.8, 0....
## $ ps_calc_03     <dbl> 0.2, 0.3, 0.1, 0.1, 0.0, 0.4, 0.5, 0.1, 0.6, 0....
## $ ps_calc_04     <int> 3, 2, 2, 2, 2, 3, 2, 1, 3, 2, 2, 2, 4, 2, 3, 2,...
## $ ps_calc_05     <int> 1, 1, 2, 4, 2, 1, 2, 2, 1, 2, 3, 2, 1, 1, 1, 1,...
## $ ps_calc_06     <int> 10, 9, 9, 7, 6, 8, 8, 7, 7, 8, 8, 8, 8, 10, 8, ...
## $ ps_calc_07     <int> 1, 5, 1, 1, 3, 2, 1, 1, 3, 2, 2, 2, 4, 1, 2, 5,...
## $ ps_calc_08     <int> 10, 8, 8, 8, 10, 11, 8, 6, 9, 9, 9, 10, 11, 8, ...
## $ ps_calc_09     <int> 1, 1, 2, 4, 2, 3, 3, 1, 4, 1, 4, 1, 1, 3, 3, 2,...
## $ ps_calc_10     <int> 5, 7, 7, 2, 12, 8, 10, 13, 11, 11, 7, 8, 9, 8, ...
## $ ps_calc_11     <int> 9, 3, 4, 2, 3, 4, 3, 7, 4, 3, 6, 9, 6, 2, 4, 5,...
## $ ps_calc_12     <int> 1, 1, 2, 2, 1, 2, 0, 1, 2, 5, 3, 2, 3, 0, 1, 2,...
## $ ps_calc_13     <int> 5, 1, 7, 4, 1, 0, 0, 3, 1, 0, 3, 1, 3, 4, 3, 6,...
## $ ps_calc_14     <int> 8, 9, 7, 9, 3, 9, 10, 6, 5, 6, 6, 10, 8, 3, 9, ...
## $ ps_calc_15_bin <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,...
## $ ps_calc_17_bin <int> 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,...
## $ ps_calc_18_bin <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ ps_calc_19_bin <int> 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,...
## $ ps_calc_20_bin <int> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...

We must change categories to factors.

train <-  train %>% 
  mutate_at(vars(contains('_cat')), .funs = as.factor) #Sets all categories to factor. 
train <- model.matrix(~ . -1, data = train)

2. Data:

Write about the data from your proposal in text form. Address the following points:

  • Data collection: Describe how the data were collected.

The data was easily collected from two zip files from kaggle. The website can be seen here.

  • Cases: What are the cases? (Remember: case = units of observation or units of experiment)

The cases are individual people who bought insurance and either did, or did not file a claim.

  • Variables: What are the two variables you will be studying? State the type of each variable.

The target variable is a categorical variable with two states. \[\text{target} \in 0,1\] I’m going to use all other variables that are both dummy variables and continuous variables.

  • Type of study: What is the type of study, observational or an experiment? Explain how you’ve arrived at your conclusion using information on the sampling and/or experimental design.

Since there is no control group or attempts to use natural variance such as IV or regression discontinuity we can say that this is an observational study that is trying to predict, based on certain characteristics, which people are most likely to file an insurance claim.

  • Scope of inference - generalization: Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalization.

The population of interest is people in Brazil who need auto insurance. There may be some room for the results of this to be interpreted outside of the population of interest, however, since there is no attempt to control or tackle bias that may lie outside of the population of interest results should be only interpreted with that in mind.

Potential bias could easily occur from an omitted variable (OVB). OVB could stem from any variable correlates with filing a claim and the probability that an individual would buy insurance. Alternatively, there could be selection bias as well. This would bias results if conditions led different people to purchase insurance. Also, given the probable difference in driving conditions in Brazil it will be difficult to translate findings outside of similar countries.

  • Scope of inference - causality: Can these data be used to establish causal links between the variables of interest?

No, it would be difficult to draw a causal conclusion between age and claim filing because the data is not an experiment.

  • Explain why or why not.

Causality looks at what would happen to the counter factual. This study is not looking into the counter factual but instead it is trying to predict which consumers of insurance will make a claim.

Ceteris paribus would an individual have a different outcome if they were a different age? This is an impossible question to answer since age makes up who a person is and (presumably) how they drive.

Further, things like multicolinierity will have less negative effects on a classification problem than on a causal inference problem. For example, if age and car type are highly correlated, there could be errors with attribution. For example, young people get in more car crashes and also buy used sports cars more. If this correlation is very high, the probability of a claim that is attributed mostly to age might be split between age and used sports car. Such an example would not matter as much in the classification setting because most people new predictions were being made on would follow the same correlation and so the two variables would in effect be summed up, leading to more accurate classification than inference.

3. Exploratory data analysis:

Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question.

Unsurprisingly, there are many more no-claims than claims.

train <- data.frame(train)
train %>% 
  select(target) %>% 
  group_by(target) %>% 
  summarise(ratio = n() / nrow(train) * 100) %>% 
  ggplot( aes(x = target, y = ratio))+
  geom_bar(stat = 'identity', fill = 'light blue') +
  ggtitle("Count of Claims vs No Claims") +
  xlab('Claim or No Claim') + ylab('Ratio (%)') + 
  geom_text(aes(label=round(ratio, 2)), position=position_dodge(width=0.9), vjust=-0.25)

This means we will have to normalize the data to make the number of claims equal to the number of no claims. (At least for some classification techniques).

test = data.table(test)
naVals <- test %>%  
  select(which(colMeans(is.na(.)) > 0)) %>% 
  summarise_all(funs(sum(is.na(.))/n())) %>%
  gather(key = "Variable", value = "missingPercent")
ggplot(naVals, aes(x = reorder( Variable, missingPercent), y = missingPercent)) +
  geom_bar(stat = "identity", fill = 'light blue') +
  ylim(0,1) +
  ggtitle("Percentage of Non- Missing Values") +
  coord_flip() 

We can see that most of ps_car_03_cat and ps_car_05_cat is missing. Other than that, most variables contain few if any NAs.

corrDF <- train %>% 
  correlate() %>% 
  focus(target)
## Warning in stats::cor(x = x, y = y, use = use, method = method): the
## standard deviation is zero
ggplot(corrDF, aes(x =reorder(rowname, abs(target)), y = target)) + 
  geom_bar(stat = 'identity', fill = 'light blue') +
  coord_flip()+
  ylab('Variable')+
  xlab('Correlation with Target')+
  ggtitle('Correlation of the Dependant Variable with all Other Variables')
## Warning: Removed 3 rows containing missing values (position_stack).

We can see from this graph, that there is little correlation.

Inference:

If your data fails some conditions and you can’t use a theoretical method, then you should use simulation. If you can use both methods, then you should use both methods. It is your responsibility to figure out the appropriate methodology.

While we can make some inference from the logistic regression (although we must be careful to not imply causality, we cannot make statistical inference from the xgboost algorithm.)

Train test split

trainIndex <- createDataPartition(train$target, p = 0.7, list = F, times = 1)
trainTrain <- train[trainIndex,]
trainTest <- train[-trainIndex,]
mod <- glm(formula = target ~ ., family = binomial(link = "logit"), 
    data = trainTrain)
mod
## 
## Call:  glm(formula = target ~ ., family = binomial(link = "logit"), 
##     data = trainTrain)
## 
## Coefficients:
##      (Intercept)                id         ps_ind_01    ps_ind_02_cat1  
##       -3.806e+00         3.312e-08         6.220e-03        -1.282e-01  
##   ps_ind_02_cat2    ps_ind_02_cat3    ps_ind_02_cat4         ps_ind_03  
##       -1.113e-01        -2.696e-01                NA         3.279e-02  
##   ps_ind_04_cat1    ps_ind_05_cat1    ps_ind_05_cat2    ps_ind_05_cat3  
##        5.665e-02         3.248e-01         6.443e-01         3.522e-01  
##   ps_ind_05_cat4    ps_ind_05_cat5    ps_ind_05_cat6     ps_ind_06_bin  
##        5.725e-01         4.838e-01         5.753e-01         1.148e-02  
##    ps_ind_07_bin     ps_ind_08_bin     ps_ind_09_bin     ps_ind_10_bin  
##        2.637e-01         2.661e-01                NA        -7.478e-01  
##    ps_ind_11_bin     ps_ind_12_bin     ps_ind_13_bin         ps_ind_14  
##       -1.288e-01         9.714e-02         1.904e-01                NA  
##        ps_ind_15     ps_ind_16_bin     ps_ind_17_bin     ps_ind_18_bin  
##       -2.217e-02        -1.112e-01         2.644e-01        -2.202e-01  
##        ps_reg_01         ps_reg_02         ps_reg_03    ps_car_01_cat1  
##       -3.444e-02         6.157e-02         1.263e-01        -3.865e-01  
##   ps_car_01_cat2    ps_car_01_cat3    ps_car_01_cat4    ps_car_01_cat5  
##        9.661e-04        -1.141e-01         1.260e-01        -3.548e-01  
##   ps_car_01_cat6    ps_car_01_cat7    ps_car_01_cat8    ps_car_01_cat9  
##       -2.414e-01        -4.331e-01        -1.799e-01        -1.827e-01  
##  ps_car_01_cat10   ps_car_01_cat11    ps_car_02_cat1    ps_car_03_cat1  
##       -2.172e-01        -3.128e-01        -4.783e-02         1.026e-01  
##   ps_car_04_cat1    ps_car_04_cat2    ps_car_04_cat3    ps_car_04_cat4  
##       -3.535e-01         1.490e-01        -5.769e-02        -7.347e-01  
##   ps_car_04_cat5    ps_car_04_cat6    ps_car_04_cat7    ps_car_04_cat8  
##       -7.973e-01        -3.797e-01        -5.909e-01        -1.264e-01  
##   ps_car_04_cat9    ps_car_05_cat1    ps_car_06_cat1    ps_car_06_cat2  
##       -5.946e-01        -6.216e-03        -1.102e-01        -4.305e-01  
##   ps_car_06_cat3    ps_car_06_cat4    ps_car_06_cat5    ps_car_06_cat6  
##       -2.376e-01        -8.722e-02         5.912e-01        -1.566e-01  
##   ps_car_06_cat7    ps_car_06_cat8    ps_car_06_cat9   ps_car_06_cat10  
##       -1.604e-02         1.007e-01         1.552e-01         4.971e-01  
##  ps_car_06_cat11   ps_car_06_cat12   ps_car_06_cat13   ps_car_06_cat14  
##        1.014e-01         7.529e-01        -3.408e-02        -6.141e-02  
##  ps_car_06_cat15   ps_car_06_cat16   ps_car_06_cat17    ps_car_07_cat1  
##        3.776e-01         2.844e-01         3.782e-01        -2.475e-01  
##   ps_car_08_cat1    ps_car_09_cat1    ps_car_09_cat2    ps_car_09_cat3  
##        6.581e-02         3.310e-01         1.111e-01         8.125e-02  
##   ps_car_09_cat4    ps_car_10_cat1    ps_car_10_cat2    ps_car_11_cat2  
##        7.597e-01        -1.410e-01        -3.719e-01                NA  
##   ps_car_11_cat3    ps_car_11_cat4    ps_car_11_cat5    ps_car_11_cat6  
##       -2.247e-01        -5.275e-01        -7.489e-01        -4.942e-01  
##   ps_car_11_cat7    ps_car_11_cat8    ps_car_11_cat9   ps_car_11_cat10  
##       -8.208e-01        -2.190e-01        -2.467e-01        -1.104e-01  
##  ps_car_11_cat11   ps_car_11_cat12   ps_car_11_cat13   ps_car_11_cat14  
##       -4.429e-01        -1.621e-01        -3.392e-01        -7.807e-01  
##  ps_car_11_cat15   ps_car_11_cat16   ps_car_11_cat17   ps_car_11_cat18  
##       -3.228e-01        -5.317e-01        -4.014e-01         5.331e-02  
##  ps_car_11_cat19   ps_car_11_cat20   ps_car_11_cat21   ps_car_11_cat22  
##       -5.022e-01        -2.208e-01         1.548e-01        -3.003e-01  
##  ps_car_11_cat23   ps_car_11_cat24   ps_car_11_cat25   ps_car_11_cat26  
##       -3.874e-01        -1.682e-02                NA         2.185e-01  
##  ps_car_11_cat27   ps_car_11_cat28   ps_car_11_cat29   ps_car_11_cat30  
##       -1.527e-01        -6.873e-01        -3.651e-01        -8.361e-01  
##  ps_car_11_cat31   ps_car_11_cat32   ps_car_11_cat33   ps_car_11_cat34  
##        5.086e-02        -4.128e-01        -1.325e-01        -1.062e-01  
##  ps_car_11_cat35   ps_car_11_cat36   ps_car_11_cat37   ps_car_11_cat38  
##        1.792e-01        -5.515e-01         3.507e-02         3.132e-02  
##  ps_car_11_cat39   ps_car_11_cat40   ps_car_11_cat41   ps_car_11_cat42  
##       -5.643e-01        -5.616e-01         1.447e-01        -3.302e-01  
##  ps_car_11_cat43   ps_car_11_cat44   ps_car_11_cat45   ps_car_11_cat46  
##       -4.487e-01        -5.290e-01         3.147e-01        -2.789e-01  
##  ps_car_11_cat47   ps_car_11_cat48   ps_car_11_cat49   ps_car_11_cat50  
##       -1.190e-01         1.569e-01        -1.161e+00        -4.642e-02  
##  ps_car_11_cat51   ps_car_11_cat52   ps_car_11_cat53   ps_car_11_cat54  
##       -3.362e-02        -4.491e-01        -6.950e-01        -3.264e-01  
##  ps_car_11_cat55   ps_car_11_cat56   ps_car_11_cat57   ps_car_11_cat58  
##       -6.994e-01         1.211e-01        -5.289e-01        -8.958e-02  
##  ps_car_11_cat59   ps_car_11_cat60   ps_car_11_cat61   ps_car_11_cat62  
##        4.719e-01        -5.023e-02         5.942e-02        -5.955e-01  
##  ps_car_11_cat63   ps_car_11_cat64   ps_car_11_cat65   ps_car_11_cat66  
##       -2.641e-01        -2.881e-01        -4.388e-02        -1.000e+00  
##  ps_car_11_cat67   ps_car_11_cat68   ps_car_11_cat69   ps_car_11_cat70  
##       -2.888e-01        -2.896e-01        -4.034e-01        -2.850e-01  
##  ps_car_11_cat71   ps_car_11_cat72   ps_car_11_cat73   ps_car_11_cat74  
##       -5.965e-01        -2.762e-01        -7.765e-01        -5.291e-02  
##  ps_car_11_cat75   ps_car_11_cat76   ps_car_11_cat77   ps_car_11_cat78  
##        9.830e-03        -5.934e-02        -7.501e-01        -5.279e-01  
##  ps_car_11_cat79   ps_car_11_cat80   ps_car_11_cat81   ps_car_11_cat82  
##       -3.073e-02                NA        -2.517e-01        -1.446e-01  
##  ps_car_11_cat83   ps_car_11_cat84   ps_car_11_cat85   ps_car_11_cat86  
##       -6.870e-01        -9.014e-02        -6.007e-01        -1.805e-01  
##  ps_car_11_cat87   ps_car_11_cat88   ps_car_11_cat89   ps_car_11_cat90  
##       -2.445e-01        -1.325e-01        -1.151e+00         1.260e-01  
##  ps_car_11_cat91   ps_car_11_cat92   ps_car_11_cat93   ps_car_11_cat94  
##       -3.520e-02        -1.241e-01        -1.520e-01        -3.124e-01  
##  ps_car_11_cat95   ps_car_11_cat96   ps_car_11_cat97   ps_car_11_cat98  
##       -9.880e-01        -4.831e-01        -4.869e-02        -3.161e-01  
##  ps_car_11_cat99  ps_car_11_cat100  ps_car_11_cat101  ps_car_11_cat102  
##       -4.018e-01        -1.286e-01        -4.501e-01         6.285e-02  
## ps_car_11_cat103  ps_car_11_cat104         ps_car_11         ps_car_12  
##       -4.157e-01        -2.443e-01         1.434e-02         1.259e+00  
##        ps_car_13         ps_car_14         ps_car_15        ps_calc_01  
##        4.893e-01         1.841e-01         7.361e-03         7.703e-02  
##       ps_calc_02        ps_calc_03        ps_calc_04        ps_calc_05  
##        3.833e-02         6.312e-02         1.917e-02         1.301e-02  
##       ps_calc_06        ps_calc_07        ps_calc_08        ps_calc_09  
##        1.205e-02         5.781e-03         5.890e-03         6.986e-03  
##       ps_calc_10        ps_calc_11        ps_calc_12        ps_calc_13  
##        2.511e-03        -4.183e-03        -1.257e-02        -1.345e-02  
##       ps_calc_14    ps_calc_15_bin    ps_calc_16_bin    ps_calc_17_bin  
##        3.771e-03        -6.713e-02         2.869e-02        -4.733e-02  
##   ps_calc_18_bin    ps_calc_19_bin    ps_calc_20_bin  
##        2.710e-02        -5.414e-02        -6.883e-02  
## 
## Degrees of Freedom: 87451 Total (i.e. Null);  87251 Residual
## Null Deviance:       32360 
## Residual Deviance: 31410     AIC: 31810
p <- predict(mod, newdata = trainTest[, -which(names(trainTest) == "target")], type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pr <- prediction(p, trainTest$target)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]

auc
## [1] 0.6133971

That AUC doesn’t look very good.

Let’s see if we can improve that with a better model.

XGBoost:

This is a great package.

# xgb_normalizedgini <- function(preds, dtrain){
#   actual <- getinfo(dtrain, "label")
#   score <- NormalizedGini(preds,actual)
#   return(list(metric = "NormalizedGini", value = score))
# }
# 
# param <- list(booster="gbtree",
#               objective="binary:logistic",
#               eta = 0.02,
#               gamma = 1,
#               max_depth = 6,
#               min_child_weight = 1,
#               subsample = 0.8,
#               colsample_bytree = 0.8
# )


set.seed(101)
param = list(
  objective ="binary:logistic", # Because only two categories
  eval_metric = "auc", # from competiton
  subsample = 0.8,
  gamma = 1,
  colsample_bytree = 0.8,
  max_depth = 6,
  min_child_weight = 1,
  tree_method = "auto",
  eta  = 0.02,
  colsample_bytree = 0.8,
  nthreads = 8
)

x_train <- xgb.DMatrix(
    as.matrix(trainTrain[,-2]),
    label = trainTrain$target, 
    missing = NaN)

x_val <- xgb.DMatrix(
    as.matrix(trainTest[,-2]), 
    label = trainTest$target,
    missing = NaN)
x_test <- xgb.DMatrix(as.matrix(test), missing = NaN)


model <- xgb.train(
    data = x_train,
    nrounds = 400, 
    params = param,  
    maximize = TRUE,
    watchlist = list(val = x_val),
    print_every_n = 10
  )
## Warning in check.booster.params(params, ...): The following parameters were provided multiple times:
##  colsample_bytree
##   Only the last value for each of them will be used.
## [1]  val-auc:0.581973 
## [11] val-auc:0.603465 
## [21] val-auc:0.604384 
## [31] val-auc:0.604154 
## [41] val-auc:0.604657 
## [51] val-auc:0.607109 
## [61] val-auc:0.608757 
## [71] val-auc:0.609682 
## [81] val-auc:0.610687 
## [91] val-auc:0.611023 
## [101]    val-auc:0.613474 
## [111]    val-auc:0.614051 
## [121]    val-auc:0.614805 
## [131]    val-auc:0.615449 
## [141]    val-auc:0.615297 
## [151]    val-auc:0.615624 
## [161]    val-auc:0.617181 
## [171]    val-auc:0.617997 
## [181]    val-auc:0.618855 
## [191]    val-auc:0.619456 
## [201]    val-auc:0.619822 
## [211]    val-auc:0.620351 
## [221]    val-auc:0.620734 
## [231]    val-auc:0.620938 
## [241]    val-auc:0.621467 
## [251]    val-auc:0.622909 
## [261]    val-auc:0.623384 
## [271]    val-auc:0.623323 
## [281]    val-auc:0.623746 
## [291]    val-auc:0.623751 
## [301]    val-auc:0.623942 
## [311]    val-auc:0.624052 
## [321]    val-auc:0.624465 
## [331]    val-auc:0.624078 
## [341]    val-auc:0.623697 
## [351]    val-auc:0.623504 
## [361]    val-auc:0.623350 
## [371]    val-auc:0.622776 
## [381]    val-auc:0.622678 
## [391]    val-auc:0.622568 
## [400]    val-auc:0.622236
pred_3_e  <- predict(model, x_val)
pred_3_t  <- predict(model, x_test)
bet <- Gini(pred_3_e, trainTest$target)
bet1 = auc * 2 -1

While we have achieved our goal of improving upon basic by getting an gini coefficient of 0.2444722 while our basic logistic regression had a gini of 0.2267942. While this is an improvement, the winning result was 0.29698. While a 0.0176781 improvement might not be impressive but considering that the difference between 0.0525078 any improvement over baseline is impressive. For example, using basic xgboost over logistic regression gives a 7.79% increase in gini (not to be confused with accuracy per se).

After 100 iterations we only have a 0.616 compared to the value from the simple logistic regression.

  • Check conditions
qqnorm(mod$residuals)
qqline(mod$residuals)

Wooo, That did not pass the test, however, since we used a logistic regression, this cannot be expected. In fact, it almost looks like a logistic regression which is comforting.

  • Theoretical inference (if possible) - hypothesis test and confidence interval

Theoretical inference is possible based on the results from the logistic regression, however, in order to maintain privacy, the competition did not give variable names that correspond to real world attributes. As such, while there are CIs, we don’t actually know what they correspond too.

  • Brief description of methodology that reflects your conceptual understanding

We performed two tests, one which was a logistic regression and the other was boosted tree algorithm. The results were compared by using the gini coefficient on a test set. ## Conclusion:

While the ML algorithm had a better gini score, it was much more expensive to compute and did not leave any options for intelligibility. Also, considering how close the logistic regression was to the winning kaggle gini score, it does not appear that policy makers should be concerned with using an inferior algorithm. However, since the company only cares about prediction, it should use the most cutting edge algorithm such as a CNN or XGBoost.