Final Assignment:
1. Introduction:
What is your research question? Why do you care? Why should others care?
Can baisic machine learning techniques outperform simple logistic regression?
Can we better predict the probability of an insurance user filing a claim? This is an interesting question because it will allow better price discrimination to insurance users. This means that risky and less risky individuals can be charged different prices.
DATA 606 is a statistics course. While statistics means many things, inference is a very important objective for many statistical analyses is statistical inference. This difference from the objective of the company. The company just wants to understand that a model predicts claims. I want to understand if the interpretability of a logistic regression can make compensate for possible poorer performance.
Loading Packages:
library(dplyr)
library(tidyr)
library(data.table)
library(ggplot2)
library(ggthemes)
library(tibble)
library(knitr)
library(corrr)
library(corrplot)
library(caret)
library(xgboost)
library(MLmetrics)
library(ROCR)
library(lattice)test <- as.tibble(fread("/Users/kailukowiak/Data606_Proposal/test.csv", na.strings = c("-1","-1.0")))##
Read 17.9% of 892816 rows
Read 58.2% of 892816 rows
Read 98.6% of 892816 rows
Read 892816 rows and 58 (of 58) columns from 0.160 GB file in 00:00:05
train <- as.tibble(fread("/Users/kailukowiak/Data606_Proposal/train.csv", na.strings = c("-1","-1.0")))##
Read 94.1% of 595212 rows
Read 595212 rows and 59 (of 59) columns from 0.108 GB file in 00:00:03
glimpse(train)## Observations: 595,212
## Variables: 59
## $ id <int> 7, 9, 13, 16, 17, 19, 20, 22, 26, 28, 34, 35, 3...
## $ target <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_01 <int> 2, 1, 5, 0, 0, 5, 2, 5, 5, 1, 5, 2, 2, 1, 5, 5,...
## $ ps_ind_02_cat <int> 2, 1, 4, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,...
## $ ps_ind_03 <int> 5, 7, 9, 2, 0, 4, 3, 4, 3, 2, 2, 3, 1, 3, 11, 3...
## $ ps_ind_04_cat <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,...
## $ ps_ind_05_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_06_bin <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_07_bin <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,...
## $ ps_ind_08_bin <int> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...
## $ ps_ind_09_bin <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ ps_ind_10_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15 <int> 11, 3, 12, 8, 9, 6, 8, 13, 6, 4, 3, 9, 10, 12, ...
## $ ps_ind_16_bin <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ ps_ind_17_bin <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ ps_reg_01 <dbl> 0.7, 0.8, 0.0, 0.9, 0.7, 0.9, 0.6, 0.7, 0.9, 0....
## $ ps_reg_02 <dbl> 0.2, 0.4, 0.0, 0.2, 0.6, 1.8, 0.1, 0.4, 0.7, 1....
## $ ps_reg_03 <dbl> 0.7180703, 0.7660777, NA, 0.5809475, 0.8407586,...
## $ ps_car_01_cat <int> 10, 11, 7, 7, 11, 10, 6, 11, 10, 11, 11, 11, 6,...
## $ ps_car_02_cat <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
## $ ps_car_03_cat <int> NA, NA, NA, 0, NA, NA, NA, 0, NA, 0, NA, NA, NA...
## $ ps_car_04_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 8, 0, 0, 0, 0, 9,...
## $ ps_car_05_cat <int> 1, NA, NA, 1, NA, 0, 1, 0, 1, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat <int> 4, 11, 14, 11, 14, 14, 11, 11, 14, 14, 13, 11, ...
## $ ps_car_07_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_08_cat <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,...
## $ ps_car_09_cat <int> 0, 2, 2, 3, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 2, 0,...
## $ ps_car_10_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat <int> 12, 19, 60, 104, 82, 104, 99, 30, 68, 104, 20, ...
## $ ps_car_11 <int> 2, 3, 1, 1, 3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 1, 2,...
## $ ps_car_12 <dbl> 0.4000000, 0.3162278, 0.3162278, 0.3741657, 0.3...
## $ ps_car_13 <dbl> 0.8836789, 0.6188165, 0.6415857, 0.5429488, 0.5...
## $ ps_car_14 <dbl> 0.3708099, 0.3887158, 0.3472751, 0.2949576, 0.3...
## $ ps_car_15 <dbl> 3.605551, 2.449490, 3.316625, 2.000000, 2.00000...
## $ ps_calc_01 <dbl> 0.6, 0.3, 0.5, 0.6, 0.4, 0.7, 0.2, 0.1, 0.9, 0....
## $ ps_calc_02 <dbl> 0.5, 0.1, 0.7, 0.9, 0.6, 0.8, 0.6, 0.5, 0.8, 0....
## $ ps_calc_03 <dbl> 0.2, 0.3, 0.1, 0.1, 0.0, 0.4, 0.5, 0.1, 0.6, 0....
## $ ps_calc_04 <int> 3, 2, 2, 2, 2, 3, 2, 1, 3, 2, 2, 2, 4, 2, 3, 2,...
## $ ps_calc_05 <int> 1, 1, 2, 4, 2, 1, 2, 2, 1, 2, 3, 2, 1, 1, 1, 1,...
## $ ps_calc_06 <int> 10, 9, 9, 7, 6, 8, 8, 7, 7, 8, 8, 8, 8, 10, 8, ...
## $ ps_calc_07 <int> 1, 5, 1, 1, 3, 2, 1, 1, 3, 2, 2, 2, 4, 1, 2, 5,...
## $ ps_calc_08 <int> 10, 8, 8, 8, 10, 11, 8, 6, 9, 9, 9, 10, 11, 8, ...
## $ ps_calc_09 <int> 1, 1, 2, 4, 2, 3, 3, 1, 4, 1, 4, 1, 1, 3, 3, 2,...
## $ ps_calc_10 <int> 5, 7, 7, 2, 12, 8, 10, 13, 11, 11, 7, 8, 9, 8, ...
## $ ps_calc_11 <int> 9, 3, 4, 2, 3, 4, 3, 7, 4, 3, 6, 9, 6, 2, 4, 5,...
## $ ps_calc_12 <int> 1, 1, 2, 2, 1, 2, 0, 1, 2, 5, 3, 2, 3, 0, 1, 2,...
## $ ps_calc_13 <int> 5, 1, 7, 4, 1, 0, 0, 3, 1, 0, 3, 1, 3, 4, 3, 6,...
## $ ps_calc_14 <int> 8, 9, 7, 9, 3, 9, 10, 6, 5, 6, 6, 10, 8, 3, 9, ...
## $ ps_calc_15_bin <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,...
## $ ps_calc_17_bin <int> 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,...
## $ ps_calc_18_bin <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ ps_calc_19_bin <int> 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,...
## $ ps_calc_20_bin <int> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...
We must change categories to factors.
train <- train %>%
mutate_at(vars(contains('_cat')), .funs = as.factor) #Sets all categories to factor.
train <- model.matrix(~ . -1, data = train)2. Data:
Write about the data from your proposal in text form. Address the following points:
- Data collection: Describe how the data were collected.
The data was easily collected from two zip files from kaggle. The website can be seen here.
- Cases: What are the cases? (Remember: case = units of observation or units of experiment)
The cases are individual people who bought insurance and either did, or did not file a claim.
- Variables: What are the two variables you will be studying? State the type of each variable.
The target variable is a categorical variable with two states. \[\text{target} \in 0,1\] I’m going to use all other variables that are both dummy variables and continuous variables.
- Type of study: What is the type of study, observational or an experiment? Explain how you’ve arrived at your conclusion using information on the sampling and/or experimental design.
Since there is no control group or attempts to use natural variance such as IV or regression discontinuity we can say that this is an observational study that is trying to predict, based on certain characteristics, which people are most likely to file an insurance claim.
- Scope of inference - generalization: Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalization.
The population of interest is people in Brazil who need auto insurance. There may be some room for the results of this to be interpreted outside of the population of interest, however, since there is no attempt to control or tackle bias that may lie outside of the population of interest results should be only interpreted with that in mind.
Potential bias could easily occur from an omitted variable (OVB). OVB could stem from any variable correlates with filing a claim and the probability that an individual would buy insurance. Alternatively, there could be selection bias as well. This would bias results if conditions led different people to purchase insurance. Also, given the probable difference in driving conditions in Brazil it will be difficult to translate findings outside of similar countries.
- Scope of inference - causality: Can these data be used to establish causal links between the variables of interest?
No, it would be difficult to draw a causal conclusion between age and claim filing because the data is not an experiment.
- Explain why or why not.
Causality looks at what would happen to the counter factual. This study is not looking into the counter factual but instead it is trying to predict which consumers of insurance will make a claim.
Ceteris paribus would an individual have a different outcome if they were a different age? This is an impossible question to answer since age makes up who a person is and (presumably) how they drive.
Further, things like multicolinierity will have less negative effects on a classification problem than on a causal inference problem. For example, if age and car type are highly correlated, there could be errors with attribution. For example, young people get in more car crashes and also buy used sports cars more. If this correlation is very high, the probability of a claim that is attributed mostly to age might be split between age and used sports car. Such an example would not matter as much in the classification setting because most people new predictions were being made on would follow the same correlation and so the two variables would in effect be summed up, leading to more accurate classification than inference.
3. Exploratory data analysis:
Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question.
Unsurprisingly, there are many more no-claims than claims.
train <- data.frame(train)
train %>%
select(target) %>%
group_by(target) %>%
summarise(ratio = n() / nrow(train) * 100) %>%
ggplot( aes(x = target, y = ratio))+
geom_bar(stat = 'identity', fill = 'light blue') +
ggtitle("Count of Claims vs No Claims") +
xlab('Claim or No Claim') + ylab('Ratio (%)') +
geom_text(aes(label=round(ratio, 2)), position=position_dodge(width=0.9), vjust=-0.25)This means we will have to normalize the data to make the number of claims equal to the number of no claims. (At least for some classification techniques).
test = data.table(test)
naVals <- test %>%
select(which(colMeans(is.na(.)) > 0)) %>%
summarise_all(funs(sum(is.na(.))/n())) %>%
gather(key = "Variable", value = "missingPercent")ggplot(naVals, aes(x = reorder( Variable, missingPercent), y = missingPercent)) +
geom_bar(stat = "identity", fill = 'light blue') +
ylim(0,1) +
ggtitle("Percentage of Non- Missing Values") +
coord_flip() We can see that most of ps_car_03_cat and ps_car_05_cat is missing. Other than that, most variables contain few if any NAs.
corrDF <- train %>%
correlate() %>%
focus(target)## Warning in stats::cor(x = x, y = y, use = use, method = method): the
## standard deviation is zero
ggplot(corrDF, aes(x =reorder(rowname, abs(target)), y = target)) +
geom_bar(stat = 'identity', fill = 'light blue') +
coord_flip()+
ylab('Variable')+
xlab('Correlation with Target')+
ggtitle('Correlation of the Dependant Variable with all Other Variables')## Warning: Removed 3 rows containing missing values (position_stack).
We can see from this graph, that there is little correlation.
Inference:
If your data fails some conditions and you can’t use a theoretical method, then you should use simulation. If you can use both methods, then you should use both methods. It is your responsibility to figure out the appropriate methodology.
While we can make some inference from the logistic regression (although we must be careful to not imply causality, we cannot make statistical inference from the xgboost algorithm.)
Train test split
trainIndex <- createDataPartition(train$target, p = 0.7, list = F, times = 1)
trainTrain <- train[trainIndex,]
trainTest <- train[-trainIndex,]mod <- glm(formula = target ~ ., family = binomial(link = "logit"),
data = trainTrain)mod##
## Call: glm(formula = target ~ ., family = binomial(link = "logit"),
## data = trainTrain)
##
## Coefficients:
## (Intercept) id ps_ind_01 ps_ind_02_cat1
## -3.806e+00 3.312e-08 6.220e-03 -1.282e-01
## ps_ind_02_cat2 ps_ind_02_cat3 ps_ind_02_cat4 ps_ind_03
## -1.113e-01 -2.696e-01 NA 3.279e-02
## ps_ind_04_cat1 ps_ind_05_cat1 ps_ind_05_cat2 ps_ind_05_cat3
## 5.665e-02 3.248e-01 6.443e-01 3.522e-01
## ps_ind_05_cat4 ps_ind_05_cat5 ps_ind_05_cat6 ps_ind_06_bin
## 5.725e-01 4.838e-01 5.753e-01 1.148e-02
## ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin
## 2.637e-01 2.661e-01 NA -7.478e-01
## ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14
## -1.288e-01 9.714e-02 1.904e-01 NA
## ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin
## -2.217e-02 -1.112e-01 2.644e-01 -2.202e-01
## ps_reg_01 ps_reg_02 ps_reg_03 ps_car_01_cat1
## -3.444e-02 6.157e-02 1.263e-01 -3.865e-01
## ps_car_01_cat2 ps_car_01_cat3 ps_car_01_cat4 ps_car_01_cat5
## 9.661e-04 -1.141e-01 1.260e-01 -3.548e-01
## ps_car_01_cat6 ps_car_01_cat7 ps_car_01_cat8 ps_car_01_cat9
## -2.414e-01 -4.331e-01 -1.799e-01 -1.827e-01
## ps_car_01_cat10 ps_car_01_cat11 ps_car_02_cat1 ps_car_03_cat1
## -2.172e-01 -3.128e-01 -4.783e-02 1.026e-01
## ps_car_04_cat1 ps_car_04_cat2 ps_car_04_cat3 ps_car_04_cat4
## -3.535e-01 1.490e-01 -5.769e-02 -7.347e-01
## ps_car_04_cat5 ps_car_04_cat6 ps_car_04_cat7 ps_car_04_cat8
## -7.973e-01 -3.797e-01 -5.909e-01 -1.264e-01
## ps_car_04_cat9 ps_car_05_cat1 ps_car_06_cat1 ps_car_06_cat2
## -5.946e-01 -6.216e-03 -1.102e-01 -4.305e-01
## ps_car_06_cat3 ps_car_06_cat4 ps_car_06_cat5 ps_car_06_cat6
## -2.376e-01 -8.722e-02 5.912e-01 -1.566e-01
## ps_car_06_cat7 ps_car_06_cat8 ps_car_06_cat9 ps_car_06_cat10
## -1.604e-02 1.007e-01 1.552e-01 4.971e-01
## ps_car_06_cat11 ps_car_06_cat12 ps_car_06_cat13 ps_car_06_cat14
## 1.014e-01 7.529e-01 -3.408e-02 -6.141e-02
## ps_car_06_cat15 ps_car_06_cat16 ps_car_06_cat17 ps_car_07_cat1
## 3.776e-01 2.844e-01 3.782e-01 -2.475e-01
## ps_car_08_cat1 ps_car_09_cat1 ps_car_09_cat2 ps_car_09_cat3
## 6.581e-02 3.310e-01 1.111e-01 8.125e-02
## ps_car_09_cat4 ps_car_10_cat1 ps_car_10_cat2 ps_car_11_cat2
## 7.597e-01 -1.410e-01 -3.719e-01 NA
## ps_car_11_cat3 ps_car_11_cat4 ps_car_11_cat5 ps_car_11_cat6
## -2.247e-01 -5.275e-01 -7.489e-01 -4.942e-01
## ps_car_11_cat7 ps_car_11_cat8 ps_car_11_cat9 ps_car_11_cat10
## -8.208e-01 -2.190e-01 -2.467e-01 -1.104e-01
## ps_car_11_cat11 ps_car_11_cat12 ps_car_11_cat13 ps_car_11_cat14
## -4.429e-01 -1.621e-01 -3.392e-01 -7.807e-01
## ps_car_11_cat15 ps_car_11_cat16 ps_car_11_cat17 ps_car_11_cat18
## -3.228e-01 -5.317e-01 -4.014e-01 5.331e-02
## ps_car_11_cat19 ps_car_11_cat20 ps_car_11_cat21 ps_car_11_cat22
## -5.022e-01 -2.208e-01 1.548e-01 -3.003e-01
## ps_car_11_cat23 ps_car_11_cat24 ps_car_11_cat25 ps_car_11_cat26
## -3.874e-01 -1.682e-02 NA 2.185e-01
## ps_car_11_cat27 ps_car_11_cat28 ps_car_11_cat29 ps_car_11_cat30
## -1.527e-01 -6.873e-01 -3.651e-01 -8.361e-01
## ps_car_11_cat31 ps_car_11_cat32 ps_car_11_cat33 ps_car_11_cat34
## 5.086e-02 -4.128e-01 -1.325e-01 -1.062e-01
## ps_car_11_cat35 ps_car_11_cat36 ps_car_11_cat37 ps_car_11_cat38
## 1.792e-01 -5.515e-01 3.507e-02 3.132e-02
## ps_car_11_cat39 ps_car_11_cat40 ps_car_11_cat41 ps_car_11_cat42
## -5.643e-01 -5.616e-01 1.447e-01 -3.302e-01
## ps_car_11_cat43 ps_car_11_cat44 ps_car_11_cat45 ps_car_11_cat46
## -4.487e-01 -5.290e-01 3.147e-01 -2.789e-01
## ps_car_11_cat47 ps_car_11_cat48 ps_car_11_cat49 ps_car_11_cat50
## -1.190e-01 1.569e-01 -1.161e+00 -4.642e-02
## ps_car_11_cat51 ps_car_11_cat52 ps_car_11_cat53 ps_car_11_cat54
## -3.362e-02 -4.491e-01 -6.950e-01 -3.264e-01
## ps_car_11_cat55 ps_car_11_cat56 ps_car_11_cat57 ps_car_11_cat58
## -6.994e-01 1.211e-01 -5.289e-01 -8.958e-02
## ps_car_11_cat59 ps_car_11_cat60 ps_car_11_cat61 ps_car_11_cat62
## 4.719e-01 -5.023e-02 5.942e-02 -5.955e-01
## ps_car_11_cat63 ps_car_11_cat64 ps_car_11_cat65 ps_car_11_cat66
## -2.641e-01 -2.881e-01 -4.388e-02 -1.000e+00
## ps_car_11_cat67 ps_car_11_cat68 ps_car_11_cat69 ps_car_11_cat70
## -2.888e-01 -2.896e-01 -4.034e-01 -2.850e-01
## ps_car_11_cat71 ps_car_11_cat72 ps_car_11_cat73 ps_car_11_cat74
## -5.965e-01 -2.762e-01 -7.765e-01 -5.291e-02
## ps_car_11_cat75 ps_car_11_cat76 ps_car_11_cat77 ps_car_11_cat78
## 9.830e-03 -5.934e-02 -7.501e-01 -5.279e-01
## ps_car_11_cat79 ps_car_11_cat80 ps_car_11_cat81 ps_car_11_cat82
## -3.073e-02 NA -2.517e-01 -1.446e-01
## ps_car_11_cat83 ps_car_11_cat84 ps_car_11_cat85 ps_car_11_cat86
## -6.870e-01 -9.014e-02 -6.007e-01 -1.805e-01
## ps_car_11_cat87 ps_car_11_cat88 ps_car_11_cat89 ps_car_11_cat90
## -2.445e-01 -1.325e-01 -1.151e+00 1.260e-01
## ps_car_11_cat91 ps_car_11_cat92 ps_car_11_cat93 ps_car_11_cat94
## -3.520e-02 -1.241e-01 -1.520e-01 -3.124e-01
## ps_car_11_cat95 ps_car_11_cat96 ps_car_11_cat97 ps_car_11_cat98
## -9.880e-01 -4.831e-01 -4.869e-02 -3.161e-01
## ps_car_11_cat99 ps_car_11_cat100 ps_car_11_cat101 ps_car_11_cat102
## -4.018e-01 -1.286e-01 -4.501e-01 6.285e-02
## ps_car_11_cat103 ps_car_11_cat104 ps_car_11 ps_car_12
## -4.157e-01 -2.443e-01 1.434e-02 1.259e+00
## ps_car_13 ps_car_14 ps_car_15 ps_calc_01
## 4.893e-01 1.841e-01 7.361e-03 7.703e-02
## ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05
## 3.833e-02 6.312e-02 1.917e-02 1.301e-02
## ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09
## 1.205e-02 5.781e-03 5.890e-03 6.986e-03
## ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13
## 2.511e-03 -4.183e-03 -1.257e-02 -1.345e-02
## ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin
## 3.771e-03 -6.713e-02 2.869e-02 -4.733e-02
## ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
## 2.710e-02 -5.414e-02 -6.883e-02
##
## Degrees of Freedom: 87451 Total (i.e. Null); 87251 Residual
## Null Deviance: 32360
## Residual Deviance: 31410 AIC: 31810
p <- predict(mod, newdata = trainTest[, -which(names(trainTest) == "target")], type="response")## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pr <- prediction(p, trainTest$target)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc## [1] 0.6133971
That AUC doesn’t look very good.
Let’s see if we can improve that with a better model.
XGBoost:
This is a great package.
# xgb_normalizedgini <- function(preds, dtrain){
# actual <- getinfo(dtrain, "label")
# score <- NormalizedGini(preds,actual)
# return(list(metric = "NormalizedGini", value = score))
# }
#
# param <- list(booster="gbtree",
# objective="binary:logistic",
# eta = 0.02,
# gamma = 1,
# max_depth = 6,
# min_child_weight = 1,
# subsample = 0.8,
# colsample_bytree = 0.8
# )
set.seed(101)
param = list(
objective ="binary:logistic", # Because only two categories
eval_metric = "auc", # from competiton
subsample = 0.8,
gamma = 1,
colsample_bytree = 0.8,
max_depth = 6,
min_child_weight = 1,
tree_method = "auto",
eta = 0.02,
colsample_bytree = 0.8,
nthreads = 8
)
x_train <- xgb.DMatrix(
as.matrix(trainTrain[,-2]),
label = trainTrain$target,
missing = NaN)
x_val <- xgb.DMatrix(
as.matrix(trainTest[,-2]),
label = trainTest$target,
missing = NaN)
x_test <- xgb.DMatrix(as.matrix(test), missing = NaN)
model <- xgb.train(
data = x_train,
nrounds = 400,
params = param,
maximize = TRUE,
watchlist = list(val = x_val),
print_every_n = 10
)## Warning in check.booster.params(params, ...): The following parameters were provided multiple times:
## colsample_bytree
## Only the last value for each of them will be used.
## [1] val-auc:0.581973
## [11] val-auc:0.603465
## [21] val-auc:0.604384
## [31] val-auc:0.604154
## [41] val-auc:0.604657
## [51] val-auc:0.607109
## [61] val-auc:0.608757
## [71] val-auc:0.609682
## [81] val-auc:0.610687
## [91] val-auc:0.611023
## [101] val-auc:0.613474
## [111] val-auc:0.614051
## [121] val-auc:0.614805
## [131] val-auc:0.615449
## [141] val-auc:0.615297
## [151] val-auc:0.615624
## [161] val-auc:0.617181
## [171] val-auc:0.617997
## [181] val-auc:0.618855
## [191] val-auc:0.619456
## [201] val-auc:0.619822
## [211] val-auc:0.620351
## [221] val-auc:0.620734
## [231] val-auc:0.620938
## [241] val-auc:0.621467
## [251] val-auc:0.622909
## [261] val-auc:0.623384
## [271] val-auc:0.623323
## [281] val-auc:0.623746
## [291] val-auc:0.623751
## [301] val-auc:0.623942
## [311] val-auc:0.624052
## [321] val-auc:0.624465
## [331] val-auc:0.624078
## [341] val-auc:0.623697
## [351] val-auc:0.623504
## [361] val-auc:0.623350
## [371] val-auc:0.622776
## [381] val-auc:0.622678
## [391] val-auc:0.622568
## [400] val-auc:0.622236
pred_3_e <- predict(model, x_val)
pred_3_t <- predict(model, x_test)bet <- Gini(pred_3_e, trainTest$target)
bet1 = auc * 2 -1While we have achieved our goal of improving upon basic by getting an gini coefficient of 0.2444722 while our basic logistic regression had a gini of 0.2267942. While this is an improvement, the winning result was 0.29698. While a 0.0176781 improvement might not be impressive but considering that the difference between 0.0525078 any improvement over baseline is impressive. For example, using basic xgboost over logistic regression gives a 7.79% increase in gini (not to be confused with accuracy per se).
After 100 iterations we only have a 0.616 compared to the value from the simple logistic regression.
- Check conditions
qqnorm(mod$residuals)
qqline(mod$residuals)Wooo, That did not pass the test, however, since we used a logistic regression, this cannot be expected. In fact, it almost looks like a logistic regression which is comforting.
- Theoretical inference (if possible) - hypothesis test and confidence interval
Theoretical inference is possible based on the results from the logistic regression, however, in order to maintain privacy, the competition did not give variable names that correspond to real world attributes. As such, while there are CIs, we don’t actually know what they correspond too.
- Brief description of methodology that reflects your conceptual understanding
We performed two tests, one which was a logistic regression and the other was boosted tree algorithm. The results were compared by using the gini coefficient on a test set. ## Conclusion:
While the ML algorithm had a better gini score, it was much more expensive to compute and did not leave any options for intelligibility. Also, considering how close the logistic regression was to the winning kaggle gini score, it does not appear that policy makers should be concerned with using an inferior algorithm. However, since the company only cares about prediction, it should use the most cutting edge algorithm such as a CNN or XGBoost.