1. Overview

Welcome to the first of the classification series. In this presentation we’ll be exploring the logistic regression.

Recall that a logistic regression (or logit) is similar to a linear regression. It will use other variables to predict an output in an equation type format. The main difference between the two is the output for a linear regression is a continuous variable while the logistic regression is used to estimate a binary classification.

One reason we might want to use a logistic regression is that it is fairly easy to interpret. The coefficients in a logistic regression can be transformed to probabilities. This means, you can describe each of the variables in your model in terms of the probability your chosen output will occur. Another advantage is the logistic run time is relatively low compared to other classification methods, making it useful for large datasets or slow computers.

2. Data Formating

2.1 Setup

This is the step where data is imported and needed packages are loaded. Typically, I would load all libraries here, but to make it clear what packages are needed for which section and to allow copying and pasting of relevant sections, I will load only the packages that will be used in that code chunk.

# Loading Packages for Data Cleaning Sections
library(data.table) # reads data quickly
library(dplyr) # handy for data manipulation
library(plotly) # nice data viz package

# Loading Data
dat <- fread("C:/Users/AK055125/Cerner Corporation/Sallee,Andrew - Analytics Learning Lab Files/General/Shared Training Materials/Lightning Talks/Classification Series/train.csv")

2.2 Data Cleaning

Here missing/inconsistent data is dealt with. An example of inconsistent data is having negative ages or a person having a title “Miss” but marked as male. In addition, if there are columns that are the wrong data type (e.g. numeric data stored as a character), that would be dealt with here.

Fortunately for us, the only data cleaning needed is handling the missing data in the age column. Since we are trying to predict external data that may have missing age data we will need to create a rule for imputation (since age is rather important in determining probability of surviving the Titanic). Fortunately there is some correlation between the titles and ages: “Master” is a term for a male child, those with “Miss” in their title have a lower average age than “Mrs,” and those with “Officer” titles tend to be the oldest age group. Thus, we can impute missing ages by average ages of the different titles and get a better approximation for the missing ages than we would otherwise.

# Creating Title Data (Extracts Title From Full Name)
dat$title <-  gsub("^.*, (.*?)\\..*$", "\\1", dat$Name)

# Categorizing Titles (Replaces String on Left with String on Right)
dat$title[dat$title == 'Mlle'] <- 'Miss' 
dat$title[dat$title == 'Ms'] <- 'Miss'
dat$title[dat$title == 'Mme'] <- 'Mrs' 
dat$title[dat$title == 'Lady'] <- 'Miss'
dat$title[dat$title == 'Dona'] <- 'Miss'
dat$title[dat$title == 'Capt'] <- 'Officer' 
dat$title[dat$title == 'Col'] <- 'Officer' 
dat$title[dat$title == 'Major'] <- 'Officer'
dat$title[dat$title == 'Dr'] <- 'Officer'
dat$title[dat$title == 'Rev'] <- 'Officer'
dat$title[dat$title == 'Don'] <- 'Officer'
dat$title[dat$title == 'Sir'] <- 'Officer'
dat$title[dat$title == 'the Countess'] <- 'Officer'
dat$title[dat$title == 'Jonkheer'] <- 'Officer'

# Checking and Plottting Missing Values
colMeans(is.na(dat))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.1986532 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
##       title 
##   0.0000000

plot_ly(
  x = names(colMeans(is.na(dat))),           # choose unique bar category
  y = as.numeric(colMeans(is.na(dat))),      # choose height of bar
  type = "bar"                               # specify bar chart
) %>%

  layout(title = "Titanic Missing Values")

# Imputing NAs (Needed in case we have to predict on missing data)
# Since age is the only missing variable let's use the mean age for a given title to impute
dat$Age = ifelse(is.na(dat$Age), 
                 ifelse(dat$title == "Mr", mean(dat[dat$title == "Mr", ]$Age, na.rm=TRUE),
                        ifelse(dat$title == "Mrs", mean(dat[dat$title == "Mrs",]$Age, na.rm = TRUE), 
                        ifelse(dat$title == "Miss", mean(dat[dat$title == "Miss",]$Age, na.rm = TRUE),
                        ifelse(dat$title == "Master", mean(dat[dat$title == "Master",]$Age, na.rm = TRUE),
                        ifelse(dat$title == "Officer", mean(dat[dat$title == "Officer",]$Age, na.rm = TRUE), 
                        dat$Age))))), dat$Age)

2.3 Feature Engineering

Here we create additional variables that will be useful in our model. Some features I created include:

Title dummies
Family size
Gender
Embarked location.

# Title Dummies
dat$mr <- ifelse(dat$title == "Mr", 1, 0)
dat$miss <- ifelse(dat$title == "Miss", 1, 0)
dat$mrs <- ifelse(dat$title == "Mrs", 1, 0)
dat$master <- ifelse(dat$title == "Master", 1, 0)
dat$officer <- ifelse(dat$title == "Officer", 1, 0)

# Plotting Titles
plot_ly(
  x =  ordered(c("Mr", "Miss", "Mrs", "Master", "Officer"), levels = c("Mr", "Miss", "Mrs", "Master", "Officer")),# choose unique bar category
  y = c(sum(dat$mr), sum(dat$miss), sum(dat$mrs), sum(dat$master), sum(dat$officer)),  # choose height of bar
  type = "bar"                                                                         # specify bar chart
) %>%

  layout(title = "Title Information")

# Creating Family Size
dat$FamilySize <-dat$SibSp + dat$Parch + 1 

# Plotting Family Size
plot_ly(x = density(dat$FamilySize)$x, 
        y = density(dat$FamilySize)$y, 
        mode = "lines", fill = "tozeroy", name = "Density")%>%

  layout(title = "Family Size")

# Class
dat$Class1 <- ifelse(dat$Pclass == 1, 1, 0)
dat$Class2 <- ifelse(dat$Pclass == 2, 1, 0)
dat$Class3 <- ifelse(dat$Pclass == 3, 1, 0)

# Plotting Class Sizes
plot_ly(
  x =  ordered(c("Class 1", "Class 2", "Class 3"), levels = c("Class 1", "Class 2", "Class 3")),# choose unique bar category
  y = c(sum(dat$Class1), sum(dat$Class2), sum(dat$Class3)),  # choose height of bar
  type = "bar"                                                                         # specify bar chart
) %>%

  layout(title = "Class Sizes")

# Gender
dat$female <- ifelse(dat$Sex == "male", 0, 1)

# Plotting Gender
plot_ly(
  x =  ordered(c("Male", "Female"), levels = c("Male", "Female")),# choose unique bar category
  y = c(length(dat$female) - sum(dat$female), sum(dat$female)),  # choose height of bar
  type = "bar"                                                                         # specify bar chart
) %>%

  layout(title = "Gender Breakdown")

# Embarked Location
dat$Southampton <- ifelse(dat$Embarked == "S", 1, 0)
dat$Queenstown <- ifelse(dat$Embarked == "Q", 1, 0)
dat$Cherbourg <- ifelse(dat$Embarked == "C", 1, 0)

# Plotting Embarked Location
plot_ly(
  x =  ordered(c("Southampton", "Cherbourg", "Queenstown"), levels = c("Southampton", "Cherbourg", "Queenstown")),# choose unique bar category
  y = c(sum(dat$Southampton), sum(dat$Cherbourg), sum(dat$Queenstown)),  # choose height of bar
  type = "bar"                                                                         # specify bar chart
) %>%

  layout(title = "Embarked Location")

2.4 Readying Data for Model Selection

Here we make a data frame of our numeric data, create a design matrix and specify our response. This formatting will be needed in our model selection process.

# Subsetting Data to Numeric Features
list <- c("Survived", "Age", "Fare", "mr", "miss", "mrs", "master", "officer", "FamilySize", "Class1", "Class2", "Class3", "female", "Southampton", "Queenstown", "Cherbourg")
newdat <- subset(dat, select = list)

# Creating Target Variable and Design Matrix
x = model.matrix(Survived ~ ., data = newdat)
y = dat$Survived

3. Model Selection

Now that we have all of these variables created/cleaned, how do we know which ones are going to be good predictors of surviving? We can choose a model by using different model selection processes. In this document, we will cover LASSO, Stepwise, and Information Criteria Optimization methods. At the end of this document we will evaluate which of these model selection processes yielded the best prediction of surviving the Titanic.

3.1 LASSO

In statistics and machine learning, LASSO (least absolute shrinkage and selection operator) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. It penalizes variables that are highly correlated with each other. Variables that both have small coefficients and high multicollinearity, are “shrunk” to 0, which is how it selects a model.

# Loading Packages
library(glmnet)

# vector of values from 10^10 to 10^-2
grid = 10^seq(10, -2, length = 100) 

# will run 1 for each "grid" value to test lambdas
lasso.mod=glmnet(x, y, alpha = 1, lambda = grid, type.logistic = "Newton") 

# Use K-fold validation to identify best lamda
cv.out = cv.glmnet(x, y, alpha = 1, type.logistic = "Newton")

# Storing best lamda
bestlam = cv.out$lambda.min

# Predicting Coefficents Using Best Lamda
lasso.coef = predict(lasso.mod, type = "coefficients", s = bestlam, type.logistic = "Newton")

# Reports Lasso Selected Variables
lasso.coef

## 17 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  0.5243356024
## (Intercept)  .           
## Age         -0.0026227733
## Fare         0.0003535121
## mr          -0.1293191629
## miss         .           
## mrs          0.0780520034
## master       0.3404865631
## officer     -0.0728694238
## FamilySize  -0.0469547288
## Class1       0.1317158393
## Class2       .           
## Class3      -0.1497072268
## female       0.3853642281
## Southampton -0.0310960348
## Queenstown   .           
## Cherbourg    0.0178599362

# Chosen formula:
# glm(formula = Survived ~ Age + Fare + mr + mrs + master + officer + 
#     FamilySize + Class1 + Class3 + female + Southampton + Cherbourg, 
#     family = binomial, data = dat)

This tells us our LASSO model will include:

Age
Fare
Mr
Mrs
Master
Officer
Family Size
Class 1
Class 3
Female
Southampton
Cherbourg

3.2 Stepwise

Stepwise selection is a method that allows moves in either direction, dropping or adding variables at the various steps. Backward stepwise selection involves starting off in a backward approach and then potentially adding back variables if they later appear to be significant. Here we use AIC (Akaike Information Criterion) to decide if a variable should be added or dropped. Stepwise selection is faster computationally than other methods, but may not produce the optimal prediction model.

# Full and Null Models
null = glm(Survived ~ 1, data = newdat, family = binomial)
full = glm(Survived ~ ., data = newdat, family = binomial)

# Running Selection
step(null, 
     scope = list(lower = Survived ~ 1, 
                  upper = full), 
     data = newdat, 
     direction = "both", 
     trace=FALSE)

## 
## Call:  glm(formula = Survived ~ mr + Class3 + officer + FamilySize + 
##     Class1 + Age + mrs + Southampton, family = binomial, data = newdat)
## 
## Coefficients:
## (Intercept)           mr       Class3      officer   FamilySize  
##     3.30627     -2.91409     -1.14305     -2.66655     -0.41482  
##      Class1          Age          mrs  Southampton  
##     1.31397     -0.03268      0.88853     -0.41218  
## 
## Degrees of Freedom: 890 Total (i.e. Null);  882 Residual
## Null Deviance:       1187 
## Residual Deviance: 731.5     AIC: 749.5

# Chosen Equation
# glm(formula = Survived ~ mr + Class3 + officer + FamilySize + 
#     Class1 + Age + mrs + Southampton, family = binomial, data = dat)

Our Stepwise variables include:

Age
Mr
Mrs
Officer
Family Size
Class 1
Class 3
Southampton

3.3 Information Criteria Optimization (ICO)

This method is similar to stepwise selection in the sense we are optimizing a information criteria (in this case we will also use AIC). Instead of looking at each step to see if a variable increases or decreases AIC, it considers all combinations of variables and measures which specification has the best (lowest) AIC. This method can be quite time consuming, but tends to lead to better models than stepwise selection.

# Loading Packages
library(glmulti)

# Testing Models
ICO.out <-
    glmulti(Survived ~ ., data = newdat,
            level = 1,               # No interaction considered
            method = "h",            # Exhaustive approach
            crit = "aic",            # AIC as criteria
            confsetsize = 1,         # Keep 1 best model
            plotty = F, report = F,  # No plot or interim reports
            fitfunction = "glm",     # glm function
            family = binomial)       # binomial family for logistic regression

# Reporting Best Model
ICO.out@formulas

# Chosen Variables:
# glm(formula = Survived ~ Age + Fare + miss + master + FamilySize + Class1 + 
#     Class2 + female + Southampton, family = binomial, data = dat)

Selected ICO Variables:

Age
Fare
Miss
Master
Family Size
Class 1
Class 2
Female
Southampton

4. Implementation

It’s finally time to actually run each of the models!

4.1 LASSO

# LASSO Model Estimation
LASSOmodel <- glm(formula = Survived ~ Age + Fare + mr + mrs + master + officer + 
                  FamilySize + Class1 + Class3 + female + Southampton + Cherbourg, 
                  family = binomial, data = dat)
  
  
# LASSO Model Prediction
predictLASSO <- predict(LASSOmodel, type = 'response')

# Model Summary
summary(LASSOmodel)

## 
## Call:
## glm(formula = Survived ~ Age + Fare + mr + mrs + master + officer + 
##     FamilySize + Class1 + Class3 + female + Southampton + Cherbourg, 
##     family = binomial, data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3599  -0.5646  -0.3801   0.5515   2.5335  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.014214 617.671519  -0.019 0.984481    
## Age          -0.029251   0.009633  -3.037 0.002393 ** 
## Fare          0.003526   0.002648   1.331 0.183025    
## mr           12.201136 617.671394   0.020 0.984240    
## mrs           0.905283   0.357156   2.535 0.011254 *  
## master       15.527602 617.671517   0.025 0.979944    
## officer      12.194020 617.671167   0.020 0.984249    
## FamilySize   -0.465555   0.084654  -5.499 3.81e-08 ***
## Class1        1.077704   0.322664   3.340 0.000838 ***
## Class3       -1.093537   0.253650  -4.311 1.62e-05 ***
## female       15.109448 617.671331   0.024 0.980484    
## Southampton  -0.298251   0.342645  -0.870 0.384062    
## Cherbourg     0.143393   0.393988   0.364 0.715894    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  725.06  on 878  degrees of freedom
## AIC: 751.06
## 
## Number of Fisher Scoring iterations: 13

4.2 Stepwise

# Stepwise Model Estimation
Stepwisemodel <- glm(formula = Survived ~ mr + Class3 + officer + FamilySize + 
                     Class1 + Age + mrs + Southampton, family = binomial, data = dat)

# Stepwise Optimization Prediction
predictStepwise <- predict(Stepwisemodel, type = 'response', data = dat)

# Model Summary
summary(Stepwisemodel)

## 
## Call:
## glm(formula = Survived ~ mr + Class3 + officer + FamilySize + 
##     Class1 + Age + mrs + Southampton, family = binomial, data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3197  -0.5768  -0.3721   0.5643   2.4897  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.306272   0.415724   7.953 1.82e-15 ***
## mr          -2.914086   0.258061 -11.292  < 2e-16 ***
## Class3      -1.143049   0.248793  -4.594 4.34e-06 ***
## officer     -2.666549   0.566423  -4.708 2.51e-06 ***
## FamilySize  -0.414822   0.074143  -5.595 2.21e-08 ***
## Class1       1.313974   0.291287   4.511 6.45e-06 ***
## Age         -0.032679   0.009292  -3.517 0.000437 ***
## mrs          0.888525   0.346086   2.567 0.010248 *  
## Southampton -0.412183   0.211796  -1.946 0.051639 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  731.46  on 882  degrees of freedom
## AIC: 749.46
## 
## Number of Fisher Scoring iterations: 5

4.3 Information Criteria Optimization

# IC Optimization Estimation
ICOModel <- glm(formula = Survived ~ Age + Fare + miss + master + FamilySize + Class1 + 
                Class2 + female + Southampton, family = binomial, data = dat)

# IC Optimization Prediction
predictICO <- predict(ICOModel, type = 'response')

# Model Summary
summary(ICOModel)

## 
## Call:
## glm(formula = Survived ~ Age + Fare + miss + master + FamilySize + 
##     Class1 + Class2 + female + Southampton, family = binomial, 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3759  -0.5661  -0.3785   0.5496   2.4999  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.806927   0.380821  -2.119  0.03410 *  
## Age         -0.029672   0.009489  -3.127  0.00177 ** 
## Fare         0.003584   0.002630   1.362  0.17308    
## miss        -0.931180   0.351726  -2.647  0.00811 ** 
## master       3.310772   0.529903   6.248 4.16e-10 ***
## FamilySize  -0.465276   0.084397  -5.513 3.53e-08 ***
## Class1       2.198100   0.310200   7.086 1.38e-12 ***
## Class2       1.104488   0.250793   4.404 1.06e-05 ***
## female       3.823441   0.323060  11.835  < 2e-16 ***
## Southampton -0.393635   0.214701  -1.833  0.06674 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  725.34  on 881  degrees of freedom
## AIC: 745.34
## 
## Number of Fisher Scoring iterations: 5

5. Assumptions

5.1 Description

Good news! There are less assumptions for a logistic regression than we would have for a linear regression! Things we no longer have to assume are:

Logistic regression does not require a linear relationship between the dependent and independent variables.
The error terms (residuals) do not need to be normally distributed.
Homoskedasticity is not required.

Some assumptions we still have are:

Binary logistic regression requires the dependent variable to be binary (taking the value 1 or 0).
Logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.
Logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.
Logistic regression assumes linearity of independent variables and log odds. Although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.

The first assumption is not covered in the testing/resolution sections. If the dependent variable is not binary, then logistic regression is not the appropriate algorithm for analysis.

Additionally, the second assumption is not covered since the only way to test is with a good understanding of how the data was collected.

5.2 Testing

5.2.1 LASSO

Are the variables linear with log odds?

Focusing on the numeric variables (since it’s more clear what linear looks like for those variables), it looks like the only concern is with fare and family size. One method of testing if a term is linear is the Box-Tidwell test.

# Loading Libraries
library(tidyverse)
library(broom)
theme_set(theme_classic())

# Linear Relationship between logit of outcome and each predictor
predictors <- names(newdat)
newdat0 <- newdat %>%
  mutate(logit = log(predictLASSO/(1-predictLASSO))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(newdat0, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

Is multicolinearity a problem?

Looks like our VIF is really high for our master, mr, female, and officer variables.

# Low/0 Multicolinearity
car::vif(LASSOmodel)

##          Age         Fare           mr          mrs       master 
## 2.016598e+00 1.615716e+00 1.084492e+07 1.721582e+00 2.443033e+06 
##      officer   FamilySize       Class1       Class3       female 
## 1.267000e+06 1.810141e+00 2.344179e+00 1.829860e+00 1.039705e+07 
##  Southampton    Cherbourg 
## 2.787987e+00 2.808219e+00

5.2.2 Stepwise

Are the variables linear with log odds?

Focusing on the numeric variables (since it’s more clear what linear looks like for those variables), it looks like the only concern is with fare and family size. One method of testing if a term is linear is the Box-Tidwell test.

# Loading Libraries
library(tidyverse)
library(broom)
theme_set(theme_classic())

# Linear Relationship between logit of outcome and each predictor
predictors <- names(newdat)
newdat1 <- newdat %>%
  mutate(logit = log(predictStepwise/(1-predictStepwise))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(newdat1, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

Is multicolinearity a problem?

All our variable inflation factors are below 2. This indicates that multicolinearity is not a big problem with this model. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

# Low/0 Multicolinearity
car::vif(Stepwisemodel)

##          mr      Class3     officer  FamilySize      Class1         Age 
##    1.915878    1.780854    1.312263    1.423359    1.976319    1.916808 
##         mrs Southampton 
##    1.636775    1.074973

5.2.3 Information Criteria Optimization

Are the variables linear with log odds?

Focusing on the numeric variables (since it’s more clear what linear looks like for those variables), it looks like the only concern is with fare and family size. One method of testing if a term is linear is the Box-Tidwell test.

# Loading Libraries
library(tidyverse)
library(broom)
theme_set(theme_classic())

# Linear Relationship between logit of outcome and each predictor
predictors <- names(newdat)
newdat2 <- newdat %>%
  mutate(logit = log(predictICO/(1-predictICO))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(newdat2, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

Is multicolinearity a problem?

All our variable inflation factors are below 3. This indicates that multicolinearity is not a big problem with this model. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

# Low/0 Multicolinearity
car::vif(ICOModel)

##         Age        Fare        miss      master  FamilySize      Class1 
##    1.960000    1.597190    2.705066    1.806067    1.797638    2.171761 
##      Class2      female Southampton 
##    1.244879    2.843844    1.094326

5.3 Resolving Issues

Linearity: If a variable is not linear with the log odds, then the variable should be transformed to account for its nonlinearity. One solution is to log transform the variable, or to use a Box-Cox transformation if log transforming the variable is not appropriate. Sometimes adding a square or cubed term (e.g. Age^2 or Age^3) of the nonlinear feature can resolve the problem.

Multicolinearity: Removing variables that are highly correlated with each other is one way to resolve a high VIF.

5.3.1 LASSO

Let’s apply a log transformation to the two variables looking the least linear. It looks quite a bit better.

# LASSO Model Re-Estimation
LASSOmodel <- glm(formula = Survived ~ Age + log(Fare +1) + mr + mrs + master + officer + 
                  log(FamilySize) + Class1 + Class3 + female + Southampton + Cherbourg, 
                  family = binomial, data = dat)

# LASSO Model Prediction
predictLASSO <- predict(LASSOmodel, type = 'response')  
  
# Testing Linearity
predictors <- names(newdat)
newdat0 <- newdat %>%
  mutate(logit = log(predictLASSO/(1-predictLASSO))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(newdat0, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

Let’s try removing our gender variable since we can determine the gender of most people with our title information. Also our gender variable has the highest VIF, which is a good indicator for where to start.

# LASSO Estimation Part 3
LASSOmodel <- glm(formula = Survived ~ Age + log(Fare +1) + mr + mrs + master + officer + 
                  log(FamilySize) + Class1 + Class3 + Southampton + Cherbourg, 
                  family = binomial, data = dat)

# Checking Multicolinearity
car::vif(LASSOmodel)

##             Age   log(Fare + 1)              mr             mrs 
##        2.126480        2.769800        2.084358        1.781944 
##          master         officer log(FamilySize)          Class1 
##        1.476277        1.324973        2.551103        2.667290 
##          Class3     Southampton       Cherbourg 
##        1.965657        2.789384        2.825473

Looks like that was enough to fix our multicolinearity problems!

5.3.2 Stepwise

Let’s apply a log transformation to the one looking the least linear. It looks quite a bit better.

# Stepwise Model Re-Estimation
Stepwisemodel <- glm(formula = Survived ~ mr + Class3 + officer + log(FamilySize) + 
                     Class1 + Age + mrs + Southampton, family = binomial, data = dat)

# Stepwise Optimization Prediction
predictStepwise <- predict(Stepwisemodel, type = 'response', data = dat)

# Testing Linearity
predictors <- names(newdat)
newdat1 <- newdat %>%
  mutate(logit = log(predictStepwise/(1-predictStepwise))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(newdat1, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

5.3.3 Information Criteria Optimization

Let’s apply a log transformation to the two variables looking the least linear. It looks quite a bit better.

# IC Optimization Re-Estimation
ICOModel <- glm(formula = Survived ~ Age + log(Fare + 1) + miss + master + log(FamilySize) + Class1 + 
                Class2 + female + Southampton, family = binomial, data = dat)

# IC Optimization Prediction
predictICO <- predict(ICOModel, type = 'response')

# Testing Linearity
predictors <- names(newdat)
newdat2 <- newdat %>%
  mutate(logit = log(predictICO/(1-predictICO))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(newdat2, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

6. Evaluation

6.1 LASSO Model Evaluation

6.1.1 Determining Cutoff

The output of a logit is a probability. In our application we can predict if a person is X% going to survive, but many times we need to change that probability into a binary prediction. Since there is an imbalance of those who survived (38%) and those who didn’t (62%), using a 50% cutoff may not be appropriate for what we want to do. Sometimes predicting a false positive is more costly to us than predicting a false negative. This is the time where we can specify those costs and adjust the cutoff mark.

# Loading Packages
library(InformationValue)
library(caret)
library(pROC)
library(Epi)

# Setting Up K-fold Validation
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)

# Estimating Model
LASSOmodel <- train(Survived ~ Age + log(Fare +1) + mr + mrs + master + officer + 
                   log(FamilySize) + Class1 + Class3 + female + Southampton + Cherbourg, 
                   data = dat, 
                   method = "glm", 
                   family = "binomial",
                   trControl = ctrl, 
                   tuneLength = 5)

# Identifying Ideal Cutoff
LASSOprob = plogis(predict(LASSOmodel, dat))
LASSOoptcutoff <- optimalCutoff(dat$Survived, LASSOprob)[1]
LASSOoptcutoff

## [1] 0.6210585

6.1.2 Performance Metrics

The first performance metric we’ll use is accuracy. This measure is the number of observations our model classified correctly over the total number of observations.

# Loading Packages for KAble Table
library(kableExtra)
library(knitr)

# Calculating Accuracy
accuracy1 <- 1 - misClassError(dat$Survived, LASSOprob, threshold = LASSOoptcutoff)
paste("Accuracy rate:", paste(round(accuracy1, digits = 4)*100, "%", sep = ""))

## [1] "Accuracy rate: 83.61%"

The next metric is sensitivity. Sensitivity measures the number of true positives (those that survived that we predicted would survive) over the number of total predicted positives (the number we predicted would survive).

# Calculating Sensitivity
sensitivity1 <- sum((as.numeric(LASSOprob) >= LASSOoptcutoff) & dat$Survived == 1)/sum(as.numeric(LASSOprob) > LASSOoptcutoff)
paste("Sensitivity:", paste(round(sensitivity1, digits = 4)*100, "%", sep = ""))

## [1] "Sensitivity: 80.43%"

Similar to sensitivity is specificity. Specificity measures the number of true negatives (those that did not survive that we predicted wouldn’t survive) over the number of total predicted negatives (the number we predicted would not survive).

# Calculating Specificity
specificity1 <- sum((as.numeric(LASSOprob) < LASSOoptcutoff) & dat$Survived == 0)/sum(as.numeric(LASSOprob) < LASSOoptcutoff)
paste("Specificity:", paste(round(specificity1, digits = 4)*100, "%", sep = ""))

## [1] "Specificity: 85.41%"

A confusion matrix is a good way to check for any imbalances in our classifier. Ideally, we want most of our values in the top left or bottom right corners. The observations not in those corners are those we misclassified. Another thing to watch is that we don’t have an overwhelming number of false positives compared to false negatives or vice versa. That could indicate some problems with our model usefulness.

# Displaying Confusion Matrix
as.data.table(cbind(c("", " Predicted Survived", "Predicted Died"), 
                    c("Actually Survived", sum((as.numeric(LASSOprob) >= LASSOoptcutoff) & dat$Survived == 1), sum((as.numeric(LASSOprob) < LASSOoptcutoff) & dat$Survived == 1)),
                    c("Actually Died", sum((as.numeric(LASSOprob) >= LASSOoptcutoff) & dat$Survived == 0), sum((as.numeric(LASSOprob) < LASSOoptcutoff) & dat$Survived == 0)))) %>%

kable("html", col.names = NULL, align = "c") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

	Actually Survived	Actually Died
Predicted Survived	259	63
Predicted Died	83	486

AUC stands for “Area Under the Receiver Operating Characteristic Curve”. It is calculated by plotting the the true positive rate against the false positive rate. In one extreme we could predict everyone survived. We would have a 100% chance of predicting that someone survived if they survived, but 0% predicting someone died if they died. On the other extreme, we could predict everyone died. Now we would have 100% chance of predicting someone died if they died and 0% chance of predicting they survived if they survived. Somewhere between these two extremes we would like to be able to predict a high number of true positives, while keeping the number of false positives low. How well our model achieves that goal is measured by AUC on a scale of 0-1 where .5 is flipping a coin, .8 is good, .9 is great, and 1 is perfect

# Area Under Reciever Operating Curve
auc1 <- auc(dat$Survived, LASSOprob)
auc1

## Area under the curve: 0.8767

6.2 Stepwise Model Evaluation

6.2.1 Determining Cutoff

The output of a logit is a probability. In our application we can predict if a person is X% going to survive, but many times we need to change that probability into a binary prediction. Since there is an imbalance of those who survived (38%) and those who didn’t (62%), using a 50% cutoff may not be appropriate for what we want to do. Sometimes predicting a false positive is more costly to us than predicting a false negative. This is the time where we can specify those costs and adjust the cutoff mark.

# Loading Packages
library(InformationValue)
library(caret)
library(pROC)
library(Epi)

# Setting Up K-fold Validation
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)

# Estimating Model
Stepwisemodel <- train(Survived ~ mr + Class3 + officer + log(FamilySize) + 
                     Class1 + Age + mrs + Southampton, 
                   data = dat, 
                   method = "glm", 
                   family = "binomial",
                   trControl = ctrl, 
                   tuneLength = 5)

# Identifying Ideal Cutoff
Stepwiseprob = plogis(predict(Stepwisemodel, dat))
Stepwiseoptcutoff <- optimalCutoff(dat$Survived, Stepwiseprob)[1]
Stepwiseoptcutoff

## [1] 0.618924

6.2.2 Performance Metrics

The first performance metric we’ll use is accuracy. This measure is the number of observations our model classified correctly over the total number of observations.

# Calculating Accuracy
accuracy2 <- 1 - misClassError(dat$Survived, Stepwiseprob, threshold = Stepwiseoptcutoff)
paste("Accuracy rate:", paste(round(accuracy2, digits = 4)*100, "%", sep = ""))

## [1] "Accuracy rate: 83.5%"

The next metric is sensitivity. Sensitivity measures the number of true positives (those that survived that we predicted would survive) over the number of total predicted positives (the number we predicted would survive).

# Calculating Sensitivity
sensitivity2 <- sum((as.numeric(Stepwiseprob) >= Stepwiseoptcutoff) & dat$Survived == 1)/sum(as.numeric(Stepwiseprob) > Stepwiseoptcutoff)
paste("Sensitivity:", paste(round(sensitivity2, digits = 4)*100, "%", sep = ""))

## [1] "Sensitivity: 80%"

Similar to sensitivity is specificity. Specificity measures the number of true negatives (those that did not survive that we predicted wouldn’t survive) over the number of total predicted negatives (the number we predicted would not survive).

# Calculating Specificity
specificity2 <- sum((as.numeric(Stepwiseprob) < Stepwiseoptcutoff) & dat$Survived == 0)/sum(as.numeric(Stepwiseprob) < Stepwiseoptcutoff)
paste("Specificity:", paste(round(specificity2, digits = 4)*100, "%", sep = ""))

## [1] "Specificity: 85.51%"

A confusion matrix is a good way to check for any imbalances in our classifier. Ideally, we want most of our values in the top left or bottom right corners. The observations not in those corners are those we misclassified. Another thing to watch is that we don’t have an overwhelming number of false positives compared to false negatives or vice versa. That could indicate some problems with our model usefulness.

# Displaying Confusion Matrix
as.data.table(cbind(c("", " Predicted Survived", "Predicted Died"), 
                    c("Actually Survived", sum((as.numeric(Stepwiseprob) >= Stepwiseoptcutoff) & dat$Survived == 1), sum((as.numeric(Stepwiseprob) < Stepwiseoptcutoff) & dat$Survived == 1)),
                    c("Actually Died", sum((as.numeric(Stepwiseprob) >= Stepwiseoptcutoff) & dat$Survived == 0), sum((as.numeric(Stepwiseprob) < Stepwiseoptcutoff) & dat$Survived == 0)))) %>%

kable("html", col.names = NULL, align = "c") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

	Actually Survived	Actually Died
Predicted Survived	260	65
Predicted Died	82	484

AUC stands for “Area Under the Receiver Operating Characteristic Curve”. It is calculated by plotting the the true positive rate against the false positive rate. In one extreme we could predict everyone survived. We would have a 100% chance of predicting that someone survived if they survived, but 0% predicting someone died if they died. On the other extreme, we could predict everyone died. Now we would have 100% chance of predicting someone died if they died and 0% chance of predicting they survived if they survived. Somewhere between these two extremes we would like to be able to predict a high number of true positives, while keeping the number of false positives low. How well our model achieves that goal is measured by AUC on a scale of 0-1 where .5 is flipping a coin, .8 is good, .9 is great, and 1 is perfect

# Area Under Reciever Operating Curve
auc2 <- auc(dat$Survived, Stepwiseprob)
auc2

## Area under the curve: 0.8734

6.3 ICO Model Evaluation

6.3.1 Determining Cutoff

The output of a logit is a probability. In our application we can predict if a person is X% going to survive, but many times we need to change that probability into a binary prediction. Since there is an imbalance of those who survived (38%) and those who didn’t (62%), using a 50% cutoff may not be appropriate for what we want to do. Sometimes predicting a false positive is more costly to us than predicting a false negative. This is the time where we can specify those costs and adjust the cutoff mark.

# Loading Packages
library(InformationValue)
library(caret)
library(pROC)
library(Epi)

# Setting Up K-fold Validation
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)

# Estimating Model
ICOmodel <- train(Survived ~ Age + log(Fare + 1) + miss + master + log(FamilySize) + Class1 + 
                   Class2 + female + Southampton, 
                   data = dat, 
                   method = "glm", 
                   family = "binomial",
                   trControl = ctrl, 
                   tuneLength = 5)

# Identifying Ideal Cutoff
ICOprob = plogis(predict(ICOmodel, dat))
ICOoptcutoff <- optimalCutoff(dat$Survived, ICOprob)[1]
ICOoptcutoff

## [1] 0.6192521

6.3.2 Performance Metrics

The first performance metric we’ll use is accuracy. This measure is the number of observations our model classified correctly over the total number of observations.

# Calculating Accuracy
accuracy3 <- 1 - misClassError(dat$Survived, ICOprob, threshold = ICOoptcutoff)
paste("Accuracy rate:", paste(round(accuracy3, digits = 4)*100, "%", sep = ""))

## [1] "Accuracy rate: 83.5%"

The next metric is sensitivity. Sensitivity measures the number of true positives (those that survived that we predicted would survive) over the number of total predicted positives (the number we predicted would survive).

# Calculating Sensitivity
sensitivity3 <- sum((as.numeric(ICOprob) >= ICOoptcutoff) & dat$Survived == 1)/sum(as.numeric(ICOprob) > ICOoptcutoff)
paste("Sensitivity:", paste(round(sensitivity3, digits = 4)*100, "%", sep = ""))

## [1] "Sensitivity: 79.64%"

Similar to sensitivity is specificity. Specificity measures the number of true negatives (those that did not survive that we predicted wouldn’t survive) over the number of total predicted negatives (the number we predicted would not survive).

# Calculating Specificity
specificity3 <- sum((as.numeric(ICOprob) < ICOoptcutoff) & dat$Survived == 0)/sum(as.numeric(ICOprob) < ICOoptcutoff)
paste("Specificity:", paste(round(specificity3, digits = 4)*100, "%", sep = ""))

## [1] "Specificity: 85.77%"

A confusion matrix is a good way to check for any imbalances in our classifier. Ideally, we want most of our values in the top left or bottom right corners. The observations not in those corners are those we misclassified. Another thing to watch is that we don’t have an overwhelming number of false positives compared to false negatives or vice versa. That could indicate some problems with our model usefulness.

# Displaying Confusion Matrix
as.data.table(cbind(c("", " Predicted Survived", "Predicted Died"), 
                    c("Actually Survived", sum((as.numeric(ICOprob) >= ICOoptcutoff) & dat$Survived == 1), sum((as.numeric(ICOprob) < ICOoptcutoff) & dat$Survived == 1)),
                    c("Actually Died", sum((as.numeric(ICOprob) >= ICOoptcutoff) & dat$Survived == 0), sum((as.numeric(ICOprob) < ICOoptcutoff) & dat$Survived == 0)))) %>%

kable("html", col.names = NULL, align = "c") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

	Actually Survived	Actually Died
Predicted Survived	262	67
Predicted Died	80	482

AUC stands for “Area Under the Receiver Operating Characteristic Curve”. It is calculated by plotting the the true positive rate against the false positive rate. In one extreme we could predict everyone survived. We would have a 100% chance of predicting that someone survived if they survived, but 0% predicting someone died if they died. On the other extreme, we could predict everyone died. Now we would have 100% chance of predicting someone died if they died and 0% chance of predicting they survived if they survived. Somewhere between these two extremes we would like to be able to predict a high number of true positives, while keeping the number of false positives low. How well our model achieves that goal is measured by AUC on a scale of 0-1 where .5 is flipping a coin, .8 is good, .9 is great, and 1 is perfect

# Area Under Reciever Operating Curve
auc3 <- auc(dat$Survived, ICOprob)
auc3

## Area under the curve: 0.877

6.4 Final Decision

Overall, I’m pretty happy with the accuracy of our models. The best logistic regression I’ve seen on Kaggle for this dataset was about 82% and all three of our models beat that measure. In terms of choosing between our three models, it depends on what is important to us. If we care more about predicting survivors (with few false positives), then we would prefer the LASSO specification. If we care more about predicting those who died (with few false negatives), we would want the ICO model. If we prefer overall accuracy, LASSO model is our best choice.

# Loading Packages
library(kableExtra)
library(knitr)

# Making Table
rbind(
  c("Model Name", "Accuracy", "Sensitivity", "Specificity", "AUC"),
  c("LASSO Model", round(accuracy1, 4), round(sensitivity1, 4), round(specificity1, 4), round(auc1, 4)),
  c("Stepwise Model", round(accuracy2, 4), round(sensitivity2, 4), round(specificity2, 4), round(auc2, 4)),
  c("ICO Model", round(accuracy3, 4), round(sensitivity3, 4), round(specificity3, 4), round(auc3, 4))
) %>%

kable("html") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Model Name	Accuracy	Sensitivity	Specificity	AUC
LASSO Model	0.8361	0.8043	0.8541	0.8767
Stepwise Model	0.835	0.8	0.8551	0.8734
ICO Model	0.835	0.7964	0.8577	0.877

Classification - Session 2 - Logistic Regression

Alex Kaechele

April 26, 2018

1. Overview

2. Data Formating

2.1 Setup

2.2 Data Cleaning

2.3 Feature Engineering

2.4 Readying Data for Model Selection

3. Model Selection

3.1 LASSO

3.2 Stepwise

3.3 Information Criteria Optimization (ICO)

4. Implementation

4.1 LASSO

4.2 Stepwise

4.3 Information Criteria Optimization

5. Assumptions

5.1 Description

5.2 Testing

5.2.1 LASSO

5.2.2 Stepwise

5.2.3 Information Criteria Optimization

5.3 Resolving Issues

5.3.1 LASSO

5.3.2 Stepwise

5.3.3 Information Criteria Optimization

6. Evaluation

6.1 LASSO Model Evaluation

6.1.1 Determining Cutoff

6.1.2 Performance Metrics

6.2 Stepwise Model Evaluation

6.2.1 Determining Cutoff

6.2.2 Performance Metrics

6.3 ICO Model Evaluation

6.3.1 Determining Cutoff

6.3.2 Performance Metrics

6.4 Final Decision