Logistic Regression | Techniques, Tuning, and Diagnostics

Quick Recap of Logistic Regression

In linear regression (ordinary least squares) the goal is to ordinate a response to a set of predictors. We do this by setting hte mean response (the expected value) as being a linear function of the predictors. This is then called the “linear Predictor” in generalized linear models (poisson, logistic, etc). In a lot of texts this is also referred to as “eta” (a - tuh >>?)

The word generalized in generalized linear models is relating a mean response to a general function, there is no special case. Logistic regression is set apart by the response being binary. This may also correspond to events, y/n, develop heart disease or not. In normal linear regression we assume the response is normally distributed. The bernoulli is assumed in logistic regression. The mean of a bernoulli is given as p, which is the probability of y = 1 given x. What’s the problem with calling B0 the probability? The coefficients aren’t bounded. Instead of modeling this mean response directly, we calculate the logit, or the log odds of p. If p is between 0 and 1, we can just transform the response to (p/(1-p)), which expands the values p can take and make it apply in more places.

Given all of this, we develop a likelihood function, and then create an optimization function that estimates the coefficients of the model, then we plug in new data and get predictions. These are called logits, and we can un-transform these to get probabilities. In all generalized linear models, the g(p) function must be invertable to return the probabilities after the fact.

pkgs <- c("corrplot", "faraway", "glmnet", "ggplot2", "plotmo", "pdp", "vip", "earth", "rms")
lib <- installed.packages()[, "Package"]
install.packages(setdiff(pkgs, lib))

# Load western collaborative group study set from faraway package
data(wcgs, package = "faraway")

head(wcgs)  # print first few records

##      age height weight sdp dbp chol behave cigs dibep chd  typechd timechd
## 2001  49     73    150 110  76  225     A2   25     B  no     none    1664
## 2002  42     70    160 154  84  177     A2   20     B  no     none    3071
## 2003  42     69    160 110  78  181     B3    0     A  no     none    3071
## 2004  41     68    152 124  78  132     B4   20     A  no     none    3064
## 2005  59     70    150 144  86  255     B3   20     A yes infdeath    1885
## 2006  44     72    204 150  90  182     B4    0     A  no     none    3102
##        arcus
## 2001  absent
## 2002 present
## 2003  absent
## 2004  absent
## 2005 present
## 2006  absent

We have a study of middle aged men in san francisco measuring information about them and tracking whether they did or didn’t develop heart disease.

We’ll be doing some explanatory analysis of the data and the model that we create in order to understand the model, why it behaves the way that it does, and how it applies to our data.

# Print general structure of object
str(wcgs)

## 'data.frame':    3154 obs. of  13 variables:
##  $ age    : int  49 42 42 41 59 44 44 40 43 42 ...
##  $ height : int  73 70 69 68 70 72 72 71 72 70 ...
##  $ weight : int  150 160 160 152 150 204 164 150 190 175 ...
##  $ sdp    : int  110 154 110 124 144 150 130 138 146 132 ...
##  $ dbp    : int  76 84 78 78 86 90 84 60 76 90 ...
##  $ chol   : int  225 177 181 132 255 182 155 140 149 325 ...
##  $ behave : Factor w/ 4 levels "A1","A2","B3",..: 2 2 3 4 3 4 4 2 3 2 ...
##  $ cigs   : int  25 20 0 20 20 0 0 0 25 0 ...
##  $ dibep  : Factor w/ 2 levels "A","B": 2 2 1 1 1 1 1 2 1 2 ...
##  $ chd    : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ typechd: Factor w/ 4 levels "angina","infdeath",..: 3 3 3 3 2 3 3 3 3 3 ...
##  $ timechd: int  1664 3071 3071 3064 1885 3102 3074 3071 3064 1032 ...
##  $ arcus  : Factor w/ 2 levels "absent","present": 1 2 1 1 2 1 1 1 1 2 ...

# Extract the three columns of interest and print a summary of each
summary(wcgs[, c("chd", "height", "cigs")])

##   chd           height           cigs     
##  no :2897   Min.   :60.00   Min.   : 0.0  
##  yes: 257   1st Qu.:68.00   1st Qu.: 0.0  
##             Median :70.00   Median : 0.0  
##             Mean   :69.78   Mean   :11.6  
##             3rd Qu.:72.00   3rd Qu.:20.0  
##             Max.   :78.00   Max.   :99.0

# Fit an additive logistic regression model
lr.fit <- glm(chd ~ height + cigs,family = binomial(link = "logit"), data = wcgs)

# Print a summary of the fitted model
summary(lr.fit)

## 
## Call:
## glm(formula = chd ~ height + cigs, family = binomial(link = "logit"), 
##     data = wcgs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0041  -0.4425  -0.3630  -0.3499   2.4357  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.50161    1.84186  -2.444   0.0145 *  
## height       0.02521    0.02633   0.957   0.3383    
## cigs         0.02313    0.00404   5.724 1.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1781.2  on 3153  degrees of freedom
## Residual deviance: 1749.0  on 3151  degrees of freedom
## AIC: 1755
## 
## Number of Fisher Scoring iterations: 5

Universal wy to interpret effects in a model - plots. The benefit is that your boss doesn’t care about odds, but they probably care about the effects of your variables.

# Fit a reduced model by dropping the term for height
lr.fit.reduced <- glm(chd ~ cigs, family = binomial(link = "logit"), data = wcgs)

# Compare full model to reduced model; what null hypothesis is implied here?
anova(lr.fit.reduced, lr.fit, test = "Chi")

## Analysis of Deviance Table
## 
## Model 1: chd ~ cigs
## Model 2: chd ~ height + cigs
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      3152       1750                     
## 2      3151       1749  1  0.92025   0.3374

# I don't think drop1() or add1() are very useful; see ?drop1 for further details
drop1(lr.fit, test = "Chi")

## Single term deletions
## 
## Model:
## chd ~ height + cigs
##        Df Deviance    AIC     LRT Pr(>Chi)    
## <none>      1749.0 1755.0                     
## height  1   1750.0 1754.0  0.9202   0.3374    
## cigs    1   1780.1 1784.1 31.0695 2.49e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# The plotmo package
plotmo::plotmo(lr.fit)

##  plotmo grid:    height cigs
##                      70    0

This takes your fitted model and visualizes the effect of each of the included variables in the model. We did this manually last week, but this is a way of visualizing the effect of a variable on the output, all else held equal. Unfortunately it doesn’t allow you to pick what gets held equally, but it will save you a lot of work.

This is useful for several reasons - We can see that as we increase height the predictor goes up, which shows us the direction of the relation, but also the magnitude of the effect of a relationship. The Cigs variable is clearly more impactful to the outcome because we can see the impact is greater. We can also see that the cigs variable increases the likelihood of developing heart disease.

Why is this cigs plot curved? The function is linear?

Even though the function is linear, it is scaled nonlinearly by the logit.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

library(pdp)

## Warning: package 'pdp' was built under R version 4.0.5

# Set theme for ggplot2 graphics
theme_set(theme_bw())

# These plots will be linear on the logit scale (why?) but nonlinear on the probability scale 
partial(lr.fit, pred.var = "cigs", prob = TRUE, plot = TRUE, rug = TRUE,
        plot.engine = "ggplot2") +
  geom_rug(data = wcgs, aes(x = cigs), alpha = 0.2, inherit.aes = FALSE)  # why add this?

## Warning: Use of `object[[1L]]` is discouraged. Use `.data[[1L]]` instead.

## Warning: Use of `object[["yhat"]]` is discouraged. Use `.data[["yhat"]]`
## instead.

## Warning: Use of `x.rug[[1L]]` is discouraged. Use `.data[[1L]]` instead.

The above is a partial dependence plot, it was introduced by Jerome Freedman, in that paper he proposed these plots as a way of interpreting gradient boosted modeling. The main difference between a partial dependence plot and one of the earlier plots is that rather than holding height constant,a partial dependence plot measures the impact of cigs while taking the average value of other variables into account at the given level. The charts from above are called a “poor mans” partial dependence plot, as they include less information, but they’re a quick and dirty way of assessing variables.

What’s happening on the x-axis? Rug plots - the information on the bottom includes the quantity of data at each level of the data. We can make some inferences about the distribution of the variable’s data based on this, there is much more data from 0 to 50 rather than above 50 for # of cigs a day, so be careful not to extrapolate based on the values for which you have less data. Just eyeball this data – direction & magnitude. If I plot the pdp for every variable in the model, plot them all on the same y scale.

# Easy enough to do by hand; here, we're plotting the predicted probability of 
# developing CHD as a function of cigs while holding height constant at its 
# median value (i.e., 70 inches)
newd <- data.frame("cigs" = 0:99, height = 70)
prob <- predict(lr.fit, newdata = newd, type = "response")
plot(newd$cigs, prob, type = "l", lty = 1, las = 1, 
     xlab = "Number of cigarettes smoked per day", 
     ylab = "Conditional probabiluty of CHD")

This is a marginal effect plot that holds height constant at 70, then plotting what happens at the predictive probability when increasing cigarettes smoked per day and plotting them. Something to keep in mind.

Model Selection

lr.fit.all <- glm(chd ~ ., family = binomial(link = "logit"), data = wcgs)#, maxit = 9999)

## Warning: glm.fit: algorithm did not converge

# What (potentially) happened?

What does “algorithm did not converge” mean?

Warning in python: “glm.fit: fitted probabilities numerically 0 or 1 occurred”. This is worried because you’re getting probabilities of exactly 0 and 1, which is a bit too certain. This is an example that is somewhat similar to one of the subjects of the homework. It’s a red flag for your model - it’s doing too well. What’s happening? Your model may actually be that good, but maybe not.

# Let's inspect the data a bit more; we'll start with a SPLOM
y <- ifelse(wcgs$chd == "yes", 1, 0)
palette("Okabe-Ito")
pairs(wcgs, col = adjustcolor(y + 1, alpha.f = 0.1))

palette("default")  # back to default color pal

we’re encoding our “yes/no” variable to {0,1}. There are a couple of variables in the full model that are perfectly correlated with the outcome variable – “type of heart disease” and “time of heart disease”.

# Which columns contain missing values?
sapply(wcgs, FUN = anyNA)

##     age  height  weight     sdp     dbp    chol  behave    cigs   dibep     chd 
##   FALSE   FALSE   FALSE   FALSE   FALSE    TRUE   FALSE   FALSE   FALSE   FALSE 
## typechd timechd   arcus 
##   FALSE   FALSE    TRUE

# Which columns contain missing values?
sapply(wcgs, FUN = function(column) mean(is.na(column)))

##          age       height       weight          sdp          dbp         chol 
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0038046925 
##       behave         cigs        dibep          chd      typechd      timechd 
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 
##        arcus 
## 0.0006341154

# Look at correlations between numeric features
num <- sapply(wcgs, FUN = is.numeric)  # identify numeric columns
(corx <- cor(wcgs[, num], use = "pairwise.complete.obs"))  # simple correlation matrix

##                  age       height       weight         sdp         dbp
## age      1.000000000 -0.095375682 -0.034404537  0.16574640  0.13919776
## height  -0.095375682  1.000000000  0.532935466  0.01837357  0.01027555
## weight  -0.034404537  0.532935466  1.000000000  0.25324962  0.29592019
## sdp      0.165746397  0.018373573  0.253249623  1.00000000  0.77290641
## dbp      0.139197757  0.010275549  0.295920186  0.77290641  1.00000000
## chol     0.089188510 -0.088937779  0.008537442  0.12306130  0.12959711
## cigs    -0.005033852  0.014911292 -0.081747507  0.02997753 -0.05934232
## timechd -0.070919630 -0.009895169 -0.065350046 -0.10788420 -0.11069397
##                 chol         cigs      timechd
## age      0.089188510 -0.005033852 -0.070919630
## height  -0.088937779  0.014911292 -0.009895169
## weight   0.008537442 -0.081747507 -0.065350046
## sdp      0.123061297  0.029977529 -0.107884203
## dbp      0.129597108 -0.059342317 -0.110693969
## chol     1.000000000  0.096031834 -0.095390054
## cigs     0.096031834  1.000000000 -0.093933141
## timechd -0.095390054 -0.093933141  1.000000000

# Visualize correlations; can be useful if you have a lot of features
corrplot::corrplot(corx, method = "square", order = "FPC", type = "lower", diag = TRUE)

> We can see some high degrees of linear association, the two blood pressure variables are more closely related, linearly, than our height and weight. We could assess whether to handle these variables as having multicollinearity by using variance inflation factors, but, in this case, we’re not really at that point.

# What about categorical features?
xtabs(~ behave + dibep + chd, data = wcgs)  # perfect correlation?

## , , chd = no
## 
##       dibep
## behave    A    B
##     A1    0  234
##     A2    0 1177
##     B3 1155    0
##     B4  331    0
## 
## , , chd = yes
## 
##       dibep
## behave    A    B
##     A1    0   30
##     A2    0  148
##     B3   61    0
##     B4   18    0

There are two variables that perfectly separate the heart disease from non-heart disease variables with categorical data.

# Refit model without leakage or redundant features
summary(lr.fit.all <- glm(chd ~ . - typechd - timechd - behave, family = binomial(link = "logit"), data = wcgs))

## 
## Call:
## glm(formula = chd ~ . - typechd - timechd - behave, family = binomial(link = "logit"), 
##     data = wcgs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3568  -0.4350  -0.3115  -0.2204   2.8482  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -12.904225   2.334253  -5.528 3.24e-08 ***
## age            0.061569   0.012397   4.967 6.81e-07 ***
## height         0.007068   0.033317   0.212  0.83198    
## weight         0.008578   0.003887   2.207  0.02734 *  
## sdp            0.018312   0.006408   2.858  0.00427 ** 
## dbp           -0.001175   0.010887  -0.108  0.91402    
## chol           0.010708   0.001530   6.997 2.61e-12 ***
## cigs           0.020966   0.004292   4.885 1.04e-06 ***
## dibepB         0.658194   0.145951   4.510 6.49e-06 ***
## arcuspresent   0.210539   0.143864   1.463  0.14334    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1769.2  on 3139  degrees of freedom
## Residual deviance: 1569.5  on 3130  degrees of freedom
##   (14 observations deleted due to missingness)
## AIC: 1589.5
## 
## Number of Fisher Scoring iterations: 6

# How to measure relative influence/importance of each feature?
vip::vi(lr.fit.all)  # see ?vip::vi for details

## # A tibble: 9 x 3
##   Variable     Importance Sign 
##   <chr>             <dbl> <chr>
## 1 chol              7.00  POS  
## 2 age               4.97  POS  
## 3 cigs              4.88  POS  
## 4 dibepB            4.51  POS  
## 5 sdp               2.86  POS  
## 6 weight            2.21  POS  
## 7 arcuspresent      1.46  POS  
## 8 height            0.212 POS  
## 9 dbp               0.108 NEG

How do you define importance to the variable?

You could write a whole book on this, but you just need to consider what the impact is of your variable on the model. In regression, one of the ways is to look at the regression output (z-statistic, t-statistics, equal to the coefficient divided by the error). The 3rd column of your regression output is essentially unitless, which is better than looking at coefficients. You could also standardize before fitting the regression.

The default for vip() is to take the regression model, take the absolute value of that statistic, then indicate the direction. There are specific ways with specific models to measure variable importance. The absolute value of standardized coefficients doesn’t apply to trees, you could check permutation importance in scikit learn, it more generally measures importance for any model. If we wanted to say “how much does a variable impact predictive accuracy in the model?” for a medium-sized inexpensive model (like logistic regression) is “leave one covariant out” importance, or “loco” importance. If we leave out a covariant, then measure the predictive accuracy, then recalculate the predictive accuracy, we can measure how important a variable is for predictions.

# Variable selection using backward elimination with AIC
(lr.fit.back <- MASS::stepAIC(lr.fit.all, direction = "backward", trace = 0))

## 
## Call:  glm(formula = chd ~ age + weight + sdp + chol + cigs + dibep + 
##     arcus, family = binomial(link = "logit"), data = wcgs)
## 
## Coefficients:
##  (Intercept)           age        weight           sdp          chol  
##   -12.470720      0.061360      0.008952      0.017665      0.010664  
##         cigs        dibepB  arcuspresent  
##     0.021117      0.658178      0.213072  
## 
## Degrees of Freedom: 3139 Total (i.e. Null);  3132 Residual
##   (14 observations deleted due to missingness)
## Null Deviance:       1769 
## Residual Deviance: 1570  AIC: 1586

Here we’re assessing the model with AIC - AIC, BIC, and others will penalize your model when the additional complexity of the model isn’t as valuable as the predictive accuracy you gain. It works for any generalized linear model and it’s better suited than a p-value for this job than p-values. How you define whether a variable gets dropped or doesn’t changes over time.

If you were to fit a model 20 times you will likely get 20 different models. The probability of selecting the right subset of features given the data is equal to 0. All subsets of variables are wrong, some are useful.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-2

# Fit an elastic net model (i.e., LASSO and ridge penalties) using 5-fold CV
dim(wcgs.complete <- na.omit(wcgs))

## [1] 3140   13

X <- model.matrix(~. - chd - typechd - timechd - behave - 1 , data = wcgs.complete)
lr.enet <- cv.glmnet(X, y = ifelse(wcgs.complete$chd == "yes", 1, 0), family = "binomial", nfold = 5,
                    keep = TRUE)
plot(lr.enet)

glmnet fits elastic net (ENET) model paths for some generalized linear models (GLMs)
Essentially, ENET = ridge regression + LASSO (or L1 penalty + L2 penalty)
- The ridge part helps deal with multicollinearity (more stable coefficients)
- The LASSO part helps with variable selection
See Section 6.2 of ISL book: https://hastie.su.domains/ISLR2/ISLRv2_website.pdf (FREE!!)
From my HOMLR book with Brad: https://bradleyboehmke.github.io/HOML/regularized-regression.html (FREE!!)
Intro videos: https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/

# Compute cross-validated performance measures
perf <- assess.glmnet(lr.enet$fit.preval, 
                      newy = ifelse(wcgs.complete$chd == "yes", 1, 0),
                      family = "binomial")

# Plot results
par(mfrow = c(3, 1))
plot(perf$mse, type = "b", ylab = "MSE (i.e., Brier score)", col = 2)
plot(perf$deviance, type = "b", ylab = "Binomial deviance", col = 3)
plot(perf$auc, type = "b", ylab = "AUROC", col = 4)

In this instance, the MSE represents the Briar score, and its much more useful than the ROC. The MSE is telling you how accurate your probabilities are.

# Inspect coefficients from three fits
coef(lr.fit.back)

##   (Intercept)           age        weight           sdp          chol 
## -12.470719514   0.061360188   0.008952417   0.017664936   0.010664486 
##          cigs        dibepB  arcuspresent 
##   0.021116530   0.658178223   0.213071542

coef(lr.enet, s = "lambda.1se")  # what does this mean?

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                         s1
## (Intercept)  -6.556874e+00
## age           2.589456e-02
## height        .           
## weight        .           
## sdp           1.084071e-02
## dbp           .           
## chol          6.598124e-03
## cigs          7.461684e-03
## dibepA       -2.580720e-01
## dibepB        1.139322e-14
## arcuspresent  .

coef(lr.enet, s = "lambda.min")  # what does this mean?

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                         s1
## (Intercept)  -1.190852e+01
## age           6.027102e-02
## height        4.618375e-03
## weight        8.301126e-03
## sdp           1.750377e-02
## dbp           .           
## chol          1.053760e-02
## cigs          2.059928e-02
## dibepA       -6.432646e-01
## dibepB        2.845976e-14
## arcuspresent  1.998339e-01

library(earth)

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

## Loading required package: TeachingDemos

# Fit a degree-2 MARS model (i.e., allow up to 2-way interaction effects)
lr.mars <- earth(chd ~ . - typechd - timechd - behave, data = na.omit(wcgs), 
                 degree = 2, glm = list(family = binomial))

# Print summary of model fit
summary(lr.mars)

## Call: earth(formula=chd~.-typechd-timechd-behave, data=na.omit(wcgs),
##             glm=list(family=binomial), degree=2)
## 
## GLM coefficients
##                                     yes
## (Intercept)                 -0.27304152
## dibepB                       0.86348050
## h(sdp-116)                   0.02866052
## h(364-chol)                 -0.01460771
## h(chol-364)                  0.07055064
## h(60-cigs)                  -0.02692803
## h(age-42) * dibepB           0.05236166
## h(185-weight) * dibepB      -0.02360928
## h(weight-185) * dibepB      -0.02880801
## h(49-age) * h(sdp-116)      -0.00283210
## h(weight-215) * h(364-chol)  0.00028556
## h(sdp-116) * h(chol-308)    -0.00131312
## h(sdp-116) * h(308-chol)    -0.00002802
## h(308-chol) * h(60-cigs)     0.00006406
## 
## GLM (family binomial, link logit):
##  nulldev   df       dev   df   devratio     AIC iters converged
##  1769.17 3139   1545.23 3126      0.127    1573     6         1
## 
## Earth selected 14 of 20 terms, and 6 of 9 predictors
## Termination condition: Reached nk 21
## Importance: age, dibepB, chol, sdp, cigs, weight, height-unused, ...
## Number of terms at each degree of interaction: 1 5 8
## Earth GCV 0.06953471    RSS 213.705    GRSq 0.06868138    RSq 0.08786658

plot(lr.mars)  # not terribly useful

library(vip)

## Warning: package 'vip' was built under R version 4.0.5

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

# Variable importance plots
vip1 <- vip(lr.fit.all, geom = "point", include_type = TRUE)
vip2 <- vip(lr.mars, geom = "point", include_type = TRUE)  # see ?vip::vi_model for details
gridExtra::grid.arrange(vip1, vip2, nrow = 1)

# Look at a PD plot of age
pd1 <- partial(lr.mars, pred.var = "age", plot = TRUE, plot.engine = "ggplot2") + ggtitle("Logit scale")
pd2 <- partial(lr.mars, pred.var = "age", plot = TRUE, prob = TRUE, plot.engine = "ggplot2") + ggtitle("Probability scale")
gridExtra::grid.arrange(pd1, pd2, nrow = 1)

## Warning: Use of `object[[1L]]` is discouraged. Use `.data[[1L]]` instead.

## Warning: Use of `object[["yhat"]]` is discouraged. Use `.data[["yhat"]]`
## instead.

## Warning: Use of `object[[1L]]` is discouraged. Use `.data[[1L]]` instead.

## Warning: Use of `object[["yhat"]]` is discouraged. Use `.data[["yhat"]]`
## instead.

##

y <- na.omit(wcgs)$chd  # observed classes
prob <- predict(lr.fit.back, type = "response")  # predicted probabilities
classes <- ifelse(prob > 0.5, "yes", "no")  # classification based on 0.5 threshold
(cm <- table("actual" = y, "predicted" = classes))  # confusion matrix

##       predicted
## actual   no  yes
##    no  2882    3
##    yes  253    2

This model doesn’t predict a “yes” for heart disease well, we could change the confusion matrix by changing the threshold!

y <- na.omit(wcgs)$chd  # observed classes
prob <- predict(lr.fit.back, type = "response")  # predicted probabilities
classes <- ifelse(prob > 0.07, "yes", "no")  # classification based on 0.5 threshold
(cm <- table("actual" = y, "predicted" = classes))  # confusion matrix

##       predicted
## actual   no  yes
##    no  1800 1085
##    yes   60  195

tp <- sum(classes == "yes" & y == "yes")  # true positives
tn <- sum(classes == "no"  & y == "no")  # true negatives
fp <- sum(classes == "yes" & y == "no")  # false positives
fn <- sum(classes == "no"  & y == "yes")  # false negatives
(tpr <- tp / (tp + fn))  # sensitivity

## [1] 0.7647059

(tnr <- tn / (tn + fp))  # specificity

## [1] 0.6239168

threshold <- seq(from = 0, to = 1, length = 999)
tp <- tn <- fp <- fn <- numeric(length(threshold))
for (i in seq_len(length(threshold))) {
  classes <- ifelse(prob > threshold[i], "yes", "no")
  tp[i] <- sum(classes == "yes" & y == "yes")  # true positives
  tn[i] <- sum(classes == "no"  & y == "no")  # true negatives
  fp[i] <- sum(classes == "yes" & y == "no")  # false positives
  fn[i] <- sum(classes == "no"  & y == "yes")  # false negatives
}
tpr <- tp / (tp + fn)  # sensitivity
tnr <- tn / (tn + fp)  # specificity

# Plot ROC curve
plot(1 - tnr, y = tpr, type = "l", col = 2, lwd = 2, xlab = "")
abline(0, 1, lty = 2)

This ROC curve measures the ability of the model to capture true positives. This doesn’t tell you anything about how accurate your probabilities are.

# Can be useful to use a package sometimes (e.g., for computing are under the ROC curve; AKA AUROC or AUC)
plot(roc1 <- pROC::roc(y, predictor = prob))

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

roc1

## 
## Call:
## roc.default(response = y, predictor = prob)
## 
## Data: prob in 2885 controls (y no) < 255 cases (y yes).
## Area under the curve: 0.7551

# Can be useful to use a package sometimes (e.g., for computing are under the ROC curve; AKA AUROC or AUC)
plot(roc1 <- pROC::roc(y, predictor = prob))

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

roc1

## 
## Call:
## roc.default(response = y, predictor = prob)
## 
## Data: prob in 2885 controls (y no) < 255 cases (y yes).
## Area under the curve: 0.7551

# Compare to ROC curve from MARS fit
prob2 <- predict(lr.mars, newdata = na.omit(wcgs), type = "response")
lines(roc2 <- pROC::roc(y, predictor = prob2), col = 2)

## Setting levels: control = no, case = yes

## Warning in roc.default(y, predictor = prob2): Deprecated use a matrix as
## predictor. Unexpected results may be produced, please pass a numeric vector.

## Setting direction: controls < cases

roc2

## 
## Call:
## roc.default(response = y, predictor = prob2)
## 
## Data: prob2 in 2885 controls (y no) < 255 cases (y yes).
## Area under the curve: 0.7667

# Function to compute lift and cumulative gain charts
lift <- function(prob, y, pos.class = NULL, cumulative = TRUE) {
  if (!all(sort(unique(y)) == c(0, 1))) {
    if (is.null(pos.class)) {
      stop("A value for `pos.class` is required whenever `y` is not a 0/1 ",
           "outcome.", call. = FALSE)
    }
    y <- ifelse(y == pos.class, 1, 0)
  }
  ord <- order(prob, decreasing = TRUE)
  prob <- prob[ord]
  y <- y[ord]
  prop <- seq_along(y) / length(y)
  lift <- if (isTRUE(cumulative)) {
    cumsum(y)
  } else {
    (cumsum(y) / seq_along(y)) / mean(y)
  }
  structure(list("lift" = lift, "prop" = prop, "cumulative" = cumulative,
                 "y" = y), class = "lift")
}

# Cumulative gain chart; what does this plot tell us?
l <- lift(prob, y = y, pos.class = "yes")
plot(l[["prop"]], l[["lift"]], type = "l", 
     xlab = "Proportion of sample", ylab = "Cumulative lift", 
     las = 1, lwd = 2, col = 2)
abline(0, sum(y == "yes"), lty = 2)

# Compare to MARS fit
l2 <- lift(prob2, y = y, pos.class = "yes")
lines(l2[["prop"]], l2[["lift"]], col = 3)

bs <- function(y, prob) {  # this is a relative measure, like log loss/binomial deviance, SSE/RMSE, AIC/BIC, etc.
  mean((prob - y) ^ 2)  # y should be in {0, 1}
}
# Compute Brier score for previous models
y01 <- ifelse(y == "yes", 1, 0)
fit0 <- glm(chd ~ 1, data = na.omit(wcgs), family = binomial(link = "logit"))
bs(y01, prob = predict(fit0, type = "response"))

## [1] 0.0746151

bs(y01, prob = predict(lr.fit.back, type = "response"))

## [1] 0.0694546

bs(y01, prob = predict(lr.enet, newx = X, type = "response", s = "lambda.min")[, 1L, drop = TRUE])

## [1] 0.06942375

https://scikit-learn.org/stable/modules/calibration.html
https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-11/issue-2/Beyond-sigmoids--How-to-obtain-well-calibrated-probabilities-from/10.1214/17-EJS1338SI.full
Pretty much anything Frank Harrell says about calibration… try googling “Frank Harrell calibration”

library(rms)

## Warning: package 'rms' was built under R version 4.0.5

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 4.0.5

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.0.5

## Loading required package: survival

## Warning: package 'survival' was built under R version 4.0.5

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:TeachingDemos':
## 
##     cnvrt.coords, subplot

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Loading required package: SparseM

## Warning: package 'SparseM' was built under R version 4.0.4

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

rms::val.prob(prob, y = y01)

##           Dxy       C (ROC)            R2             D      D:Chi-sq 
##  5.102498e-01  7.551249e-01  1.430074e-01  6.326050e-02  1.996380e+02 
##           D:p             U      U:Chi-sq           U:p             Q 
##            NA -6.369427e-04 -2.273737e-13  1.000000e+00  6.389745e-02 
##         Brier     Intercept         Slope          Emax           E90 
##  6.945460e-02  4.924095e-14  1.000000e+00  2.570397e-01  1.338980e-02 
##          Eavg           S:z           S:p 
##  7.210305e-03  2.242094e-01  8.225944e-01

# Check calibration of MARS model
rms::val.prob(prob2, y = y01)

##           Dxy       C (ROC)            R2             D      D:Chi-sq 
##  5.332953e-01  7.666476e-01  1.598064e-01  7.100152e-02  2.239448e+02 
##           D:p             U      U:Chi-sq           U:p             Q 
##            NA -6.369427e-04 -1.591616e-12  1.000000e+00  7.163846e-02 
##         Brier     Intercept         Slope          Emax           E90 
##  6.824578e-02  1.298223e-12  1.000000e+00  1.444031e-01  1.939908e-02 
##          Eavg           S:z           S:p 
##  9.621852e-03  2.976546e-01  7.659668e-01

library(rms)

# Refit the model using the rms package
lr.fit <- lrm(chd ~ age + weight + sdp + chol + cigs + dibep + arcus, 
              data = na.omit(wcgs), x = TRUE, y = TRUE)

# Resampling-based validation of the model
set.seed(1948)  # for reproducibility
validate(lr.fit, method = "boot", B = 40)

##           index.orig training    test optimism index.corrected  n
## Dxy           0.5102   0.5211  0.5030   0.0181          0.4922 40
## R2            0.1430   0.1509  0.1392   0.0118          0.1312 40
## Intercept     0.0000   0.0000 -0.1102   0.1102         -0.1102 40
## Slope         1.0000   1.0000  0.9600   0.0400          0.9600 40
## Emax          0.0000   0.0000  0.0313   0.0313          0.0313 40
## D             0.0633   0.0676  0.0615   0.0061          0.0571 40
## U            -0.0006  -0.0006  0.0001  -0.0008          0.0001 40
## Q             0.0639   0.0683  0.0614   0.0069          0.0570 40
## B             0.0695   0.0704  0.0697   0.0007          0.0688 40
## g             1.0651   1.0972  1.0489   0.0484          1.0167 40
## gp            0.0737   0.0766  0.0728   0.0039          0.0698 40

Logistic Regression | Techniques, Tuning, and Diagnostics

John Trygier

1/25/2022

Quick Recap of Logistic Regression

Model Selection