In linear regression (ordinary least squares) the goal is to ordinate a response to a set of predictors. We do this by setting hte mean response (the expected value) as being a linear function of the predictors. This is then called the “linear Predictor” in generalized linear models (poisson, logistic, etc). In a lot of texts this is also referred to as “eta” (a - tuh >>?)
The word generalized in generalized linear models is relating a mean response to a general function, there is no special case. Logistic regression is set apart by the response being binary. This may also correspond to events, y/n, develop heart disease or not. In normal linear regression we assume the response is normally distributed. The bernoulli is assumed in logistic regression. The mean of a bernoulli is given as p, which is the probability of y = 1 given x. What’s the problem with calling B0 the probability? The coefficients aren’t bounded. Instead of modeling this mean response directly, we calculate the logit, or the log odds of p. If p is between 0 and 1, we can just transform the response to (p/(1-p)), which expands the values p can take and make it apply in more places.
Given all of this, we develop a likelihood function, and then create an optimization function that estimates the coefficients of the model, then we plug in new data and get predictions. These are called logits, and we can un-transform these to get probabilities. In all generalized linear models, the g(p) function must be invertable to return the probabilities after the fact.
pkgs <- c("corrplot", "faraway", "glmnet", "ggplot2", "plotmo", "pdp", "vip", "earth", "rms")
lib <- installed.packages()[, "Package"]
install.packages(setdiff(pkgs, lib))
# Load western collaborative group study set from faraway package
data(wcgs, package = "faraway")
head(wcgs) # print first few records
## age height weight sdp dbp chol behave cigs dibep chd typechd timechd
## 2001 49 73 150 110 76 225 A2 25 B no none 1664
## 2002 42 70 160 154 84 177 A2 20 B no none 3071
## 2003 42 69 160 110 78 181 B3 0 A no none 3071
## 2004 41 68 152 124 78 132 B4 20 A no none 3064
## 2005 59 70 150 144 86 255 B3 20 A yes infdeath 1885
## 2006 44 72 204 150 90 182 B4 0 A no none 3102
## arcus
## 2001 absent
## 2002 present
## 2003 absent
## 2004 absent
## 2005 present
## 2006 absent
We have a study of middle aged men in san francisco measuring information about them and tracking whether they did or didn’t develop heart disease.
We’ll be doing some explanatory analysis of the data and the model that we create in order to understand the model, why it behaves the way that it does, and how it applies to our data.
# Print general structure of object
str(wcgs)
## 'data.frame': 3154 obs. of 13 variables:
## $ age : int 49 42 42 41 59 44 44 40 43 42 ...
## $ height : int 73 70 69 68 70 72 72 71 72 70 ...
## $ weight : int 150 160 160 152 150 204 164 150 190 175 ...
## $ sdp : int 110 154 110 124 144 150 130 138 146 132 ...
## $ dbp : int 76 84 78 78 86 90 84 60 76 90 ...
## $ chol : int 225 177 181 132 255 182 155 140 149 325 ...
## $ behave : Factor w/ 4 levels "A1","A2","B3",..: 2 2 3 4 3 4 4 2 3 2 ...
## $ cigs : int 25 20 0 20 20 0 0 0 25 0 ...
## $ dibep : Factor w/ 2 levels "A","B": 2 2 1 1 1 1 1 2 1 2 ...
## $ chd : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ typechd: Factor w/ 4 levels "angina","infdeath",..: 3 3 3 3 2 3 3 3 3 3 ...
## $ timechd: int 1664 3071 3071 3064 1885 3102 3074 3071 3064 1032 ...
## $ arcus : Factor w/ 2 levels "absent","present": 1 2 1 1 2 1 1 1 1 2 ...
# Extract the three columns of interest and print a summary of each
summary(wcgs[, c("chd", "height", "cigs")])
## chd height cigs
## no :2897 Min. :60.00 Min. : 0.0
## yes: 257 1st Qu.:68.00 1st Qu.: 0.0
## Median :70.00 Median : 0.0
## Mean :69.78 Mean :11.6
## 3rd Qu.:72.00 3rd Qu.:20.0
## Max. :78.00 Max. :99.0
# Fit an additive logistic regression model
lr.fit <- glm(chd ~ height + cigs,family = binomial(link = "logit"), data = wcgs)
# Print a summary of the fitted model
summary(lr.fit)
##
## Call:
## glm(formula = chd ~ height + cigs, family = binomial(link = "logit"),
## data = wcgs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0041 -0.4425 -0.3630 -0.3499 2.4357
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.50161 1.84186 -2.444 0.0145 *
## height 0.02521 0.02633 0.957 0.3383
## cigs 0.02313 0.00404 5.724 1.04e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1781.2 on 3153 degrees of freedom
## Residual deviance: 1749.0 on 3151 degrees of freedom
## AIC: 1755
##
## Number of Fisher Scoring iterations: 5
Universal wy to interpret effects in a model - plots. The benefit is that your boss doesn’t care about odds, but they probably care about the effects of your variables.
# Fit a reduced model by dropping the term for height
lr.fit.reduced <- glm(chd ~ cigs, family = binomial(link = "logit"), data = wcgs)
# Compare full model to reduced model; what null hypothesis is implied here?
anova(lr.fit.reduced, lr.fit, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: chd ~ cigs
## Model 2: chd ~ height + cigs
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 3152 1750
## 2 3151 1749 1 0.92025 0.3374
# I don't think drop1() or add1() are very useful; see ?drop1 for further details
drop1(lr.fit, test = "Chi")
## Single term deletions
##
## Model:
## chd ~ height + cigs
## Df Deviance AIC LRT Pr(>Chi)
## <none> 1749.0 1755.0
## height 1 1750.0 1754.0 0.9202 0.3374
## cigs 1 1780.1 1784.1 31.0695 2.49e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The plotmo package
plotmo::plotmo(lr.fit)
## plotmo grid: height cigs
## 70 0
This takes your fitted model and visualizes the effect of each of the included variables in the model. We did this manually last week, but this is a way of visualizing the effect of a variable on the output, all else held equal. Unfortunately it doesn’t allow you to pick what gets held equally, but it will save you a lot of work.
This is useful for several reasons - We can see that as we increase height the predictor goes up, which shows us the direction of the relation, but also the magnitude of the effect of a relationship. The Cigs variable is clearly more impactful to the outcome because we can see the impact is greater. We can also see that the cigs variable increases the likelihood of developing heart disease.
Why is this cigs plot curved? The function is linear?
Even though the function is linear, it is scaled nonlinearly by the logit.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(pdp)
## Warning: package 'pdp' was built under R version 4.0.5
# Set theme for ggplot2 graphics
theme_set(theme_bw())
# These plots will be linear on the logit scale (why?) but nonlinear on the probability scale
partial(lr.fit, pred.var = "cigs", prob = TRUE, plot = TRUE, rug = TRUE,
plot.engine = "ggplot2") +
geom_rug(data = wcgs, aes(x = cigs), alpha = 0.2, inherit.aes = FALSE) # why add this?
## Warning: Use of `object[[1L]]` is discouraged. Use `.data[[1L]]` instead.
## Warning: Use of `object[["yhat"]]` is discouraged. Use `.data[["yhat"]]`
## instead.
## Warning: Use of `x.rug[[1L]]` is discouraged. Use `.data[[1L]]` instead.
The above is a partial dependence plot, it was introduced by Jerome Freedman, in that paper he proposed these plots as a way of interpreting gradient boosted modeling. The main difference between a partial dependence plot and one of the earlier plots is that rather than holding height constant,a partial dependence plot measures the impact of cigs while taking the average value of other variables into account at the given level. The charts from above are called a “poor mans” partial dependence plot, as they include less information, but they’re a quick and dirty way of assessing variables.
What’s happening on the x-axis? Rug plots - the information on the bottom includes the quantity of data at each level of the data. We can make some inferences about the distribution of the variable’s data based on this, there is much more data from 0 to 50 rather than above 50 for # of cigs a day, so be careful not to extrapolate based on the values for which you have less data. Just eyeball this data – direction & magnitude. If I plot the pdp for every variable in the model, plot them all on the same y scale.
# Easy enough to do by hand; here, we're plotting the predicted probability of
# developing CHD as a function of cigs while holding height constant at its
# median value (i.e., 70 inches)
newd <- data.frame("cigs" = 0:99, height = 70)
prob <- predict(lr.fit, newdata = newd, type = "response")
plot(newd$cigs, prob, type = "l", lty = 1, las = 1,
xlab = "Number of cigarettes smoked per day",
ylab = "Conditional probabiluty of CHD")
This is a marginal effect plot that holds height constant at 70, then plotting what happens at the predictive probability when increasing cigarettes smoked per day and plotting them. Something to keep in mind.
lr.fit.all <- glm(chd ~ ., family = binomial(link = "logit"), data = wcgs)#, maxit = 9999)
## Warning: glm.fit: algorithm did not converge
# What (potentially) happened?
What does “algorithm did not converge” mean?
Warning in python: “glm.fit: fitted probabilities numerically 0 or 1 occurred”. This is worried because you’re getting probabilities of exactly 0 and 1, which is a bit too certain. This is an example that is somewhat similar to one of the subjects of the homework. It’s a red flag for your model - it’s doing too well. What’s happening? Your model may actually be that good, but maybe not.
# Let's inspect the data a bit more; we'll start with a SPLOM
y <- ifelse(wcgs$chd == "yes", 1, 0)
palette("Okabe-Ito")
pairs(wcgs, col = adjustcolor(y + 1, alpha.f = 0.1))
palette("default") # back to default color pal
we’re encoding our “yes/no” variable to {0,1}. There are a couple of variables in the full model that are perfectly correlated with the outcome variable – “type of heart disease” and “time of heart disease”.
# Which columns contain missing values?
sapply(wcgs, FUN = anyNA)
## age height weight sdp dbp chol behave cigs dibep chd
## FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## typechd timechd arcus
## FALSE FALSE TRUE
# Which columns contain missing values?
sapply(wcgs, FUN = function(column) mean(is.na(column)))
## age height weight sdp dbp chol
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0038046925
## behave cigs dibep chd typechd timechd
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## arcus
## 0.0006341154
# Look at correlations between numeric features
num <- sapply(wcgs, FUN = is.numeric) # identify numeric columns
(corx <- cor(wcgs[, num], use = "pairwise.complete.obs")) # simple correlation matrix
## age height weight sdp dbp
## age 1.000000000 -0.095375682 -0.034404537 0.16574640 0.13919776
## height -0.095375682 1.000000000 0.532935466 0.01837357 0.01027555
## weight -0.034404537 0.532935466 1.000000000 0.25324962 0.29592019
## sdp 0.165746397 0.018373573 0.253249623 1.00000000 0.77290641
## dbp 0.139197757 0.010275549 0.295920186 0.77290641 1.00000000
## chol 0.089188510 -0.088937779 0.008537442 0.12306130 0.12959711
## cigs -0.005033852 0.014911292 -0.081747507 0.02997753 -0.05934232
## timechd -0.070919630 -0.009895169 -0.065350046 -0.10788420 -0.11069397
## chol cigs timechd
## age 0.089188510 -0.005033852 -0.070919630
## height -0.088937779 0.014911292 -0.009895169
## weight 0.008537442 -0.081747507 -0.065350046
## sdp 0.123061297 0.029977529 -0.107884203
## dbp 0.129597108 -0.059342317 -0.110693969
## chol 1.000000000 0.096031834 -0.095390054
## cigs 0.096031834 1.000000000 -0.093933141
## timechd -0.095390054 -0.093933141 1.000000000
# Visualize correlations; can be useful if you have a lot of features
corrplot::corrplot(corx, method = "square", order = "FPC", type = "lower", diag = TRUE)
> We can see some high degrees of linear association, the two blood pressure variables are more closely related, linearly, than our height and weight. We could assess whether to handle these variables as having multicollinearity by using variance inflation factors, but, in this case, we’re not really at that point.
# What about categorical features?
xtabs(~ behave + dibep + chd, data = wcgs) # perfect correlation?
## , , chd = no
##
## dibep
## behave A B
## A1 0 234
## A2 0 1177
## B3 1155 0
## B4 331 0
##
## , , chd = yes
##
## dibep
## behave A B
## A1 0 30
## A2 0 148
## B3 61 0
## B4 18 0
There are two variables that perfectly separate the heart disease from non-heart disease variables with categorical data.
# Refit model without leakage or redundant features
summary(lr.fit.all <- glm(chd ~ . - typechd - timechd - behave, family = binomial(link = "logit"), data = wcgs))
##
## Call:
## glm(formula = chd ~ . - typechd - timechd - behave, family = binomial(link = "logit"),
## data = wcgs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3568 -0.4350 -0.3115 -0.2204 2.8482
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.904225 2.334253 -5.528 3.24e-08 ***
## age 0.061569 0.012397 4.967 6.81e-07 ***
## height 0.007068 0.033317 0.212 0.83198
## weight 0.008578 0.003887 2.207 0.02734 *
## sdp 0.018312 0.006408 2.858 0.00427 **
## dbp -0.001175 0.010887 -0.108 0.91402
## chol 0.010708 0.001530 6.997 2.61e-12 ***
## cigs 0.020966 0.004292 4.885 1.04e-06 ***
## dibepB 0.658194 0.145951 4.510 6.49e-06 ***
## arcuspresent 0.210539 0.143864 1.463 0.14334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1769.2 on 3139 degrees of freedom
## Residual deviance: 1569.5 on 3130 degrees of freedom
## (14 observations deleted due to missingness)
## AIC: 1589.5
##
## Number of Fisher Scoring iterations: 6
# How to measure relative influence/importance of each feature?
vip::vi(lr.fit.all) # see ?vip::vi for details
## # A tibble: 9 x 3
## Variable Importance Sign
## <chr> <dbl> <chr>
## 1 chol 7.00 POS
## 2 age 4.97 POS
## 3 cigs 4.88 POS
## 4 dibepB 4.51 POS
## 5 sdp 2.86 POS
## 6 weight 2.21 POS
## 7 arcuspresent 1.46 POS
## 8 height 0.212 POS
## 9 dbp 0.108 NEG
How do you define importance to the variable?
You could write a whole book on this, but you just need to consider what the impact is of your variable on the model. In regression, one of the ways is to look at the regression output (z-statistic, t-statistics, equal to the coefficient divided by the error). The 3rd column of your regression output is essentially unitless, which is better than looking at coefficients. You could also standardize before fitting the regression.
The default for vip() is to take the regression model, take the absolute value of that statistic, then indicate the direction. There are specific ways with specific models to measure variable importance. The absolute value of standardized coefficients doesn’t apply to trees, you could check permutation importance in scikit learn, it more generally measures importance for any model. If we wanted to say “how much does a variable impact predictive accuracy in the model?” for a medium-sized inexpensive model (like logistic regression) is “leave one covariant out” importance, or “loco” importance. If we leave out a covariant, then measure the predictive accuracy, then recalculate the predictive accuracy, we can measure how important a variable is for predictions.
# Variable selection using backward elimination with AIC
(lr.fit.back <- MASS::stepAIC(lr.fit.all, direction = "backward", trace = 0))
##
## Call: glm(formula = chd ~ age + weight + sdp + chol + cigs + dibep +
## arcus, family = binomial(link = "logit"), data = wcgs)
##
## Coefficients:
## (Intercept) age weight sdp chol
## -12.470720 0.061360 0.008952 0.017665 0.010664
## cigs dibepB arcuspresent
## 0.021117 0.658178 0.213072
##
## Degrees of Freedom: 3139 Total (i.e. Null); 3132 Residual
## (14 observations deleted due to missingness)
## Null Deviance: 1769
## Residual Deviance: 1570 AIC: 1586
Here we’re assessing the model with AIC - AIC, BIC, and others will penalize your model when the additional complexity of the model isn’t as valuable as the predictive accuracy you gain. It works for any generalized linear model and it’s better suited than a p-value for this job than p-values. How you define whether a variable gets dropped or doesn’t changes over time.
If you were to fit a model 20 times you will likely get 20 different models. The probability of selecting the right subset of features given the data is equal to 0. All subsets of variables are wrong, some are useful.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
# Fit an elastic net model (i.e., LASSO and ridge penalties) using 5-fold CV
dim(wcgs.complete <- na.omit(wcgs))
## [1] 3140 13
X <- model.matrix(~. - chd - typechd - timechd - behave - 1 , data = wcgs.complete)
lr.enet <- cv.glmnet(X, y = ifelse(wcgs.complete$chd == "yes", 1, 0), family = "binomial", nfold = 5,
keep = TRUE)
plot(lr.enet)
glmnet fits elastic net (ENET) model paths for some generalized linear models (GLMs)
Essentially, ENET = ridge regression + LASSO (or L1 penalty + L2 penalty)
See Section 6.2 of ISL book: https://hastie.su.domains/ISLR2/ISLRv2_website.pdf (FREE!!)
From my HOMLR book with Brad: https://bradleyboehmke.github.io/HOML/regularized-regression.html (FREE!!)
Intro videos: https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/
# Compute cross-validated performance measures
perf <- assess.glmnet(lr.enet$fit.preval,
newy = ifelse(wcgs.complete$chd == "yes", 1, 0),
family = "binomial")
# Plot results
par(mfrow = c(3, 1))
plot(perf$mse, type = "b", ylab = "MSE (i.e., Brier score)", col = 2)
plot(perf$deviance, type = "b", ylab = "Binomial deviance", col = 3)
plot(perf$auc, type = "b", ylab = "AUROC", col = 4)
In this instance, the MSE represents the Briar score, and its much more useful than the ROC. The MSE is telling you how accurate your probabilities are.
# Inspect coefficients from three fits
coef(lr.fit.back)
## (Intercept) age weight sdp chol
## -12.470719514 0.061360188 0.008952417 0.017664936 0.010664486
## cigs dibepB arcuspresent
## 0.021116530 0.658178223 0.213071542
coef(lr.enet, s = "lambda.1se") # what does this mean?
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -6.556874e+00
## age 2.589456e-02
## height .
## weight .
## sdp 1.084071e-02
## dbp .
## chol 6.598124e-03
## cigs 7.461684e-03
## dibepA -2.580720e-01
## dibepB 1.139322e-14
## arcuspresent .
coef(lr.enet, s = "lambda.min") # what does this mean?
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.190852e+01
## age 6.027102e-02
## height 4.618375e-03
## weight 8.301126e-03
## sdp 1.750377e-02
## dbp .
## chol 1.053760e-02
## cigs 2.059928e-02
## dibepA -6.432646e-01
## dibepB 2.845976e-14
## arcuspresent 1.998339e-01
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
# Fit a degree-2 MARS model (i.e., allow up to 2-way interaction effects)
lr.mars <- earth(chd ~ . - typechd - timechd - behave, data = na.omit(wcgs),
degree = 2, glm = list(family = binomial))
# Print summary of model fit
summary(lr.mars)
## Call: earth(formula=chd~.-typechd-timechd-behave, data=na.omit(wcgs),
## glm=list(family=binomial), degree=2)
##
## GLM coefficients
## yes
## (Intercept) -0.27304152
## dibepB 0.86348050
## h(sdp-116) 0.02866052
## h(364-chol) -0.01460771
## h(chol-364) 0.07055064
## h(60-cigs) -0.02692803
## h(age-42) * dibepB 0.05236166
## h(185-weight) * dibepB -0.02360928
## h(weight-185) * dibepB -0.02880801
## h(49-age) * h(sdp-116) -0.00283210
## h(weight-215) * h(364-chol) 0.00028556
## h(sdp-116) * h(chol-308) -0.00131312
## h(sdp-116) * h(308-chol) -0.00002802
## h(308-chol) * h(60-cigs) 0.00006406
##
## GLM (family binomial, link logit):
## nulldev df dev df devratio AIC iters converged
## 1769.17 3139 1545.23 3126 0.127 1573 6 1
##
## Earth selected 14 of 20 terms, and 6 of 9 predictors
## Termination condition: Reached nk 21
## Importance: age, dibepB, chol, sdp, cigs, weight, height-unused, ...
## Number of terms at each degree of interaction: 1 5 8
## Earth GCV 0.06953471 RSS 213.705 GRSq 0.06868138 RSq 0.08786658
plot(lr.mars) # not terribly useful
library(vip)
## Warning: package 'vip' was built under R version 4.0.5
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
# Variable importance plots
vip1 <- vip(lr.fit.all, geom = "point", include_type = TRUE)
vip2 <- vip(lr.mars, geom = "point", include_type = TRUE) # see ?vip::vi_model for details
gridExtra::grid.arrange(vip1, vip2, nrow = 1)
# Look at a PD plot of age
pd1 <- partial(lr.mars, pred.var = "age", plot = TRUE, plot.engine = "ggplot2") + ggtitle("Logit scale")
pd2 <- partial(lr.mars, pred.var = "age", plot = TRUE, prob = TRUE, plot.engine = "ggplot2") + ggtitle("Probability scale")
gridExtra::grid.arrange(pd1, pd2, nrow = 1)
## Warning: Use of `object[[1L]]` is discouraged. Use `.data[[1L]]` instead.
## Warning: Use of `object[["yhat"]]` is discouraged. Use `.data[["yhat"]]`
## instead.
## Warning: Use of `object[[1L]]` is discouraged. Use `.data[[1L]]` instead.
## Warning: Use of `object[["yhat"]]` is discouraged. Use `.data[["yhat"]]`
## instead.
##
y <- na.omit(wcgs)$chd # observed classes
prob <- predict(lr.fit.back, type = "response") # predicted probabilities
classes <- ifelse(prob > 0.5, "yes", "no") # classification based on 0.5 threshold
(cm <- table("actual" = y, "predicted" = classes)) # confusion matrix
## predicted
## actual no yes
## no 2882 3
## yes 253 2
#
This model doesn’t predict a “yes” for heart disease well, we could change the confusion matrix by changing the threshold!
y <- na.omit(wcgs)$chd # observed classes
prob <- predict(lr.fit.back, type = "response") # predicted probabilities
classes <- ifelse(prob > 0.07, "yes", "no") # classification based on 0.5 threshold
(cm <- table("actual" = y, "predicted" = classes)) # confusion matrix
## predicted
## actual no yes
## no 1800 1085
## yes 60 195
tp <- sum(classes == "yes" & y == "yes") # true positives
tn <- sum(classes == "no" & y == "no") # true negatives
fp <- sum(classes == "yes" & y == "no") # false positives
fn <- sum(classes == "no" & y == "yes") # false negatives
(tpr <- tp / (tp + fn)) # sensitivity
## [1] 0.7647059
(tnr <- tn / (tn + fp)) # specificity
## [1] 0.6239168
threshold <- seq(from = 0, to = 1, length = 999)
tp <- tn <- fp <- fn <- numeric(length(threshold))
for (i in seq_len(length(threshold))) {
classes <- ifelse(prob > threshold[i], "yes", "no")
tp[i] <- sum(classes == "yes" & y == "yes") # true positives
tn[i] <- sum(classes == "no" & y == "no") # true negatives
fp[i] <- sum(classes == "yes" & y == "no") # false positives
fn[i] <- sum(classes == "no" & y == "yes") # false negatives
}
tpr <- tp / (tp + fn) # sensitivity
tnr <- tn / (tn + fp) # specificity
# Plot ROC curve
plot(1 - tnr, y = tpr, type = "l", col = 2, lwd = 2, xlab = "")
abline(0, 1, lty = 2)
This ROC curve measures the ability of the model to capture true positives. This doesn’t tell you anything about how accurate your probabilities are.
# Can be useful to use a package sometimes (e.g., for computing are under the ROC curve; AKA AUROC or AUC)
plot(roc1 <- pROC::roc(y, predictor = prob))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
roc1
##
## Call:
## roc.default(response = y, predictor = prob)
##
## Data: prob in 2885 controls (y no) < 255 cases (y yes).
## Area under the curve: 0.7551
# Can be useful to use a package sometimes (e.g., for computing are under the ROC curve; AKA AUROC or AUC)
plot(roc1 <- pROC::roc(y, predictor = prob))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
roc1
##
## Call:
## roc.default(response = y, predictor = prob)
##
## Data: prob in 2885 controls (y no) < 255 cases (y yes).
## Area under the curve: 0.7551
# Compare to ROC curve from MARS fit
prob2 <- predict(lr.mars, newdata = na.omit(wcgs), type = "response")
lines(roc2 <- pROC::roc(y, predictor = prob2), col = 2)
## Setting levels: control = no, case = yes
## Warning in roc.default(y, predictor = prob2): Deprecated use a matrix as
## predictor. Unexpected results may be produced, please pass a numeric vector.
## Setting direction: controls < cases
roc2
##
## Call:
## roc.default(response = y, predictor = prob2)
##
## Data: prob2 in 2885 controls (y no) < 255 cases (y yes).
## Area under the curve: 0.7667
# Function to compute lift and cumulative gain charts
lift <- function(prob, y, pos.class = NULL, cumulative = TRUE) {
if (!all(sort(unique(y)) == c(0, 1))) {
if (is.null(pos.class)) {
stop("A value for `pos.class` is required whenever `y` is not a 0/1 ",
"outcome.", call. = FALSE)
}
y <- ifelse(y == pos.class, 1, 0)
}
ord <- order(prob, decreasing = TRUE)
prob <- prob[ord]
y <- y[ord]
prop <- seq_along(y) / length(y)
lift <- if (isTRUE(cumulative)) {
cumsum(y)
} else {
(cumsum(y) / seq_along(y)) / mean(y)
}
structure(list("lift" = lift, "prop" = prop, "cumulative" = cumulative,
"y" = y), class = "lift")
}
# Cumulative gain chart; what does this plot tell us?
l <- lift(prob, y = y, pos.class = "yes")
plot(l[["prop"]], l[["lift"]], type = "l",
xlab = "Proportion of sample", ylab = "Cumulative lift",
las = 1, lwd = 2, col = 2)
abline(0, sum(y == "yes"), lty = 2)
# Compare to MARS fit
l2 <- lift(prob2, y = y, pos.class = "yes")
lines(l2[["prop"]], l2[["lift"]], col = 3)
bs <- function(y, prob) { # this is a relative measure, like log loss/binomial deviance, SSE/RMSE, AIC/BIC, etc.
mean((prob - y) ^ 2) # y should be in {0, 1}
}
# Compute Brier score for previous models
y01 <- ifelse(y == "yes", 1, 0)
fit0 <- glm(chd ~ 1, data = na.omit(wcgs), family = binomial(link = "logit"))
bs(y01, prob = predict(fit0, type = "response"))
## [1] 0.0746151
bs(y01, prob = predict(lr.fit.back, type = "response"))
## [1] 0.0694546
bs(y01, prob = predict(lr.enet, newx = X, type = "response", s = "lambda.min")[, 1L, drop = TRUE])
## [1] 0.06942375
library(rms)
## Warning: package 'rms' was built under R version 4.0.5
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.0.5
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.0.5
## Loading required package: survival
## Warning: package 'survival' was built under R version 4.0.5
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:TeachingDemos':
##
## cnvrt.coords, subplot
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: SparseM
## Warning: package 'SparseM' was built under R version 4.0.4
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
rms::val.prob(prob, y = y01)
## Dxy C (ROC) R2 D D:Chi-sq
## 5.102498e-01 7.551249e-01 1.430074e-01 6.326050e-02 1.996380e+02
## D:p U U:Chi-sq U:p Q
## NA -6.369427e-04 -2.273737e-13 1.000000e+00 6.389745e-02
## Brier Intercept Slope Emax E90
## 6.945460e-02 4.924095e-14 1.000000e+00 2.570397e-01 1.338980e-02
## Eavg S:z S:p
## 7.210305e-03 2.242094e-01 8.225944e-01
# Check calibration of MARS model
rms::val.prob(prob2, y = y01)
## Dxy C (ROC) R2 D D:Chi-sq
## 5.332953e-01 7.666476e-01 1.598064e-01 7.100152e-02 2.239448e+02
## D:p U U:Chi-sq U:p Q
## NA -6.369427e-04 -1.591616e-12 1.000000e+00 7.163846e-02
## Brier Intercept Slope Emax E90
## 6.824578e-02 1.298223e-12 1.000000e+00 1.444031e-01 1.939908e-02
## Eavg S:z S:p
## 9.621852e-03 2.976546e-01 7.659668e-01
library(rms)
# Refit the model using the rms package
lr.fit <- lrm(chd ~ age + weight + sdp + chol + cigs + dibep + arcus,
data = na.omit(wcgs), x = TRUE, y = TRUE)
# Resampling-based validation of the model
set.seed(1948) # for reproducibility
validate(lr.fit, method = "boot", B = 40)
## index.orig training test optimism index.corrected n
## Dxy 0.5102 0.5211 0.5030 0.0181 0.4922 40
## R2 0.1430 0.1509 0.1392 0.0118 0.1312 40
## Intercept 0.0000 0.0000 -0.1102 0.1102 -0.1102 40
## Slope 1.0000 1.0000 0.9600 0.0400 0.9600 40
## Emax 0.0000 0.0000 0.0313 0.0313 0.0313 40
## D 0.0633 0.0676 0.0615 0.0061 0.0571 40
## U -0.0006 -0.0006 0.0001 -0.0008 0.0001 40
## Q 0.0639 0.0683 0.0614 0.0069 0.0570 40
## B 0.0695 0.0704 0.0697 0.0007 0.0688 40
## g 1.0651 1.0972 1.0489 0.0484 1.0167 40
## gp 0.0737 0.0766 0.0728 0.0039 0.0698 40