Anna
01 November, 2020
See: https://thestatsgeek.com/2014/02/08/r-squared-in-logistic-regression/
Binary logistic regression assumptions:
Maximum likelihood means that “the parameter estimates are those values which maximize the likelihood of the data which have been observed” (see link below)
Maximum likelihood estimator requires a larger sample size: logit models require more cases than OLS regression because they use maximum likelihood estimation techniques, not OLS (recommended n >= 400, both for train and test)
When the outcome is rare (<10%), even if the overall dataset is large, it can be difficult to estimate a logit model (maybe use discriminant analysis then)
No multicollinearity between predictors (use one of the correlated predictors instead)
For categorical predictors, look at the crosstabs to make sure there are no empty cells (if there are cells with 0-4 observations, collapse with the closest category)
No perfect predictors (you don’t need a model if you can predict the outcome with 1 variable)
No omitted variables (they bias coefficients)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
## vars n mean sd median trimmed mad min max range skew
## admit 1 400 0.32 0.47 0.0 0.27 0.00 0.00 1 1.00 0.78
## gre 2 400 587.70 115.52 580.0 589.06 118.61 220.00 800 580.00 -0.14
## gpa 3 400 3.39 0.38 3.4 3.40 0.40 2.26 4 1.74 -0.21
## rank 4 400 2.48 0.94 2.0 2.48 1.48 1.00 4 3.00 0.10
## kurtosis se
## admit -1.39 0.02
## gre -0.36 5.78
## gpa -0.60 0.02
## rank -0.91 0.05
the ‘rank’ variable is read as quantitative – correct it and describe the data again:
## [1] "integer"
##
## 1 2 3 4
## 61 151 121 67
## [1] "factor"
## vars n mean sd median trimmed mad min max range skew
## admit* 1 400 1.32 0.47 1.0 1.27 0.00 1.00 2 1.00 0.78
## gre 2 400 587.70 115.52 580.0 589.06 118.61 220.00 800 580.00 -0.14
## gpa 3 400 3.39 0.38 3.4 3.40 0.40 2.26 4 1.74 -0.21
## rank* 4 400 2.48 0.94 2.0 2.48 1.48 1.00 4 3.00 0.10
## kurtosis se
## admit* -1.39 0.02
## gre -0.36 5.78
## gpa -0.60 0.02
## rank* -0.91 0.05
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max range skew
## admit* 1 273 1.00 0.00 1.00 1.00 0.00 1.00 1 0.00 NaN
## gre 2 273 573.19 115.83 580.00 574.25 118.61 220.00 800 580.00 -0.10
## gpa 3 273 3.34 0.38 3.34 3.35 0.40 2.26 4 1.74 -0.07
## rank* 4 273 2.64 0.92 3.00 2.68 1.48 1.00 4 3.00 -0.03
## kurtosis se
## admit* NaN 0.00
## gre -0.38 7.01
## gpa -0.55 0.02
## rank* -0.89 0.06
## ------------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max range skew
## admit* 1 127 2.00 0.00 2.00 2.00 0.00 2.00 2 0.00 NaN
## gre 2 127 618.90 108.88 620.00 620.39 118.61 300.00 800 500.00 -0.17
## gpa 3 127 3.49 0.37 3.54 3.51 0.40 2.42 4 1.58 -0.53
## rank* 4 127 2.15 0.92 2.00 2.07 1.48 1.00 4 3.00 0.44
## kurtosis se
## admit* NaN 0.00
## gre -0.35 9.66
## gpa -0.38 0.03
## rank* -0.64 0.08
The share of admit = 1 is 0.32. It means that, without any information, there is a 0.68 probability that admit = 0.
Check for the share of missing data (if <= 5%, could be imputed). If much data is missing, consider possible biases in results or using the variable with many missings in robustness check models but not in the main model.
library(Amelia)
library(mlbench)
missmap(mydata, col=c("blue", "red"), legend = T, main = "Missing values vs observed")There are no blue cells, which means there is no missing data in this dataset.
## rank
## admit 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: admit
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 399 499.98
## gre 1 13.9204 398 486.06 0.0001907 ***
## gpa 1 5.7122 397 480.34 0.0168478 *
## rank 3 21.8265 394 458.52 7.088e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ \begin{aligned} \log\left[ \frac { P( \operatorname{admit} = \operatorname{1} ) }{ 1 - P( \operatorname{admit} = \operatorname{1} ) } \right] &= \alpha + \beta_{1}(\operatorname{gre}) + \beta_{2}(\operatorname{gpa}) + \beta_{3}(\operatorname{rank}_{\operatorname{2}})\ + \\ &\quad \beta_{4}(\operatorname{rank}_{\operatorname{3}}) + \beta_{5}(\operatorname{rank}_{\operatorname{4}}) + \epsilon \end{aligned} \]
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.99 1.14 -3.50 0.000465
## 2 gre 0.00226 0.00109 2.07 0.0385
## 3 gpa 0.804 0.332 2.42 0.0154
## 4 rank2 -0.675 0.316 -2.13 0.0328
## 5 rank3 -1.34 0.345 -3.88 0.000104
## 6 rank4 -1.55 0.418 -3.71 0.000205
## # A tibble: 400 x 10
## admit gre gpa rank .fitted .resid .std.resid .hat .sigma .cooksd
## <fct> <int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 380 3.61 3 -1.57 -0.616 -0.621 0.0161 1.08 0.000579
## 2 1 660 3.67 3 -0.885 1.57 1.58 0.0109 1.08 0.00450
## 3 1 800 4 1 1.04 0.779 0.788 0.0234 1.08 0.00145
## 4 1 640 3.19 4 -1.53 1.86 1.87 0.0167 1.08 0.0132
## 5 0 520 2.93 4 -2.01 -0.502 -0.505 0.0132 1.08 0.000302
## 6 1 760 3 2 -0.532 1.41 1.43 0.0213 1.08 0.00631
## 7 1 560 2.98 1 -0.326 1.32 1.33 0.0221 1.08 0.00535
## 8 0 400 3.08 2 -1.28 -0.699 -0.704 0.0129 1.08 0.000610
## 9 1 540 3.39 3 -1.38 1.79 1.80 0.00848 1.08 0.00573
## 10 0 700 3.92 2 0.0715 -1.21 -1.22 0.0145 1.08 0.00268
## # ... with 390 more rows
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 500. 399 -229. 471. 494. 459. 394 400
This model output shows that all predictors are significant, therefore, we look at the estimates.
Positive estimates indicate that, as the predictor changes from 0 to 1 (dummy) or increases by 1 (continuous), the chances of admit increase. A negative estimate shows a decrease of chances (in the log of odds form). To interpret the coefficients in a more intuitive way, we get the odds or, better, predicted probabilities (see next slide).
Change the reference level of a categorical predictor:
mydatarank2 <- within(mydata, rank <- relevel(rank, ref = 2))
mylogitrank2 <- update(mylogit, data = mydatarank2)
summary(mylogitrank2)##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = mydatarank2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.665422 1.109370 -4.205 2.61e-05 ***
## gre 0.002264 0.001094 2.070 0.0385 *
## gpa 0.804038 0.331819 2.423 0.0154 *
## rank1 0.675443 0.316490 2.134 0.0328 *
## rank3 -0.664761 0.283319 -2.346 0.0190 *
## rank4 -0.876021 0.366735 -2.389 0.0169 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
First, let’s recall how logit(p/(1-p)) relates to p, the probability of the event:
Warming-up: What is the odds ratio of being drown when wearing a lab coat as compared to not wearing one?
## OR 2.5 % 97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre 1.0022670 1.000137602 1.0044457
## gpa 2.2345448 1.173858216 4.3238349
## rank2 0.5089310 0.272289674 0.9448343
## rank3 0.2617923 0.131641717 0.5115181
## rank4 0.2119375 0.090715546 0.4706961
When gpa increases by one unit, the odds of admit change by a factor of 2.23 = they multiply by 2.23 = they increase by 123 per cent [95% CI = 1.17; 4.32].
When students graduate from a (university/college) of rank 2, their odds of being admitted to a (graduate school/university) change by a factor of 0.51 = decrease by 2 = decrease by 49 percent, as compared to the graduates of (university/college) of rank 1 (95% CI = [0.27;0.94]).
Create a new dataset with mean values of gre and gpa and all four ranks of previous schools:
## gre gpa rank
## 1 587.7 3.3899 1
## 2 587.7 3.3899 2
## 3 587.7 3.3899 3
## 4 587.7 3.3899 4
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1 ## predicted probabilities for different ranks at the mean levels of gre and gpa.## gre gpa rank rankP
## 1 587.7 3.3899 1 0.5166016
## 2 587.7 3.3899 2 0.3522846
## 3 587.7 3.3899 3 0.2186120
## 4 587.7 3.3899 4 0.1846684
newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100), 4),
gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
#Tip! If you ever need to predict on a combination of factor variables, use tidyr::crossing() for creating the new data
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "response", se = T))Plot the predicted values with confidence intervals (rank * gre):
newdata3 <- within(newdata3, {
PredictedProb <- fit
LL <- fit - (1.96 * se.fit)
UL <- fit + (1.96 * se.fit)
})
head(newdata3)## gre gpa rank fit se.fit residual.scale UL LL
## 1 200.0000 3.3899 1 0.3075737 0.1096320 1 0.5224523 0.09269508
## 2 206.0606 3.3899 1 0.3105042 0.1089936 1 0.5241316 0.09687675
## 3 212.1212 3.3899 1 0.3134499 0.1083418 1 0.5257999 0.10110004
## 4 218.1818 3.3899 1 0.3164108 0.1076768 1 0.5274574 0.10536425
## 5 224.2424 3.3899 1 0.3193867 0.1069990 1 0.5291047 0.10966861
## 6 230.3030 3.3899 1 0.3223773 0.1063086 1 0.5307422 0.11401236
## PredictedProb
## 1 0.3075737
## 2 0.3105042
## 3 0.3134499
## 4 0.3164108
## 5 0.3193867
## 6 0.3223773
library(ggplot2)
ggplot(newdata3, aes(x = gre, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = 0.2) +
geom_line(aes(colour = rank), size = 0.75) +
ylim(0, 1)For gre <=380, there is no difference between the graduates of rank 1-rank 4 (universities/colleges). With higher gre grades, rank 1 school graduates have higher probability to be admitted than rank 3 - rank 4 graduates. At the same time, rank 2 school graduates are not statistically significantly different from either rank 1, or rank 3, or rank 4 graduates.
#newdata4 <- with(mydata, data.frame(gre = mean(gre),
# gpa = rep(seq(from = 2.2, to = 4.0, length.out = 18), 4),
# rank = factor(rep(1:4, each = 18))))
# compare the old routine with 'crossing' from tidyr package:
library(tidyr)
newdata4 <- crossing(gre = mean(mydata$gre),
gpa = seq(from = 2.2, to = 4.0, length.out = 18),
rank = factor(1:4))
newdata5 <- cbind(newdata4, predict(mylogit, newdata = newdata4, type = "link", se = T))
newdata5 <- within(newdata5, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
ggplot(newdata5, aes(x = gpa, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = 0.2) +
geom_line(aes(colour = rank), size = 0.75) +
ylim(0, 1)There is a similar story here: if gpa is <= 2.8, there is no difference across (university/college) ranks. For higher gpa levels, rank 1 graduates have higher probabilities than rank 3 - rank 4 school graduates.
Recall that:
In linear regression, there are the R-squared and the F-test to evaluate the overall model quality.
In linear regression, the R-squared shows the explained share of the outcome’s variance, while the F-test says whether the model is better than no model (= the intercept = the mean).
In a binary logistic model, the overall goodness-of-fit measures are based on:
See https://www.r-bloggers.com/evaluating-logistic-regression-models/
Let’s look at them, one by one.
The loglikelihood (LL) is the measure of discrepancy between the model and the data.
The closer LL to zero, the better.
Based on the LL are:
We can compare them across nested models
library(lmtest)
mymodel2 <- glm(admit ~ gre + rank, data = mydata, family = "binomial")
mymodel3 <- glm(admit ~ rank, data = mydata, family = "binomial")
lrtest(mymodel3, mymodel2, mylogit) #you want LogLik closer to 0.## Likelihood ratio test
##
## Model 1: admit ~ rank
## Model 2: admit ~ gre + rank
## Model 3: admit ~ gre + gpa + rank
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 4 -237.48
## 2 5 -232.27 1 10.4349 0.001237 **
## 3 6 -229.26 1 6.0143 0.014190 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Deviance Table
##
## Model 1: admit ~ rank
## Model 2: admit ~ gre + rank
## Model 3: admit ~ gre + gpa + rank
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 396 474.97
## 2 395 464.53 1 10.4349 0.001237 **
## 3 394 458.52 1 6.0143 0.014190 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These are two equivalent functions. Both outputs say that the 2nd model is significantly better than the 1st one, and that the 3rd model is significantly better than the 2nd one.
library(sjPlot)
tab_model(mymodel3, mymodel2, mylogit, show.aic = T, show.loglik = T, bootstrap = T, dv.labels = c('mymodel3', 'mymodel2', 'mylogit'))| mymodel3 | mymodel2 | mylogit | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Odds Ratios | CI | p | Odds Ratios | CI | p |
| (Intercept) | 1.19 | 0.69 – 2.00 | 0.454 | 0.16 | 0.04 – 0.59 | 0.004 | 0.02 | 0.00 – 0.14 | <0.001 |
| rank [2] | 0.47 | 0.24 – 0.87 | 0.018 | 0.50 | 0.25 – 0.93 | 0.026 | 0.50 | 0.28 – 0.93 | 0.030 |
| rank [3] | 0.25 | 0.13 – 0.49 | <0.001 | 0.28 | 0.14 – 0.52 | <0.001 | 0.26 | 0.13 – 0.53 | <0.001 |
| rank [4] | 0.18 | 0.07 – 0.39 | <0.001 | 0.20 | 0.08 – 0.42 | <0.001 | 0.21 | 0.08 – 0.45 | 0.002 |
| gre | 1.00 | 1.00 – 1.01 | 0.002 | 1.00 | 1.00 – 1.00 | 0.038 | |||
| gpa | 2.28 | 1.18 – 4.61 | 0.016 | ||||||
| Observations | 400 | 400 | 400 | ||||||
| R2 Tjur | 0.063 | 0.087 | 0.102 | ||||||
| AIC | 482.967 | 474.532 | 470.517 | ||||||
| log-Likelihood | -237.483 | -232.266 | -229.259 | ||||||
We choose the best model (mylogit) and interpret it.
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -229.25874624 -249.98825878 41.45902508 0.08292194 0.09845702
## r2CU
## 0.13799580
See:https://www.rdocumentation.org/packages/pscl/versions/1.5.2/topics/pR2 See: https://is.muni.cz/el/1423/podzim2010/SPP456/Regression_Models_For_Categorical_Dependent_Variables_USING_STATA.pdf
## $Models
## Formula
## 1 "admit ~ rank"
## 2 "admit ~ gre + rank"
## 3 "admit ~ gre + gpa + rank"
##
## $Fit.criteria
## Rank Df.res AIC AICc BIC McFadden Cox.and.Snell Nagelkerke p.value
## 1 4 396 485.0 485.1 504.9 0.05002 0.06061 0.08495 7.399e-06
## 2 5 395 476.5 476.7 500.5 0.07089 0.08480 0.11890 1.781e-07
## 3 6 394 472.5 472.8 500.5 0.08292 0.09846 0.13800 3.528e-08
A technical note: if you use grouped data, pseudo-R-squared measures are likely to be overestimated by several times. How to deal with it? Recalculate the model fit by expanding your data back to individual observations.
In addition, “R-squared says nothing about prediction error” See: https://data.library.virginia.edu/is-r-squared-useless/
A well-fitting model shows no significant difference between the model and the observed data
In this test, a well-fitting model should report a p-value greater than 0.05
“If the p-value is small, this is indicative of poor fit. A large p-value does not mean the model fits well, since lack of evidence against a null hypothesis is not equivalent to evidence in favour of the alternative hypothesis. In particular, if our sample size is small, a high p-value from the test may simply be a consequence of the test having lower power to detect mis-specification, rather than being indicative of good fit.”
See: http://thestatsgeek.com/2014/02/16/the-hosmer-lemeshow-goodness-of-fit-test-for-logistic-regression/
#install.packages("generalhoslem")
library(generalhoslem)
logitgof(mydata$admit, fitted(mylogit), g = 10) #g should be larger than the number of predictors; df = g - 2##
## Hosmer and Lemeshow test (binary model)
##
## data: mydata$admit, fitted(mylogit)
## X-squared = 11.085, df = 8, p-value = 0.1969
Due to low power, alternative have been suggested to replace the Hosmer-Lemeshow test, they are the Osius-Rojek and the Stukel test (see functions to calculate them here: http://www.chrisbilder.com/categorical/Chapter5/AllGOFTests.R )
For the matter of comparison, let’s try them as well:
## z = 0.4920821 with p-value = 0.6226613
## Stukel Test Stat = 0.1663363 with p-value = 0.9201964
The results are similar: since p-value is above the threshold of 0.05, there is no statistical evidence of the poor fit of our model based on the difference in the occurence of event between the fitted and the observed values.
A technical note: different criteria are used as success rates across different life areas (sensitivity and specificity vs. precision and recall), see, e.g. https://uberpython.wordpress.com/2012/01/01/precision-recall-sensitivity-and-specificity/
First, we need to create a training and a test subsamples and fit a model on the training data:
bound <- floor((nrow(mydata)/4)*3) #define 75% of training and test set (could be 50%, 60%)
set.seed(123)
df <- mydata[sample(nrow(mydata)), ] #sample 400 random rows out of the data
df.train <- df[1:bound, ] #get training set
df.test <- df[(bound+1):nrow(df), ] #get test set
mylogit1 <- glm(admit ~ gre + gpa + rank, data = df.train, family = "binomial",
na.action = na.omit)Then we count the proportion of outcomes accurately predicted on the test data:
pred <- format(round(predict(mylogit1, newdata = df.test, type = "response")))
accuracy <- table(pred, df.test[,"admit"])
sum(diag(accuracy))/sum(accuracy)## [1] 0.7
## df.test$admit
## pred 0 1
## 0 63 19
## 1 11 7
This may give back the error message: “Error: data and reference should be factors with the same levels.”
If this happens, try the following:
## Confusion Matrix and Statistics
##
##
## pred 0 1
## 0 63 19
## 1 11 7
##
## Accuracy : 0.7
## 95% CI : (0.6002, 0.7876)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.8474
##
## Kappa : 0.1339
##
## Mcnemar's Test P-Value : 0.2012
##
## Sensitivity : 0.8514
## Specificity : 0.2692
## Pos Pred Value : 0.7683
## Neg Pred Value : 0.3889
## Prevalence : 0.7400
## Detection Rate : 0.6300
## Detection Prevalence : 0.8200
## Balanced Accuracy : 0.5603
##
## 'Positive' Class : 0
##
On Kappa (the relative improvement of predicted accuracy over “ground truth”), see: https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english
Conclusion: Our model fit is not statistically significantly better than no model at all (Null Information Rate).
But there are only 300 + 100 observations. We need more observations and, probably, more predictors.
First, let’s learn by looking at a very well-fitting and a poorly fitting model.
Here is an ROC curve with a strong predictor.
By contrast, this one is a very poor predictor.
library(pROC)
# Compute AUC for predicting admit with continuous variables:
f1 <- roc(admit ~ gre + gpa, data = df.train)
ggroc(f1) + theme_bw() + geom_abline(intercept = 1, slope = 1)##
## Call:
## roc.formula(formula = admit ~ gre, data = df.train)
##
## Data: gre in 199 controls (admit 0) < 101 cases (admit 1).
## Area under the curve: 0.5868
“A model with high discrimination ability will have high sensitivity and specificity simultaneously, leading to an ROC curve which goes close to the top left corner of the plot. A model with no discrimination ability will have an ROC curve which is the 45 degree diagonal line.” See: https://thestatsgeek.com/2014/05/05/area-under-the-roc-curve-assessing-discrimination-in-logistic-regression/
gre is a bad fit (< 0.80)
gpa is a rather poor fit as well (AUROC < 0.80).
See: https://cran.r-project.org/web/packages/margins/vignettes/Introduction.html
## factor AME SE z p lower upper
## gpa 0.1564 0.0630 2.4846 0.0130 0.0330 0.2798
## gre 0.0004 0.0002 2.1068 0.0351 0.0000 0.0009
## rank2 -0.1566 0.0736 -2.1272 0.0334 -0.3009 -0.0123
## rank3 -0.2872 0.0733 -3.9170 0.0001 -0.4309 -0.1435
## rank4 -0.3212 0.0802 -4.0052 0.0001 -0.4784 -0.1640
Interpretation:
for two hypothetical individuals with average values on gpa (3.4) and gre (588), the predicted probability of success is 0.16 greater for the individual from a rank 2 school than for one who is in a rank 1 school
AME for continuous variables measure the instantaneous rate of change, which may or may not be close to the effect on P(Y=1) of a one unit increase in X. In plain words, it is correct to use small steps of X for interpretation: if, say, gpa increased by some very small amount (e.g. 0.01), then P(Y=1) would increase by about 0.01 * 0.156 = 0.00156. (And there is "no guarantee that a bigger increase in X, e.g. 1, would produce an increase of 1*0.156=0.156.)
Overall: “For categorical variables with more than two possible values, e.g. religion, the marginal effects show you the difference in the predicted probabilities for cases in one category relative to the reference category.”
See: https://www3.nd.edu/~rwilliam/stats3/Margins02.pdf
Units are changes in predicted probability of Y.
Units are Y-variable natural units.
A technical note: AME is, in essence, “the slope of multi-dimensional surface with respect to one dimension of that surface. Partial derivatives can translate the coefficients to the marginal (in % of probability) contribution of X to the outcome.”
“For the non-numeric classes, discrete differences rather than partial derivatives are reported as the partial derivative of a discrete variable is undefined” See: https://cran.r-project.org/web/packages/margins/vignettes/TechnicalDetails.pdf
Now, if you want to calculate AMEs at more than just the mean values, you can try a “five-numbers” approach by estimating AME at the minimal, 25th, 50th, 75th percentiles, and the maximum of a continuous predictor (here, it is gre):
## at(gre) gre gpa rank2 rank3 rank4
## 220 0.0003054 0.1084 -0.1253 -0.2090 -0.2282
## 520 0.0004254 0.1510 -0.1571 -0.2810 -0.3120
## 580 0.0004465 0.1585 -0.1606 -0.2920 -0.3257
## 660 0.0004716 0.1675 -0.1632 -0.3039 -0.3411
## 800 0.0005043 0.1791 -0.1620 -0.3155 -0.3585
## xvals yvals upper lower
## 1 220.0000 0.1912913 0.3293332 0.05324935
## 2 244.1667 0.1999003 0.3349924 0.06480810
## 3 268.3333 0.2087967 0.3405397 0.07705355
## 4 292.5000 0.2179811 0.3459866 0.08997559
## 5 316.6667 0.2274534 0.3513519 0.10355486
## 6 340.8333 0.2372125 0.3566644 0.11776057
## 7 365.0000 0.2472562 0.3619647 0.13254773
## 8 389.1667 0.2575817 0.3673097 0.14785363
## 9 413.3333 0.2681847 0.3727760 0.16359328
## 10 437.5000 0.2790601 0.3784663 0.17965382
## 11 461.6667 0.2902016 0.3845153 0.19588791
## 12 485.8333 0.3016019 0.3910967 0.21210700
## 13 510.0000 0.3132523 0.3984280 0.22807664
## 14 534.1667 0.3251433 0.4067686 0.24351805
## 15 558.3333 0.3372640 0.4164053 0.25812272
## 16 582.5000 0.3496025 0.4276189 0.27158598
## 17 606.6667 0.3621457 0.4406338 0.28365750
## 18 630.8333 0.3748795 0.4555665 0.29419252
## 19 655.0000 0.3877889 0.4723996 0.30317809
## 20 679.1667 0.4008577 0.4909949 0.31072049
options(scipen = 999)
cplot(mylogit, "gre", what = "effect", main = "Average Marginal Effect of gre")In the first plot above, Y-axes are different. The upper one shows predicted probabilities (fitted values from predict()), the lower one shows the AME in the outcome (calculated with margins()).
## at(gpa) gre gpa rank2 rank3 rank4
## 2.260 0.0002921 0.1037 -0.1209 -0.2001 -0.2181
## 3.130 0.0004167 0.1479 -0.1561 -0.2768 -0.3068
## 3.395 0.0004501 0.1598 -0.1619 -0.2948 -0.3289
## 3.670 0.0004795 0.1703 -0.1645 -0.3087 -0.3470
## 4.000 0.0005052 0.1794 -0.1627 -0.3174 -0.3606
## xvals yvals upper lower
## 1 2.2600 0.1798308 0.2984691 0.06119251
## 2 2.3325 0.1885895 0.3051858 0.07199319
## 3 2.4050 0.1976720 0.3118808 0.08346312
## 4 2.4775 0.2070802 0.3185673 0.09559314
## 5 2.5500 0.2168152 0.3252662 0.10836423
## 6 2.6225 0.2268769 0.3320085 0.12174533
## 7 2.6950 0.2372641 0.3388377 0.13569055
## 8 2.7675 0.2479744 0.3458129 0.15013578
## 9 2.8400 0.2590039 0.3530130 0.16499472
## 10 2.9125 0.2703476 0.3605409 0.18015436
## 11 2.9850 0.2819991 0.3685276 0.19547065
## 12 3.0575 0.2939505 0.3771355 0.21076543
## 13 3.1300 0.3061923 0.3865574 0.22582718
## 14 3.2025 0.3187139 0.3970088 0.24041895
## 15 3.2750 0.3315028 0.4087086 0.25429694
## 16 3.3475 0.3445454 0.4218506 0.26724013
## 17 3.4200 0.3578264 0.4365681 0.27908480
## 18 3.4925 0.3713294 0.4529077 0.28975110
## 19 3.5650 0.3850364 0.4708228 0.29925010
## 20 3.6375 0.3989283 0.4901876 0.30766903
See: https://data.princeton.edu/wws509/notes/c3s8
## Test stat Pr(>|Test stat|)
## gre 0.1082 0.7422
## gpa 0.0835 0.7726
## rank
If the model is properly specified, no additional predictors that are statistically significant can be found.
If some continuous predictor’s stat in this test is significant, test a new model where you include its quadratic term.
Marginal model plot drawing response variable against each predictors and linear predictor. Fitted values are compared to observed data.
## GVIF Df GVIF^(1/(2*Df))
## gre 1.134377 1 1.065071
## gpa 1.155902 1 1.075129
## rank 1.025759 3 1.004248
looks OK, GVIFs are ~1.
See: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4885900/
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 198 2.106376 0.035172 NA
Obs.198 has the largest residual, but it is not significant = no problem here.
We want to see 3 “most outlying outliers”.
## StudRes Hat CookD
## 40 2.083822 0.01235387 0.015563562
## 156 2.057274 0.01487512 0.017574208
## 198 2.106376 0.01472214 0.019411921
## 356 1.199803 0.04133761 0.007569415
## 373 1.428058 0.04921401 0.015025521
model2 <- update(mylogit, subset = c(-373,-198))
compareCoefs(mylogit, model2)#no coefficients change significantly, none of them changes significance level.## Calls:
## 1: glm(formula = admit ~ gre + gpa + rank, family = "binomial", data =
## mydata)
## 2: glm(formula = admit ~ gre + gpa + rank, family = "binomial", data =
## mydata, subset = c(-373, -198))
##
## Model 1 Model 2
## (Intercept) -3.99 -4.33
## SE 1.14 1.17
##
## gre 0.00226 0.00232
## SE 0.00109 0.00111
##
## gpa 0.804 0.879
## SE 0.332 0.340
##
## rank2 -0.675 -0.626
## SE 0.316 0.319
##
## rank3 -1.340 -1.300
## SE 0.345 0.347
##
## rank4 -1.551 -1.598
## SE 0.418 0.429
##
There are no changes in the sign or magnitude of coefficients, which means there is no need to exclude any observations from mylogit.
Are train=200 and test=100 good sample sizes for logistic regression?
What is the odds ratio of 1.7 in probability terms?
Which of these measures can be compared across any two models with the same DV: AIC, pseudo-R-squared, BIC, -2LL, AME?
Based on: https://www.datacamp.com/community/tutorials/decision-trees-R See the text for interpretation of results.
CHAID, CRT, QUEST
Pros: more predictors in the model, non-linear relationships, any variable types, robust to outliers, internal imputation algorithms
CHi-square Automatic Interaction Detector: several descendants, short and ‘trivial’ trees, missing values as a category
Classification and Regression Tree: categorical outcome, tall trees, pruning, univariate splitting, imputation with surrogates
Quick, Unbiased, Efficient Statistical Tree: categorical outcome, univariate splitting, surrogates for imputation
Cons: no equation to explain the model, overfitting (CRT, pruning but unstable results), result sensitive to data
random forests: random samples of train size (with replacement), each gets a decision tree with randomly samples predictors, then results are averaged; the result is harder to interpret though.
library(tree)
mydata$admit <- as.factor(mydata$admit)
tree_admit <- tree(admit ~ ., data = mydata)
summary(tree_admit)##
## Classification tree:
## tree(formula = admit ~ ., data = mydata)
## Number of terminal nodes: 9
## Residual mean deviance: 1.08 = 422.2 / 391
## Misclassification error rate: 0.2675 = 107 / 400
## [1] 179 14 195 306 118 299
tree_pred = predict(tree_admit, mydata[-train,], type = "class")
with(mydata[-train,], table(tree_pred, admit))## admit
## tree_pred 0 1
## 0 88 26
## 1 20 16
pred <- mydata[-train, ]
#install.packages("caret")
library(caret)
#install.packages("e1071")
library(e1071)
confusionMatrix(table(tree_pred, pred$admit))## Confusion Matrix and Statistics
##
##
## tree_pred 0 1
## 0 88 26
## 1 20 16
##
## Accuracy : 0.6933
## 95% CI : (0.6129, 0.7659)
## No Information Rate : 0.72
## P-Value [Acc > NIR] : 0.7947
##
## Kappa : 0.2047
##
## Mcnemar's Test P-Value : 0.4610
##
## Sensitivity : 0.8148
## Specificity : 0.3810
## Pos Pred Value : 0.7719
## Neg Pred Value : 0.4444
## Prevalence : 0.7200
## Detection Rate : 0.5867
## Detection Prevalence : 0.7600
## Balanced Accuracy : 0.5979
##
## 'Positive' Class : 0
##