Check variable scales

the ‘rank’ variable is read as quantitative – correct it and describe the data again:

class(mydata$rank)

## [1] "integer"

table(mydata$rank)

## 
##   1   2   3   4 
##  61 151 121  67

mydata$rank <- factor(mydata$rank)
class(mydata$rank)

## [1] "factor"

describe(mydata)

##       vars   n   mean     sd median trimmed    mad    min max  range  skew
## admit    1 400   0.32   0.47    0.0    0.27   0.00   0.00   1   1.00  0.78
## gre      2 400 587.70 115.52  580.0  589.06 118.61 220.00 800 580.00 -0.14
## gpa      3 400   3.39   0.38    3.4    3.40   0.40   2.26   4   1.74 -0.21
## rank*    4 400   2.48   0.94    2.0    2.48   1.48   1.00   4   3.00  0.10
##       kurtosis   se
## admit    -1.39 0.02
## gre      -0.36 5.78
## gpa      -0.60 0.02
## rank*    -0.91 0.05

describeBy(mydata, mydata$admit)

## 
##  Descriptive statistics by group 
## group: 0
##       vars   n   mean     sd median trimmed    mad    min max  range  skew
## admit    1 273   0.00   0.00   0.00    0.00   0.00   0.00   0   0.00   NaN
## gre      2 273 573.19 115.83 580.00  574.25 118.61 220.00 800 580.00 -0.10
## gpa      3 273   3.34   0.38   3.34    3.35   0.40   2.26   4   1.74 -0.07
## rank*    4 273   2.64   0.92   3.00    2.68   1.48   1.00   4   3.00 -0.03
##       kurtosis   se
## admit      NaN 0.00
## gre      -0.38 7.01
## gpa      -0.55 0.02
## rank*    -0.89 0.06
## -------------------------------------------------------- 
## group: 1
##       vars   n   mean     sd median trimmed    mad    min max  range  skew
## admit    1 127   1.00   0.00   1.00    1.00   0.00   1.00   1   0.00   NaN
## gre      2 127 618.90 108.88 620.00  620.39 118.61 300.00 800 500.00 -0.17
## gpa      3 127   3.49   0.37   3.54    3.51   0.40   2.42   4   1.58 -0.53
## rank*    4 127   2.15   0.92   2.00    2.07   1.48   1.00   4   3.00  0.44
##       kurtosis   se
## admit      NaN 0.00
## gre      -0.35 9.66
## gpa      -0.38 0.03
## rank*    -0.64 0.08

The share of admit = 1 is 0.32. It means that, without any information, there is a 0.68 probability that admit = 0.

Missing data

Check for the share of missing data (if <= 5%, could be imputed). If much data is missing, consider possible biases in results or using the variable with many missings in robustness check models but not in the main model.

library(Amelia)
library(mlbench)
missmap(mydata, col=c("blue", "red"), legend = T, main = "Missing values vs observed")

There are no blue cells, which means there is no missing data in this dataset.

Check the cross-tabs of interest :

You should check for empty or small cells by doing a crosstab between
categorical predictors and the outcome variable. If a cell has very few
cases (a small cell), the model may become unstable or it might not run at all.

xtabs(~ admit + rank, data = mydata)

##      rank
## admit  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12

Estimate the logit model and check the significance of single predictors - can we throw them away?

mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)

## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

anova(mylogit, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: admit
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   399     499.98              
## gre   1  13.9204       398     486.06 0.0001907 ***
## gpa   1   5.7122       397     480.34 0.0168478 *  
## rank  3  21.8265       394     458.52 7.088e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This model output shows that all predictors are significant, therefore, we look at the estimates.

Positive estimates indicate that, as the predictor changes from 0 to 1 (dummy) or increases by 1 (continuous), the chances of admit increase. A negative estimate shows a decrease of chances (in the log of odds form). To interpret the coefficients in a more intuitive way, we get the odds or, better, predicted probabilities (see next slide).

Some extra things:

1. Change the reference level of a categorical predictor:

mydatarank2 <- within(mydata, rank <- relevel(rank, ref = 2))
mylogitrank2 <- update(mylogit, data = mydatarank2)
summary(mylogitrank2)

## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = mydatarank2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.665422   1.109370  -4.205 2.61e-05 ***
## gre          0.002264   0.001094   2.070   0.0385 *  
## gpa          0.804038   0.331819   2.423   0.0154 *  
## rank1        0.675443   0.316490   2.134   0.0328 *  
## rank3       -0.664761   0.283319  -2.346   0.0190 *  
## rank4       -0.876021   0.366735  -2.389   0.0169 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

2. Check the fit of the ‘rank’ variable:

#install.packages("aod")
library(aod)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011

anova(mylogit, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: admit
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   399     499.98              
## gre   1  13.9204       398     486.06 0.0001907 ***
## gpa   1   5.7122       397     480.34 0.0168478 *  
## rank  3  21.8265       394     458.52 7.088e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.Check the significance of the difference between ranks 2 and 3

l <-cbind(0, 0, 0, 1, -1, 0) #contrast between rank2 and rank3
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 5.5, df = 1, P(> X2) = 0.019

Start with getting the coefficients in odds (+ their confidence intervals)

exp(cbind(OR = coef(mylogit), confint(mylogit)))

##                    OR       2.5 %    97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre         1.0022670 1.000137602 1.0044457
## gpa         2.2345448 1.173858216 4.3238349
## rank2       0.5089310 0.272289674 0.9448343
## rank3       0.2617923 0.131641717 0.5115181
## rank4       0.2119375 0.090715546 0.4706961

When gpa increases by one unit, the odds of admit change by a factor of 2.23 = they multiply by 2.23 = they increase by 123 per cent [95% CI = 1.17; 4.32].
When students graduate from a (university/college) of rank 2, their odds of being admitted to a (graduate school/university) change by a factor of 0.51 = decrease by 2 = decrease by 49 percent, as compared to the graduates of (university/college) of rank 1 (95% CI = [0.27;0.94]).

Continue with calculating the predicted probability of admission at each value of rank, holding gre and gpa at their means:

Create a new dataset with mean values of gre and gpa and all four ranks of previous schools:

newdata1 <- with(mydata, data.frame(gre = mean(gre), 
                        gpa = mean(gpa), 
                        rank = factor(1:4)))
newdata1

##     gre    gpa rank
## 1 587.7 3.3899    1
## 2 587.7 3.3899    2
## 3 587.7 3.3899    3
## 4 587.7 3.3899    4

for rank:

newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1 ## predicted probabilities for different ranks at the mean levels of gre and gpa.

##     gre    gpa rank     rankP
## 1 587.7 3.3899    1 0.5166016
## 2 587.7 3.3899    2 0.3522846
## 3 587.7 3.3899    3 0.2186120
## 4 587.7 3.3899    4 0.1846684

for gre levels (we pick gre for it has a larger variance):

newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100), 4), 
                                    gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link", se = T))

Plot the predicted values with confidence intervals (rank * gre):

newdata3 <- within(newdata3, {
  PredictedProb <- plogis(fit)
  LL <- plogis(fit - (1.96 * se.fit))
  UL <- plogis(fit + (1.96 * se.fit))
})
head(newdata3)

##        gre    gpa rank        fit    se.fit residual.scale        UL
## 1 200.0000 3.3899    1 -0.8114870 0.5147714              1 0.5492064
## 2 206.0606 3.3899    1 -0.7977632 0.5090986              1 0.5498513
## 3 212.1212 3.3899    1 -0.7840394 0.5034491              1 0.5505074
## 4 218.1818 3.3899    1 -0.7703156 0.4978239              1 0.5511750
## 5 224.2424 3.3899    1 -0.7565919 0.4922237              1 0.5518545
## 6 230.3030 3.3899    1 -0.7428681 0.4866494              1 0.5525464
##          LL PredictedProb
## 1 0.1393812     0.3075737
## 2 0.1423880     0.3105042
## 3 0.1454429     0.3134499
## 4 0.1485460     0.3164108
## 5 0.1516973     0.3193867
## 6 0.1548966     0.3223773

library(ggplot2)
ggplot(newdata3, aes(x = gre, y = PredictedProb)) +
  geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = 0.2) +
  geom_line(aes(colour = rank), size = 0.75) +
  ylim(0, 1)

For gre <=380, there is no difference between the graduates of rank 1-rank 4 (universities/colleges). With higher gre grades, rank 1 school graduates have higher probability to be admitted than rank 3 - rank 4 graduates. At the same time, rank 2 school graduates are not statistically significantly different from either rank 1, or rank 3, or rank 4 graduates.

for gpa levels:

newdata4 <- with(mydata, data.frame(gre = mean(gre), 
                                    gpa = rep(seq(from = 2.2, to = 4.0, length.out = 18), 4),
                                    rank = factor(rep(1:4, each = 18))))
newdata5 <- cbind(newdata4, predict(mylogit, newdata = newdata4, type = "link", se = T))
newdata5 <- within(newdata5, {
  PredictedProb <- plogis(fit)
  LL <- plogis(fit - (1.96 * se.fit))
  UL <- plogis(fit + (1.96 * se.fit))
})
ggplot(newdata5, aes(x = gpa, y = PredictedProb)) +
  geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = 0.2) +
  geom_line(aes(colour = rank), size = 0.75) +
  ylim(0, 1)

There is a similar story here: if gpa is <= 2.8, there is no difference across (university/college) ranks. For higher gpa levels, rank 1 graduates have higher probabilities than rank 3 - rank 4 school graduates.

Overall goodness-of-fit measures

Recall that:

In linear regression, there are the R-squared and the F-test to evaluate the overall model quality.
In linear regression, the R-squared shows the explained share of the outcome’s variance, while the F-test says whether the model is better than no model (= the intercept = the mean).

In a binary logistic model, the overall goodness-of-fit measures are based on:

loglikelihood (-2LL, pseudo-R-squared) and
confusion matrix (accuracy rate, ROC curve).

See https://www.r-bloggers.com/evaluating-logistic-regression-models/

Let’s look at them, one by one.

Likelihood Ratio for the overall model fit

The loglikelihood (LL) is the measure of discrepancy between the model and the data.

The closer LL to zero, the better.

Based on the LL are:

deviance (aka -2LL)
pseudo-R-squared measures

We can compare them across nested models

library(lmtest)
mymodel2 <- glm(admit ~ gre + rank, data = mydata, family = "binomial")
mymodel3 <- glm(admit ~ rank, data = mydata, family = "binomial")
lrtest(mymodel3, mymodel2, mylogit) #you want LogLik closer to 0.

## Likelihood ratio test
## 
## Model 1: admit ~ rank
## Model 2: admit ~ gre + rank
## Model 3: admit ~ gre + gpa + rank
##   #Df  LogLik Df   Chisq Pr(>Chisq)   
## 1   4 -237.48                         
## 2   5 -232.27  1 10.4349   0.001237 **
## 3   6 -229.26  1  6.0143   0.014190 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(mymodel3, mymodel2, mylogit, test ="Chisq") # This is JUST THE SAME (look at p-values), but it is -2LL.

## Analysis of Deviance Table
## 
## Model 1: admit ~ rank
## Model 2: admit ~ gre + rank
## Model 3: admit ~ gre + gpa + rank
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       396     474.97                        
## 2       395     464.53  1  10.4349 0.001237 **
## 3       394     458.52  1   6.0143 0.014190 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These are two equivalent functions. Both outputs say that the 2nd model is significantly better than the 1st one, and that the 3rd model is significantly better than the 2nd one. We choose the best model (mylogit) and interpret it.

Pseudo-R-squared measures:

-2LL (likelihood ratio): the closer -2LL to zero, the better
McFadden’s: values 0.2-0.4 are interpreted as “satisfactory” as it is a conservative measure

#install.packages("pscl")
library(pscl)
pR2(mylogit)

##           llh       llhNull            G2      McFadden          r2ML 
## -229.25874624 -249.98825878   41.45902508    0.08292194    0.09845702 
##          r2CU 
##    0.13799580

llh is The log-likelihood from the fitted (full) model
llhNull is The log-likelihood from the intercept-only restricted model
G2 equals minus two times the difference in the log-likelihoods (An LR test of the hypothesis that all coefficients exceptthe intercept(s) are zero).
McFadden is McFadden’s pseudo r-squared, also known as the “likelihood-ratio index”, compares a model with just the intercept to a model with all parameters.
r2ML means maximum likelihood pseudo r-squared (1−exp(−G2/N)) and
r2CU is Cragg and Uhler’s pseudo r-squared (a normed measure)

See:https://www.rdocumentation.org/packages/pscl/versions/1.5.2/topics/pR2 See: https://is.muni.cz/el/1423/podzim2010/SPP456/Regression_Models_For_Categorical_Dependent_Variables_USING_STATA.pdf

library(rcompanion)
compareGLM(mymodel3, mymodel2, mylogit)

## $Models
##   Formula                   
## 1 "admit ~ rank"            
## 2 "admit ~ gre + rank"      
## 3 "admit ~ gre + gpa + rank"
## 
## $Fit.criteria
##   Rank Df.res   AIC  AICc   BIC McFadden Cox.and.Snell Nagelkerke
## 1    4    396 485.0 485.1 504.9  0.05002       0.06061    0.08495
## 2    5    395 476.5 476.7 500.5  0.07089       0.08480    0.11890
## 3    6    394 472.5 472.8 500.5  0.08292       0.09846    0.13800
##     p.value
## 1 7.399e-06
## 2 1.781e-07
## 3 3.528e-08

A technical note: if you use grouped data, pseudo-R-squared measures are likely to be overestimated by several times. This is a know phenomenon. How to deal with it? Recalculate the model fit by expanding your data back to individual observations.

See: https://stats.stackexchange.com/questions/183699/how-to-calculate-pseudo-r2-when-using-logistic-regression-on-aggregated-data-fil

In addition, “R-squared says nothing about prediction error” See: https://data.library.virginia.edu/is-r-squared-useless/

Hosmer-Lemeshow Test: search for evidence of poor fit

A well-fitting model shows no significant difference between the model and the observed data

In this test, a well-fitting model should report a p-value greater than 0.05

“If the p-value is small, this is indicative of poor fit. A large p-value does not mean the model fits well, since lack of evidence against a null hypothesis is not equivalent to evidence in favour of the alternative hypothesis. In particular, if our sample size is small, a high p-value from the test may simply be a consequence of the test having lower power to detect mis-specification, rather than being indicative of good fit.”

See: http://thestatsgeek.com/2014/02/16/the-hosmer-lemeshow-goodness-of-fit-test-for-logistic-regression/

#install.packages("generalhoslem")
library(generalhoslem)
logitgof(mydata$admit, fitted(mylogit), g = 10) #g should be larger than the number of predictors; df = g - 2

## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  mydata$admit, fitted(mylogit)
## X-squared = 11.085, df = 8, p-value = 0.1969

Alternatively:

library(sjstats)
hoslem_gof(mylogit)

## 
## # Hosmer-Lemeshow Goodness-of-Fit Test
## 
##   Chi-squared: 11.085
##            df:  8    
##       p-value:  0.197

Due to low power, alternative have been suggested to replace the Hosmer-Lemeshow test, they are the Osius-Rojek and the Stukel test (see functions to calculate them here: http://www.chrisbilder.com/categorical/Chapter5/AllGOFTests.R )

For the matter of comparison, let’s try them as well:

o.r.test(mylogit)

## z =  0.4920821 with p-value =  0.6226613

stukel.test(mylogit)

## Stukel Test Stat =  0.1663363 with p-value =  0.9201964

The results are similar: since p-value is above the threshold of 0.05, there is no statistical evidence of the poor fit of our model based on the difference in the occurence of event between the fitted and the observed values.

Classification tables / Hit ratio

A classification table, or matrix (fitted values are cross-tabed with test data)
Accuracy, sensitivity (recall), and specificity

See: https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf

A technical note: different criteria are used as success rates across different life areas (sensitivity and specificity vs. precision and recall), see, e.g. https://uberpython.wordpress.com/2012/01/01/precision-recall-sensitivity-and-specificity/

First, we need to create a training and a test subsamples and fit a model on the training data:

bound <- floor((nrow(mydata)/4) * 3)  #define 75% of training and test set (could be 50%, 60%)
set.seed(123)
df <- mydata[sample(nrow(mydata)), ]  #sample 400 random rows out of the data
df.train <- df[1:bound, ]  #get training set
df.test <- df[(bound + 1):nrow(df), ]  #get test set

mylogit1 <- glm(admit ~ gre + gpa + rank, data = df.train, family = "binomial", 
    na.action = na.omit)

Then we count the proportion of outcomes accurately predicted on the test data:

pred <- format(round(predict(mylogit1, newdata = df.test, type = "response")))
accuracy <- table(pred, df.test[,"admit"])
sum(diag(accuracy))/sum(accuracy)

## [1] 0.69

xtabs(~ pred + df.test$admit)

##     df.test$admit
## pred  0  1
##    0 65 27
##    1  4  4

library(caret)

## Loading required package: lattice

#confusionMatrix(data=pred, df.test$admit)

This may give back the error message: “Error: data and reference should be factors with the same levels.”

If this happens, try the following:

confusionMatrix(table(pred, df.test$admit))

## Confusion Matrix and Statistics
## 
##     
## pred  0  1
##    0 65 27
##    1  4  4
##                                           
##                Accuracy : 0.69            
##                  95% CI : (0.5897, 0.7787)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.5484          
##                                           
##                   Kappa : 0.0893          
##                                           
##  Mcnemar's Test P-Value : 7.772e-05       
##                                           
##             Sensitivity : 0.9420          
##             Specificity : 0.1290          
##          Pos Pred Value : 0.7065          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.6900          
##          Detection Rate : 0.6500          
##    Detection Prevalence : 0.9200          
##       Balanced Accuracy : 0.5355          
##                                           
##        'Positive' Class : 0               
##

On Kappa (the relative improvement of predicted accuracy over “ground truth”), see: https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english

Conclusion: Our model fit is not statistically significantly better than no model at all (Null Information Rate).

But there are only 300 + 100 observations. We need more observations and, probably, more predictors.

See: https://www.hranalytics101.com/how-to-assess-model-accuracy-the-basics/#confusion-matrix-and-the-no-information-rate

Let’s ROC! (‘receiver operating characteristic’)

We are concerned about the area under the ROC curve, or AUROC, or concordance.
That metric ranges from 0.50 (no discrimination ability) to 1.00 (perfect discrimination)
Values above 0.80 indicate that the model does a good job
This is a curve of the cost function

First, let’s learn by looking at a very well-fitting and a poorly fitting model.

Here is an ROC curve with a strong predictor.

By contrast, this one is a very poor predictor.

library(pROC)
# Compute AUC for predicting admit with continuous variables:
f1 <- roc(admit ~ gre + gpa, data = df.train) 
ggroc(f1) + theme_bw() + geom_abline(intercept = 1, slope = 1)

f1$gre

## 
## Call:
## roc.formula(formula = admit ~ gre, data = df.train)
## 
## Data: gre in 204 controls (admit 0) < 96 cases (admit 1).
## Area under the curve: 0.6255

“A model with high discrimination ability will have high sensitivity and specificity simultaneously, leading to an ROC curve which goes close to the top left corner of the plot. A model with no discrimination ability will have an ROC curve which is the 45 degree diagonal line.” See: https://thestatsgeek.com/2014/05/05/area-under-the-roc-curve-assessing-discrimination-in-logistic-regression/

gre is a bad fit (< 0.80)

gpa is a rather poor fit as well (AUROC < 0.80).

To report average marginal effects that are comparable across models:

With binary independent variables, marginal effects measure discrete change, i.e. how the predicted probabilities change as the binary independent variable changes from 0 to 1.
Marginal effects for continuous variables often provide a good approximation to the amount of change in Y that will be produced by a 1-unit change in X.

See: https://cran.r-project.org/web/packages/margins/vignettes/Introduction.html

#install.packages("margins")
library(margins)
m <- margins(mylogit, type = "response")

## Warning in warn_for_weights(model): 'weights' used in model estimation are
## currently ignored!

summary(m)

##  factor     AME     SE       z      p   lower   upper
##     gpa  0.1564 0.0630  2.4846 0.0130  0.0330  0.2798
##     gre  0.0004 0.0002  2.1068 0.0351  0.0000  0.0009
##   rank2 -0.1566 0.0736 -2.1272 0.0334 -0.3009 -0.0123
##   rank3 -0.2872 0.0733 -3.9170 0.0001 -0.4309 -0.1435
##   rank4 -0.3212 0.0802 -4.0052 0.0001 -0.4784 -0.1640

Interpretation:

for two hypothetical individuals with average values on gpa (3.4) and gre (588), the predicted probability of success is 0.16 greater for the individual from a rank 2 school than for one who is in a rank 1 school
AME for continuous variables measure the instantaneous rate of change, which may or may not be close to the effect on P(Y=1) of a one unit increase in X. In plain words, it is correct to use small steps of X for interpretation: if, say, gpa increased by some very small amount (e.g. 0.01), then P(Y=1) would increase by about 0.01 * 0.156 = 0.00156. (And there is "no guarantee that a bigger increase in X, e.g. 1, would produce an increase of 1*0.156=0.156.)

Overall: “For categorical variables with more than two possible values, e.g. religion, the marginal effects show you the difference in the predicted probabilities for cases in one category relative to the reference category.”

See: https://www3.nd.edu/~rwilliam/stats3/Margins02.pdf

plot(m) # gre and gpa seem to be mixed up :( Give preference to the table values.

Units are changes in predicted probability of Y.

plot(m, type = "link")

## Warning in plot.xy(xy, type, ...): plot type 'link' will be truncated to
## first character

Units are Y-variable natural units.

A technical note: AME is, in essence, “the slope of multi-dimensional surface with respect to one dimension of that surface. Partial derivatives can translate the coefficients to the marginal (in % of probability) contribution of X to the outcome.”

“For the non-numeric classes, discrete differences rather than partial derivatives are reported as the partial derivative of a discrete variable is undefined” See: https://cran.r-project.org/web/packages/margins/vignettes/TechnicalDetails.pdf

Now, if you want to calculate AMEs at more than just the mean values, you can try a “five-numbers” approach by estimating AME at the minimal, 25th, 50th, 75th percentiles, and the maximum of a continuous predictor (here, it is gre):

margins(mylogit, at = list(gre = fivenum(mydata$gre)))

## Warning in warn_for_weights(model): 'weights' used in model estimation are
## currently ignored!

## Average marginal effects at specified values

## glm(formula = admit ~ gre + gpa + rank, family = "binomial",     data = mydata)

##  at(gre)       gre    gpa   rank2   rank3   rank4
##      220 0.0003054 0.1084 -0.1253 -0.2090 -0.2282
##      520 0.0004254 0.1510 -0.1571 -0.2810 -0.3120
##      580 0.0004465 0.1585 -0.1606 -0.2920 -0.3257
##      660 0.0004716 0.1675 -0.1632 -0.3039 -0.3411
##      800 0.0005043 0.1791 -0.1620 -0.3155 -0.3585

cplot(mylogit, "gre", what = "prediction", main = "Predicted admit, given gre")

options(scipen = 999)
cplot(mylogit, "gre", what = "effect", main = "Average Marginal Effect of gre")

## Warning in warn_for_weights(model): 'weights' used in model estimation are
## currently ignored!

In the first plot above, Y-axes are different. The upper one shows predicted probabilities (fitted values from predict()), the lower one shows the AME in the outcome (calculated with margins()).

margins(mylogit, at = list(gpa = fivenum(mydata$gpa))) #fivenum are min, 25%, 50%(median), 75%, and maximum.

## Warning in warn_for_weights(model): 'weights' used in model estimation are
## currently ignored!

## Average marginal effects at specified values

## glm(formula = admit ~ gre + gpa + rank, family = "binomial",     data = mydata)

##  at(gpa)       gre    gpa   rank2   rank3   rank4
##    2.260 0.0002921 0.1037 -0.1209 -0.2001 -0.2181
##    3.130 0.0004167 0.1479 -0.1561 -0.2768 -0.3068
##    3.395 0.0004501 0.1598 -0.1619 -0.2948 -0.3289
##    3.670 0.0004795 0.1703 -0.1645 -0.3087 -0.3470
##    4.000 0.0005052 0.1794 -0.1627 -0.3174 -0.3606

cplot(mylogit, "gpa", what = "prediction", main = "Predicted admit, given gpa")

cplot(mylogit, "gpa", what = "effect", main = "Average Marginal Effect of gpa")

## Warning in warn_for_weights(model): 'weights' used in model estimation are
## currently ignored!

Diagnostics

deviance residuals show contribution of each point to likelihood of the model.
binned plot: shows observations with highest residuals (>3 is high enough).
A plot in which the gray lines indicate 2 standard-error bounds, within which one would expect about 95% of the binned residuals to fall, if the model were actually true.

See: https://data.princeton.edu/wws509/notes/c3s8

arm::binnedplot(mydata$gre, mydata$admit)

arm::binnedplot(mydata$gpa, mydata$admit)

linearity of the logit over the covariates: Plotting of residuals against individual predictors or linear predictor is helpful in identifying non-linearity.

car::residualPlots(mylogit)

##      Test stat Pr(>|Test stat|)
## gre     0.1082           0.7422
## gpa     0.0835           0.7726
## rank

If the model is properly specified, no additional predictors that are statistically significant can be found.

If some continuous predictor’s stat in this test is significant, test a new model where you include its quadratic term.

car::marginalModelPlots(mylogit)

## Warning in mmps(...): Interactions and/or factors skipped

Marginal model plot drawing response variable against each predictors and linear predictor. Fitted values are compared to observed data.

no multicollinearity (GVIF >10 could be a problem)

library(car)
vif(mylogit)

##          GVIF Df GVIF^(1/(2*Df))
## gre  1.134377  1        1.065071
## gpa  1.155902  1        1.075129
## rank 1.025759  3        1.004248

looks OK, GVIFs are ~1.

outliers,high leverage, and influential observations

Leverage is a measure of how much the estimated value of the dependent variable changes when the point is removed.
There is one value of leverage for each data point. Data points with high leverage force the regression line to be close to the point.
an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation.

See: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4885900/

outlierTest(mylogit)

## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferonni p
## 198 2.106376           0.035172           NA

influenceIndexPlot(mylogit, id.n. = 3)

Obs.198 has the largest residual, but it is not significant = no problem here.

We want to see 3 “most outlying outliers”.

Cook’s distance

Size of the circle is Cook’s distance.
A large value of Cook’s distance indicates an influential observation.

influencePlot(mylogit, col = "red", id.n. = 3)#obs.373 has the largest leverage; obs.198 is most influential.

##      StudRes        Hat       CookD
## 40  2.083822 0.01235387 0.015563562
## 156 2.057274 0.01487512 0.017574208
## 198 2.106376 0.01472214 0.019411921
## 356 1.199803 0.04133761 0.007569415
## 373 1.428058 0.04921401 0.015025521

To check the effect of the outlier, you can estimate the model without it and compare the results:

model2 <- update(mylogit, subset = c(-373,-198))
compareCoefs(mylogit, model2)#no coefficients change significantly, none of them changes significance level.

## Calls:
## 1: glm(formula = admit ~ gre + gpa + rank, family = "binomial", data = 
##   mydata)
## 2: glm(formula = admit ~ gre + gpa + rank, family = "binomial", data = 
##   mydata, subset = c(-373, -198))
## 
##             Model 1 Model 2
## (Intercept)   -3.99   -4.33
## SE             1.14    1.17
##                            
## gre         0.00226 0.00232
## SE          0.00109 0.00111
##                            
## gpa           0.804   0.879
## SE            0.332   0.340
##                            
## rank2        -0.675  -0.626
## SE            0.316   0.319
##                            
## rank3        -1.340  -1.300
## SE            0.345   0.347
##                            
## rank4        -1.551  -1.598
## SE            0.418   0.429
##

There are no changes in the sign or magnitude of coefficients, which means there is no need to exclude any observations from mylogit.

Logistic Regression

‘Look, it’s binary!’

Learning objectives for this practice session:

Upload the data and have a look at it