Introduction
Logistic regression, also called a logit model, is used to model dichotomous outcome variables.
In the logit model the log odds of the outcome is modeled as a linear combination of the predictor
variables.
Problem Statement
I am interested to learn how variables, such as GRE (Graduate Record Exam scores),
GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate
school.
The response variable, admit/don't admit, is a binary variable.
Description of the data
For my assignment purpose, I am going to expand on above problem statement about getting admitted
into graduate school.
I Will use a hypothetical data, which can be downloaded from:
http://www.ats.ucla.edu/stat/data/binary.csv
1. This dataset has a binary response (outcome, dependent) variable called 'admit' (1: admitted, 0: Not).
2. There are three predictor variables: 'gre', 'gpa' and 'rank'. I will treat the variables 'gre'
and 'gpa' as continuous.
3. The variable 'rank' takes on the values 1 through 4. Institutions with a rank of 1 have the highest
prestige, while those with a rank of 4 have the lowest.
4. I will also show the basic descriptives for the entire data set in summary section.
Viewing a few observations in the dataset
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
## 7 1 560 2.98 1
## 8 0 400 3.08 2
## 9 1 540 3.39 3
## 10 0 700 3.92 2
Summary of Data
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
Standard Deviation of Data
## admit gre gpa rank
## 0.4660867 115.5165364 0.3805668 0.9444602
Two-way contingency table of categorical outcome and predictors
## rank
## admit 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
Building Logistic Regression Model
First, we convert 'rank' to a factor to indicate that 'rank' should be treated as a categorical variable.
Below is the conversion result.
## Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
We not bulid the logistic model.
Here 'admit' is the outcome variable and 'gre','gpa', and 'rank' are predictors.
Summary of the built model is shown below:
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
In the above results (model summary), we can easily identify that:
'gre','gpa', and 'rank' order 2, 3 and 4 are statistically significant at 95% CI.
Below are the obtained confidence intervals for the coefficient estimates in the above logit model.
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre 0.0001375921 0.004435874
## gpa 0.1602959439 1.464142727
## rank2 -1.3008888002 -0.056745722
## rank3 -2.0276713127 -0.670372346
## rank4 -2.4000265384 -0.753542605
Below are the obtained confidence intervals using standard errors in the above logit model.
## 2.5 % 97.5 %
## (Intercept) -6.2242418514 -1.755716295
## gre 0.0001202298 0.004408622
## gpa 0.1536836760 1.454391423
## rank2 -1.2957512650 -0.055134591
## rank3 -2.0169920597 -0.663415773
## rank4 -2.3703986294 -0.732528724
Odd Ratio
## (Intercept) gre gpa rank2 rank3 rank4
## 0.0185001 1.0022670 2.2345448 0.5089310 0.2617923 0.2119375
Odds Ratios and 95% CI :
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre 1.0022670 1.000137602 1.0044457
## gpa 2.2345448 1.173858216 4.3238349
## rank2 0.5089310 0.272289674 0.9448343
## rank3 0.2617923 0.131641717 0.5115181
## rank4 0.2119375 0.090715546 0.4706961
Odds Ratio interpretation
From above results, we can say that for every unit increase in 'gpa', the odds of being admitted to
graduate school (versus not being admitted) increase by a factor of 2.23
Confounding for the Association
To varify whether or not there was evidence of confounding for the association between the primary
explanatory and the response variable, we will first remove one variable from the earlier logistic
regression model and then verify the difference.
Lest remove, 'gpa' and rebuild the model.
##
## Call:
## glm(formula = admit ~ gre + rank, family = "binomial", data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5199 -0.8715 -0.6588 1.1775 2.1113
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.802365 0.672982 -2.678 0.007402 **
## gre 0.003224 0.001019 3.163 0.001562 **
## rank2 -0.721737 0.313033 -2.306 0.021132 *
## rank3 -1.291305 0.340775 -3.789 0.000151 ***
## rank4 -1.602054 0.414932 -3.861 0.000113 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 464.53 on 395 degrees of freedom
## AIC: 474.53
##
## Number of Fisher Scoring iterations: 4
We can see from the summary that 'gre' and 'rank' (2 through 4) are still statistically
significant at 95%CI.
We will now also calculate Odds Ratio and 95% CI for the new model.
## (Intercept) gre rank2 rank3 rank4
## 0.1649085 1.0032291 0.4859076 0.2749117 0.2014823
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.1649085 0.04314803 0.6074323
## gre 1.0032291 1.00125509 1.0052723
## rank2 0.4859076 0.26162841 0.8955744
## rank3 0.2749117 0.13958428 0.5327929
## rank4 0.2014823 0.08667895 0.4446645
Confounding effect result interpretation
Clearly we can observe that in earlier results, for every unit increase in 'gre', the odds of
being admitted to graduate school (versus not being admitted) increase by a factor of 1.0022670,
which remains approximately the same (1.0032291) in the second case as well.
This confirms that there are no confounding effects for the association among the predictor variables.
Logistic Regression Results
After adjusting for potential confounding factors (gpa, rank), the odds of being admitted to
graduate school (versus not being admitted) increase by a factor more than two times higher for
candidates with higher 'gpa' than for candidates with less scores.
(OR=2.23, 95% CI = 1.174-4.32, p<0.05)