R Example: Data Explanations (probit (=binary).sav)
A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable (binary.sav).
this dataset has a binary response (outcome, dependent) variable called admit, which is equal to 1 if the individual was admitted to graduate school, and 0 otherwise.
We will use recoded binary response variable called admit_re for this study. which is equal to 0 if the individual was admitted to graduate school, and 1 otherwise.
There are three predictor variables: GRE, GPA, and rank. We will treat the variables GRE and GPA as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.
# Import the dataset into R
library(haven)
probit_binary_ <- read_sav("probit (=binary).sav")
# Prepare a copy for analysis
prob_data <- probit_binary_
# Take a glance of the data
summary(prob_data)
## admit admit_re gre gpa
## Min. :0.0000 Min. :0.0000 Min. :220.0 Min. :2.260
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130
## Median :0.0000 Median :1.0000 Median :580.0 Median :3.395
## Mean :0.3175 Mean :0.6825 Mean :587.7 Mean :3.390
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670
## Max. :1.0000 Max. :1.0000 Max. :800.0 Max. :4.000
## rank
## Min. :1.000
## 1st Qu.:2.000
## Median :2.000
## Mean :2.485
## 3rd Qu.:3.000
## Max. :4.000
str(prob_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 400 obs. of 5 variables:
## $ admit : 'haven_labelled' num 0 1 1 1 0 1 1 0 1 0 ...
## ..- attr(*, "format.spss")= chr "F4.0"
## ..- attr(*, "display_width")= int 9
## ..- attr(*, "labels")= Named num 0 1
## .. ..- attr(*, "names")= chr "No=not admit" "Yes=admit"
## $ admit_re: 'haven_labelled' num 1 0 0 0 1 0 0 1 0 1 ...
## ..- attr(*, "format.spss")= chr "F4.0"
## ..- attr(*, "display_width")= int 9
## ..- attr(*, "labels")= Named num 0 1
## .. ..- attr(*, "names")= chr "Yes=admit" "No=not admit"
## $ gre : num 380 660 800 640 520 760 560 400 540 700 ...
## ..- attr(*, "format.spss")= chr "F3.0"
## ..- attr(*, "display_width")= int 9
## $ gpa : num 3.61 3.67 4 3.19 2.93 ...
## ..- attr(*, "format.spss")= chr "F4.2"
## ..- attr(*, "display_width")= int 9
## $ rank : 'haven_labelled' num 3 3 1 4 4 2 1 2 3 2 ...
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 9
## ..- attr(*, "labels")= Named num 1 2 3 4
## .. ..- attr(*, "names")= chr "highest" "high" "low" "lowest"
# Re-level the data
prob_data$rank <- factor(prob_data$rank,levels=c(1,2,3,4),labels=c("highest", "high", "low", "lowest"))
prob_data$admit_re <- factor(prob_data$admit_re,levels=c(0,1),labels=c("admitted","not-admitted"))
# Build a frequency table
xtabs(~rank + admit_re, data = prob_data)
## admit_re
## rank admitted not-admitted
## highest 33 28
## high 54 97
## low 28 93
## lowest 12 55
# In order to run the Probit logistic regression model, we need the glm function from the stats package
library(stats)
# Build the Probit logistic Regression model
prob_model <- glm(admit_re ~ gre + gpa + rank, family = binomial(link = "probit"),
data = prob_data)
## model summary
summary(prob_model)
##
## Call:
## glm(formula = admit_re ~ gre + gpa + rank, family = binomial(link = "probit"),
## data = prob_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1035 -1.1560 0.6389 0.8710 1.6163
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.386836 0.673946 3.542 0.000398 ***
## gre -0.001376 0.000650 -2.116 0.034329 *
## gpa -0.477730 0.197197 -2.423 0.015410 *
## rankhigh 0.415399 0.194977 2.131 0.033130 *
## ranklow 0.812138 0.208358 3.898 9.71e-05 ***
## ranklowest 0.935899 0.245272 3.816 0.000136 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.41 on 394 degrees of freedom
## AIC: 470.41
##
## Number of Fisher Scoring iterations: 4
# Run a "only intercept" model
OIM <- glm(admit_re ~ 1,family = binomial(link = "probit"), data = prob_data)
summary(OIM)
##
## Call:
## glm(formula = admit_re ~ 1, family = binomial(link = "probit"),
## data = prob_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5148 -1.5148 0.8741 0.8741 0.8741
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4747 0.0653 7.27 3.61e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 499.98 on 399 degrees of freedom
## AIC: 501.98
##
## Number of Fisher Scoring iterations: 4
# Compare the Intercept only model with our model
anova(OIM,prob_model,test="F")
## Warning: using F test with a 'binomial' family is inappropriate
## Analysis of Deviance Table
##
## Model 1: admit_re ~ 1
## Model 2: admit_re ~ gre + gpa + rank
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 399 499.98
## 2 394 458.41 5 41.563 8.3127 7.219e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Before proceeding to examine the individual coefficients, you want to look at an overall test of the null hypothesis that the location coefficients (βs) for all of the variables in the model are 0.
From the table, you see that the chi-square is 41.563 and p < .001. This means that you can reject the null hypothesis that the model without predictors is as good as the model with the predictors.
H0 : The model is a good fitting to the null model
H1 : The model is not a good fitting to the null model (i.e. the predictors have a significant effect)
# Check the predicted probability for each program
head(prob_model$fitted.values,20)
## 1 2 3 4 5 6 7
## 0.8293613 0.7046477 0.2661311 0.8207949 0.8864147 0.6268783 0.5764696
## 8 9 10 11 12 13 14
## 0.7824784 0.7986055 0.4866859 0.6222300 0.5961079 0.2844973 0.6435312
## 15 16 17 18 19 20
## 0.3131301 0.8146865 0.6557750 0.9306664 0.4642531 0.4300943
# We can get the predicted result by use predict functio
head(predict(prob_model),20)
## 1 2 3 4 5 6
## 0.95164448 0.53781521 -0.62455646 0.91839858 1.20767928 0.32359668
## 7 8 9 10 11 12
## 0.19287000 0.78059094 0.83665051 -0.03337962 0.31134272 0.24328564
## 13 14 15 16 17 18
## -0.56953283 0.36791375 -0.48699740 0.89529946 0.40095950 1.48077289
## 19 20
## -0.08972452 -0.17613415
# Test the goodness of fit
chisq.test(prob_data$admit_re,predict(prob_model))
## Warning in chisq.test(prob_data$admit_re, predict(prob_model)): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: prob_data$admit_re and predict(prob_model)
## X-squared = 390, df = 390, p-value = 0.4905
From the observed and expected frequencies, you can compute the usual Pearson and Deviance goodness-of-fit measures. You can chose either A or B options to report the goodness-of-fit.
Both of the goodness-of-fit statistics should be used only for models that have reasonably large expected values in each cell. If you have a continuous independent variable or many categorical predictors or some predictors with many values, you may have many cells with small expected values.
If your model fits well (p > .05), the observed and expected cell counts are similar. In other words, the model fits the data well.
# Load the DescTools package for calculate the R square
library("DescTools")
# Calculate the R Square
PseudoR2(prob_model, which = c("CoxSnell","Nagelkerke","McFadden"))
## CoxSnell Nagelkerke McFadden
## 0.09869212 0.13832531 0.08313060
There are several R2-like statistics that can be used to measure the strength of the association between the dependent variable and the predictor variables.
They are not as useful as the statistic in multiple regression, since their interpretation is not straightforward.
For this example, there is the relationship of 13.8% between independent variables and dependent variable based on Nagelkerke’s R2.
# Check our model summary again to see the esitmates
summary(prob_model)
##
## Call:
## glm(formula = admit_re ~ gre + gpa + rank, family = binomial(link = "probit"),
## data = prob_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1035 -1.1560 0.6389 0.8710 1.6163
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.386836 0.673946 3.542 0.000398 ***
## gre -0.001376 0.000650 -2.116 0.034329 *
## gpa -0.477730 0.197197 -2.423 0.015410 *
## rankhigh 0.415399 0.194977 2.131 0.033130 *
## ranklow 0.812138 0.208358 3.898 9.71e-05 ***
## ranklowest 0.935899 0.245272 3.816 0.000136 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.41 on 394 degrees of freedom
## AIC: 470.41
##
## Number of Fisher Scoring iterations: 4
# We can use the coef() function to check the parameter estimates
(Prob_estimates <- coef(summary(prob_model)))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.386836491 0.6739460869 3.541584 3.977326e-04
## gre -0.001375591 0.0006500337 -2.116184 3.432919e-02
## gpa -0.477730110 0.1971968582 -2.422605 1.540967e-02
## rankhigh 0.415399408 0.1949766644 2.130508 3.312967e-02
## ranklow 0.812138084 0.2083576222 3.897808 9.706718e-05
## ranklowest 0.935899179 0.2452719184 3.815762 1.357635e-04
# Please note the intercept estimate and the rank estimates are different between R and SPSS, that is becuase in R, it takes the different reference group. The default setting is use 1 as the reference group.
In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.
Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to asses model fit.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre, gpa, and the three terms for rank are statistically significant. The probit regression coefficients give the change in the z-score or probit index for a one unit change in the predictor.