Class_11 Probit Regression

Prepare the Data

R Example: Data Explanations (probit (=binary).sav)

  • A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable (binary.sav).

  • this dataset has a binary response (outcome, dependent) variable called admit, which is equal to 1 if the individual was admitted to graduate school, and 0 otherwise.

  • We will use recoded binary response variable called admit_re for this study. which is equal to 0 if the individual was admitted to graduate school, and 1 otherwise.

  • There are three predictor variables: GRE, GPA, and rank. We will treat the variables GRE and GPA as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

# Import the dataset into R
library(haven)
probit_binary_ <- read_sav("probit (=binary).sav")
# Prepare a copy for analysis
prob_data <- probit_binary_
# Take a glance of the data
summary(prob_data)
##      admit           admit_re           gre             gpa       
##  Min.   :0.0000   Min.   :0.0000   Min.   :220.0   Min.   :2.260  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130  
##  Median :0.0000   Median :1.0000   Median :580.0   Median :3.395  
##  Mean   :0.3175   Mean   :0.6825   Mean   :587.7   Mean   :3.390  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670  
##  Max.   :1.0000   Max.   :1.0000   Max.   :800.0   Max.   :4.000  
##       rank      
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :2.485  
##  3rd Qu.:3.000  
##  Max.   :4.000
str(prob_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    400 obs. of  5 variables:
##  $ admit   : 'haven_labelled' num  0 1 1 1 0 1 1 0 1 0 ...
##   ..- attr(*, "format.spss")= chr "F4.0"
##   ..- attr(*, "display_width")= int 9
##   ..- attr(*, "labels")= Named num  0 1
##   .. ..- attr(*, "names")= chr  "No=not admit" "Yes=admit"
##  $ admit_re: 'haven_labelled' num  1 0 0 0 1 0 0 1 0 1 ...
##   ..- attr(*, "format.spss")= chr "F4.0"
##   ..- attr(*, "display_width")= int 9
##   ..- attr(*, "labels")= Named num  0 1
##   .. ..- attr(*, "names")= chr  "Yes=admit" "No=not admit"
##  $ gre     : num  380 660 800 640 520 760 560 400 540 700 ...
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "display_width")= int 9
##  $ gpa     : num  3.61 3.67 4 3.19 2.93 ...
##   ..- attr(*, "format.spss")= chr "F4.2"
##   ..- attr(*, "display_width")= int 9
##  $ rank    : 'haven_labelled' num  3 3 1 4 4 2 1 2 3 2 ...
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 9
##   ..- attr(*, "labels")= Named num  1 2 3 4
##   .. ..- attr(*, "names")= chr  "highest" "high" "low" "lowest"
# Re-level the data
prob_data$rank <- factor(prob_data$rank,levels=c(1,2,3,4),labels=c("highest", "high", "low", "lowest"))
prob_data$admit_re <- factor(prob_data$admit_re,levels=c(0,1),labels=c("admitted","not-admitted"))
# Build a frequency table
xtabs(~rank + admit_re, data = prob_data)
##          admit_re
## rank      admitted not-admitted
##   highest       33           28
##   high          54           97
##   low           28           93
##   lowest        12           55

Run the Probit logistic Regression model using stats package

# In order to run the Probit logistic regression model, we need the glm function from the stats package
library(stats)
# Build the Probit logistic Regression model
prob_model <- glm(admit_re ~ gre + gpa + rank, family = binomial(link = "probit"), 
    data = prob_data)
## model summary
summary(prob_model)
## 
## Call:
## glm(formula = admit_re ~ gre + gpa + rank, family = binomial(link = "probit"), 
##     data = prob_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1035  -1.1560   0.6389   0.8710   1.6163  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.386836   0.673946   3.542 0.000398 ***
## gre         -0.001376   0.000650  -2.116 0.034329 *  
## gpa         -0.477730   0.197197  -2.423 0.015410 *  
## rankhigh     0.415399   0.194977   2.131 0.033130 *  
## ranklow      0.812138   0.208358   3.898 9.71e-05 ***
## ranklowest   0.935899   0.245272   3.816 0.000136 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.41  on 394  degrees of freedom
## AIC: 470.41
## 
## Number of Fisher Scoring iterations: 4

Compare the overall model fit

# Run a "only intercept" model
OIM <- glm(admit_re ~ 1,family = binomial(link = "probit"), data = prob_data)
summary(OIM)
## 
## Call:
## glm(formula = admit_re ~ 1, family = binomial(link = "probit"), 
##     data = prob_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5148  -1.5148   0.8741   0.8741   0.8741  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4747     0.0653    7.27 3.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 499.98  on 399  degrees of freedom
## AIC: 501.98
## 
## Number of Fisher Scoring iterations: 4
# Compare the Intercept only model with our model
anova(OIM,prob_model,test="F")
## Warning: using F test with a 'binomial' family is inappropriate
## Analysis of Deviance Table
## 
## Model 1: admit_re ~ 1
## Model 2: admit_re ~ gre + gpa + rank
##   Resid. Df Resid. Dev Df Deviance      F    Pr(>F)    
## 1       399     499.98                                 
## 2       394     458.41  5   41.563 8.3127 7.219e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Before proceeding to examine the individual coefficients, you want to look at an overall test of the null hypothesis that the location coefficients (βs) for all of the variables in the model are 0.

  • From the table, you see that the chi-square is 41.563 and p < .001. This means that you can reject the null hypothesis that the model without predictors is as good as the model with the predictors.

  • H0 : The model is a good fitting to the null model

  • H1 : The model is not a good fitting to the null model (i.e. the predictors have a significant effect)

Check the model fit information

# Check the predicted probability for each program
head(prob_model$fitted.values,20)
##         1         2         3         4         5         6         7 
## 0.8293613 0.7046477 0.2661311 0.8207949 0.8864147 0.6268783 0.5764696 
##         8         9        10        11        12        13        14 
## 0.7824784 0.7986055 0.4866859 0.6222300 0.5961079 0.2844973 0.6435312 
##        15        16        17        18        19        20 
## 0.3131301 0.8146865 0.6557750 0.9306664 0.4642531 0.4300943
# We can get the predicted result by use predict functio
head(predict(prob_model),20)
##           1           2           3           4           5           6 
##  0.95164448  0.53781521 -0.62455646  0.91839858  1.20767928  0.32359668 
##           7           8           9          10          11          12 
##  0.19287000  0.78059094  0.83665051 -0.03337962  0.31134272  0.24328564 
##          13          14          15          16          17          18 
## -0.56953283  0.36791375 -0.48699740  0.89529946  0.40095950  1.48077289 
##          19          20 
## -0.08972452 -0.17613415
# Test the goodness of fit
chisq.test(prob_data$admit_re,predict(prob_model))
## Warning in chisq.test(prob_data$admit_re, predict(prob_model)): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  prob_data$admit_re and predict(prob_model)
## X-squared = 390, df = 390, p-value = 0.4905
  • From the observed and expected frequencies, you can compute the usual Pearson and Deviance goodness-of-fit measures. You can chose either A or B options to report the goodness-of-fit.

  • Both of the goodness-of-fit statistics should be used only for models that have reasonably large expected values in each cell. If you have a continuous independent variable or many categorical predictors or some predictors with many values, you may have many cells with small expected values.

  • If your model fits well (p > .05), the observed and expected cell counts are similar. In other words, the model fits the data well.

Measuring Strength of Association (Calculating the Pseudo R-Square)

# Load the DescTools package for calculate the R square
library("DescTools")
# Calculate the R Square
PseudoR2(prob_model, which = c("CoxSnell","Nagelkerke","McFadden"))
##   CoxSnell Nagelkerke   McFadden 
## 0.09869212 0.13832531 0.08313060
  • There are several R2-like statistics that can be used to measure the strength of the association between the dependent variable and the predictor variables.

  • They are not as useful as the statistic in multiple regression, since their interpretation is not straightforward.

  • For this example, there is the relationship of 13.8% between independent variables and dependent variable based on Nagelkerke’s R2.

Parameter Estimates

# Check our model summary again to see the esitmates
summary(prob_model)
## 
## Call:
## glm(formula = admit_re ~ gre + gpa + rank, family = binomial(link = "probit"), 
##     data = prob_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1035  -1.1560   0.6389   0.8710   1.6163  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.386836   0.673946   3.542 0.000398 ***
## gre         -0.001376   0.000650  -2.116 0.034329 *  
## gpa         -0.477730   0.197197  -2.423 0.015410 *  
## rankhigh     0.415399   0.194977   2.131 0.033130 *  
## ranklow      0.812138   0.208358   3.898 9.71e-05 ***
## ranklowest   0.935899   0.245272   3.816 0.000136 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.41  on 394  degrees of freedom
## AIC: 470.41
## 
## Number of Fisher Scoring iterations: 4
# We can use the coef() function to check the parameter estimates
(Prob_estimates <- coef(summary(prob_model)))
##                 Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept)  2.386836491 0.6739460869  3.541584 3.977326e-04
## gre         -0.001375591 0.0006500337 -2.116184 3.432919e-02
## gpa         -0.477730110 0.1971968582 -2.422605 1.540967e-02
## rankhigh     0.415399408 0.1949766644  2.130508 3.312967e-02
## ranklow      0.812138084 0.2083576222  3.897808 9.706718e-05
## ranklowest   0.935899179 0.2452719184  3.815762 1.357635e-04
# Please note the intercept estimate and the rank estimates are different between R and SPSS, that is becuase in R, it takes the different reference group. The default setting is use 1 as the reference group.
  • In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.

  • Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to asses model fit.

  • The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre, gpa, and the three terms for rank are statistically significant. The probit regression coefficients give the change in the z-score or probit index for a one unit change in the predictor.

    • For a one unit increase in gre, the z-score decreases by 0.001.
    • For each one unit increase in gpa, the z-score decreases by 0.478.
    • The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution of rank of 2, versus an institution with a rank of 1 (the reference group), increases the z-score by 0.415.
    • Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC.