Class 10: Ordinal Logistic Regression

Prepare the Data

Data Information (Agresti (1996), Table 6.9, p. 186)

  • Data set (impairment.sav) comes from a study of mental health for a random sample of adult residents of Alachua County, Florida. Mental impairment is ordinal, with categories (1 = well, 2 = mild symptom formation, 3 = moderate symptom formation, 4 = impaired).

  • The study related Y = mental impairment to two explanatory variables. The life events is a composite measure of the number and severity of important life events such as birth of child, new job, divorce, or death in family that occurred to the subject within the past three years. In this sample it has a mean of 4.3 and standard deviation of 2.7. Socioeconomic status (SES) is measured here as binary (0 = low & 1 = high).

library(haven)
impairment <- read_sav("impairment.sav")
head(impairment)
## # A tibble: 6 x 4
##   subject  Impairment        ses lifeevents
##     <dbl>   <dbl+lbl>  <dbl+lbl>      <dbl>
## 1       1 1 [1= Well] 1 [1=high]          1
## 2       2 1 [1= Well] 1 [1=high]          9
## 3       3 1 [1= Well] 1 [1=high]          4
## 4       4 1 [1= Well] 1 [1=high]          3
## 5       5 1 [1= Well] 0 [0=low]           2
## 6       6 1 [1= Well] 1 [1=high]          0
# Prepare a copy for exercise
OLR_data <- impairment
# Ordering the dependent variable
OLR_data$Impairment <- factor(OLR_data$Impairment,levels=c(1,2,3,4),labels=c("well","mild","moderate","impaired"))
# Ordering the independent variable
OLR_data$ses <- factor(OLR_data$ses,levels=c(0,1),labels=c("low","high"))

Descriptive Analysis

# Exploratory data analysis 
# Summarizing the data
summary(OLR_data)
##     subject         Impairment   ses       lifeevents   
##  Min.   : 1.00   well    :12   low :18   Min.   :0.000  
##  1st Qu.:10.75   mild    :12   high:22   1st Qu.:2.000  
##  Median :20.50   moderate: 7             Median :4.000  
##  Mean   :20.50   impaired: 9             Mean   :4.275  
##  3rd Qu.:30.25                           3rd Qu.:6.250  
##  Max.   :40.00                           Max.   :9.000
# Making frequency table
table(OLR_data$Impairment, OLR_data$ses)
##           
##            low high
##   well       4    8
##   mild       4    8
##   moderate   5    2
##   impaired   5    4
table(OLR_data$Impairment, OLR_data$lifeevents)
##           
##            0 1 2 3 4 5 6 7 8 9
##   well     1 3 2 3 1 0 0 1 0 1
##   mild     0 2 1 3 0 3 1 0 1 1
##   moderate 1 0 0 2 2 0 1 0 0 1
##   impaired 0 0 1 0 2 1 0 1 3 1

Run the ordinal logistic Regression model using MASS package

# In order to run the ordinal logistic regression model, we need the polr function from the MASS package
library(MASS)
# Build ordinal logistic regression model
OLRmodel <- polr(Impairment ~ ses + lifeevents , data = OLR_data, Hess = TRUE)
summary(OLRmodel)
## Call:
## polr(formula = Impairment ~ ses + lifeevents, data = OLR_data, 
##     Hess = TRUE)
## 
## Coefficients:
##              Value Std. Error t value
## seshigh    -1.1112     0.6109  -1.819
## lifeevents  0.3189     0.1210   2.635
## 
## Intercepts:
##                   Value   Std. Error t value
## well|mild         -0.2819  0.6423    -0.4389
## mild|moderate      1.2128  0.6607     1.8357
## moderate|impaired  2.2094  0.7210     3.0644
## 
## Residual Deviance: 99.0979 
## AIC: 109.0979
# Run a "only intercept" model
OIM <- polr(Impairment ~ 1, data = OLR_data)
summary(OIM)
## 
## Re-fitting to get Hessian
## Call:
## polr(formula = Impairment ~ 1, data = OLR_data)
## 
## No coefficients
## 
## Intercepts:
##                   Value   Std. Error t value
## well|mild         -0.8473  0.3450    -2.4557
## mild|moderate      0.4055  0.3227     1.2562
## moderate|impaired  1.2368  0.3786     3.2663
## 
## Residual Deviance: 109.0421 
## AIC: 115.0421

Check the Overall Model Fit

# Compare the our test model with the "Only intercept" model
anova(OIM,OLRmodel)
## Likelihood ratio tests of ordinal regression models
## 
## Response: Impairment
##              Model Resid. df Resid. Dev   Test    Df LR stat.     Pr(Chi)
## 1                1        37   109.0421                                  
## 2 ses + lifeevents        35    99.0979 1 vs 2     2 9.944156 0.006928734
  • Before proceeding to examine the individual coefficients, you want to look at an overall test of the null hypothesis that the location coefficients for all of the variables in the model are 0.

  • You can base this on the change in log-likelihood when the variables are added to a model that contains only the intercept. The change in likelihood function has a chi-square distribution even when there are cells with small observed and predicted counts.

  • From the table, you see that the chi-square is 9.944 and p = .007. This means that you can reject the null hypothesis that the model without predictors is as good as the model with the predictors.

Check the model fit information

# Check the predicted probability for each program
OLRmodel$fitted.values
##          well      mild   moderate   impaired
## 1  0.62491585 0.2564222 0.07131474 0.04734725
## 2  0.11502358 0.2518325 0.24398557 0.38915837
## 3  0.39028430 0.3502175 0.14495563 0.11454256
## 4  0.46822886 0.3287373 0.11707594 0.08595788
## 5  0.28503416 0.3548980 0.18808642 0.17198137
## 6  0.69621280 0.2146344 0.05428168 0.03487110
## 7  0.35416883 0.3555319 0.15911298 0.13118624
## 8  0.46822886 0.3287373 0.11707594 0.08595788
## 9  0.46822886 0.3287373 0.11707594 0.08595788
## 10 0.19738781 0.3255946 0.22513070 0.25188691
## 11 0.35416883 0.3555319 0.15911298 0.13118624
## 12 0.28503416 0.3548980 0.18808642 0.17198137
## 13 0.31756633 0.3571768 0.17419493 0.15106190
## 14 0.10019399 0.2315350 0.24178533 0.42648565
## 15 0.46822886 0.3287373 0.11707594 0.08595788
## 16 0.35416883 0.3555319 0.15911298 0.13118624
## 17 0.15167000 0.2918553 0.23993344 0.31654127
## 18 0.54775531 0.2959818 0.09227179 0.06399113
## 19 0.13282485 0.2729369 0.24333367 0.35090454
## 20 0.31756633 0.3571768 0.17419493 0.15106190
## 21 0.11502358 0.2518325 0.24398557 0.38915837
## 22 0.22469945 0.3390046 0.21407808 0.22221786
## 23 0.46822886 0.3287373 0.11707594 0.08595788
## 24 0.62491585 0.2564222 0.07131474 0.04734725
## 25 0.42998727 0.3408054 0.13029529 0.09891202
## 26 0.39028430 0.3502175 0.14495563 0.11454256
## 27 0.22469945 0.3390046 0.21407808 0.22221786
## 28 0.04102612 0.1191445 0.18048372 0.65934567
## 29 0.25278003 0.3485130 0.20206806 0.19663890
## 30 0.17402747 0.3103145 0.23352933 0.28212871
## 31 0.22469945 0.3390046 0.21407808 0.22221786
## 32 0.15167000 0.2918553 0.23993344 0.31654127
## 33 0.54775531 0.2959818 0.09227179 0.06399113
## 34 0.19738781 0.3255946 0.22513070 0.25188691
## 35 0.13282485 0.2729369 0.24333367 0.35090454
## 36 0.17402747 0.3103145 0.23352933 0.28212871
## 37 0.17402747 0.3103145 0.23352933 0.28212871
## 38 0.15167000 0.2918553 0.23993344 0.31654127
## 39 0.05557758 0.1522454 0.20761767 0.58455931
## 40 0.04102612 0.1191445 0.18048372 0.65934567
# We can get the predicted result by use predict functio
predict(OLRmodel)
##  [1] well     impaired well     well     mild     well     mild    
##  [8] well     well     mild     mild     mild     mild     impaired
## [15] well     mild     impaired well     impaired mild     impaired
## [22] mild     well     well     well     well     mild     impaired
## [29] mild     mild     mild     impaired well     mild     impaired
## [36] mild     mild     impaired impaired impaired
## Levels: well mild moderate impaired
# Test the goodness of fit
chisq.test(OLR_data$Impairment,predict(OLRmodel))
## Warning in chisq.test(OLR_data$Impairment, predict(OLRmodel)): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  OLR_data$Impairment and predict(OLRmodel)
## X-squared = 7.9614, df = 6, p-value = 0.2409

Compute a confusion table and misclassification error (R exclusive)

#Compute confusion table and misclassification error
predictOLR <-  predict(OLRmodel,OLR_data)
cTab <- table(OLR_data$Impairment, predictOLR)
mean(as.character(OLR_data$Impairment) != as.character(predictOLR))
## [1] 0.625
(CCR <- sum(diag(cTab)) / sum(cTab)) # Calculate the classification rate
## [1] 0.375
  • The confusion matrix shows the performance of the ordinal logistic regression model. For example, it shows that, in the test dataset, 6 times category “well” is identified correctly. Similarly, 4 times category “mild” and 5 times category "" is identified correctly.

  • Using the confusion matrix, we can find that the misclassification error for our model is 62.5%.

Measuring Strength of Association (Calculating the Pseudo R-Square)

# Load the DescTools package for calculate the R square
library("DescTools")
# Calculate the R Square
PseudoR2(OLRmodel, which = c("CoxSnell","Nagelkerke","McFadden"))
##   CoxSnell Nagelkerke   McFadden 
## 0.22011118 0.23553327 0.09119561
  • There are several R^2-like statistics that can be used to measure the strength of the association between the dependent variable and the predictor variables.

  • They are not as useful as the statistic in multiple regression, since their interpretation is not straightforward.

  • For this example, there is the relationship of 23.6% between independent variables and dependent variable based on Nagelkerke’s R^2.

Parameter Estimates

# We can use the coef() function to check the parameter estimates
(OLRestimates <- coef(summary(OLRmodel)))
##                        Value Std. Error   t value
## seshigh           -1.1112310  0.6108786 -1.819070
## lifeevents         0.3188613  0.1209922  2.635387
## well|mild         -0.2819031  0.6422667 -0.438919
## mild|moderate      1.2127926  0.6606543  1.835745
## moderate|impaired  2.2093721  0.7209703  3.064443
# Add the p-value to our parameter estimates table
p <- pnorm(abs(OLRestimates[, "t value"]), lower.tail = FALSE) * 2
(OLRestimates_P <- cbind(OLRestimates, "p value" = p))
##                        Value Std. Error   t value     p value
## seshigh           -1.1112310  0.6108786 -1.819070 0.068900746
## lifeevents         0.3188613  0.1209922  2.635387 0.008404133
## well|mild         -0.2819031  0.6422667 -0.438919 0.660720194
## mild|moderate      1.2127926  0.6606543  1.835745 0.066395449
## moderate|impaired  2.2093721  0.7209703  3.064443 0.002180760
  • Please check the Slides 22 to Slides 26 to learn how to interpret the estimates

Calculating Expected Values

# Bulding the newpatient entry to test our model
newpatient <- data.frame("subject" = 1, "Impairment" = NA, "ses" = "high", "lifeevents"=3)
# Use predict function to predict the patient's situation
predict(OLRmodel,newpatient)
## [1] well
## Levels: well mild moderate impaired