The data set(hsbdemo.sav) contains variables on 200 students. The outcome variable is prog, program type (1=general, 2=academic, and 3=vocational). The predictor variables are ses, social economic status (1=low, 2=middle, and 3=high), math, mathematics score, and science, science score: both are continuous variables.
( Research Question ):Whenhighschoolstudents choose the program (general, vocational, and academic programs), how do their math and science scores and their social economic status (SES) affect their decision?
# Starting our example by import the data into R
library(haven)
hsbdemo <- read_sav("hsbdemo.sav")
hsb <- hsbdemo
summary(hsb)
## id female ses schtyp
## Min. : 1.00 Min. :0.000 Min. :1.000 Min. :1.00
## 1st Qu.: 50.75 1st Qu.:0.000 1st Qu.:2.000 1st Qu.:1.00
## Median :100.50 Median :1.000 Median :2.000 Median :1.00
## Mean :100.50 Mean :0.545 Mean :2.055 Mean :1.16
## 3rd Qu.:150.25 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:1.00
## Max. :200.00 Max. :1.000 Max. :3.000 Max. :2.00
## prog read write math
## Min. :1.000 Min. :28.00 Min. :31.00 Min. :33.00
## 1st Qu.:2.000 1st Qu.:44.00 1st Qu.:45.75 1st Qu.:45.00
## Median :2.000 Median :50.00 Median :54.00 Median :52.00
## Mean :2.025 Mean :52.23 Mean :52.77 Mean :52.65
## 3rd Qu.:2.250 3rd Qu.:60.00 3rd Qu.:60.00 3rd Qu.:59.00
## Max. :3.000 Max. :76.00 Max. :67.00 Max. :75.00
## science socst honors awards
## Min. :26.00 Min. :26.00 Min. :0.000 Min. :0.00
## 1st Qu.:44.00 1st Qu.:46.00 1st Qu.:0.000 1st Qu.:0.00
## Median :53.00 Median :52.00 Median :0.000 Median :1.00
## Mean :51.85 Mean :52.41 Mean :0.265 Mean :1.67
## 3rd Qu.:58.00 3rd Qu.:61.00 3rd Qu.:1.000 3rd Qu.:2.00
## Max. :74.00 Max. :71.00 Max. :1.000 Max. :7.00
## cid
## Min. : 1.00
## 1st Qu.: 5.00
## Median :10.50
## Mean :10.43
## 3rd Qu.:15.00
## Max. :20.00
# Load the jmv package for frequency table
library(jmv)
##
## Attaching package: 'jmv'
## The following object is masked from 'package:stats':
##
## anova
# Use the descritptives function to get the descritptive data
descriptives(hsb, vars = vars(ses, prog, math, science), freq = TRUE)
##
## DESCRIPTIVES
##
## Descriptives
## ──────────────────────────────────────────────
## ses prog math science
## ──────────────────────────────────────────────
## N 200 200 200 200
## Missing 0 0 0 0
## Mean 2.06 2.02 52.6 51.9
## Median 2.00 2.00 52.0 53.0
## Minimum 1.00 1.00 33.0 26.0
## Maximum 3.00 3.00 75.0 74.0
## ──────────────────────────────────────────────
##
##
## FREQUENCIES
##
## Frequencies of ses
## ──────────────────────────────────────────────────
## Levels Counts % of Total Cumulative %
## ──────────────────────────────────────────────────
## 1 47 23.5 23.5
## 2 95 47.5 71.0
## 3 58 29.0 100.0
## ──────────────────────────────────────────────────
##
##
## Frequencies of prog
## ──────────────────────────────────────────────────
## Levels Counts % of Total Cumulative %
## ──────────────────────────────────────────────────
## 1 45 22.5 22.5
## 2 105 52.5 75.0
## 3 50 25.0 100.0
## ──────────────────────────────────────────────────
# To see the crosstable, we need CrossTable function from gmodels package
library(gmodels)
# Build a crosstable between admit and rank
CrossTable(hsb$ses, hsb$prog)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 200
##
##
## | hsb$prog
## hsb$ses | 1 | 2 | 3 | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## 1 | 16 | 19 | 12 | 47 |
## | 2.783 | 1.305 | 0.005 | |
## | 0.340 | 0.404 | 0.255 | 0.235 |
## | 0.356 | 0.181 | 0.240 | |
## | 0.080 | 0.095 | 0.060 | |
## -------------|-----------|-----------|-----------|-----------|
## 2 | 20 | 44 | 31 | 95 |
## | 0.088 | 0.692 | 2.213 | |
## | 0.211 | 0.463 | 0.326 | 0.475 |
## | 0.444 | 0.419 | 0.620 | |
## | 0.100 | 0.220 | 0.155 | |
## -------------|-----------|-----------|-----------|-----------|
## 3 | 9 | 42 | 7 | 58 |
## | 1.257 | 4.381 | 3.879 | |
## | 0.155 | 0.724 | 0.121 | 0.290 |
## | 0.200 | 0.400 | 0.140 | |
## | 0.045 | 0.210 | 0.035 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 45 | 105 | 50 | 200 |
## | 0.225 | 0.525 | 0.250 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
Below we use the multinom function from the nnet package to estimate a multinomial logistic regression model. There are other functions in other R packages capable of multinomial regression. We chose the multinom function because it does not require the data to be reshaped (as the mlogit package does) and to mirror the example code found in Hilbe’s Logistic Regression Models.
First, we need to choose the level of our outcome that we wish to use as our baseline and specify this in the relevel function. Then, we run our model using multinom. The multinom package does not include p-value calculation for the regression coefficients, so we calculate p-values using Wald tests (here z-tests).
# Load the multinom package
library(nnet)
# Since we are going to use Academic as the reference group, we need relevel the group.
hsb$prog2 <- relevel(as.factor(hsb$prog), ref = 2)
hsb$ses <- as.factor(hsb$ses)
levels(hsb$prog2)
## [1] "2" "1" "3"
# Give the names to each level
levels(hsb$prog2) <- c("academic","general","vocational")
# Run a "only intercept" model
OIM <- multinom(prog2 ~ 1, data = hsb)
## # weights: 6 (2 variable)
## initial value 219.722458
## final value 204.096674
## converged
summary(OIM)
## Call:
## multinom(formula = prog2 ~ 1, data = hsb)
##
## Coefficients:
## (Intercept)
## general -0.8472980
## vocational -0.7419374
##
## Std. Errors:
## (Intercept)
## general 0.1781742
## vocational 0.1718249
##
## Residual Deviance: 408.1933
## AIC: 412.1933
# Run a multinomial model
test <- multinom(prog2 ~ ses + math + science + math*science, data = hsb)
## # weights: 21 (12 variable)
## initial value 219.722458
## iter 10 value 173.831002
## iter 20 value 167.382760
## final value 166.951813
## converged
summary(test)
## Call:
## multinom(formula = prog2 ~ ses + math + science + math * science,
## data = hsb)
##
## Coefficients:
## (Intercept) ses2 ses3 math science
## general 5.897618 -0.4081497 -1.1254491 -0.1852220 0.01323626
## vocational 22.728283 0.8402168 -0.5605656 -0.5036705 -0.28297703
## math:science
## general 0.001025283
## vocational 0.006185571
##
## Std. Errors:
## (Intercept) ses2 ses3 math science
## general 0.002304064 0.2613732 0.2134308 0.02694593 0.02953364
## vocational 0.003856861 0.2959741 0.1984775 0.02681947 0.03142872
## math:science
## general 0.0004761369
## vocational 0.0004760567
##
## Residual Deviance: 333.9036
## AIC: 357.9036
# Check the Z-score for the model (wald Z)
z <- summary(test)$coefficients/summary(test)$standard.errors
z
## (Intercept) ses2 ses3 math science
## general 2559.659 -1.561559 -5.273132 -6.873837 0.4481759
## vocational 5892.948 2.838819 -2.824328 -18.780028 -9.0037711
## math:science
## general 2.153336
## vocational 12.993348
# 2-tailed z test
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
## (Intercept) ses2 ses3 math science
## general 0 0.118391942 1.341147e-07 6.249667e-12 0.6540263
## vocational 0 0.004528089 4.737981e-03 0.000000e+00 0.0000000
## math:science
## general 0.03129229
## vocational 0.00000000
These are the logit coefficients relative to the reference category. For example,under ‘math’, the -0.185 suggests that for one unit increase in ‘science’ score, the logit coefficient for ‘low’ relative to ‘middle’ will go down by that amount, -0.185.
# the anova function is confilcted with JMV's anova function, so we need to unlibrary the JMV function before we use the anova function.
detach("package:jmv", unload=TRUE)
# Compare the our test model with the "Only intercept" model
anova(OIM,test)
## Likelihood ratio tests of Multinomial Models
##
## Response: prog2
## Model Resid. df Resid. Dev Test Df
## 1 1 398 408.1933
## 2 ses + math + science + math * science 388 333.9036 1 vs 2 10
## LR stat. Pr(Chi)
## 1
## 2 74.28972 6.539991e-12
Interpretation of the Model Fit information
The log-likelihood is a measure of how much unexplained variability there is in the data. Therefore, the difference or change in log-likelihood indicates how much new variance has been explained by the model.
The chi-square test tests the decrease in unexplained variance from the baseline model (408.1933) to the final model (333.9036), which is a difference of 408.1933 - 333.9036 = 74.29. This change is significant, which means that our final model explains a significant amount of the original variability.
The likelihood ratio chi-square of 74.29 with a p-value < 0.001 tells us that our model as a whole fits significantly better than an empty or null model (i.e., a model with no predictors)
# Check the predicted probability for each program
head(test$fitted.values,30)
## academic general vocational
## 1 0.18801940 0.17122451 0.6407561
## 2 0.12019189 0.10715542 0.7726527
## 3 0.52212681 0.08123771 0.3966355
## 4 0.23683979 0.23125435 0.5319059
## 5 0.10132130 0.12329032 0.7753884
## 6 0.38079544 0.10780793 0.5113966
## 7 0.32321815 0.16454057 0.5122413
## 8 0.09033932 0.08381233 0.8258484
## 9 0.02336687 0.09050704 0.8861261
## 10 0.32321815 0.16454057 0.5122413
## 11 0.16304678 0.29839918 0.5385540
## 12 0.22842326 0.27539161 0.4961851
## 13 0.32747927 0.28141483 0.3911059
## 14 0.05717483 0.12540921 0.8174160
## 15 0.15003741 0.52649953 0.3234631
## 16 0.12638004 0.20495962 0.6686603
## 17 0.25269654 0.39841475 0.3488887
## 18 0.05771613 0.08029142 0.8619924
## 19 0.27404420 0.16131436 0.5646414
## 20 0.25197679 0.20299587 0.5450273
## 21 0.15870561 0.17100945 0.6702849
## 22 0.27404420 0.16131436 0.5646414
## 23 0.16304678 0.29839918 0.5385540
## 24 0.16340250 0.31572103 0.5208765
## 25 0.26080538 0.36665885 0.3725358
## 26 0.74715288 0.12007209 0.1327750
## 27 0.33135572 0.26860688 0.4000374
## 28 0.32321815 0.16454057 0.5122413
## 29 0.40025162 0.32553566 0.2742127
## 30 0.19518342 0.30892507 0.4958915
# We can get the predicted result by use predict functio
head(predict(test),30)
## [1] vocational vocational academic vocational vocational vocational
## [7] vocational vocational vocational vocational vocational vocational
## [13] vocational vocational general vocational general vocational
## [19] vocational vocational vocational vocational vocational vocational
## [25] vocational academic vocational vocational academic vocational
## Levels: academic general vocational
# Test the goodness of fit
chisq.test(hsb$prog2,predict(test))
## Warning in chisq.test(hsb$prog2, predict(test)): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: hsb$prog2 and predict(test)
## X-squared = 47.841, df = 4, p-value = 1.019e-09
# Load the DescTools package for calculate the R square
library("DescTools")
## Registered S3 method overwritten by 'DescTools':
## method from
## reorder.factor gdata
# Calculate the R Square
PseudoR2(test, which = c("CoxSnell","Nagelkerke","McFadden"))
## CoxSnell Nagelkerke McFadden
## 0.3102655 0.3565873 0.1819964
Interpretation of the R-Square
These are three pseudo R squared values. Logistic regression does not have an equivalent to the R squared that is found in OLS regression; however, many people have tried to come up with one. These statistics do not mean exactly what R squared means in OLS regression (the proportion of variance of the response variable explained by the predictors), we suggest interpreting them with great caution.
Cox and Snell’s R-Square imitates multiple R-Square based on ‘likelihood’, but its maximum can be (and usually is) less than 1.0, making it difficult to interpret. Here it is indicating that there is the relationship of 31% between the dependent variable and the independent variables. Or it is indicating that 31% of the variation in the dependent variable is explained by the logistic model.
The Nagelkerke modification that does range from 0 to 1 is a more reliable measure of the relationship. Nagelkerke’s R2 will normally be higher than the Cox and Snell measure. In our case it is 0.357, indicating a relationship of 35.7% between the predictors and the prediction.
McFadden = {LL(null) – LL(full)} / LL(null). In our case it is 0.182, indicating a relationship of 18.2% between the predictors and the prediction.
# Use the lmtest package to run Likelihood Ratio Tests
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
lrtest(test, "ses") #Chi-Square=12.922,p=0.01166*
## # weights: 15 (8 variable)
## initial value 219.722458
## iter 10 value 176.645309
## iter 20 value 173.670728
## iter 30 value 173.430200
## final value 173.412997
## converged
## Likelihood ratio test
##
## Model 1: prog2 ~ ses + math + science + math * science
## Model 2: prog2 ~ math + science + math:science
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 12 -166.95
## 2 8 -173.41 -4 12.922 0.01166 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(test, "math") # Chi-Square=10.613,p=0.004959*
## # weights: 18 (10 variable)
## initial value 219.722458
## iter 10 value 172.387155
## final value 172.258318
## converged
## Likelihood ratio test
##
## Model 1: prog2 ~ ses + math + science + math * science
## Model 2: prog2 ~ ses + science + math:science
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 12 -166.95
## 2 10 -172.26 -2 10.613 0.004959 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(test, "science") # Chi-Squre=5.3874,p=0.06763
## # weights: 18 (10 variable)
## initial value 219.722458
## iter 10 value 169.972735
## final value 169.645522
## converged
## Likelihood ratio test
##
## Model 1: prog2 ~ ses + math + science + math * science
## Model 2: prog2 ~ ses + math + math:science
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 12 -166.95
## 2 10 -169.65 -2 5.3874 0.06763 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(test, "math:science") # Chi-Square=5.249,p=0.072
## # weights: 18 (10 variable)
## initial value 219.722458
## iter 10 value 170.088834
## final value 169.576379
## converged
## Likelihood ratio test
##
## Model 1: prog2 ~ ses + math + science + math * science
## Model 2: prog2 ~ ses + math + science
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 12 -166.95
## 2 10 -169.58 -2 5.2491 0.07247 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation of the Likelihood Ratio Tests
The results of the likelihood ratio tests can be used to ascertain the significance of predictors to the model. This table tells us that SES and math score had significant main effects on program selection, X^2(4) = 12.917, p = .012 for SES and X^2(2) = 10.613, p = .005 for SES.
These likelihood statistics can be seen as sorts of overall statistics that tell us which predictors significantly enable us to predict the outcome category, but they don’t really tell us specifically what the effect is. To see this we have to look at the individual parameter estimates.
# Let's check our model again
summary(test)
## Call:
## multinom(formula = prog2 ~ ses + math + science + math * science,
## data = hsb)
##
## Coefficients:
## (Intercept) ses2 ses3 math science
## general 5.897618 -0.4081497 -1.1254491 -0.1852220 0.01323626
## vocational 22.728283 0.8402168 -0.5605656 -0.5036705 -0.28297703
## math:science
## general 0.001025283
## vocational 0.006185571
##
## Std. Errors:
## (Intercept) ses2 ses3 math science
## general 0.002304064 0.2613732 0.2134308 0.02694593 0.02953364
## vocational 0.003856861 0.2959741 0.1984775 0.02681947 0.03142872
## math:science
## general 0.0004761369
## vocational 0.0004760567
##
## Residual Deviance: 333.9036
## AIC: 357.9036
# Check the Wald Z again
z <- summary(test)$coefficients/summary(test)$standard.errors
z
## (Intercept) ses2 ses3 math science
## general 2559.659 -1.561559 -5.273132 -6.873837 0.4481759
## vocational 5892.948 2.838819 -2.824328 -18.780028 -9.0037711
## math:science
## general 2.153336
## vocational 12.993348
# 2-tailed z test
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
## (Intercept) ses2 ses3 math science
## general 0 0.118391942 1.341147e-07 6.249667e-12 0.6540263
## vocational 0 0.004528089 4.737981e-03 0.000000e+00 0.0000000
## math:science
## general 0.03129229
## vocational 0.00000000
Note that the table is split into two rows. This is because these parameters compare pairs of outcome categories.
We specified the second category (2 = academic) as our reference category; therefore, the first row of the table labelled General is comparing this category against the ‘Academic’ category. the second row of the table labelled Vocational is also comparing this category against the ‘Academic’ category.
Because we are just comparing two categories the interpretation is the same as for binary logistic regression:
# extract the coefficients from the model and exponentiate
exp(coef(test))
## (Intercept) ses2 ses3 math science
## general 3.641690e+02 0.6648794 0.3245067 0.8309198 1.0133243
## vocational 7.426219e+09 2.3168692 0.5708861 0.6043085 0.7535371
## math:science
## general 1.001026
## vocational 1.006205
The relative log odds of being in general program versus in academic program will decrease by 1.125 if moving from the highest level of SES (SES = 3) to the lowest level of SES (SES = 1) , b = -1.125, Wald χ2(1) = -5.27, p <.001.
Exp(-1.1254491) = 0.3245067 means that when students move from the highest level of SES (SES = 3) to the lowest level of SES (1= SES) the odds ratio is 0.325 times as high and therefore students with the lowest level of SES tend to choose general program against academic program more than students with the highest level of SES.
The relative log odds of being in vocational program versus in academic program will decrease by 0.56 if moving from the highest level of SES (SES = 3) to the lowest level of SES (SES = 1) , b = -0.56, Wald χ2(1) = -2.82, p < 0.01.
Exp(-0.56) = 0.57 means that when students move from the highest level of SES (SES = 3) to the lowest level of SES (SES=1) the odds ratio is 0.57 times as high and therefore students with the lowest level of SES tend to choose vocational program against academic program more than students with the highest level of SES.
Please check your slides for detailed information. You can find all the values on above R outcomes.
# Load the summarytools package to use the classification function
library(summarytools)
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
# Build a classification table by using the ctable function
ctable <- table(hsb$prog2,predict(test))
ctable
##
## academic general vocational
## academic 88 4 13
## general 24 12 9
## vocational 21 4 25