This hsb data was collected as a subset of the High School and Beyond study conducted by the National Education Longitudinal Studies program of the National Center for Education Statistics. The variables are gender, race, socioeconomic status, school type, chosen high school program type: scores on reading, writing, math, science, and social studies. We want to determine which factors are related to the choice of the type of program - academic, vocational, or general - that the students pursue in high school. The response is multinomial with three levels.
a). Fit a trinomial response model with the other relevant variables as predictors (untransformed).
The “prog” response variable is a nominal variable with no natural order to the categories. We see from plots that the level of enrollment in high school program across all three type differs by gender (Men tended to be less enrolled in general rather than vocational programs). Similarly, enrollment varies by race (Asians tended to have the highest enrollment in general high school programs while Hispanics had the lowest enrollment for both general and vocational programs and Whites had a high enrollment in both general and vocational programs). Students in the lowest socioeconomic status had the highest enrollment in vocational programs whereas students in middle socioeconomic status had the highest enrollment in general high school programs. Similarly, public school students had the highest enrollment in vocational programs while non-public school students had the highest enrollment in general high school programs. Finally, while reading and writing scores were low for students in vocational programs, math, science and social study scores were higher for these students.
b). Use backward elimination to reduce the model to one where all predictors are statistically significant. Give an interpretation of the resulting model.
I used the AIC criterion deploying a stepwise search method to select which variables need to be included in the model. At the first stage of the search, we see that omitting “race” would be the best option to reduce the AIC criterion. At the next step, “gender” is removed, in the third step “write” is removed and in the final step “read” is removed.
We can also use the standard likelihood methods to derive a test to compare nested models. We first remove “gender” and then compare the deviance of the resulting model.
c). For the student with id 99, compute the predicted probabilities of the three possible choices.
library(faraway)
library(nnet)
## Warning: package 'nnet' was built under R version 4.0.5
sprog <- hsb$prog
levels(sprog) <- c("academic", "vocational", "general")
summary(sprog)
## academic vocational general
## 105 45 50
par(mfrow=c(3,3))
matplot(prop.table(table(hsb$gender, sprog)), type = "l", xlab = "Gender", ylab = "High school program type", lty = c(1, 2, 5))
matplot(prop.table(table(hsb$race, sprog)), type = "l", xlab = "Race", ylab = "High school program type", lty = c(1, 2, 5))
matplot(prop.table(table(hsb$ses, sprog)), type = "l", xlab = "Socioeconomic status", ylab = "High school program type", lty = c(1, 2, 5))
matplot(prop.table(table(hsb$schtyp, sprog)), type = "l", xlab = "School type", ylab = "High school program type", lty = c(1, 2, 5))
matplot(prop.table(table(hsb$read, sprog)), type = "l", xlab = "Reading scores", ylab = "High school program type", lty = c(1, 2, 5))
matplot(prop.table(table(hsb$write, sprog)), type = "l", xlab = "Writing scores", ylab = "High school program type", lty = c(1, 2, 5))
matplot(prop.table(table(hsb$math, sprog)), type = "l", xlab = "Math scores", ylab = "High school program type", lty = c(1, 2, 5))
matplot(prop.table(table(hsb$science, sprog)), type = "l", xlab = "Science scores", ylab = "High school program type", lty = c(1, 2, 5))
matplot(prop.table(table(hsb$socst, sprog)), type = "l", xlab = "Social studies scores", ylab = "High school program type", lty = c(1, 2, 5))
mmod <- multinom(sprog ~ gender + race + ses + schtyp + read + write + math + science + socst, hsb)
## # weights: 42 (26 variable)
## initial value 219.722458
## iter 10 value 171.814970
## iter 20 value 153.793692
## iter 30 value 152.935260
## final value 152.935256
## converged
summary(mmod)
## Call:
## multinom(formula = sprog ~ gender + race + ses + schtyp + read +
## write + math + science + socst, data = hsb)
##
## Coefficients:
## (Intercept) gendermale raceasian racehispanic racewhite seslow
## vocational 3.631901 -0.09264717 1.352739 -0.6322019 0.2965156 1.09864111
## general 7.481381 -0.32104341 -0.700070 -0.1993556 0.3358881 0.04747323
## sesmiddle schtyppublic read write math science
## vocational 0.7029621 0.5845405 -0.04418353 -0.03627381 -0.1092888 0.10193746
## general 1.1815808 2.0553336 -0.03481202 -0.03166001 -0.1139877 0.05229938
## socst
## vocational -0.01976995
## general -0.08040129
##
## Std. Errors:
## (Intercept) gendermale raceasian racehispanic racewhite seslow
## vocational 1.823452 0.4548778 1.058754 0.8935504 0.7354829 0.6066763
## general 2.104698 0.5021132 1.470176 0.8393676 0.7480573 0.7045772
## sesmiddle schtyppublic read write math science
## vocational 0.5045938 0.5642925 0.03103707 0.03381324 0.03522441 0.03274038
## general 0.5700833 0.8348229 0.03422409 0.03585729 0.03885131 0.03424763
## socst
## vocational 0.02712589
## general 0.02938212
##
## Residual Deviance: 305.8705
## AIC: 357.8705
mmod1 <- step(mmod)
## Start: AIC=357.87
## sprog ~ gender + race + ses + schtyp + read + write + math +
## science + socst
##
## trying - gender
## # weights: 39 (24 variable)
## initial value 219.722458
## iter 10 value 171.468391
## iter 20 value 153.592758
## final value 153.142827
## converged
## trying - race
## # weights: 33 (20 variable)
## initial value 219.722458
## iter 10 value 172.925326
## iter 20 value 156.065379
## final value 155.776076
## converged
## trying - ses
## # weights: 36 (22 variable)
## initial value 219.722458
## iter 10 value 173.415148
## iter 20 value 159.270550
## final value 159.018173
## converged
## trying - schtyp
## # weights: 39 (24 variable)
## initial value 219.722458
## iter 10 value 179.272154
## iter 20 value 157.543382
## final value 157.125206
## converged
## trying - read
## # weights: 39 (24 variable)
## initial value 219.722458
## iter 10 value 183.830532
## iter 20 value 154.286859
## final value 154.065250
## converged
## trying - write
## # weights: 39 (24 variable)
## initial value 219.722458
## iter 10 value 176.101070
## iter 20 value 153.975940
## final value 153.626207
## converged
## trying - math
## # weights: 39 (24 variable)
## initial value 219.722458
## iter 10 value 182.534864
## iter 20 value 160.202057
## final value 159.929539
## converged
## trying - science
## # weights: 39 (24 variable)
## initial value 219.722458
## iter 10 value 184.326108
## iter 20 value 158.520506
## final value 158.243167
## converged
## trying - socst
## # weights: 39 (24 variable)
## initial value 219.722458
## iter 10 value 179.967335
## iter 20 value 157.376702
## final value 157.146736
## converged
## Df AIC
## - race 20 351.5522
## - gender 24 354.2857
## - write 24 355.2524
## - read 24 356.1305
## <none> 26 357.8705
## - ses 22 362.0363
## - schtyp 24 362.2504
## - socst 24 362.2935
## - science 24 364.4863
## - math 24 367.8591
## # weights: 33 (20 variable)
## initial value 219.722458
## iter 10 value 172.925326
## iter 20 value 156.065379
## final value 155.776076
## converged
##
## Step: AIC=351.55
## sprog ~ gender + ses + schtyp + read + write + math + science +
## socst
##
## trying - gender
## # weights: 30 (18 variable)
## initial value 219.722458
## iter 10 value 172.662548
## iter 20 value 156.063823
## final value 156.032828
## converged
## trying - ses
## # weights: 27 (16 variable)
## initial value 219.722458
## iter 10 value 174.614066
## iter 20 value 161.475590
## final value 161.472216
## converged
## trying - schtyp
## # weights: 30 (18 variable)
## initial value 219.722458
## iter 10 value 180.749264
## iter 20 value 159.825179
## final value 159.649518
## converged
## trying - read
## # weights: 30 (18 variable)
## initial value 219.722458
## iter 10 value 183.217967
## iter 20 value 156.956223
## final value 156.905034
## converged
## trying - write
## # weights: 30 (18 variable)
## initial value 219.722458
## iter 10 value 176.860078
## iter 20 value 156.634024
## final value 156.325078
## converged
## trying - math
## # weights: 30 (18 variable)
## initial value 219.722458
## iter 10 value 183.819884
## iter 20 value 162.568961
## final value 162.533639
## converged
## trying - science
## # weights: 30 (18 variable)
## initial value 219.722458
## iter 10 value 185.688554
## iter 20 value 161.852139
## final value 161.818793
## converged
## trying - socst
## # weights: 30 (18 variable)
## initial value 219.722458
## iter 10 value 180.870401
## iter 20 value 159.648982
## final value 159.589251
## converged
## Df AIC
## - gender 18 348.0657
## - write 18 348.6502
## - read 18 349.8101
## <none> 20 351.5522
## - ses 16 354.9444
## - socst 18 355.1785
## - schtyp 18 355.2990
## - science 18 359.6376
## - math 18 361.0673
## # weights: 30 (18 variable)
## initial value 219.722458
## iter 10 value 172.662548
## iter 20 value 156.063823
## final value 156.032828
## converged
##
## Step: AIC=348.07
## sprog ~ ses + schtyp + read + write + math + science + socst
##
## trying - ses
## # weights: 24 (14 variable)
## initial value 219.722458
## iter 10 value 174.143433
## iter 20 value 161.751125
## iter 20 value 161.751124
## iter 20 value 161.751124
## final value 161.751124
## converged
## trying - schtyp
## # weights: 27 (16 variable)
## initial value 219.722458
## iter 10 value 180.674669
## iter 20 value 159.902074
## final value 159.901300
## converged
## trying - read
## # weights: 27 (16 variable)
## initial value 219.722458
## iter 10 value 182.139891
## iter 20 value 157.256553
## final value 157.255365
## converged
## trying - write
## # weights: 27 (16 variable)
## initial value 219.722458
## iter 10 value 176.827677
## iter 20 value 156.410686
## final value 156.406678
## converged
## trying - math
## # weights: 27 (16 variable)
## initial value 219.722458
## iter 10 value 183.645245
## iter 20 value 162.999887
## final value 162.998232
## converged
## trying - science
## # weights: 27 (16 variable)
## initial value 219.722458
## iter 10 value 185.984215
## iter 20 value 162.121250
## final value 162.117077
## converged
## trying - socst
## # weights: 27 (16 variable)
## initial value 219.722458
## iter 10 value 180.818124
## iter 20 value 159.843366
## final value 159.841915
## converged
## Df AIC
## - write 16 344.8134
## - read 16 346.5107
## <none> 18 348.0657
## - ses 14 351.5022
## - socst 16 351.6838
## - schtyp 16 351.8026
## - science 16 356.2342
## - math 16 357.9965
## # weights: 27 (16 variable)
## initial value 219.722458
## iter 10 value 176.827677
## iter 20 value 156.410686
## final value 156.406678
## converged
##
## Step: AIC=344.81
## sprog ~ ses + schtyp + read + math + science + socst
##
## trying - ses
## # weights: 21 (12 variable)
## initial value 219.722458
## iter 10 value 175.697433
## final value 162.312774
## converged
## trying - schtyp
## # weights: 24 (14 variable)
## initial value 219.722458
## iter 10 value 183.996922
## iter 20 value 160.624557
## final value 160.624514
## converged
## trying - read
## # weights: 24 (14 variable)
## initial value 219.722458
## iter 10 value 171.169761
## iter 20 value 157.775586
## final value 157.775540
## converged
## trying - math
## # weights: 24 (14 variable)
## initial value 219.722458
## iter 10 value 175.971820
## iter 20 value 164.774187
## final value 164.774168
## converged
## trying - science
## # weights: 24 (14 variable)
## initial value 219.722458
## iter 10 value 175.048445
## iter 20 value 162.188199
## final value 162.188190
## converged
## trying - socst
## # weights: 24 (14 variable)
## initial value 219.722458
## iter 10 value 170.620432
## iter 20 value 161.495973
## final value 161.495961
## converged
## Df AIC
## - read 14 343.5511
## <none> 16 344.8134
## - ses 12 348.6255
## - schtyp 14 349.2490
## - socst 14 350.9919
## - science 14 352.3764
## - math 14 357.5483
## # weights: 24 (14 variable)
## initial value 219.722458
## iter 10 value 171.169761
## iter 20 value 157.775586
## final value 157.775540
## converged
##
## Step: AIC=343.55
## sprog ~ ses + schtyp + math + science + socst
##
## trying - ses
## # weights: 18 (10 variable)
## initial value 219.722458
## iter 10 value 166.035430
## final value 163.818866
## converged
## trying - schtyp
## # weights: 21 (12 variable)
## initial value 219.722458
## iter 10 value 171.145160
## iter 20 value 162.019070
## iter 20 value 162.019070
## iter 20 value 162.019070
## final value 162.019070
## converged
## trying - math
## # weights: 21 (12 variable)
## initial value 219.722458
## iter 10 value 171.943351
## final value 169.805302
## converged
## trying - science
## # weights: 21 (12 variable)
## initial value 219.722458
## iter 10 value 165.438528
## final value 162.495650
## converged
## trying - socst
## # weights: 21 (12 variable)
## initial value 219.722458
## iter 10 value 167.071471
## final value 165.099779
## converged
## Df AIC
## <none> 14 343.5511
## - ses 10 347.6377
## - schtyp 12 348.0381
## - science 12 348.9913
## - socst 12 354.1996
## - math 12 363.6106
summary(mmod1)
## Call:
## multinom(formula = sprog ~ ses + schtyp + math + science + socst,
## data = hsb)
##
## Coefficients:
## (Intercept) seslow sesmiddle schtyppublic math science
## vocational 2.587029 0.87607389 0.6978995 0.6468812 -0.1212242 0.08209791
## general 6.687272 -0.01569301 1.2065000 1.9955504 -0.1369641 0.03941237
## socst
## vocational -0.04441228
## general -0.09363417
##
## Std. Errors:
## (Intercept) seslow sesmiddle schtyppublic math science
## vocational 1.686492 0.5758781 0.4930330 0.545598 0.03213345 0.02787694
## general 1.945363 0.6690861 0.5571202 0.812881 0.03591701 0.02864929
## socst
## vocational 0.02344856
## general 0.02586717
##
## Residual Deviance: 315.5511
## AIC: 343.5511
mmodg <- multinom(sprog ~ race + ses + schtyp + read + write + math + science + socst, hsb)
## # weights: 39 (24 variable)
## initial value 219.722458
## iter 10 value 171.468391
## iter 20 value 153.592758
## final value 153.142827
## converged
deviance(mmodg) - deviance(mmod)
## [1] 0.415142
pchisq(0.415142, mmod$edf - mmodg$edf, lower = F)
## [1] 0.8125555
#predict(mmod1, data.frame(id = 99), type = "probs")