I will be using Extending the Linear Model with R: (Faraway), to explore solving a problem with multinomial data in chapter 7.
The hsb data was collected as a subset of the “High School and Beyond” study conducted by the National Education Longitudinal Studies (NELS) program of the National Center for Education Statistics.
One purpose of the study was to determine which factors are related to the choice of the type of program, academic, vocational, or general, that the students purse in high school.
## 'data.frame': 200 obs. of 11 variables:
## $ id : int 70 121 86 141 172 113 50 11 84 48 ...
## $ gender : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
## $ race : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
## $ ses : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
## $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
## $ prog : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
## $ read : int 57 68 44 63 47 44 50 34 63 57 ...
## $ write : int 52 59 33 44 52 52 59 46 57 55 ...
## $ math : int 41 53 54 47 57 51 42 45 54 52 ...
## $ science: int 47 63 58 53 53 63 53 39 58 50 ...
## $ socst : int 57 61 31 56 61 61 61 36 51 51 ...
## id gender race ses schtyp
## Min. : 1.00 female:109 african-amer: 20 high :58 private: 32
## 1st Qu.: 50.75 male : 91 asian : 11 low :47 public :168
## Median :100.50 hispanic : 24 middle:95
## Mean :100.50 white :145
## 3rd Qu.:150.25
## Max. :200.00
## prog read write math science
## academic:105 Min. :28.00 Min. :31.00 Min. :33.00 Min. :26.00
## general : 45 1st Qu.:44.00 1st Qu.:45.75 1st Qu.:45.00 1st Qu.:44.00
## vocation: 50 Median :50.00 Median :54.00 Median :52.00 Median :53.00
## Mean :52.23 Mean :52.77 Mean :52.65 Mean :51.85
## 3rd Qu.:60.00 3rd Qu.:60.00 3rd Qu.:59.00 3rd Qu.:58.00
## Max. :76.00 Max. :67.00 Max. :75.00 Max. :74.00
## socst
## Min. :26.00
## 1st Qu.:46.00
## Median :52.00
## Mean :52.41
## 3rd Qu.:61.00
## Max. :71.00
We can see from the data we have quite a few factors, those are:
gender, 2 levels, female and malerace, 4 levels, african-amer, asian, hispanic, and whiteses (socioeconomic class), 3 levels, high, low, and middleschtyp (school type), 2 levels, private and publicprog (choice of high school program) 3 levels, academic, general, and vocation, this will be our response variable.The remaining variables are student subject scores, which are broken down for reading, write (writing), math, science, and socst (social science).
Make a table showing the proportion of males and females choosing the three different programs. Comment on the difference. Repeat this comparison but for SES rather than gender.
Construct a plot like the right panel of Figure 7.1 that shows the relationship between program choice and reading score. Comment on the plot. Repeat for math in place of reading.
Compute the correlation matrix for the five subject scores.
Fit a multinomial response model for the program choice and examine the fitted coefficients. Of the five subjects, one gives unexpected coefficients. Identify this subject and suggest an explanation for this behavior.
Construct a derived variable that is the sum of the five subject scores. Fit a multinomial model as before except with this one sum variable in place of the five subjects separately. Compare the two models to decide which should be preferred.
Use a stepwise method to reduce the model. Which variables are in your selected model?
Construct a plot of predicted probabilities from your selected model where the math score varies over the observed range. Other predictors should be set at the most common level or mean value as appropriate. Your plot should be similar to Figure 7.2. Comment on the relationship.
## gender
## prog female male
## academic 58 47
## general 24 21
## vocation 27 23
## gender
## prog female male Total
## academic 55.24 44.76 100.00
## general 53.33 46.67 100.00
## vocation 54.00 46.00 100.00
## gender
## prog female male
## academic 53.21 51.65
## general 22.02 23.08
## vocation 24.77 25.27
## Total 100.00 100.00
We can see from the above three tables that proportion wise the distribution of male and female in their program choice is nearly the same, where the only difference seems to be a slightly higher volume of female students. For all gender the preference begins with academic as the favorite, vocation as the second, and finally general being the least preferred.
## ses
## prog high low middle
## academic 42 19 44
## general 9 16 20
## vocation 7 12 31
## ses
## prog high low middle Total
## academic 40.00 18.10 41.90 100.00
## general 20.00 35.56 44.44 100.00
## vocation 14.00 24.00 62.00 100.00
## ses
## prog high low middle
## academic 72.41 40.43 46.32
## general 15.52 34.04 21.05
## vocation 12.07 25.53 32.63
## Total 100.00 100.00 100.00
With ses we actually see quite a disparity for program choice. 72% of those in high class choose academic as their program, while for low and middle it’s a lot less, only 40% and 46%. The low class close 2nd favorite program choice is general. The middle classes not so close 2nd was vocation.
Higher reading scores tend to favor academic programs to the point of exclusivity at the highest end. Reading scores that are on the lower end will favor vocation and general programs.
Students with low math scores tend to favor general and vocational programs. The choice for an academic path seems to trend upwards as the math score increases and remains relatively high when it comes to exceptionally high scores. Interestingly enough, general as a program choice is not even present when math scores go into the higher range.
In both cases we can see that the reading and math scores exhibited similar trends with identifying program choice. In both situations low scores leaned in favor toward general and vocation programs while higher subject scores would churn out more picks in favor of academics.
| read | write | math | science | socst | |
|---|---|---|---|---|---|
| read | 1 | 0.5968 | 0.6623 | 0.6302 | 0.6215 |
| write | 0.5968 | 1 | 0.6174 | 0.5704 | 0.6048 |
| math | 0.6623 | 0.6174 | 1 | 0.6307 | 0.5445 |
| science | 0.6302 | 0.5704 | 0.6307 | 1 | 0.4651 |
| socst | 0.6215 | 0.6048 | 0.5445 | 0.4651 | 1 |
The plots above show strong correlation between all the variables dealing with scores. Later we will have to keep in mind when we build our model that these features may impact each other and thus be more susceptible to error. Feature engineering by perhaps aggregating the scores is an option we will explore.
## # weights: 45 (28 variable)
## initial value 219.722458
## iter 10 value 181.098338
## iter 20 value 154.577078
## iter 30 value 152.478856
## final value 152.478368
## converged
## Call:
## multinom(formula = prog ~ ., data = df)
##
## Coefficients:
## (Intercept) id gendermale raceasian racehispanic racewhite
## general 4.263658 -0.007332836 -0.04666403 1.2170225 -0.8702109 0.8609754
## vocation 7.845921 -0.003680462 -0.29724832 -0.7863428 -0.3236628 0.6223190
## seslow sesmiddle schtyppublic read write math
## general 1.1547399 0.7430976 0.1384853 -0.05445264 -0.03716360 -0.1037470
## vocation 0.0728241 1.1897765 1.8285649 -0.04078359 -0.03220268 -0.1099712
## science socst
## general 0.1065258 -0.01786542
## vocation 0.0537472 -0.07959798
##
## Std. Errors:
## (Intercept) id gendermale raceasian racehispanic racewhite
## general 1.960941 0.007678009 0.4587870 1.064969 0.9286986 0.9438010
## vocation 2.288984 0.008408855 0.5048241 1.476435 0.8924359 0.9519097
## seslow sesmiddle schtyppublic read write math
## general 0.6134530 0.5096129 0.7338284 0.03300204 0.03398842 0.03556357
## vocation 0.7067682 0.5739217 0.9981540 0.03583547 0.03597627 0.03885464
## science socst
## general 0.03331314 0.02737227
## vocation 0.03445137 0.02963317
##
## Residual Deviance: 304.9567
## AIC: 360.9567
The multinomial log-linear model provided by the neural network package has an interesting result as we can see that science seems to have the opposite affect relative to all the other scores. At times multicollinearity can have such impacts such as changing the direction of coefficient impact. When we feature engineer we will see if the standard error can be lowered.
## # weights: 33 (20 variable)
## initial value 219.722458
## iter 10 value 167.158173
## iter 20 value 164.141699
## final value 164.130704
## converged
## Call:
## multinom(formula = prog ~ id + gender + race + ses + schtyp +
## sum.subject, data = df.reduced)
##
## Coefficients:
## (Intercept) id gendermale raceasian racehispanic racewhite
## general 3.227335 -0.003708235 0.24883040 1.0243408 -0.5484976 1.060033
## vocation 7.112010 -0.003220142 -0.09614882 -0.6015843 -0.1937564 1.098265
## seslow sesmiddle schtyppublic sum.subject
## general 1.0593830 0.6350558 0.3875245 -0.02052599
## vocation 0.2517821 1.1874930 1.8098161 -0.04125543
##
## Std. Errors:
## (Intercept) id gendermale raceasian racehispanic racewhite
## general 1.798815 0.006823237 0.3941480 0.9439661 0.8799224 0.8740777
## vocation 2.157426 0.007659938 0.4364287 1.3769618 0.8411264 0.8970833
## seslow sesmiddle schtyppublic sum.subject
## general 0.5664146 0.4789630 0.6826598 0.005976099
## vocation 0.6797684 0.5566371 0.9568939 0.007225491
##
## Residual Deviance: 328.2614
## AIC: 368.2614
We can see the standard error for the combined subject scores is lower than that of the previous model with subject scores not combined. This can most likely be explained by collinearity.
## Call:
## multinom(formula = prog ~ ses + schtyp + sum.subject, data = df.reduced)
##
## Coefficients:
## (Intercept) seslow sesmiddle schtyppublic sum.subject
## general 2.593944 0.8078324 0.5808536 0.5594952 -0.01635887
## vocation 6.372051 0.1330839 1.1517240 1.8490860 -0.03681150
##
## Std. Errors:
## (Intercept) seslow sesmiddle schtyppublic sum.subject
## general 1.587502 0.5386033 0.4720925 0.5219044 0.005422494
## vocation 1.877764 0.6468558 0.5465572 0.7974692 0.006553295
##
## Residual Deviance: 336.0554
## AIC: 356.0554
The stepwise methods concludes that the following explanatory values are the best:
sesschtypsum.subjectWe construct a plot of predicted probabilities from our selected model where:
math score varies over the observed range
predictors are set at the most common level or mean value
## id write read math science
## Min. :100.5 Min. :52.77 Min. :52.23 Min. :33.0 Min. :51.85
## 1st Qu.:100.5 1st Qu.:52.77 1st Qu.:52.23 1st Qu.:43.5 1st Qu.:51.85
## Median :100.5 Median :52.77 Median :52.23 Median :54.0 Median :51.85
## Mean :100.5 Mean :52.77 Mean :52.23 Mean :54.0 Mean :51.85
## 3rd Qu.:100.5 3rd Qu.:52.77 3rd Qu.:52.23 3rd Qu.:64.5 3rd Qu.:51.85
## Max. :100.5 Max. :52.77 Max. :52.23 Max. :75.0 Max. :51.85
## socst gender race schtyp ses
## Min. :52.41 female:43 white:43 public:43 middle:43
## 1st Qu.:52.41
## Median :52.41
## Mean :52.41
## 3rd Qu.:52.41
## Max. :52.41
## math academic general vocation
## Min. :33.0 Min. :0.06328 Min. :0.05617 Min. :0.08253
## 1st Qu.:43.5 1st Qu.:0.17333 1st Qu.:0.12945 1st Qu.:0.20305
## Median :54.0 Median :0.39356 Median :0.22677 Median :0.37968
## Mean :54.0 Mean :0.42368 Mean :0.21086 Mean :0.36545
## 3rd Qu.:64.5 3rd Qu.:0.66750 3rd Qu.:0.29657 3rd Qu.:0.53010
## Max. :75.0 Max. :0.86130 Max. :0.32211 Max. :0.61462
## df$prog
## predict(multinomial.fit) academic general vocation
## academic 92 22 15
## general 6 12 5
## vocation 7 11 30
## [1] "Accuracy: 0.69"
We can see the above model had a 69% accuracy in predicting correct program choice. We can see that the volume for general choice as a program is low, which makes sense since we saw for example in the Math Reading Score plot that a general program choice was not even a pick for students that scored in the higher range.
The plot does highlight the positive trend for academic as the Math Score goes up, We saw that previously in our analysis of the scores vs program pick plots. You can also see how the low math scores tend to favor general and vocation program choices, this was also apparent in our initial analysis.