Abstract

I will be using Extending the Linear Model with R: (Faraway), to explore solving a problem with multinomial data in chapter 7.

The Data

The hsb data was collected as a subset of the “High School and Beyond” study conducted by the National Education Longitudinal Studies (NELS) program of the National Center for Education Statistics.

One purpose of the study was to determine which factors are related to the choice of the type of program, academic, vocational, or general, that the students purse in high school.

## 'data.frame':    200 obs. of  11 variables:
##  $ id     : int  70 121 86 141 172 113 50 11 84 48 ...
##  $ gender : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
##  $ race   : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
##  $ ses    : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
##  $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
##  $ prog   : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
##  $ read   : int  57 68 44 63 47 44 50 34 63 57 ...
##  $ write  : int  52 59 33 44 52 52 59 46 57 55 ...
##  $ math   : int  41 53 54 47 57 51 42 45 54 52 ...
##  $ science: int  47 63 58 53 53 63 53 39 58 50 ...
##  $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...
##        id            gender              race         ses         schtyp   
##  Min.   :  1.00   female:109   african-amer: 20   high  :58   private: 32  
##  1st Qu.: 50.75   male  : 91   asian       : 11   low   :47   public :168  
##  Median :100.50                hispanic    : 24   middle:95                
##  Mean   :100.50                white       :145                            
##  3rd Qu.:150.25                                                            
##  Max.   :200.00                                                            
##        prog          read           write            math          science     
##  academic:105   Min.   :28.00   Min.   :31.00   Min.   :33.00   Min.   :26.00  
##  general : 45   1st Qu.:44.00   1st Qu.:45.75   1st Qu.:45.00   1st Qu.:44.00  
##  vocation: 50   Median :50.00   Median :54.00   Median :52.00   Median :53.00  
##                 Mean   :52.23   Mean   :52.77   Mean   :52.65   Mean   :51.85  
##                 3rd Qu.:60.00   3rd Qu.:60.00   3rd Qu.:59.00   3rd Qu.:58.00  
##                 Max.   :76.00   Max.   :67.00   Max.   :75.00   Max.   :74.00  
##      socst      
##  Min.   :26.00  
##  1st Qu.:46.00  
##  Median :52.00  
##  Mean   :52.41  
##  3rd Qu.:61.00  
##  Max.   :71.00

We can see from the data we have quite a few factors, those are:

  • gender, 2 levels, female and male
  • race, 4 levels, african-amer, asian, hispanic, and white
  • ses (socioeconomic class), 3 levels, high, low, and middle
  • schtyp (school type), 2 levels, private and public
  • prog (choice of high school program) 3 levels, academic, general, and vocation, this will be our response variable.

The remaining variables are student subject scores, which are broken down for reading, write (writing), math, science, and socst (social science).

Problem Statement

  1. Make a table showing the proportion of males and females choosing the three different programs. Comment on the difference. Repeat this comparison but for SES rather than gender.

  2. Construct a plot like the right panel of Figure 7.1 that shows the relationship between program choice and reading score. Comment on the plot. Repeat for math in place of reading.

  3. Compute the correlation matrix for the five subject scores.

  4. Fit a multinomial response model for the program choice and examine the fitted coefficients. Of the five subjects, one gives unexpected coefficients. Identify this subject and suggest an explanation for this behavior.

  5. Construct a derived variable that is the sum of the five subject scores. Fit a multinomial model as before except with this one sum variable in place of the five subjects separately. Compare the two models to decide which should be preferred.

  6. Use a stepwise method to reduce the model. Which variables are in your selected model?

  7. Construct a plot of predicted probabilities from your selected model where the math score varies over the observed range. Other predictors should be set at the most common level or mean value as appropriate. Your plot should be similar to Figure 7.2. Comment on the relationship.

Program Choice Versus

Gender

##           gender
## prog       female male
##   academic     58   47
##   general      24   21
##   vocation     27   23
##           gender
## prog       female   male  Total
##   academic  55.24  44.76 100.00
##   general   53.33  46.67 100.00
##   vocation  54.00  46.00 100.00
##           gender
## prog       female   male
##   academic  53.21  51.65
##   general   22.02  23.08
##   vocation  24.77  25.27
##   Total    100.00 100.00

We can see from the above three tables that proportion wise the distribution of male and female in their program choice is nearly the same, where the only difference seems to be a slightly higher volume of female students. For all gender the preference begins with academic as the favorite, vocation as the second, and finally general being the least preferred.

Socioeconomic Class

##           ses
## prog       high low middle
##   academic   42  19     44
##   general     9  16     20
##   vocation    7  12     31
##           ses
## prog         high    low middle  Total
##   academic  40.00  18.10  41.90 100.00
##   general   20.00  35.56  44.44 100.00
##   vocation  14.00  24.00  62.00 100.00
##           ses
## prog         high    low middle
##   academic  72.41  40.43  46.32
##   general   15.52  34.04  21.05
##   vocation  12.07  25.53  32.63
##   Total    100.00 100.00 100.00

With ses we actually see quite a disparity for program choice. 72% of those in high class choose academic as their program, while for low and middle it’s a lot less, only 40% and 46%. The low class close 2nd favorite program choice is general. The middle classes not so close 2nd was vocation.

Plot for Reading Score

Higher reading scores tend to favor academic programs to the point of exclusivity at the highest end. Reading scores that are on the lower end will favor vocation and general programs.

Plot for Math Score

Students with low math scores tend to favor general and vocational programs. The choice for an academic path seems to trend upwards as the math score increases and remains relatively high when it comes to exceptionally high scores. Interestingly enough, general as a program choice is not even present when math scores go into the higher range.

Summary of Scores

In both cases we can see that the reading and math scores exhibited similar trends with identifying program choice. In both situations low scores leaned in favor toward general and vocation programs while higher subject scores would churn out more picks in favor of academics.

Correlation Matrix: 5 Subjects

  read write math science socst
read 1 0.5968 0.6623 0.6302 0.6215
write 0.5968 1 0.6174 0.5704 0.6048
math 0.6623 0.6174 1 0.6307 0.5445
science 0.6302 0.5704 0.6307 1 0.4651
socst 0.6215 0.6048 0.5445 0.4651 1

The plots above show strong correlation between all the variables dealing with scores. Later we will have to keep in mind when we build our model that these features may impact each other and thus be more susceptible to error. Feature engineering by perhaps aggregating the scores is an option we will explore.

Multinomial Model

## # weights:  45 (28 variable)
## initial  value 219.722458 
## iter  10 value 181.098338
## iter  20 value 154.577078
## iter  30 value 152.478856
## final  value 152.478368 
## converged
## Call:
## multinom(formula = prog ~ ., data = df)
## 
## Coefficients:
##          (Intercept)           id  gendermale  raceasian racehispanic racewhite
## general     4.263658 -0.007332836 -0.04666403  1.2170225   -0.8702109 0.8609754
## vocation    7.845921 -0.003680462 -0.29724832 -0.7863428   -0.3236628 0.6223190
##             seslow sesmiddle schtyppublic        read       write       math
## general  1.1547399 0.7430976    0.1384853 -0.05445264 -0.03716360 -0.1037470
## vocation 0.0728241 1.1897765    1.8285649 -0.04078359 -0.03220268 -0.1099712
##            science       socst
## general  0.1065258 -0.01786542
## vocation 0.0537472 -0.07959798
## 
## Std. Errors:
##          (Intercept)          id gendermale raceasian racehispanic racewhite
## general     1.960941 0.007678009  0.4587870  1.064969    0.9286986 0.9438010
## vocation    2.288984 0.008408855  0.5048241  1.476435    0.8924359 0.9519097
##             seslow sesmiddle schtyppublic       read      write       math
## general  0.6134530 0.5096129    0.7338284 0.03300204 0.03398842 0.03556357
## vocation 0.7067682 0.5739217    0.9981540 0.03583547 0.03597627 0.03885464
##             science      socst
## general  0.03331314 0.02737227
## vocation 0.03445137 0.02963317
## 
## Residual Deviance: 304.9567 
## AIC: 360.9567

The multinomial log-linear model provided by the neural network package has an interesting result as we can see that science seems to have the opposite affect relative to all the other scores. At times multicollinearity can have such impacts such as changing the direction of coefficient impact. When we feature engineer we will see if the standard error can be lowered.

Feature Engineering

## # weights:  33 (20 variable)
## initial  value 219.722458 
## iter  10 value 167.158173
## iter  20 value 164.141699
## final  value 164.130704 
## converged
## Call:
## multinom(formula = prog ~ id + gender + race + ses + schtyp + 
##     sum.subject, data = df.reduced)
## 
## Coefficients:
##          (Intercept)           id  gendermale  raceasian racehispanic racewhite
## general     3.227335 -0.003708235  0.24883040  1.0243408   -0.5484976  1.060033
## vocation    7.112010 -0.003220142 -0.09614882 -0.6015843   -0.1937564  1.098265
##             seslow sesmiddle schtyppublic sum.subject
## general  1.0593830 0.6350558    0.3875245 -0.02052599
## vocation 0.2517821 1.1874930    1.8098161 -0.04125543
## 
## Std. Errors:
##          (Intercept)          id gendermale raceasian racehispanic racewhite
## general     1.798815 0.006823237  0.3941480 0.9439661    0.8799224 0.8740777
## vocation    2.157426 0.007659938  0.4364287 1.3769618    0.8411264 0.8970833
##             seslow sesmiddle schtyppublic sum.subject
## general  0.5664146 0.4789630    0.6826598 0.005976099
## vocation 0.6797684 0.5566371    0.9568939 0.007225491
## 
## Residual Deviance: 328.2614 
## AIC: 368.2614

We can see the standard error for the combined subject scores is lower than that of the previous model with subject scores not combined. This can most likely be explained by collinearity.

Stepwise Method

## Call:
## multinom(formula = prog ~ ses + schtyp + sum.subject, data = df.reduced)
## 
## Coefficients:
##          (Intercept)    seslow sesmiddle schtyppublic sum.subject
## general     2.593944 0.8078324 0.5808536    0.5594952 -0.01635887
## vocation    6.372051 0.1330839 1.1517240    1.8490860 -0.03681150
## 
## Std. Errors:
##          (Intercept)    seslow sesmiddle schtyppublic sum.subject
## general     1.587502 0.5386033 0.4720925    0.5219044 0.005422494
## vocation    1.877764 0.6468558 0.5465572    0.7974692 0.006553295
## 
## Residual Deviance: 336.0554 
## AIC: 356.0554

The stepwise methods concludes that the following explanatory values are the best:

  1. ses
  2. schtyp
  3. sum.subject

Final Model

We construct a plot of predicted probabilities from our selected model where:

  • math score varies over the observed range

  • predictors are set at the most common level or mean value

##        id            write            read            math         science     
##  Min.   :100.5   Min.   :52.77   Min.   :52.23   Min.   :33.0   Min.   :51.85  
##  1st Qu.:100.5   1st Qu.:52.77   1st Qu.:52.23   1st Qu.:43.5   1st Qu.:51.85  
##  Median :100.5   Median :52.77   Median :52.23   Median :54.0   Median :51.85  
##  Mean   :100.5   Mean   :52.77   Mean   :52.23   Mean   :54.0   Mean   :51.85  
##  3rd Qu.:100.5   3rd Qu.:52.77   3rd Qu.:52.23   3rd Qu.:64.5   3rd Qu.:51.85  
##  Max.   :100.5   Max.   :52.77   Max.   :52.23   Max.   :75.0   Max.   :51.85  
##      socst          gender      race       schtyp       ses    
##  Min.   :52.41   female:43   white:43   public:43   middle:43  
##  1st Qu.:52.41                                                 
##  Median :52.41                                                 
##  Mean   :52.41                                                 
##  3rd Qu.:52.41                                                 
##  Max.   :52.41
##       math         academic          general           vocation      
##  Min.   :33.0   Min.   :0.06328   Min.   :0.05617   Min.   :0.08253  
##  1st Qu.:43.5   1st Qu.:0.17333   1st Qu.:0.12945   1st Qu.:0.20305  
##  Median :54.0   Median :0.39356   Median :0.22677   Median :0.37968  
##  Mean   :54.0   Mean   :0.42368   Mean   :0.21086   Mean   :0.36545  
##  3rd Qu.:64.5   3rd Qu.:0.66750   3rd Qu.:0.29657   3rd Qu.:0.53010  
##  Max.   :75.0   Max.   :0.86130   Max.   :0.32211   Max.   :0.61462

##                         df$prog
## predict(multinomial.fit) academic general vocation
##                 academic       92      22       15
##                 general         6      12        5
##                 vocation        7      11       30
## [1] "Accuracy: 0.69"

We can see the above model had a 69% accuracy in predicting correct program choice. We can see that the volume for general choice as a program is low, which makes sense since we saw for example in the Math Reading Score plot that a general program choice was not even a pick for students that scored in the higher range.

The plot does highlight the positive trend for academic as the Math Score goes up, We saw that previously in our analysis of the scores vs program pick plots. You can also see how the low math scores tend to favor general and vocation program choices, this was also apparent in our initial analysis.