Multiclass Logistic Regression

STAT 371: Applied Categorical Data Analysis

Cover Sections 6.1-6.3 of the Recommended Textbook

Learning Learning Outcomes

After finishing this chapter, students should be able to

Introduction

-In this chapter, we focus on generalizing logistic regression from binary response variable to nominal and ordinal response variables with more than two levels.

The procedure of Building Baseline Model for Multi-class Nominal Data

  1. Fit \(c-1\) binary logit regression \[\begin{eqnarray} && \ \ \ \ \ln \frac{P(Y=j)}{P(Y=1)}=\beta_0^{(j)}+\beta_1^{(j)}x_1+\cdots+\beta_p^{(j)}x_p =\mathbf{x}\beta^{(j)} \nonumber\\ &&\Longrightarrow P(Y=j)=P(Y=1) e^{\mathbf{x} \beta^{(j)}}, j=2, 3, \cdots, c. \end{eqnarray}\] That is \[ \begin{aligned} P(Y=1)&=P(Y=1)\\ P(Y=2)&=P(Y=1)\times e^{\mathbf{x} \beta^{(2)}}\\ P(Y=3)&=P(Y=1)\times e^{\mathbf{x} \beta^{(3)}}\\ \vdots&=\vdots\\ P(Y=c)&=P(Y=1)\times e^{\mathbf{x} \beta^{(c)}}\\ \end{aligned} \] Given the fact that \(\sum_{j=1}^c P(Y=j)=1\), we have \[\begin{eqnarray} && \ \ \ \ P(Y=1)(1+e^{\mathbf{x} \beta^{(2)}}+e^{\mathbf{x} \beta^{(3)}}+\cdots+e^{\mathbf{x} \beta^{(c)}})=1 \nonumber\\ &&\Longrightarrow \nonumber\\ && \ \ \ \ P(Y=1)=\frac{1}{1+e^{\mathbf{x} \beta^{(2)}+}e^{\mathbf{x} \beta^{(3)}}+\cdots+e^{\mathbf{x} \beta^{(c)}}}. \end{eqnarray}\] Therefore, we can solve for \[ P(Y=j)=\frac{e^{\mathbf{x} \beta^{(j)}}}{1+e^{\mathbf{x} \beta^{(2)}+}e^{\mathbf{x} \beta^{(3)}}+\cdots+e^{\mathbf{x} \beta^{(c)}}}, j=2, \cdots, c. \]

  2. Classify the observation to the class with the largest probability.

Example

gdf=read.csv("glass_3type.csv")
head(gdf)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

We fit a model with three ingredients Ba, Mg and Al.

library(nnet)
mlm=multinom(Type~Ba+Mg+Al, trace=F, data=gdf)
summary(mlm)
## Call:
## multinom(formula = Type ~ Ba + Mg + Al, data = gdf, trace = F)
## 
## Coefficients:
##   (Intercept)         Ba        Mg       Al
## 2  -0.8628106 -0.2563371 -1.280758 4.045984
## 7  -5.8562697  1.3506022 -2.035040 6.904372
## 
## Std. Errors:
##   (Intercept)       Ba        Mg        Al
## 2    1.459145 1.532541 0.3581700 0.9153595
## 7    2.354668 1.602768 0.4561505 1.4073782
## 
## Residual Deviance: 199.7249 
## AIC: 215.7249
  1. Based on the computer output of mlm, write down the fitted baseline model.









  1. Predict the glass type of the first observation which has Ba=0, Mg=4.49, and Al=1.10,









Likelihood Ratio Test for Model Selection

Example

Recall the glass data set.

  1. Based on the computer output below, test at the 5% significance level whether the reduced model with Ba, Mg and Al is as effective as the full model with all nine ingredients.
mlm0=multinom(Type~., trace=F, data=gdf) #full model
summary(mlm0)
## Call:
## multinom(formula = Type ~ ., data = gdf, trace = F)
## 
## Coefficients:
##   (Intercept)        RI        Na        Mg        Al        Si         K
## 2   202.38031 285.92105 -5.654031 -7.952748 -1.591137 -6.503354 -5.246471
## 7   -15.99688  67.36962  1.384642 -3.856447  4.993300 -1.310103  1.949833
##           Ca        Ba         Fe
## 2 -6.5879425 -8.430123   1.889733
## 7 -0.9702565  1.112616 -36.129595
## 
## Std. Errors:
##   (Intercept)         RI        Na        Mg       Al        Si        K
## 2  0.03690012 0.05816511 0.5314143 0.5956419 1.287586 0.1342823 1.794421
## 7  0.04623695 0.07687846 1.2848145 0.8610747 2.324660 0.2555301 2.299190
##          Ca       Ba        Fe
## 2 0.4407552 2.829943 2.1911100
## 7 0.7878997 3.184999 0.1211051
## 
## Residual Deviance: 157.4798 
## AIC: 197.4798
anova(mlm0,mlm) #likelihood ratio test
##                                       Model Resid. df Resid. Dev   Test    Df
## 1                              Ba + Mg + Al       342   199.7249           NA
## 2 RI + Na + Mg + Al + Si + K + Ca + Ba + Fe       330   157.4798 1 vs 2    12
##   LR stat.      Pr(Chi)
## 1       NA           NA
## 2 42.24507 3.028625e-05
c(logLik(mlm0),logLik(mlm)) #log-likelihood of two models
## [1] -78.73992 -99.86245
















We can run the likelihood ratio test using the built-in function .

library(lmtest)
lrtest(mlm,mlm0) #likelihood ratio test
## Likelihood ratio test
## 
## Model 1: Type ~ Ba + Mg + Al
## Model 2: Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -99.862                         
## 2  20 -78.740 12 42.245  3.029e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  1. We can also compare the accuracy of the two fitted baseline models (with all ingredients) and (with three ingredients).

Based on the confusion tables, confirm that the accuracy of the reduced model is 0.7143 and the accuracy of the full model is 0.7829.

fit.mat=mlm$fitted.values #n by 3 matrix
fit.mat[1:3,]
##           1         2            7
## 1 0.8963698 0.1030816 0.0005485729
## 2 0.4873383 0.5016772 0.0109845772
## 3 0.2976229 0.6766409 0.0257362348
pred=c(1,2,7)[apply(fit.mat,1,which.max)] #predict class label
pred[1:3]
## [1] 1 2 2
fit.mat0=mlm0$fitted.values
pred0=c(1,2,7)[apply(fit.mat0,1,which.max)]
(tab=table(True=gdf$Type,Predict=pred)) #confusion table
##     Predict
## True  1  2  7
##    1 47 23  0
##    2 21 53  2
##    7  1  3 25
(tab0=table(True=gdf$Type,Predict=pred0))
##     Predict
## True  1  2  7
##    1 52 18  0
##    2 18 57  1
##    7  0  1 28
c(sum(diag(tab))/nrow(gdf),sum(diag(tab0))/nrow(gdf)) #accuracy
## [1] 0.7142857 0.7828571

Cumulative Logit Model for Multi-class Ordinal Data

Example Recall the heart disease data has a variable - output : heart disease (0-4, 0=no presence)

For the variable output, the cumulative logits are:

Cumulative Logit Models with Proportional Odds

The parameter \(\beta\) quantifies the effect of \(x\) on the log odds of resulting in category \(j\) or below; in the model the effect is assumed to be the same for all \(j=1, 2, \cdots, c-1\).

The interpretation of the coefficients of the fitted proportional odds regression model is as follows:

In general, logistic regression for ordinal response is given by \[ \mbox{logit}[P(Y \le j)]=\alpha_j+\beta_1 x_1+\beta_2x_2+\cdots+\beta_kx_k, j=1, 2, \cdots, c-1, \]

Model Probability of Each Category

Example: Proportional Odds Model for Ordinal Response

Recall the heart disease data with variables

Fit a proportional odds model on the response variable output which has five levels (0-4) using sex and number of major vessels with blockage ca as the predictor variables.

We can use the polr function in R package MASS to fit a proportional odds logic regression. Note that the polr function reports negative of the slope.

hdf=read.csv("heart_disease.csv")
library(MASS)
pofit=polr(as.factor(output)~as.factor(sex)+ca,data=hdf,Hess=T)
summary(pofit)
## Call:
## polr(formula = as.factor(output) ~ as.factor(sex) + ca, data = hdf, 
##     Hess = T)
## 
## Coefficients:
##                 Value Std. Error t value
## as.factor(sex)1 1.079     0.2782   3.877
## ca              1.151     0.1359   8.466
## 
## Intercepts:
##     Value   Std. Error t value
## 0|1  1.6434  0.2614     6.2858
## 1|2  2.7185  0.2943     9.2386
## 2|3  3.6417  0.3267    11.1475
## 3|4  5.3086  0.4260    12.4608
## 
## Residual Deviance: 665.1239 
## AIC: 677.1239 
## (4 observations deleted due to missingness)

The label 0∣1 represents the odds ratio \(\frac{P(Y\le 0)}{P(Y>0)}\), which is equivalent to \(\frac{P(Y\le 0)}{P(Y\ge 1)}\). Similarly, the label 1|2 corresponds to \(\frac{P(Y\le 1)}{P(Y>1)}=\frac{P(Y\le 1)}{P(Y\ge 2)}\). The same rule applies to all remaining labels.

  1. Based on the computer output, write down the fitted proportional odds model.














We can create the table of coefficients including p-value based on the fitted proportional odds model pofit.

ctable=coef(summary(pofit))
## calculate and store p values
p=pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable=cbind(ctable, "p value" = p))
##                    Value Std. Error   t value      p value
## as.factor(sex)1 1.078707  0.2782487  3.876774 1.058504e-04
## ca              1.150559  0.1359032  8.466024 2.539118e-17
## 0|1             1.643417  0.2614488  6.285808 3.261539e-10
## 1|2             2.718474  0.2942508  9.238629 2.496760e-20
## 2|3             3.641733  0.3266851 11.147533 7.361919e-29
## 3|4             5.308555  0.4260188 12.460847 1.220677e-35

  1. Interpret the coefficients of the fitted proportional odds model.

  1. Probability of Each Category. We choose several subjects to predict their categories of heart disease status (0-4).
(testdf=hdf[c(1,2,5,7),c("sex","ca","output")])
##   sex ca output
## 1   1  0      0
## 2   1  3      2
## 5   0  0      0
## 7   0  2      3

For the first subject who is a male (\(x_1=1\)) and has no major vessel with blockage (\(x_2=0\)), his probability of heart disease status is \[ \begin{aligned} P(Y\le 0)&=\frac{e^{1.6434-1.079 x_1-1.151 x_2}}{1+e^{1.6434-1.079 x_1-1.151 x_2}}=\frac{e^{1.6434-1.079\times 1-1.151\times 0}}{1+e^{1.6434-1.079\times 1-1.151\times 0}}=0.6375\\ P(Y\le 1)&=\frac{e^{2.7185-1.079 x_1-1.151 x_2}}{1+e^{2.7185-1.079 x_1-1.151 x_2}}=\frac{e^{2.7185-1.079 \times 1-1.151 \times 0}}{1+e^{2.7185-1.079 \times 1-1.151 \times 0}}=0.8375\\ P(Y\le 2)&=\frac{e^{3.6417-1.079 x_1-1.151 x_2}}{1+e^{3.6417-1.079 x_1-1.151 x_2}}=\frac{e^{3.6417-1.079 \times 1-1.151 \times 0}}{1+e^{3.6417-1.079 \times 1-1.151 \times 0}}=0.9284\\ P(Y\le 3)&=\frac{e^{5.3086-1.079 x_1-1.151 x_2}}{1+e^{5.3086-1.079 x_1-1.151 x_2}}=\frac{e^{5.3086-1.079 \times 1-1.151 \times 0}}{1+e^{5.3086-1.079 \times 1-1.151 \times 0}}=0.9857\\ \end{aligned} \] which gives \[ \begin{aligned} P(Y=0)&=P(Y\le 0)=0.6375\\ P(Y=1)&=P(Y\le 1)-P(Y\le 0)=0.8375-0.6375=0.2\\ P(Y=2)&=P(Y\le 2)-P(Y\le 1)=0.9284-0.8375=0.0909\\ P(Y=3)&=P(Y\le 3)-P(Y\le 2)=0.9857-0.9284=0.0573\\ P(Y=4)&=P(Y\le 4)-P(Y\le 3)=1-0.9857=0.0143\\ \end{aligned} \]

Since \(P(Y=0)\) has the largest value, we predict the first subject’s disease status as 0, which is a correct prediction.

Similarly, for the 4th subject who is a female (\(x_1=0\)) with 2 major vessels blocked (\(x_2=2\)), her probability of each disease status can be calculated as \[ \begin{aligned} P(Y\le 0)&=\frac{e^{1.6434-1.079 x_1-1.151 x_2}}{1+e^{1.6434-1.079 x_1-1.151 x_2}}=\frac{e^{1.6434-1.079\times 0-1.151\times 2}}{1+e^{1.6434-1.079\times 0-1.151\times 2}}=0.3411\\ P(Y\le 1)&=\frac{e^{2.7185-1.079 x_1-1.151 x_2}}{1+e^{2.7185-1.079 x_1-1.151 x_2}}=\frac{e^{2.7185-1.079 \times 0-1.151 \times 2}}{1+e^{2.7185-1.079 \times 0-1.151 \times 2}}=0.6026\\ P(Y\le 2)&=\frac{e^{3.6417-1.079 x_1-1.151 x_2}}{1+e^{3.6417-1.079 x_1-1.151 x_2}}=\frac{e^{3.6417-1.079 \times 0-1.151 \times 2}}{1+e^{3.6417-1.079 \times 0-1.151 \times 2}}=0.7924\\ P(Y\le 3)&=\frac{e^{5.3086-1.079 x_1-1.151 x_2}}{1+e^{5.3086-1.079 x_1-1.151 x_2}}=\frac{e^{5.3086-1.079 \times 0-1.151 \times 2}}{1+e^{5.3086-1.079 \times 0-1.151 \times 2}}=0.9529\\ \end{aligned} \] which gives















Since \(P(Y=0)\) has the largest value, we predict the 4th subject’s disease status as 0 which turns out to be an error, the true disease status is 3. This might be due to the fact that only sex and ca are used to fit the model.

We can also use the built-in predict function to obtain the probabilities.

predict(pofit,testdf,type="p") #probability of each category
##           0          1          2          3          4
## 1 0.6375414 0.19996170 0.09094055 0.05721049 0.01434581
## 2 0.0528015 0.08760571 0.15098063 0.39389758 0.31471458
## 5 0.8379993 0.10010868 0.03635438 0.02061294 0.00492470
## 7 0.3412560 0.26159419 0.18974070 0.16031809 0.04709106
predict(pofit,testdf) #return the category label
## [1] 0 3 0 0
## Levels: 0 1 2 3 4

We can also use vglm function in VGAM package to fit a cumulative proportional odds model. Note that the vglm function reports the positive of the slope. We have the identical results as the output of pofit.

library(VGAM)
pofit1=vglm(output~sex+ca,family=cumulative(parallel=TRUE),data=hdf)
pofit1
## 
## Call:
## vglm(formula = output ~ sex + ca, family = cumulative(parallel = TRUE), 
##     data = hdf)
## 
## 
## Coefficients:
## (Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4           sex 
##      1.643443      2.718512      3.641761      5.308637     -1.078700 
##            ca 
##     -1.150571 
## 
## Degrees of Freedom: 1196 Total; 1190 Residual
## Residual deviance: 665.1239 
## Log-likelihood: -332.562
predictvglm(pofit1,newdata=testdf,type="response") #probability of each category
##            0          1          2          3           4
## 1 0.63754933 0.19996011 0.09093667 0.05720935 0.014344544
## 2 0.05280143 0.08760708 0.15097950 0.39390908 0.314702905
## 5 0.83800287 0.10010734 0.03635286 0.02061263 0.004924299
## 7 0.34125655 0.26159713 0.18973796 0.16031991 0.047088450

Proportional Odds Model Versus Baseline Model

We use forward selection to fit the proportional odds model and baseline model.

chdf=hdf[complete.cases(hdf),] #complete cases
chdf$sex=as.factor(chdf$sex)
chdf$cp=as.factor(chdf$cp)
chdf$fbs=as.factor(chdf$fbs)
chdf$restecg=as.factor(chdf$restecg)
chdf$exang=as.factor(chdf$exang)
chdf$slp=as.factor(chdf$slp)
chdf$thal=as.factor(chdf$thal)

m0=polr(as.factor(output)~1,data=chdf)
mf=polr(as.factor(output)~.,data=chdf)
mf.po= step(m0, scope=formula(mf), direction="forward",trace = F, test="Chisq") #final po model

mm0=multinom(as.factor(output)~1, trace=F, data=chdf)
mmf=multinom(as.factor(output)~., trace=F, data=chdf)


#mf.b= step(mm0, scope=formula(mmf), direction="forward",trace=F,test="Chisq")
#summary(mf.b)

#fit the final model mannually
mf.b= multinom(as.factor(output) ~ thal + ca + oldpeak + cp + thalachh + fbs, data = chdf, trace=F) 

mf.po #final proportional odds model
## Call:
## polr(formula = as.factor(output) ~ thal + ca + cp + slp + sex + 
##     oldpeak + restecg + thalachh + trtbps, data = chdf)
## 
## Coefficients:
##       thal6       thal7          ca         cp1         cp2         cp3 
##  0.34898989  1.36311200  0.87626924 -1.62110826 -1.08404676 -1.82304970 
##        slp2        slp3        sex1     oldpeak    restecg1    restecg2 
##  1.00299143  0.92753487  1.01167619  0.26029082  1.84526450  0.44062969 
##    thalachh      trtbps 
## -0.01228890  0.01285316 
## 
## Intercepts:
##      0|1      1|2      2|3      3|4 
## 2.079397 3.810718 5.083658 7.074360 
## 
## Residual Deviance: 520.0181 
## AIC: 556.0181
mf.b #final baseline model
## Call:
## multinom(formula = as.factor(output) ~ thal + ca + oldpeak + 
##     cp + thalachh + fbs, data = chdf, trace = F)
## 
## Coefficients:
##   (Intercept)      thal6    thal7        ca   oldpeak       cp1         cp2
## 1   0.7146982  0.3272016 1.649825 0.8843728 0.3547678 -1.590706  -1.0164124
## 2   0.2407046  1.5594422 2.290530 1.2777661 0.8970209 -2.225817  -1.8198575
## 3   0.8051665 -0.2434778 2.760109 1.4632858 1.0171285 -2.094614  -0.4655464
## 4  -4.1287656  2.0967411 2.739979 1.7812933 1.0833166 -2.827671 -14.6139522
##          cp3    thalachh       fbs1
## 1  -1.120156 -0.01457584 -0.8366657
## 2  -2.518059 -0.02599813  0.8194541
## 3 -15.020256 -0.03590242  0.9856303
## 4  -1.397088 -0.01067928 -0.6510075
## 
## Residual Deviance: 504.7278 
## AIC: 584.7278
c(AIC(mf.po),AIC(mf.b))
## [1] 556.0181 584.7278
c(BIC(mf.po),BIC(mf.b))
## [1] 622.5053 732.4771
c(logLik(mf.po),logLik(mf.b))
## [1] -260.0091 -252.3639

predpo=c(0:4)[apply(mf.po$fitted.values,1,which.max)]
predb=c(0:4)[apply(mf.b$fitted.values,1,which.max)]

(tabpo=table(True=chdf$output,Preditct=predpo))
##     Preditct
## True   0   1   2   3   4
##    0 150   7   1   2   0
##    1  26  16   3   8   1
##    2   5  10   4  16   0
##    3   1   9   7  16   2
##    4   1   2   2   6   2
(tabb=table(True=chdf$output,Preditct=predb))
##     Preditct
## True   0   1   2   3   4
##    0 149   7   2   2   0
##    1  28  13   3  10   0
##    2   3  10   5  17   0
##    3   2   9   3  20   1
##    4   1   3   2   5   2
accuracy.po=sum(diag(tabpo))/nrow(chdf)
accuracy.b=sum(diag(tabb))/nrow(chdf)
c(accuracy.po,accuracy.b)
## [1] 0.6329966 0.6363636