Cover Sections 6.1-6.3 of the Recommended Textbook
After finishing this chapter, students should be able to
Explain the main idea of modelling multi-class categorical nominal variable using baseline models.
Calculate the probability belonging to each categorical based on the fitted baseline model.
Conduct a likelihood ratio test to compare nested baseline models based on the computer output.
Explain the main idea of modelling multi-class categorical ordinal variable using proportional odds models.
Calculate the probability belonging to each categorical based on the fitted proportional odds model.
Conduct a likelihood ratio test to compare nested proportional odds models based on the computer output.
Explain the relationship between proportional odds model and cumulative link model.
Calculate the accuracy of a fitted baseline/proportional odds model based on the confusion matrix.
Explain the main idea of a probit regression.
-In this chapter, we focus on generalizing logistic regression from binary response variable to nominal and ordinal response variables with more than two levels.
Baseline Model for Multi-class Nominal Data
Fit \(c-1\) binary logit regression \[\begin{eqnarray} && \ \ \ \ \ln \frac{P(Y=j)}{P(Y=1)}=\beta_0^{(j)}+\beta_1^{(j)}x_1+\cdots+\beta_p^{(j)}x_p =\mathbf{x}\beta^{(j)} \nonumber\\ &&\Longrightarrow P(Y=j)=P(Y=1) e^{\mathbf{x} \beta^{(j)}}, j=2, 3, \cdots, c. \end{eqnarray}\] That is \[ \begin{aligned} P(Y=1)&=P(Y=1)\\ P(Y=2)&=P(Y=1)\times e^{\mathbf{x} \beta^{(2)}}\\ P(Y=3)&=P(Y=1)\times e^{\mathbf{x} \beta^{(3)}}\\ \vdots&=\vdots\\ P(Y=c)&=P(Y=1)\times e^{\mathbf{x} \beta^{(c)}}\\ \end{aligned} \] Given the fact that \(\sum_{j=1}^c P(Y=j)=1\), we have \[\begin{eqnarray} && \ \ \ \ P(Y=1)(1+e^{\mathbf{x} \beta^{(2)}}+e^{\mathbf{x} \beta^{(3)}}+\cdots+e^{\mathbf{x} \beta^{(c)}})=1 \nonumber\\ &&\Longrightarrow \nonumber\\ && \ \ \ \ P(Y=1)=\frac{1}{1+e^{\mathbf{x} \beta^{(2)}+}e^{\mathbf{x} \beta^{(3)}}+\cdots+e^{\mathbf{x} \beta^{(c)}}}. \end{eqnarray}\] Therefore, we can solve for \[ P(Y=j)=\frac{e^{\mathbf{x} \beta^{(j)}}}{1+e^{\mathbf{x} \beta^{(2)}+}e^{\mathbf{x} \beta^{(3)}}+\cdots+e^{\mathbf{x} \beta^{(c)}}}, j=2, \cdots, c. \]
Classify the observation to the class with the largest probability.
We use the Glass data to illustrate how logistic regression can be fitted for nominal response with more than two levels.
The detailed information about the “Glass” data set can be found https://archive.ics.uci.edu/ml/datasets/Glass+Identification.
The original data set has 214 observations, 6 types of glasses (response variable), and 10 explanatory variables (ID plus 9 ingredients such as Na, Mg, AL, Ca, Ba, Fe and so on).
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
To simplify the problem to show the main idea of logistic regression with multi-class response, we only pick the three types of glass with relatively large frequencies:
We only consider the three ingredients (Ba, Mg and Al) as predictors.
We fit a model with three ingredients Ba, Mg and Al.
## Call:
## multinom(formula = Type ~ Ba + Mg + Al, data = gdf, trace = F)
##
## Coefficients:
## (Intercept) Ba Mg Al
## 2 -0.8628106 -0.2563371 -1.280758 4.045984
## 7 -5.8562697 1.3506022 -2.035040 6.904372
##
## Std. Errors:
## (Intercept) Ba Mg Al
## 2 1.459145 1.532541 0.3581700 0.9153595
## 7 2.354668 1.602768 0.4561505 1.4073782
##
## Residual Deviance: 199.7249
## AIC: 215.7249
Likelihood ratio test can be used to compare two nested models.
Suppose the nominal response variable has \(c\) levels and the logit model has \(k\) predictor variables, the baseline logit model has \((c-1)(k+1)\) number of parameters.
Example
Recall the glass data set.
## Call:
## multinom(formula = Type ~ ., data = gdf, trace = F)
##
## Coefficients:
## (Intercept) RI Na Mg Al Si K
## 2 202.38031 285.92105 -5.654031 -7.952748 -1.591137 -6.503354 -5.246471
## 7 -15.99688 67.36962 1.384642 -3.856447 4.993300 -1.310103 1.949833
## Ca Ba Fe
## 2 -6.5879425 -8.430123 1.889733
## 7 -0.9702565 1.112616 -36.129595
##
## Std. Errors:
## (Intercept) RI Na Mg Al Si K
## 2 0.03690012 0.05816511 0.5314143 0.5956419 1.287586 0.1342823 1.794421
## 7 0.04623695 0.07687846 1.2848145 0.8610747 2.324660 0.2555301 2.299190
## Ca Ba Fe
## 2 0.4407552 2.829943 2.1911100
## 7 0.7878997 3.184999 0.1211051
##
## Residual Deviance: 157.4798
## AIC: 197.4798
## Model Resid. df Resid. Dev Test Df
## 1 Ba + Mg + Al 342 199.7249 NA
## 2 RI + Na + Mg + Al + Si + K + Ca + Ba + Fe 330 157.4798 1 vs 2 12
## LR stat. Pr(Chi)
## 1 NA NA
## 2 42.24507 3.028625e-05
## [1] -78.73992 -99.86245
We can run the likelihood ratio test using the built-in function .
## Likelihood ratio test
##
## Model 1: Type ~ Ba + Mg + Al
## Model 2: Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -99.862
## 2 20 -78.740 12 42.245 3.029e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the confusion tables, confirm that the accuracy of the reduced model is 0.7143 and the accuracy of the full model is 0.7829.
## 1 2 7
## 1 0.8963698 0.1030816 0.0005485729
## 2 0.4873383 0.5016772 0.0109845772
## 3 0.2976229 0.6766409 0.0257362348
## [1] 1 2 2
fit.mat0=mlm0$fitted.values
pred0=c(1,2,7)[apply(fit.mat0,1,which.max)]
(tab=table(True=gdf$Type,Predict=pred)) #confusion table## Predict
## True 1 2 7
## 1 47 23 0
## 2 21 53 2
## 7 1 3 25
## Predict
## True 1 2 7
## 1 52 18 0
## 2 18 57 1
## 7 0 1 28
## [1] 0.7142857 0.7828571
When the response variable is ordinal with \(c\) levels, we have \[ P(Y\le j)=P(Y=1)+P(Y=2)+\cdots+P(Y=j)=p_1+p_2+\cdots+p_j, j=1, 2, \cdots, c. \]
The cumulative probabilities reflect the ordering: \[ P(Y\le 1)\le P(Y\le 2)\le \cdots \le P(Y\le c)=1. \]
The cumulative logits are given by \[ \mbox{logit}[P(Y\le j)]=\ln\frac{P(Y\le j)}{1-P(Y\le j)}=\ln \frac{p_1+p_2+\cdots+p_j}{p_{j+1}+\cdots+p_c}, j=1, 2, \cdots, c-1. \]
For cumulative logit model, we don’t need to model \(P(Y\le c)\) since \(P(Y\le c)=1\).
Example Recall the heart disease data has a variable - output : heart disease (0-4, 0=no presence)
For the variable output, the cumulative logits are:
The parameter \(\beta\) quantifies the effect of \(x\) on the log odds of resulting in category \(j\) or below; in the model the effect is assumed to be the same for all \(j=1, 2, \cdots, c-1\).
The interpretation of the coefficients of the fitted proportional odds regression model is as follows:
Interpretation of \(\alpha_j\). For a fixed value of \(x\), \[ \mbox{logit}P(Y\le j|x=a)]-\mbox{logit}P(Y \le i|x=a)=(\alpha_j+\beta a)-(\alpha_i+\beta a)=\alpha_j-\alpha_i, j>i \] which implies \[ \frac{P(Y\le j)/P(Y>j)}{P(Y\le i)/P(Y>i)}=e^{a_j-a_i}. \] That is, the odds of resulting in category \(j\) or below is \(e^{\alpha_j-\alpha_i}\) times of the odds of resulting in category \(i\) or below given that \(x\) is the same.
Interpretation of \(\beta\). With one unit increase in \(x\), \[ \mbox{logit}P(Y\le j|x=a+1)-\mbox{logit}P(Y\le j|x=a)=[\alpha_j+\beta (a+1)]-[\alpha_j+\beta a]=\beta \] which implies \[ \frac{P(Y\le j|x=a+1)/P(Y>j|x=a+1)}{P(Y\le j|x=a)/P(Y>j|x=a)}=e^{\beta}. \] This means the odds of resulting in category \(j\) or below increases by \((e^{\beta}-1)\times 100\%\) when \(x\) increases by 1 unit. The change is the same for each category \(j\); this property is called proportional odds.
If \(\beta>0\) then the probability to fall into a lower category increases with increasing \(x\).
This is against the usual interpretation, positive slope is associated with a positive correlation.
For this reason, some statistical software models \(\mbox{logit}(P(Y\le j))=\alpha_j-\beta x\) and reports the negative of the slope.
Make sure that you read the documentation of the built-in functions. For example in R, polr function in package MASS reports the negative of the slope. However, the vglm function in package VGAM reports the positive of the slope.
In general, logistic regression for ordinal response is given by \[ \mbox{logit}[P(Y \le j)]=\alpha_j+\beta_1 x_1+\beta_2x_2+\cdots+\beta_kx_k, j=1, 2, \cdots, c-1, \]
The intercept \(\alpha\) varies but the slopes \(\beta_i\)s are the same for all categories \(j=1, 2, \cdots c-1\).
This model has \((c-1)+k\) parameters, much smaller than the number of parameters of the baseline logistic regression for nominal response which uses different intercept and slopes \(\beta_i\)s for each categories \(j=2, \cdots c\) given that \(c=1\) is the reference level.
The number of parameters in a baseline logit regression model is \((c-1)(k+1)\).
Given that \[ \mbox{logit}[P(Y \le j)]=\alpha_j+\beta_1 x_1+\beta_2x_2+\cdots+\beta_kx_k, j=1, 2, \cdots, c-1, \] we have \[ P(Y \le j)=\frac{e^{\alpha_j+\beta_1 x_1+\beta_2x_2+\cdots+\beta_kx_k}}{1+e^{\alpha_j+\beta_1 x_1+\beta_2x_2+\cdots+\beta_kx_k}}, j=1, 2, \cdots, c-1. \]
Therefore, the probability of each category can be calculated as \[ P(Y=j)=P(Y\le j)-P(Y\le j-1), j=2, 3, \cdots, c. \]
Note that \(P(Y=1)=P(Y\le 1)\).
We assign the observation to the category with the largest probability, i.e., \[ \hat{y}=\mbox{argmax}_j P(Y=j). \]
Recall the heart disease data with variables
Fit a proportional odds model on the response variable output which has five levels (0-4) using sex and number of major vessels with blockage ca as the predictor variables.
We can use the polr function in R package MASS to fit a proportional odds logic regression. Note that the polr function reports negative of the slope.
hdf=read.csv("heart_disease.csv")
library(MASS)
pofit=polr(as.factor(output)~as.factor(sex)+ca,data=hdf,Hess=T)
summary(pofit)## Call:
## polr(formula = as.factor(output) ~ as.factor(sex) + ca, data = hdf,
## Hess = T)
##
## Coefficients:
## Value Std. Error t value
## as.factor(sex)1 1.079 0.2782 3.877
## ca 1.151 0.1359 8.466
##
## Intercepts:
## Value Std. Error t value
## 0|1 1.6434 0.2614 6.2858
## 1|2 2.7185 0.2943 9.2386
## 2|3 3.6417 0.3267 11.1475
## 3|4 5.3086 0.4260 12.4608
##
## Residual Deviance: 665.1239
## AIC: 677.1239
## (4 observations deleted due to missingness)
The label 0∣1 represents the odds ratio \(\frac{P(Y\le 0)}{P(Y>0)}\), which is equivalent to \(\frac{P(Y\le 0)}{P(Y\ge 1)}\). Similarly, the label 1|2 corresponds to \(\frac{P(Y\le 1)}{P(Y>1)}=\frac{P(Y\le 1)}{P(Y\ge 2)}\). The same rule applies to all remaining labels.
We can create the table of coefficients including p-value based on the fitted proportional odds model pofit.
ctable=coef(summary(pofit))
## calculate and store p values
p=pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable=cbind(ctable, "p value" = p))## Value Std. Error t value p value
## as.factor(sex)1 1.078707 0.2782487 3.876774 1.058504e-04
## ca 1.150559 0.1359032 8.466024 2.539118e-17
## 0|1 1.643417 0.2614488 6.285808 3.261539e-10
## 1|2 2.718474 0.2942508 9.238629 2.496760e-20
## 2|3 3.641733 0.3266851 11.147533 7.361919e-29
## 3|4 5.308555 0.4260188 12.460847 1.220677e-35
\(\hat{\beta}_1=1.079\):
\(\hat{\beta}_2=1.151\):
## sex ca output
## 1 1 0 0
## 2 1 3 2
## 5 0 0 0
## 7 0 2 3
For the first subject who is a male (\(x_1=1\)) and has no major vessel with blockage (\(x_2=0\)), his probability of heart disease status is \[ \begin{aligned} P(Y\le 0)&=\frac{e^{1.6434-1.079 x_1-1.151 x_2}}{1+e^{1.6434-1.079 x_1-1.151 x_2}}=\frac{e^{1.6434-1.079\times 1-1.151\times 0}}{1+e^{1.6434-1.079\times 1-1.151\times 0}}=0.6375\\ P(Y\le 1)&=\frac{e^{2.7185-1.079 x_1-1.151 x_2}}{1+e^{2.7185-1.079 x_1-1.151 x_2}}=\frac{e^{2.7185-1.079 \times 1-1.151 \times 0}}{1+e^{2.7185-1.079 \times 1-1.151 \times 0}}=0.8375\\ P(Y\le 2)&=\frac{e^{3.6417-1.079 x_1-1.151 x_2}}{1+e^{3.6417-1.079 x_1-1.151 x_2}}=\frac{e^{3.6417-1.079 \times 1-1.151 \times 0}}{1+e^{3.6417-1.079 \times 1-1.151 \times 0}}=0.9284\\ P(Y\le 3)&=\frac{e^{5.3086-1.079 x_1-1.151 x_2}}{1+e^{5.3086-1.079 x_1-1.151 x_2}}=\frac{e^{5.3086-1.079 \times 1-1.151 \times 0}}{1+e^{5.3086-1.079 \times 1-1.151 \times 0}}=0.9857\\ \end{aligned} \] which gives \[ \begin{aligned} P(Y=0)&=P(Y\le 0)=0.6375\\ P(Y=1)&=P(Y\le 1)-P(Y\le 0)=0.8375-0.6375=0.2\\ P(Y=2)&=P(Y\le 2)-P(Y\le 1)=0.9284-0.8375=0.0909\\ P(Y=3)&=P(Y\le 3)-P(Y\le 2)=0.9857-0.9284=0.0573\\ P(Y=4)&=P(Y\le 4)-P(Y\le 3)=1-0.9857=0.0143\\ \end{aligned} \]
Since \(P(Y=0)\) has the largest value, we predict the first subject’s disease status as 0, which is a correct prediction.
Similarly, for the 4th subject who is a female (\(x_1=0\)) with 2 major vessels blocked
(\(x_2=2\)), her probability of each
disease status can be calculated as \[
\begin{aligned}
P(Y\le 0)&=\frac{e^{1.6434-1.079 x_1-1.151 x_2}}{1+e^{1.6434-1.079
x_1-1.151 x_2}}=\frac{e^{1.6434-1.079\times 0-1.151\times
2}}{1+e^{1.6434-1.079\times 0-1.151\times 2}}=0.3411\\
P(Y\le 1)&=\frac{e^{2.7185-1.079 x_1-1.151 x_2}}{1+e^{2.7185-1.079
x_1-1.151 x_2}}=\frac{e^{2.7185-1.079 \times 0-1.151 \times
2}}{1+e^{2.7185-1.079 \times 0-1.151 \times 2}}=0.6026\\
P(Y\le 2)&=\frac{e^{3.6417-1.079 x_1-1.151 x_2}}{1+e^{3.6417-1.079
x_1-1.151 x_2}}=\frac{e^{3.6417-1.079 \times 0-1.151 \times
2}}{1+e^{3.6417-1.079 \times 0-1.151 \times 2}}=0.7924\\
P(Y\le 3)&=\frac{e^{5.3086-1.079 x_1-1.151 x_2}}{1+e^{5.3086-1.079
x_1-1.151 x_2}}=\frac{e^{5.3086-1.079 \times 0-1.151 \times
2}}{1+e^{5.3086-1.079 \times 0-1.151 \times 2}}=0.9529\\
\end{aligned}
\] which gives
Since \(P(Y=0)\) has the largest value, we predict the 4th subject’s disease status as 0 which turns out to be an error, the true disease status is 3. This might be due to the fact that only sex and ca are used to fit the model.
We can also use the built-in predict function to obtain the probabilities.
## 0 1 2 3 4
## 1 0.6375414 0.19996170 0.09094055 0.05721049 0.01434581
## 2 0.0528015 0.08760571 0.15098063 0.39389758 0.31471458
## 5 0.8379993 0.10010868 0.03635438 0.02061294 0.00492470
## 7 0.3412560 0.26159419 0.18974070 0.16031809 0.04709106
## [1] 0 3 0 0
## Levels: 0 1 2 3 4
We can also use vglm function in VGAM package to fit a cumulative proportional odds model. Note that the vglm function reports the positive of the slope. We have the identical results as the output of pofit.
##
## Call:
## vglm(formula = output ~ sex + ca, family = cumulative(parallel = TRUE),
## data = hdf)
##
##
## Coefficients:
## (Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4 sex
## 1.643443 2.718512 3.641761 5.308637 -1.078700
## ca
## -1.150571
##
## Degrees of Freedom: 1196 Total; 1190 Residual
## Residual deviance: 665.1239
## Log-likelihood: -332.562
## 0 1 2 3 4
## 1 0.63754933 0.19996011 0.09093667 0.05720935 0.014344544
## 2 0.05280143 0.08760708 0.15097950 0.39390908 0.314702905
## 5 0.83800287 0.10010734 0.03635286 0.02061263 0.004924299
## 7 0.34125655 0.26159713 0.18973796 0.16031991 0.047088450
We use forward selection to fit the proportional odds model and baseline model.
chdf=hdf[complete.cases(hdf),] #complete cases
chdf$sex=as.factor(chdf$sex)
chdf$cp=as.factor(chdf$cp)
chdf$fbs=as.factor(chdf$fbs)
chdf$restecg=as.factor(chdf$restecg)
chdf$exang=as.factor(chdf$exang)
chdf$slp=as.factor(chdf$slp)
chdf$thal=as.factor(chdf$thal)
m0=polr(as.factor(output)~1,data=chdf)
mf=polr(as.factor(output)~.,data=chdf)
mf.po= step(m0, scope=formula(mf), direction="forward",trace = F, test="Chisq") #final po model
mm0=multinom(as.factor(output)~1, trace=F, data=chdf)
mmf=multinom(as.factor(output)~., trace=F, data=chdf)
#mf.b= step(mm0, scope=formula(mmf), direction="forward",trace=F,test="Chisq")
#summary(mf.b)
#fit the final model mannually
mf.b= multinom(as.factor(output) ~ thal + ca + oldpeak + cp + thalachh + fbs, data = chdf, trace=F) ## Call:
## polr(formula = as.factor(output) ~ thal + ca + cp + slp + sex +
## oldpeak + restecg + thalachh + trtbps, data = chdf)
##
## Coefficients:
## thal6 thal7 ca cp1 cp2 cp3
## 0.34898989 1.36311200 0.87626924 -1.62110826 -1.08404676 -1.82304970
## slp2 slp3 sex1 oldpeak restecg1 restecg2
## 1.00299143 0.92753487 1.01167619 0.26029082 1.84526450 0.44062969
## thalachh trtbps
## -0.01228890 0.01285316
##
## Intercepts:
## 0|1 1|2 2|3 3|4
## 2.079397 3.810718 5.083658 7.074360
##
## Residual Deviance: 520.0181
## AIC: 556.0181
## Call:
## multinom(formula = as.factor(output) ~ thal + ca + oldpeak +
## cp + thalachh + fbs, data = chdf, trace = F)
##
## Coefficients:
## (Intercept) thal6 thal7 ca oldpeak cp1 cp2
## 1 0.7146982 0.3272016 1.649825 0.8843728 0.3547678 -1.590706 -1.0164124
## 2 0.2407046 1.5594422 2.290530 1.2777661 0.8970209 -2.225817 -1.8198575
## 3 0.8051665 -0.2434778 2.760109 1.4632858 1.0171285 -2.094614 -0.4655464
## 4 -4.1287656 2.0967411 2.739979 1.7812933 1.0833166 -2.827671 -14.6139522
## cp3 thalachh fbs1
## 1 -1.120156 -0.01457584 -0.8366657
## 2 -2.518059 -0.02599813 0.8194541
## 3 -15.020256 -0.03590242 0.9856303
## 4 -1.397088 -0.01067928 -0.6510075
##
## Residual Deviance: 504.7278
## AIC: 584.7278
## [1] 556.0181 584.7278
## [1] 622.5053 732.4771
## [1] -260.0091 -252.3639
The final proportional odds model mf.po includes variables thal (three levels), ca, cp (four levels), slp (three levels), sex (two levels), oldpeak, restecg (three levels), thalachh, trtbps.
The final baseline mf.b model includes variables thal (three levels), ca, cp (four levels), oldpeak, thalachh, fbs (two levels). The fitted final model has \(4\times 10=40\) parameters.
We can also compare the accuracy of the cumulative proportional odds model and the baseline model.
predpo=c(0:4)[apply(mf.po$fitted.values,1,which.max)]
predb=c(0:4)[apply(mf.b$fitted.values,1,which.max)]
(tabpo=table(True=chdf$output,Preditct=predpo))## Preditct
## True 0 1 2 3 4
## 0 150 7 1 2 0
## 1 26 16 3 8 1
## 2 5 10 4 16 0
## 3 1 9 7 16 2
## 4 1 2 2 6 2
## Preditct
## True 0 1 2 3 4
## 0 149 7 2 2 0
## 1 28 13 3 10 0
## 2 3 10 5 17 0
## 3 2 9 3 20 1
## 4 1 3 2 5 2
accuracy.po=sum(diag(tabpo))/nrow(chdf)
accuracy.b=sum(diag(tabb))/nrow(chdf)
c(accuracy.po,accuracy.b)## [1] 0.6329966 0.6363636