Learning Learning Outcomes

After finishing this chapter, students should be able to

Explain the main idea of modelling multi-class categorical nominal variable using baseline models.
Calculate the probability belonging to each categorical based on the fitted baseline model.
Conduct a likelihood ratio test to compare nested baseline models based on the computer output.
Explain the main idea of modelling multi-class categorical ordinal variable using proportional odds models.
Calculate the probability belonging to each categorical based on the fitted proportional odds model.
Conduct a likelihood ratio test to compare nested proportional odds models based on the computer output.
Explain the relationship between proportional odds model and cumulative link model.
Calculate the accuracy of a fitted baseline/proportional odds model based on the confusion matrix.
Explain the main idea of a probit regression.

Introduction

-In this chapter, we focus on generalizing logistic regression from binary response variable to nominal and ordinal response variables with more than two levels.

Baseline Model for Multi-class Nominal Data
- Suppose the categorical response variable has \(c\) levels, and arbitrarily let category 1 be the reference level.
- Logit regression can be generalized to multi-class.

The procedure of Building Baseline Model for Multi-class Nominal Data

Fit \(c-1\) binary logit regression \[\begin{eqnarray} && \ \ \ \ \ln \frac{P(Y=j)}{P(Y=1)}=\beta_0^{(j)}+\beta_1^{(j)}x_1+\cdots+\beta_p^{(j)}x_p =\mathbf{x}\beta^{(j)} \nonumber\\ &&\Longrightarrow P(Y=j)=P(Y=1) e^{\mathbf{x} \beta^{(j)}}, j=2, 3, \cdots, c. \end{eqnarray}\] That is \[ \begin{aligned} P(Y=1)&=P(Y=1)\\ P(Y=2)&=P(Y=1)\times e^{\mathbf{x} \beta^{(2)}}\\ P(Y=3)&=P(Y=1)\times e^{\mathbf{x} \beta^{(3)}}\\ \vdots&=\vdots\\ P(Y=c)&=P(Y=1)\times e^{\mathbf{x} \beta^{(c)}}\\ \end{aligned} \] Given the fact that \(\sum_{j=1}^c P(Y=j)=1\), we have \[\begin{eqnarray} && \ \ \ \ P(Y=1)(1+e^{\mathbf{x} \beta^{(2)}}+e^{\mathbf{x} \beta^{(3)}}+\cdots+e^{\mathbf{x} \beta^{(c)}})=1 \nonumber\\ &&\Longrightarrow \nonumber\\ && \ \ \ \ P(Y=1)=\frac{1}{1+e^{\mathbf{x} \beta^{(2)}+}e^{\mathbf{x} \beta^{(3)}}+\cdots+e^{\mathbf{x} \beta^{(c)}}}. \end{eqnarray}\] Therefore, we can solve for \[ P(Y=j)=\frac{e^{\mathbf{x} \beta^{(j)}}}{1+e^{\mathbf{x} \beta^{(2)}+}e^{\mathbf{x} \beta^{(3)}}+\cdots+e^{\mathbf{x} \beta^{(c)}}}, j=2, \cdots, c. \]
Classify the observation to the class with the largest probability.

Example

We use the Glass data to illustrate how logistic regression can be fitted for nominal response with more than two levels.
The detailed information about the “Glass” data set can be found https://archive.ics.uci.edu/ml/datasets/Glass+Identification.
The original data set has 214 observations, 6 types of glasses (response variable), and 10 explanatory variables (ID plus 9 ingredients such as Na, Mg, AL, Ca, Ba, Fe and so on).

gdf=read.csv("glass_3type.csv")
head(gdf)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

To simplify the problem to show the main idea of logistic regression with multi-class response, we only pick the three types of glass with relatively large frequencies:
- type 1 (building_windows_float_processed)
- type 2 (building_windows_non_float_processed) and
- type 7 (headlamps).
We only consider the three ingredients (Ba, Mg and Al) as predictors.

We fit a model with three ingredients Ba, Mg and Al.

library(nnet)
mlm=multinom(Type~Ba+Mg+Al, trace=F, data=gdf)
summary(mlm)

## Call:
## multinom(formula = Type ~ Ba + Mg + Al, data = gdf, trace = F)
## 
## Coefficients:
##   (Intercept)         Ba        Mg       Al
## 2  -0.8628106 -0.2563371 -1.280758 4.045984
## 7  -5.8562697  1.3506022 -2.035040 6.904372
## 
## Std. Errors:
##   (Intercept)       Ba        Mg        Al
## 2    1.459145 1.532541 0.3581700 0.9153595
## 7    2.354668 1.602768 0.4561505 1.4073782
## 
## Residual Deviance: 199.7249 
## AIC: 215.7249

Based on the computer output of mlm, write down the fitted baseline model.

Predict the glass type of the first observation which has Ba=0, Mg=4.49, and Al=1.10,

Likelihood Ratio Test for Model Selection

Likelihood ratio test can be used to compare two nested models.
Suppose the nominal response variable has \(c\) levels and the logit model has \(k\) predictor variables, the baseline logit model has \((c-1)(k+1)\) number of parameters.

Example

Recall the glass data set.

Based on the computer output below, test at the 5% significance level whether the reduced model with Ba, Mg and Al is as effective as the full model with all nine ingredients.

mlm0=multinom(Type~., trace=F, data=gdf) #full model
summary(mlm0)

## Call:
## multinom(formula = Type ~ ., data = gdf, trace = F)
## 
## Coefficients:
##   (Intercept)        RI        Na        Mg        Al        Si         K
## 2   202.38031 285.92105 -5.654031 -7.952748 -1.591137 -6.503354 -5.246471
## 7   -15.99688  67.36962  1.384642 -3.856447  4.993300 -1.310103  1.949833
##           Ca        Ba         Fe
## 2 -6.5879425 -8.430123   1.889733
## 7 -0.9702565  1.112616 -36.129595
## 
## Std. Errors:
##   (Intercept)         RI        Na        Mg       Al        Si        K
## 2  0.03690012 0.05816511 0.5314143 0.5956419 1.287586 0.1342823 1.794421
## 7  0.04623695 0.07687846 1.2848145 0.8610747 2.324660 0.2555301 2.299190
##          Ca       Ba        Fe
## 2 0.4407552 2.829943 2.1911100
## 7 0.7878997 3.184999 0.1211051
## 
## Residual Deviance: 157.4798 
## AIC: 197.4798

anova(mlm0,mlm) #likelihood ratio test

##                                       Model Resid. df Resid. Dev   Test    Df
## 1                              Ba + Mg + Al       342   199.7249           NA
## 2 RI + Na + Mg + Al + Si + K + Ca + Ba + Fe       330   157.4798 1 vs 2    12
##   LR stat.      Pr(Chi)
## 1       NA           NA
## 2 42.24507 3.028625e-05

c(logLik(mlm0),logLik(mlm)) #log-likelihood of two models

## [1] -78.73992 -99.86245

We can run the likelihood ratio test using the built-in function .

library(lmtest)
lrtest(mlm,mlm0) #likelihood ratio test

## Likelihood ratio test
## 
## Model 1: Type ~ Ba + Mg + Al
## Model 2: Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -99.862                         
## 2  20 -78.740 12 42.245  3.029e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also compare the accuracy of the two fitted baseline models (with all ingredients) and (with three ingredients).

Based on the confusion tables, confirm that the accuracy of the reduced model is 0.7143 and the accuracy of the full model is 0.7829.

fit.mat=mlm$fitted.values #n by 3 matrix
fit.mat[1:3,]

##           1         2            7
## 1 0.8963698 0.1030816 0.0005485729
## 2 0.4873383 0.5016772 0.0109845772
## 3 0.2976229 0.6766409 0.0257362348

pred=c(1,2,7)[apply(fit.mat,1,which.max)] #predict class label
pred[1:3]

## [1] 1 2 2

fit.mat0=mlm0$fitted.values
pred0=c(1,2,7)[apply(fit.mat0,1,which.max)]
(tab=table(True=gdf$Type,Predict=pred)) #confusion table

##     Predict
## True  1  2  7
##    1 47 23  0
##    2 21 53  2
##    7  1  3 25

(tab0=table(True=gdf$Type,Predict=pred0))

##     Predict
## True  1  2  7
##    1 52 18  0
##    2 18 57  1
##    7  0  1 28

c(sum(diag(tab))/nrow(gdf),sum(diag(tab0))/nrow(gdf)) #accuracy

## [1] 0.7142857 0.7828571

Cumulative Logit Model for Multi-class Ordinal Data

When the response variable is ordinal with \(c\) levels, we have \[ P(Y\le j)=P(Y=1)+P(Y=2)+\cdots+P(Y=j)=p_1+p_2+\cdots+p_j, j=1, 2, \cdots, c. \]
The cumulative probabilities reflect the ordering: \[ P(Y\le 1)\le P(Y\le 2)\le \cdots \le P(Y\le c)=1. \]
The cumulative logits are given by \[ \mbox{logit}[P(Y\le j)]=\ln\frac{P(Y\le j)}{1-P(Y\le j)}=\ln \frac{p_1+p_2+\cdots+p_j}{p_{j+1}+\cdots+p_c}, j=1, 2, \cdots, c-1. \]
For cumulative logit model, we don’t need to model \(P(Y\le c)\) since \(P(Y\le c)=1\).

Example Recall the heart disease data has a variable - output : heart disease (0-4, 0=no presence)

For the variable output, the cumulative logits are:

Cumulative Logit Models with Proportional Odds

The cumulative logit model with only one explanatory variable \(x\) is given by \[ \mbox{logit}[P(Y \le j)]=\alpha_j+\beta x, j=1, 2, \cdots, c-1. \]

The parameter \(\beta\) quantifies the effect of \(x\) on the log odds of resulting in category \(j\) or below; in the model the effect is assumed to be the same for all \(j=1, 2, \cdots, c-1\).

The interpretation of the coefficients of the fitted proportional odds regression model is as follows:

Interpretation of \(\alpha_j\). For a fixed value of \(x\), \[ \mbox{logit}P(Y\le j|x=a)]-\mbox{logit}P(Y \le i|x=a)=(\alpha_j+\beta a)-(\alpha_i+\beta a)=\alpha_j-\alpha_i, j>i \] which implies \[ \frac{P(Y\le j)/P(Y>j)}{P(Y\le i)/P(Y>i)}=e^{a_j-a_i}. \] That is, the odds of resulting in category \(j\) or below is \(e^{\alpha_j-\alpha_i}\) times of the odds of resulting in category \(i\) or below given that \(x\) is the same.
Interpretation of \(\beta\). With one unit increase in \(x\), \[ \mbox{logit}P(Y\le j|x=a+1)-\mbox{logit}P(Y\le j|x=a)=[\alpha_j+\beta (a+1)]-[\alpha_j+\beta a]=\beta \] which implies \[ \frac{P(Y\le j|x=a+1)/P(Y>j|x=a+1)}{P(Y\le j|x=a)/P(Y>j|x=a)}=e^{\beta}. \] This means the odds of resulting in category \(j\) or below increases by \((e^{\beta}-1)\times 100\%\) when \(x\) increases by 1 unit. The change is the same for each category \(j\); this property is called proportional odds.

If \(\beta>0\) then the probability to fall into a lower category increases with increasing \(x\).
- This is against the usual interpretation, positive slope is associated with a positive correlation.
- For this reason, some statistical software models \(\mbox{logit}(P(Y\le j))=\alpha_j-\beta x\) and reports the negative of the slope.
- Make sure that you read the documentation of the built-in functions. For example in R, polr function in package MASS reports the negative of the slope. However, the vglm function in package VGAM reports the positive of the slope.

In general, logistic regression for ordinal response is given by \[ \mbox{logit}[P(Y \le j)]=\alpha_j+\beta_1 x_1+\beta_2x_2+\cdots+\beta_kx_k, j=1, 2, \cdots, c-1, \]

The intercept \(\alpha\) varies but the slopes \(\beta_i\)s are the same for all categories \(j=1, 2, \cdots c-1\).
This model has \((c-1)+k\) parameters, much smaller than the number of parameters of the baseline logistic regression for nominal response which uses different intercept and slopes \(\beta_i\)s for each categories \(j=2, \cdots c\) given that \(c=1\) is the reference level.
The number of parameters in a baseline logit regression model is \((c-1)(k+1)\).

Model Probability of Each Category

Given that \[ \mbox{logit}[P(Y \le j)]=\alpha_j+\beta_1 x_1+\beta_2x_2+\cdots+\beta_kx_k, j=1, 2, \cdots, c-1, \] we have \[ P(Y \le j)=\frac{e^{\alpha_j+\beta_1 x_1+\beta_2x_2+\cdots+\beta_kx_k}}{1+e^{\alpha_j+\beta_1 x_1+\beta_2x_2+\cdots+\beta_kx_k}}, j=1, 2, \cdots, c-1. \]
Therefore, the probability of each category can be calculated as \[ P(Y=j)=P(Y\le j)-P(Y\le j-1), j=2, 3, \cdots, c. \]
Note that \(P(Y=1)=P(Y\le 1)\).
We assign the observation to the category with the largest probability, i.e., \[ \hat{y}=\mbox{argmax}_j P(Y=j). \]

Example: Proportional Odds Model for Ordinal Response

Recall the heart disease data with variables

age : Age of the patient
sex : Sex of the patient (1 = male; 0 = female)
cp : Chest Pain type
- Value 0: asymptomatic
- Value 1: non-anginal pain
- Value 2: atypical angina
- Value 3: typical angina
trtbps : resting blood pressure (in mm Hg)
chol : cholesterol in mg/dl fetched via BMI sensor
fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg : resting electrocardiographic results
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
thalach : maximum heart rate achieved
exang: exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest. ST relates to positions on the ECG plot.
slp: the slope of the peak exercise ST segment
- Value 1: upsloping
- Value 2: flat
- Value 3: downsloping
ca: number of major vessels (0-3)
thal: A blood disorder called thalassemia
- Value 3: normal blood flow
- Value 6: fixed defect (no blood flow in some part of the heart)
- Value 7: reversible defect (a blood flow is observed but it is not normal)
output : heart disease (0-4, 0=no presence)

Fit a proportional odds model on the response variable output which has five levels (0-4) using sex and number of major vessels with blockage ca as the predictor variables.

We can use the polr function in R package MASS to fit a proportional odds logic regression. Note that the polr function reports negative of the slope.

hdf=read.csv("heart_disease.csv")
library(MASS)
pofit=polr(as.factor(output)~as.factor(sex)+ca,data=hdf,Hess=T)
summary(pofit)

## Call:
## polr(formula = as.factor(output) ~ as.factor(sex) + ca, data = hdf, 
##     Hess = T)
## 
## Coefficients:
##                 Value Std. Error t value
## as.factor(sex)1 1.079     0.2782   3.877
## ca              1.151     0.1359   8.466
## 
## Intercepts:
##     Value   Std. Error t value
## 0|1  1.6434  0.2614     6.2858
## 1|2  2.7185  0.2943     9.2386
## 2|3  3.6417  0.3267    11.1475
## 3|4  5.3086  0.4260    12.4608
## 
## Residual Deviance: 665.1239 
## AIC: 677.1239 
## (4 observations deleted due to missingness)

The label 0∣1 represents the odds ratio \(\frac{P(Y\le 0)}{P(Y>0)}\), which is equivalent to \(\frac{P(Y\le 0)}{P(Y\ge 1)}\). Similarly, the label 1|2 corresponds to \(\frac{P(Y\le 1)}{P(Y>1)}=\frac{P(Y\le 1)}{P(Y\ge 2)}\). The same rule applies to all remaining labels.

Based on the computer output, write down the fitted proportional odds model.

We can create the table of coefficients including p-value based on the fitted proportional odds model pofit.

ctable=coef(summary(pofit))
## calculate and store p values
p=pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable=cbind(ctable, "p value" = p))

##                    Value Std. Error   t value      p value
## as.factor(sex)1 1.078707  0.2782487  3.876774 1.058504e-04
## ca              1.150559  0.1359032  8.466024 2.539118e-17
## 0|1             1.643417  0.2614488  6.285808 3.261539e-10
## 1|2             2.718474  0.2942508  9.238629 2.496760e-20
## 2|3             3.641733  0.3266851 11.147533 7.361919e-29
## 3|4             5.308555  0.4260188 12.460847 1.220677e-35

Interpret the coefficients of the fitted proportional odds model.

\(\hat{\beta}_1=1.079\):
\(\hat{\beta}_2=1.151\):

Probability of Each Category. We choose several subjects to predict their categories of heart disease status (0-4).

(testdf=hdf[c(1,2,5,7),c("sex","ca","output")])

##   sex ca output
## 1   1  0      0
## 2   1  3      2
## 5   0  0      0
## 7   0  2      3

For the first subject who is a male (\(x_1=1\)) and has no major vessel with blockage (\(x_2=0\)), his probability of heart disease status is \[ \begin{aligned} P(Y\le 0)&=\frac{e^{1.6434-1.079 x_1-1.151 x_2}}{1+e^{1.6434-1.079 x_1-1.151 x_2}}=\frac{e^{1.6434-1.079\times 1-1.151\times 0}}{1+e^{1.6434-1.079\times 1-1.151\times 0}}=0.6375\\ P(Y\le 1)&=\frac{e^{2.7185-1.079 x_1-1.151 x_2}}{1+e^{2.7185-1.079 x_1-1.151 x_2}}=\frac{e^{2.7185-1.079 \times 1-1.151 \times 0}}{1+e^{2.7185-1.079 \times 1-1.151 \times 0}}=0.8375\\ P(Y\le 2)&=\frac{e^{3.6417-1.079 x_1-1.151 x_2}}{1+e^{3.6417-1.079 x_1-1.151 x_2}}=\frac{e^{3.6417-1.079 \times 1-1.151 \times 0}}{1+e^{3.6417-1.079 \times 1-1.151 \times 0}}=0.9284\\ P(Y\le 3)&=\frac{e^{5.3086-1.079 x_1-1.151 x_2}}{1+e^{5.3086-1.079 x_1-1.151 x_2}}=\frac{e^{5.3086-1.079 \times 1-1.151 \times 0}}{1+e^{5.3086-1.079 \times 1-1.151 \times 0}}=0.9857\\ \end{aligned} \] which gives \[ \begin{aligned} P(Y=0)&=P(Y\le 0)=0.6375\\ P(Y=1)&=P(Y\le 1)-P(Y\le 0)=0.8375-0.6375=0.2\\ P(Y=2)&=P(Y\le 2)-P(Y\le 1)=0.9284-0.8375=0.0909\\ P(Y=3)&=P(Y\le 3)-P(Y\le 2)=0.9857-0.9284=0.0573\\ P(Y=4)&=P(Y\le 4)-P(Y\le 3)=1-0.9857=0.0143\\ \end{aligned} \]

Since \(P(Y=0)\) has the largest value, we predict the first subject’s disease status as 0, which is a correct prediction.

Similarly, for the 4th subject who is a female (\(x_1=0\)) with 2 major vessels blocked (\(x_2=2\)), her probability of each disease status can be calculated as \[ \begin{aligned} P(Y\le 0)&=\frac{e^{1.6434-1.079 x_1-1.151 x_2}}{1+e^{1.6434-1.079 x_1-1.151 x_2}}=\frac{e^{1.6434-1.079\times 0-1.151\times 2}}{1+e^{1.6434-1.079\times 0-1.151\times 2}}=0.3411\\ P(Y\le 1)&=\frac{e^{2.7185-1.079 x_1-1.151 x_2}}{1+e^{2.7185-1.079 x_1-1.151 x_2}}=\frac{e^{2.7185-1.079 \times 0-1.151 \times 2}}{1+e^{2.7185-1.079 \times 0-1.151 \times 2}}=0.6026\\ P(Y\le 2)&=\frac{e^{3.6417-1.079 x_1-1.151 x_2}}{1+e^{3.6417-1.079 x_1-1.151 x_2}}=\frac{e^{3.6417-1.079 \times 0-1.151 \times 2}}{1+e^{3.6417-1.079 \times 0-1.151 \times 2}}=0.7924\\ P(Y\le 3)&=\frac{e^{5.3086-1.079 x_1-1.151 x_2}}{1+e^{5.3086-1.079 x_1-1.151 x_2}}=\frac{e^{5.3086-1.079 \times 0-1.151 \times 2}}{1+e^{5.3086-1.079 \times 0-1.151 \times 2}}=0.9529\\ \end{aligned} \] which gives

Since \(P(Y=0)\) has the largest value, we predict the 4th subject’s disease status as 0 which turns out to be an error, the true disease status is 3. This might be due to the fact that only sex and ca are used to fit the model.

We can also use the built-in predict function to obtain the probabilities.

predict(pofit,testdf,type="p") #probability of each category

##           0          1          2          3          4
## 1 0.6375414 0.19996170 0.09094055 0.05721049 0.01434581
## 2 0.0528015 0.08760571 0.15098063 0.39389758 0.31471458
## 5 0.8379993 0.10010868 0.03635438 0.02061294 0.00492470
## 7 0.3412560 0.26159419 0.18974070 0.16031809 0.04709106

predict(pofit,testdf) #return the category label

## [1] 0 3 0 0
## Levels: 0 1 2 3 4

We can also use vglm function in VGAM package to fit a cumulative proportional odds model. Note that the vglm function reports the positive of the slope. We have the identical results as the output of pofit.

library(VGAM)
pofit1=vglm(output~sex+ca,family=cumulative(parallel=TRUE),data=hdf)
pofit1

## 
## Call:
## vglm(formula = output ~ sex + ca, family = cumulative(parallel = TRUE), 
##     data = hdf)
## 
## 
## Coefficients:
## (Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4           sex 
##      1.643443      2.718512      3.641761      5.308637     -1.078700 
##            ca 
##     -1.150571 
## 
## Degrees of Freedom: 1196 Total; 1190 Residual
## Residual deviance: 665.1239 
## Log-likelihood: -332.562

predictvglm(pofit1,newdata=testdf,type="response") #probability of each category

##            0          1          2          3           4
## 1 0.63754933 0.19996011 0.09093667 0.05720935 0.014344544
## 2 0.05280143 0.08760708 0.15097950 0.39390908 0.314702905
## 5 0.83800287 0.10010734 0.03635286 0.02061263 0.004924299
## 7 0.34125655 0.26159713 0.18973796 0.16031991 0.047088450

Proportional Odds Model Versus Baseline Model

We use forward selection to fit the proportional odds model and baseline model.

chdf=hdf[complete.cases(hdf),] #complete cases
chdf$sex=as.factor(chdf$sex)
chdf$cp=as.factor(chdf$cp)
chdf$fbs=as.factor(chdf$fbs)
chdf$restecg=as.factor(chdf$restecg)
chdf$exang=as.factor(chdf$exang)
chdf$slp=as.factor(chdf$slp)
chdf$thal=as.factor(chdf$thal)

m0=polr(as.factor(output)~1,data=chdf)
mf=polr(as.factor(output)~.,data=chdf)
mf.po= step(m0, scope=formula(mf), direction="forward",trace = F, test="Chisq") #final po model

mm0=multinom(as.factor(output)~1, trace=F, data=chdf)
mmf=multinom(as.factor(output)~., trace=F, data=chdf)


#mf.b= step(mm0, scope=formula(mmf), direction="forward",trace=F,test="Chisq")
#summary(mf.b)

#fit the final model mannually
mf.b= multinom(as.factor(output) ~ thal + ca + oldpeak + cp + thalachh + fbs, data = chdf, trace=F)

mf.po #final proportional odds model

## Call:
## polr(formula = as.factor(output) ~ thal + ca + cp + slp + sex + 
##     oldpeak + restecg + thalachh + trtbps, data = chdf)
## 
## Coefficients:
##       thal6       thal7          ca         cp1         cp2         cp3 
##  0.34898989  1.36311200  0.87626924 -1.62110826 -1.08404676 -1.82304970 
##        slp2        slp3        sex1     oldpeak    restecg1    restecg2 
##  1.00299143  0.92753487  1.01167619  0.26029082  1.84526450  0.44062969 
##    thalachh      trtbps 
## -0.01228890  0.01285316 
## 
## Intercepts:
##      0|1      1|2      2|3      3|4 
## 2.079397 3.810718 5.083658 7.074360 
## 
## Residual Deviance: 520.0181 
## AIC: 556.0181

mf.b #final baseline model

## Call:
## multinom(formula = as.factor(output) ~ thal + ca + oldpeak + 
##     cp + thalachh + fbs, data = chdf, trace = F)
## 
## Coefficients:
##   (Intercept)      thal6    thal7        ca   oldpeak       cp1         cp2
## 1   0.7146982  0.3272016 1.649825 0.8843728 0.3547678 -1.590706  -1.0164124
## 2   0.2407046  1.5594422 2.290530 1.2777661 0.8970209 -2.225817  -1.8198575
## 3   0.8051665 -0.2434778 2.760109 1.4632858 1.0171285 -2.094614  -0.4655464
## 4  -4.1287656  2.0967411 2.739979 1.7812933 1.0833166 -2.827671 -14.6139522
##          cp3    thalachh       fbs1
## 1  -1.120156 -0.01457584 -0.8366657
## 2  -2.518059 -0.02599813  0.8194541
## 3 -15.020256 -0.03590242  0.9856303
## 4  -1.397088 -0.01067928 -0.6510075
## 
## Residual Deviance: 504.7278 
## AIC: 584.7278

c(AIC(mf.po),AIC(mf.b))

## [1] 556.0181 584.7278

c(BIC(mf.po),BIC(mf.b))

## [1] 622.5053 732.4771

c(logLik(mf.po),logLik(mf.b))

## [1] -260.0091 -252.3639

The final proportional odds model mf.po includes variables thal (three levels), ca, cp (four levels), slp (three levels), sex (two levels), oldpeak, restecg (three levels), thalachh, trtbps.
- The fitted final model has \(4+14=18\) parameters.
The final baseline mf.b model includes variables thal (three levels), ca, cp (four levels), oldpeak, thalachh, fbs (two levels). The fitted final model has \(4\times 10=40\) parameters.
- The final proportional odds model has a smaller AIC and BIC, and a slightly smaller log likelihood.
We can also compare the accuracy of the cumulative proportional odds model and the baseline model.
- The accuracies of the two models are almost identical, 0.633 for proportional odds model and 0.636 for baseline model.

predpo=c(0:4)[apply(mf.po$fitted.values,1,which.max)]
predb=c(0:4)[apply(mf.b$fitted.values,1,which.max)]

(tabpo=table(True=chdf$output,Preditct=predpo))

##     Preditct
## True   0   1   2   3   4
##    0 150   7   1   2   0
##    1  26  16   3   8   1
##    2   5  10   4  16   0
##    3   1   9   7  16   2
##    4   1   2   2   6   2

(tabb=table(True=chdf$output,Preditct=predb))

##     Preditct
## True   0   1   2   3   4
##    0 149   7   2   2   0
##    1  28  13   3  10   0
##    2   3  10   5  17   0
##    3   2   9   3  20   1
##    4   1   3   2   5   2

accuracy.po=sum(diag(tabpo))/nrow(chdf)
accuracy.b=sum(diag(tabb))/nrow(chdf)
c(accuracy.po,accuracy.b)

## [1] 0.6329966 0.6363636

Multiclass Logistic Regression

STAT 371: Applied Categorical Data Analysis