Polytomous Logistic Regression

## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE, 
    echo = T, fig.width = 5, fig.height = 5)
options(width = 125, scipen = 5, digits = 5)

setwd("~/statistics/self.logistic/")

References

Nominal variables

Nominal variables simply indicate different categories. An example used here is histological subtype of cancer.

Polytomous logistic regression

One of the categories of the outcome variable is designated as the reference category, and each of the other levels is compared with this reference level. In this example the outcome level Adenocarcinoma is chosen as the reference. So the comparison will be between Adenosquamous vs Adenocarcinoma and between Other vs Adenocarcinoma.

Prepare data

cancer.dta file availabe at http://www.sph.emory.edu/~dkleinb/logreg2.htm.

library(foreign)
cancer <- read.dta("http://www.sph.emory.edu/~dkleinb/datasets/cancer.dta")

cancer <- within(cancer, {
    grade    <- factor(grade,    0:2, c("well differentiated","moderately differentiated","poorly differentiated"),
                       ordered = TRUE)
    race     <- factor(race,     0:1, c("white","black"))
    estrogen <- factor(estrogen, 0:1, c("never used","ever used"))
    subtype  <- factor(subtype,  0:2, c("Adenocarcinoma","Adenosquamous","Other"))
    age      <- factor(age,      0:1, c("50-64","65-79"))
    smoking  <- factor(smoking,  0:1, c("non-smoker","current smoker"))
})

Check data

Subtype is the polytomous outcome variable.

summary(cancer)
       id                               grade        race           estrogen             subtype       age     
 Min.   : 10009   well differentiated      :130   white:207   never used:135   Adenocarcinoma:186   50-64:106  
 1st Qu.: 11590   moderately differentiated:105   black: 81   ever used :151   Adenosquamous : 45   65-79:182  
 Median : 21424   poorly differentiated    : 53               NA's      :  2   Other         : 57              
 Mean   : 25752                                                                                                
 3rd Qu.: 30820                                                                                                
 Max.   :301017                                                                                                
           smoking   
 non-smoker    :260  
 current smoker: 28  




Univariable polytomous logistic regression with nnet:multinom()

Modeling

One line of code for each model specification, but actually two models are constructed. The one that compares the reference outcome level Adenocarcinoma to Adenosquamous, and the other that compares the refence outcome level Adenocarcinoma to Others.

library(nnet)
multinom.age  <- multinom(subtype ~ age, data = cancer)
# weights:  9 (4 variable)
initial  value 316.400339 
iter  10 value 254.447180
final  value 254.445123 
converged
multinom.null <- multinom(subtype ~ 1, data = cancer)
# weights:  6 (2 variable)
initial  value 316.400339 
iter  10 value 257.190003
iter  10 value 257.190003
iter  10 value 257.190003
final  value 257.190003 
converged

Results (annotated)

(summary.multinom.age <- summary(multinom.age, Wald = TRUE))
Call:
multinom(formula = subtype ~ age, data = cancer)

Coefficients:                           # coefficients
              (Intercept)  age65-79
Adenosquamous   -1.945909 0.7809218     # Adenosquamous vs Adenocarcinoma
Other           -1.453434 0.4256484     # Other vs Adenocarcinoma

Std. Errors:                            # SE
              (Intercept)  age65-79
Adenosquamous   0.3223291 0.3774680     # Adenosquamous vs Adenocarcinoma
Other           0.2618064 0.3214933     # Other vs Adenocarcinoma

Value/SE (Wald statistics):             # Wald statistics
              (Intercept) age65-79
Adenosquamous   -6.037027 2.068842      # Adenosquamous vs Adenocarcinoma
Other           -5.551561 1.323973      # Other vs Adenocarcinoma

Residual Deviance: 508.8902
AIC: 516.8902

Adenosquamous vs Adenocarcinoma model

Estimated log odds of Adenosquamous in comparison to Adenocarcinoma among those who are aged 65-79.

ln\( [\frac {\hat{P}(D = AdenoSq|X_1)} {\hat{P}(D = AdenoCa|X_1)}] = -1.9459 + (0.7809) \)

Other vs Adenocarcinoma model

Estimated log odds of Other in comparison to Adenocarcinoma among those who are aged 65-79.

ln\( [\frac {\hat{P}(D = Other|X_1)} {\hat{P}(D = AdenoCa|X_1)}] = -1.4534 + (0.4256) \)

Odds ratios

exp(coef(multinom.age))
              (Intercept) age65-79
Adenosquamous     0.14286   2.1835
Other             0.23377   1.5306

Adenosquamous vs Adenocarcinoma model

OR of Adenosquamous in comparison to Adenocarcinoma among those aged 65-79 in comparison to those aged 50-64.

\( \hat{OR_1} \) = exp[\( \hat{\beta}_{11} \)] = exp(0.7809) = 2.18

Other vs Adenocarcinoma model

OR of Other in comparison to Adenocarcinoma among those aged 65-79 in comparison to those aged 50-64.

\( \hat{OR_2} \) = exp[\( \hat{\beta}_{21} \)] = exp(0.4256) = 1.53

In this special case where there is only one dichotomous predictor, the odds ratios from the polytomous model are the same as crude odds ratios.

Crude odss ratios

tab.res <- xtabs(~ subtype + age, data = cancer)
library(epiR)
## AdenoCa vs AdenoSq comparison
epi.2by2(tab.res[-3,], units = 1)
             Disease +    Disease -      Total        Inc risk *        Odds
Exposed +           77          109        186             0.414       0.706
Exposed -           11           34         45             0.244       0.324
Total               88          143        231             0.381       0.615

Point estimates and 95 % CIs:
---------------------------------------------------------
Inc risk ratio                         1.69 (0.99, 2.91)
Odds ratio                             2.18 (1, 5.07)
Attrib risk *                          0.17 (0.03, 0.31)
Attrib risk in population *            0.14 (0, 0.28)
Attrib fraction in exposed (%)         40.95 (-1.47, 65.64)
Attrib fraction in population (%)      35.83 (-3.1, 60.07)
---------------------------------------------------------
 * Cases per population unit 
## AdenoCa vs Other comparison
epi.2by2(tab.res[-2,], units = 1)
             Disease +    Disease -      Total        Inc risk *        Odds
Exposed +           77          109        186             0.414       0.706
Exposed -           18           39         57             0.316       0.462
Total               95          148        243             0.391       0.642

Point estimates and 95 % CIs:
---------------------------------------------------------
Inc risk ratio                         1.31 (0.86, 1.99)
Odds ratio                             1.53 (0.78, 3.06)
Attrib risk *                          0.1 (-0.04, 0.24)
Attrib risk in population *            0.08 (-0.06, 0.21)
Attrib fraction in exposed (%)         23.72 (-15.94, 49.81)
Attrib fraction in population (%)      19.22 (-13.44, 42.48)
---------------------------------------------------------
 * Cases per population unit 

Confidence intervals for odds ratios

exp(confint(multinom.age))
, , Adenosquamous

               2.5 % 97.5 %
(Intercept) 0.075951 0.2687
age65-79    1.041954 4.5756

, , Other

              2.5 %  97.5 %
(Intercept) 0.13994 0.39051
age65-79    0.81508 2.87417

Interpretation

For women diagnosed with primary endometrial cancer, older subjects (aged 65 - 79) relative to younger subjects (aged 50 - 64) were more likely to have their tumors categorized as Other than as Adenocarcinoma (\( \hat{OR_2} = 1.53 \)) and were even more likely to have their tumors classified as Adenosquamous than as Adenocarcinoma (\( \hat{OR_1} = 2.18 \)).

Predicted probabilities

In a binary logistic regression model, the predicted probability is the probability of having outcome = 1, in the case of a polytomous logistic regression model, the predicted probabilities are probabilities of having each outcome level (here adenocarcinoma, adenosquamous, and other).

library(effects)
plot(allEffects(multinom.age), style = "stacked")

plot of chunk unnamed-chunk-9

Hypothesis testing: Likelihood ratio test

library(car)
## Likelihood ratio test for significance of betas (two betas for age)
Anova(multinom.age, type = "III")
Analysis of Deviance Table (Type III tests)

Response: subtype
    LR Chisq Df Pr(>Chisq)  
age     5.49  2      0.064 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Deviance
list(null = deviance(multinom.null),
     age  = deviance(multinom.age))
$null
[1] 514.38

$age
[1] 508.89

Hypothesis testing: Wald tests

summary.multinom.age$Wald.ratios[, 2, drop = F]
              age65-79
Adenosquamous 2.068842  # Wald statistic for beta of age in AdenoCa vs AdenoSq
Other         1.323973  # Wald statistic for beta of age in AdenoCa vs Other

For p-values square the Wald statistics and use Chi-square statistic with one degree of freedom.

pchisq(summary.multinom.age$Wald.ratios^2, df = 1, low = F)
                  (Intercept) age65-79
Adenosquamous 0.0000000015698 0.038561
Other         0.0000000283129 0.185512

Multivariable polytomous logistic regression with nnet:multinom()

multinom.estrogen.smoking  <-
    multinom(subtype ~ age + estrogen + smoking, data = cancer)
# weights:  15 (8 variable)
initial  value 314.203115 
iter  10 value 247.380882
final  value 247.202541 
converged

summary.multinom.estrogen.smoking <-
    summary(multinom.estrogen.smoking, Wald = T)

summary.multinom.estrogen.smoking
Call:
multinom(formula = subtype ~ age + estrogen + smoking, data = cancer)

Coefficients:
              (Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous     -1.8822  0.98698          -0.64384               0.88942
Other             -1.2032  0.28225          -0.10705              -1.79133

Std. Errors:
              (Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous     0.40248  0.41179           0.34356               0.52535
Other             0.31897  0.32796           0.30674               1.04645

Value/SE (Wald statistics):
              (Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous     -4.6764  2.39680            -1.874                1.6930
Other             -3.7720  0.86061            -0.349               -1.7118

Residual Deviance: 494.41 
AIC: 510.41 

Odds ratios

exp(coef(multinom.estrogen.smoking))
              (Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous     0.15226   2.6831           0.52527               2.43373
Other             0.30024   1.3261           0.89848               0.16674

Confidence intervals

exp(confint(multinom.estrogen.smoking))
, , Adenosquamous

                         2.5 % 97.5 %
(Intercept)           0.069181 0.3351
age65-79              1.197078 6.0139
estrogenever used     0.267880 1.0300
smokingcurrent smoker 0.869149 6.8148

, , Other

                         2.5 %  97.5 %
(Intercept)           0.160680 0.56102
age65-79              0.697296 2.52196
estrogenever used     0.492509 1.63909
smokingcurrent smoker 0.021443 1.29651

Likelihood ratio test

Anova(multinom.estrogen.smoking, type = "III")
Analysis of Deviance Table (Type III tests)

Response: subtype
         LR Chisq Df Pr(>Chisq)  
age          6.56  2      0.038 *
estrogen     3.59  2      0.166  
smoking      9.01  2      0.011 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Wald statistics

summary.multinom.estrogen.smoking$Wald.ratios
              (Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous     -4.6764  2.39680            -1.874                1.6930
Other             -3.7720  0.86061            -0.349               -1.7118
## p-value
pchisq(summary.multinom.estrogen.smoking$Wald.ratios^2, 1, low = F)
               (Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous 0.0000029192 0.016539          0.060929              0.090452
Other         0.0001619271 0.389454          0.727086              0.086931

Predicted probabilities

## By each variable controlling for others
plot(allEffects(multinom.estrogen.smoking), style = "stacked")

plot of chunk unnamed-chunk-18

## Grouped by all variables
plot(effect("age*estrogen*smoking", multinom.estrogen.smoking), style = "stacked")

plot of chunk unnamed-chunk-18

Interpretation Age is statistially significant for the Adenosquamous vs Adenocarcinoma comparison, but not for the Other vs Adenoarcinoma comparison after adjustment for estrogen use and smoking status. People aged 65-79 (compared to people aged 50-64) were at 2.68 times more odds of having Adenosquamous (compared to Adenocarcinoma).

Fit using VGAM::vglm()

The VGAM::vglm() function can fit multinomial logistic regression as a special case of a vector generalized linear models.

## Load VGAM
library(VGAM)
## Using the first level as the reference level for all comparison
vglm.age.est.smk <- vglm(subtype ~ age + estrogen + smoking, data = cancer,
                         family = multinomial(parallel = FALSE, refLevel = 1))
summary(vglm.age.est.smk)

Call:
vglm(formula = subtype ~ age + estrogen + smoking, family = multinomial(parallel = FALSE, 
    refLevel = 1), data = cancer)

Pearson Residuals:
                      Min     1Q Median     3Q  Max
log(mu[,2]/mu[,1]) -0.981 -0.472 -0.398 -0.180 3.97
log(mu[,3]/mu[,1]) -0.632 -0.592 -0.514 -0.218 5.14

Coefficients:
                        Estimate Std. Error z value
(Intercept):1             -1.882      0.402  -4.676
(Intercept):2             -1.203      0.319  -3.772
age65-79:1                 0.987      0.412   2.397
age65-79:2                 0.282      0.328   0.861
estrogenever used:1       -0.644      0.344  -1.874
estrogenever used:2       -0.107      0.307  -0.349
smokingcurrent smoker:1    0.889      0.525   1.693
smokingcurrent smoker:2   -1.791      1.046  -1.713

Number of linear predictors:  2 

Names of linear predictors: log(mu[,2]/mu[,1]), log(mu[,3]/mu[,1])

Dispersion Parameter for multinomial family:   1

Residual deviance: 494.4 on 564 degrees of freedom

Log-likelihood: -247.2 on 564 degrees of freedom

Number of iterations: 5 

The lines marked “:1” are the ones for the first comparison, in this case level 2 (Adenosquamous) vs level 1 (Adenocarcinoma). The lines marked “:2” are the ones for the second comparison, in this case level 3 (Other) vs level 1 (Adenocarcinoma).