## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE,
echo = T, fig.width = 5, fig.height = 5)
options(width = 125, scipen = 5, digits = 5)
setwd("~/statistics/self.logistic/")
Nominal variables simply indicate different categories. An example used here is histological subtype of cancer.
One of the categories of the outcome variable is designated as the reference category, and each of the other levels is compared with this reference level. In this example the outcome level Adenocarcinoma is chosen as the reference. So the comparison will be between Adenosquamous vs Adenocarcinoma and between Other vs Adenocarcinoma.
cancer.dta file availabe at http://www.sph.emory.edu/~dkleinb/logreg2.htm.
library(foreign)
cancer <- read.dta("http://www.sph.emory.edu/~dkleinb/datasets/cancer.dta")
cancer <- within(cancer, {
grade <- factor(grade, 0:2, c("well differentiated","moderately differentiated","poorly differentiated"),
ordered = TRUE)
race <- factor(race, 0:1, c("white","black"))
estrogen <- factor(estrogen, 0:1, c("never used","ever used"))
subtype <- factor(subtype, 0:2, c("Adenocarcinoma","Adenosquamous","Other"))
age <- factor(age, 0:1, c("50-64","65-79"))
smoking <- factor(smoking, 0:1, c("non-smoker","current smoker"))
})
Subtype is the polytomous outcome variable.
summary(cancer)
id grade race estrogen subtype age
Min. : 10009 well differentiated :130 white:207 never used:135 Adenocarcinoma:186 50-64:106
1st Qu.: 11590 moderately differentiated:105 black: 81 ever used :151 Adenosquamous : 45 65-79:182
Median : 21424 poorly differentiated : 53 NA's : 2 Other : 57
Mean : 25752
3rd Qu.: 30820
Max. :301017
smoking
non-smoker :260
current smoker: 28
Modeling
One line of code for each model specification, but actually two models are constructed. The one that compares the reference outcome level Adenocarcinoma to Adenosquamous, and the other that compares the refence outcome level Adenocarcinoma to Others.
library(nnet)
multinom.age <- multinom(subtype ~ age, data = cancer)
# weights: 9 (4 variable)
initial value 316.400339
iter 10 value 254.447180
final value 254.445123
converged
multinom.null <- multinom(subtype ~ 1, data = cancer)
# weights: 6 (2 variable)
initial value 316.400339
iter 10 value 257.190003
iter 10 value 257.190003
iter 10 value 257.190003
final value 257.190003
converged
Results (annotated)
(summary.multinom.age <- summary(multinom.age, Wald = TRUE))
Call:
multinom(formula = subtype ~ age, data = cancer)
Coefficients: # coefficients
(Intercept) age65-79
Adenosquamous -1.945909 0.7809218 # Adenosquamous vs Adenocarcinoma
Other -1.453434 0.4256484 # Other vs Adenocarcinoma
Std. Errors: # SE
(Intercept) age65-79
Adenosquamous 0.3223291 0.3774680 # Adenosquamous vs Adenocarcinoma
Other 0.2618064 0.3214933 # Other vs Adenocarcinoma
Value/SE (Wald statistics): # Wald statistics
(Intercept) age65-79
Adenosquamous -6.037027 2.068842 # Adenosquamous vs Adenocarcinoma
Other -5.551561 1.323973 # Other vs Adenocarcinoma
Residual Deviance: 508.8902
AIC: 516.8902
Adenosquamous vs Adenocarcinoma model
Estimated log odds of Adenosquamous in comparison to Adenocarcinoma among those who are aged 65-79.
ln\( [\frac {\hat{P}(D = AdenoSq|X_1)} {\hat{P}(D = AdenoCa|X_1)}] = -1.9459 + (0.7809) \)
Other vs Adenocarcinoma model
Estimated log odds of Other in comparison to Adenocarcinoma among those who are aged 65-79.
ln\( [\frac {\hat{P}(D = Other|X_1)} {\hat{P}(D = AdenoCa|X_1)}] = -1.4534 + (0.4256) \)
Odds ratios
exp(coef(multinom.age))
(Intercept) age65-79
Adenosquamous 0.14286 2.1835
Other 0.23377 1.5306
Adenosquamous vs Adenocarcinoma model
OR of Adenosquamous in comparison to Adenocarcinoma among those aged 65-79 in comparison to those aged 50-64.
\( \hat{OR_1} \) = exp[\( \hat{\beta}_{11} \)] = exp(0.7809) = 2.18
Other vs Adenocarcinoma model
OR of Other in comparison to Adenocarcinoma among those aged 65-79 in comparison to those aged 50-64.
\( \hat{OR_2} \) = exp[\( \hat{\beta}_{21} \)] = exp(0.4256) = 1.53
In this special case where there is only one dichotomous predictor, the odds ratios from the polytomous model are the same as crude odds ratios.
Crude odss ratios
tab.res <- xtabs(~ subtype + age, data = cancer)
library(epiR)
## AdenoCa vs AdenoSq comparison
epi.2by2(tab.res[-3,], units = 1)
Disease + Disease - Total Inc risk * Odds
Exposed + 77 109 186 0.414 0.706
Exposed - 11 34 45 0.244 0.324
Total 88 143 231 0.381 0.615
Point estimates and 95 % CIs:
---------------------------------------------------------
Inc risk ratio 1.69 (0.99, 2.91)
Odds ratio 2.18 (1, 5.07)
Attrib risk * 0.17 (0.03, 0.31)
Attrib risk in population * 0.14 (0, 0.28)
Attrib fraction in exposed (%) 40.95 (-1.47, 65.64)
Attrib fraction in population (%) 35.83 (-3.1, 60.07)
---------------------------------------------------------
* Cases per population unit
## AdenoCa vs Other comparison
epi.2by2(tab.res[-2,], units = 1)
Disease + Disease - Total Inc risk * Odds
Exposed + 77 109 186 0.414 0.706
Exposed - 18 39 57 0.316 0.462
Total 95 148 243 0.391 0.642
Point estimates and 95 % CIs:
---------------------------------------------------------
Inc risk ratio 1.31 (0.86, 1.99)
Odds ratio 1.53 (0.78, 3.06)
Attrib risk * 0.1 (-0.04, 0.24)
Attrib risk in population * 0.08 (-0.06, 0.21)
Attrib fraction in exposed (%) 23.72 (-15.94, 49.81)
Attrib fraction in population (%) 19.22 (-13.44, 42.48)
---------------------------------------------------------
* Cases per population unit
Confidence intervals for odds ratios
exp(confint(multinom.age))
, , Adenosquamous
2.5 % 97.5 %
(Intercept) 0.075951 0.2687
age65-79 1.041954 4.5756
, , Other
2.5 % 97.5 %
(Intercept) 0.13994 0.39051
age65-79 0.81508 2.87417
Interpretation
For women diagnosed with primary endometrial cancer, older subjects (aged 65 - 79) relative to younger subjects (aged 50 - 64) were more likely to have their tumors categorized as Other than as Adenocarcinoma (\( \hat{OR_2} = 1.53 \)) and were even more likely to have their tumors classified as Adenosquamous than as Adenocarcinoma (\( \hat{OR_1} = 2.18 \)).
Predicted probabilities
In a binary logistic regression model, the predicted probability is the probability of having outcome = 1, in the case of a polytomous logistic regression model, the predicted probabilities are probabilities of having each outcome level (here adenocarcinoma, adenosquamous, and other).
library(effects)
plot(allEffects(multinom.age), style = "stacked")
Hypothesis testing: Likelihood ratio test
library(car)
## Likelihood ratio test for significance of betas (two betas for age)
Anova(multinom.age, type = "III")
Analysis of Deviance Table (Type III tests)
Response: subtype
LR Chisq Df Pr(>Chisq)
age 5.49 2 0.064 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Deviance
list(null = deviance(multinom.null),
age = deviance(multinom.age))
$null
[1] 514.38
$age
[1] 508.89
Hypothesis testing: Wald tests
summary.multinom.age$Wald.ratios[, 2, drop = F]
age65-79
Adenosquamous 2.068842 # Wald statistic for beta of age in AdenoCa vs AdenoSq
Other 1.323973 # Wald statistic for beta of age in AdenoCa vs Other
For p-values square the Wald statistics and use Chi-square statistic with one degree of freedom.
pchisq(summary.multinom.age$Wald.ratios^2, df = 1, low = F)
(Intercept) age65-79
Adenosquamous 0.0000000015698 0.038561
Other 0.0000000283129 0.185512
multinom.estrogen.smoking <-
multinom(subtype ~ age + estrogen + smoking, data = cancer)
# weights: 15 (8 variable)
initial value 314.203115
iter 10 value 247.380882
final value 247.202541
converged
summary.multinom.estrogen.smoking <-
summary(multinom.estrogen.smoking, Wald = T)
summary.multinom.estrogen.smoking
Call:
multinom(formula = subtype ~ age + estrogen + smoking, data = cancer)
Coefficients:
(Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous -1.8822 0.98698 -0.64384 0.88942
Other -1.2032 0.28225 -0.10705 -1.79133
Std. Errors:
(Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous 0.40248 0.41179 0.34356 0.52535
Other 0.31897 0.32796 0.30674 1.04645
Value/SE (Wald statistics):
(Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous -4.6764 2.39680 -1.874 1.6930
Other -3.7720 0.86061 -0.349 -1.7118
Residual Deviance: 494.41
AIC: 510.41
Odds ratios
exp(coef(multinom.estrogen.smoking))
(Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous 0.15226 2.6831 0.52527 2.43373
Other 0.30024 1.3261 0.89848 0.16674
Confidence intervals
exp(confint(multinom.estrogen.smoking))
, , Adenosquamous
2.5 % 97.5 %
(Intercept) 0.069181 0.3351
age65-79 1.197078 6.0139
estrogenever used 0.267880 1.0300
smokingcurrent smoker 0.869149 6.8148
, , Other
2.5 % 97.5 %
(Intercept) 0.160680 0.56102
age65-79 0.697296 2.52196
estrogenever used 0.492509 1.63909
smokingcurrent smoker 0.021443 1.29651
Likelihood ratio test
Anova(multinom.estrogen.smoking, type = "III")
Analysis of Deviance Table (Type III tests)
Response: subtype
LR Chisq Df Pr(>Chisq)
age 6.56 2 0.038 *
estrogen 3.59 2 0.166
smoking 9.01 2 0.011 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Wald statistics
summary.multinom.estrogen.smoking$Wald.ratios
(Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous -4.6764 2.39680 -1.874 1.6930
Other -3.7720 0.86061 -0.349 -1.7118
## p-value
pchisq(summary.multinom.estrogen.smoking$Wald.ratios^2, 1, low = F)
(Intercept) age65-79 estrogenever used smokingcurrent smoker
Adenosquamous 0.0000029192 0.016539 0.060929 0.090452
Other 0.0001619271 0.389454 0.727086 0.086931
Predicted probabilities
## By each variable controlling for others
plot(allEffects(multinom.estrogen.smoking), style = "stacked")
## Grouped by all variables
plot(effect("age*estrogen*smoking", multinom.estrogen.smoking), style = "stacked")
Interpretation Age is statistially significant for the Adenosquamous vs Adenocarcinoma comparison, but not for the Other vs Adenoarcinoma comparison after adjustment for estrogen use and smoking status. People aged 65-79 (compared to people aged 50-64) were at 2.68 times more odds of having Adenosquamous (compared to Adenocarcinoma).
The VGAM::vglm() function can fit multinomial logistic regression as a special case of a vector generalized linear models.
## Load VGAM
library(VGAM)
## Using the first level as the reference level for all comparison
vglm.age.est.smk <- vglm(subtype ~ age + estrogen + smoking, data = cancer,
family = multinomial(parallel = FALSE, refLevel = 1))
summary(vglm.age.est.smk)
Call:
vglm(formula = subtype ~ age + estrogen + smoking, family = multinomial(parallel = FALSE,
refLevel = 1), data = cancer)
Pearson Residuals:
Min 1Q Median 3Q Max
log(mu[,2]/mu[,1]) -0.981 -0.472 -0.398 -0.180 3.97
log(mu[,3]/mu[,1]) -0.632 -0.592 -0.514 -0.218 5.14
Coefficients:
Estimate Std. Error z value
(Intercept):1 -1.882 0.402 -4.676
(Intercept):2 -1.203 0.319 -3.772
age65-79:1 0.987 0.412 2.397
age65-79:2 0.282 0.328 0.861
estrogenever used:1 -0.644 0.344 -1.874
estrogenever used:2 -0.107 0.307 -0.349
smokingcurrent smoker:1 0.889 0.525 1.693
smokingcurrent smoker:2 -1.791 1.046 -1.713
Number of linear predictors: 2
Names of linear predictors: log(mu[,2]/mu[,1]), log(mu[,3]/mu[,1])
Dispersion Parameter for multinomial family: 1
Residual deviance: 494.4 on 564 degrees of freedom
Log-likelihood: -247.2 on 564 degrees of freedom
Number of iterations: 5
The lines marked “:1” are the ones for the first comparison, in this case level 2 (Adenosquamous) vs level 1 (Adenocarcinoma). The lines marked “:2” are the ones for the second comparison, in this case level 3 (Other) vs level 1 (Adenocarcinoma).