(from Extending the Linear Model with R [ELMR], p.124)
The pneumo data gives the number of coal miners classified by radiological examination into one of three categories of pneumonoconiosis and by the number of years spent working at the coal face divided into eight categories.
library(nnet) # for multinom: Fit Multinomial Log-linear Models
library(MASS) # for polr: Ordered Logistic or Probit Regression
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(arm) # ARM: Data Analysis Using Regression and Multilevel/Hierarchical Models
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.12-2, built: 2021-10-15)
## Working directory is /Users/dpong/Data 621/Discussion12
data(pneumo, package="faraway")
head(pneumo)
After examining the dataset, I realized that the response variable is Status.
nominal_fit <- multinom(status~year,weights=Freq,data=pneumo, Hess = TRUE)
## # weights: 9 (4 variable)
## initial value 407.585159
## iter 10 value 208.724810
## final value 208.724782
## converged
summary(nominal_fit)
## Call:
## multinom(formula = status ~ year, data = pneumo, weights = Freq,
## Hess = TRUE)
##
## Coefficients:
## (Intercept) year
## normal 4.2916723 -0.08356506
## severe -0.7681706 0.02572027
##
## Std. Errors:
## (Intercept) year
## normal 0.5214110 0.01528044
## severe 0.7377192 0.01976662
##
## Residual Deviance: 417.4496
## AIC: 425.4496
predict.nominal = predict(nominal_fit,newdata=data.frame(year=25),type = "probs")
predict.nominal
## mild normal severe
## 0.09148821 0.82778696 0.08072483
The nominal model predicts that the status of a miner with 25 years of service will most likely be normal, with a probability of 82.8%.
ordinal_fit <- polr(status~year,weights=Freq,data=pneumo,Hess=TRUE)
summary(ordinal_fit)
## Call:
## polr(formula = status ~ year, data = pneumo, weights = Freq,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## year 0.01566 0.009057 1.73
##
## Intercepts:
## Value Std. Error t value
## mild|normal -1.8449 0.2492 -7.4039
## normal|severe 2.3676 0.2709 8.7411
##
## Residual Deviance: 502.1551
## AIC: 508.1551
predict(ordinal_fit,newdata=data.frame(year=25),type = "probs")
## mild normal severe
## 0.09652357 0.78172799 0.12174844
So with the ordinal model, normal status is the predicted outcome for a miner with 25 years of service, albeit the probability is slightly lower at 78.2% than the nominal model seen in part a.
data_normal<-pneumo[pneumo$status=="normal",]
data_mild<-pneumo[pneumo$status=="mild",]
data_severe<-pneumo[pneumo$status=="severe",]
data_abnormal<-rbind(data_mild,data_severe)
# str(data_abnormal)
# Making a copy of data_abnormal for use with the 2nd binomial model down the line
data_abnormal.2 <- copy(data_abnormal)
data_abnormal$status<-rep("abnormal",dim(data_abnormal)[1])
data_hier <- rbind(data_normal, data_abnormal)
data_hier$status <- factor(data_hier$status)
str(data_hier)
## 'data.frame': 24 obs. of 3 variables:
## $ Freq : num 98 51 34 35 32 23 12 4 0 2 ...
## $ status: Factor w/ 2 levels "normal","abnormal": 1 1 1 1 1 1 1 1 2 2 ...
## $ year : num 5.8 15 21.5 27.5 33.5 39.5 46 51.5 5.8 15 ...
data_abnormal.2$status <- factor(data_abnormal.2$status)
str(data_abnormal.2)
## 'data.frame': 16 obs. of 3 variables:
## $ Freq : num 0 2 6 5 10 7 6 2 0 1 ...
## $ status: Factor w/ 2 levels "mild","severe": 1 1 1 1 1 1 1 1 2 2 ...
## $ year : num 5.8 15 21.5 27.5 33.5 39.5 46 51.5 5.8 15 ...
binomial_fit = glm( status ~ year, data = data_hier,
family = binomial(link = "logit"), weights = Freq)
binomial_abnormal_fit = glm( status ~ year, data = data_abnormal.2,
family = binomial(link = "logit"), weights = Freq)
# summary(binomial_abnormal_fit)
top_level_pred <- predict(binomial_fit,newdata=data.frame(year=25),type="response")
top_level_pred
## 1
## 0.173701
second_level_pred <- predict(binomial_abnormal_fit,newdata=data.frame(year=25),type="response")
second_level_pred
## 1
## 0.4435842
So the interpretation is as follows,
The predicted probability of a miner with 25 years of service for being abnormal is 17.3% whereas the predicted probability of the same miner for being normal is \(1- top\_level\_pred = 82.7\%\)
The predicted probability of a miner with 25 years of service, given he’s a disease, for having a severe status is \(second\_level\_pred * top\_level\_pred = 7.7\%\)
The predicted probability of a miner with 25 years of service, given he’s a disease, for having a mild status is \((1-second\_level\_pred) * top\_level\_pred = 9.7\%\)
#residual plot1
pneumo_new<-reshape2::dcast(pneumo, year ~ status, value.var = "Freq")
pneumo_new<-pneumo_new %>%
mutate(total=apply(pneumo_new[,2:4],1,sum))
pneumo_new[,2:4]<-round(pneumo_new[,2:4]/pneumo_new[,"total"],2)
pred1<-predict(nominal_fit,newdata=pneumo_new,type="p")
resid1<-pneumo_new[,2:4]-pred1
par(mfrow=c(1,3))
for(i in 1:3)
binnedplot(pred1[,i],resid1[,i], main="Nominal Fit")
#residual plot2
pred2<-predict(ordinal_fit,newdata=pneumo_new,type="p")
resid2<-pneumo_new[,2:4]-pred2
par(mfrow=c(1,3))
for(i in 1:3)
binnedplot(pred2[,i],resid2[,i], main="Ordinal Fit")
#residual plot3
p_abnormal<-predict(binomial_fit,pneumo_new,type="r")
p_normal<-1-p_abnormal
p_severe<-p_abnormal*predict(binomial_abnormal_fit,pneumo_new,type="r")
p_mild<-p_abnormal*(1-predict(binomial_abnormal_fit,pneumo_new,type="r"))
pred3<-cbind(p_mild,p_normal,p_severe)
resid3<-pneumo_new[,2:4]-pred3
par(mfrow=c(1,3))
for(i in 1:3)
binnedplot(pred3[,i],resid3[,i], main="Hierarchical Fit")
Observations: The nominal fit and the hierarchical fit have some similar binned residual plots for all 3 statuses of outcome. The ordinal fit appear to have more extreme residuals in all 3 statuses of outcome as you can tell the residuals don’t fall within the boundaries of the CIs. Thus analysis 2, namely, the ordinal fit, is deemed as the one that we should not opt for. And I’ll be indifferent to analysis 1 & 3, namely the nominal fit & hierarchical fit.