Question

(from Extending the Linear Model with R [ELMR], p.124)

The pneumo data gives the number of coal miners classified by radiological examination into one of three categories of pneumonoconiosis and by the number of years spent working at the coal face divided into eight categories.

Loading of libraries

library(nnet) # for multinom: Fit Multinomial Log-linear Models
library(MASS) # for polr: Ordered Logistic or Probit Regression
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(arm) # ARM: Data Analysis Using Regression and Multilevel/Hierarchical Models
## Loading required package: Matrix
## Loading required package: lme4
## 
## arm (Version 1.12-2, built: 2021-10-15)
## Working directory is /Users/dpong/Data 621/Discussion12

(a) Treating the pneumonoconiosis status as response variable as nominal, build a model for predicting the frequency of the three outcomes in terms of length of service and use it to predict the outcome for a miner with 25 years of service.

data(pneumo, package="faraway")
head(pneumo)

After examining the dataset, I realized that the response variable is Status.

nominal_fit <- multinom(status~year,weights=Freq,data=pneumo, Hess = TRUE)
## # weights:  9 (4 variable)
## initial  value 407.585159 
## iter  10 value 208.724810
## final  value 208.724782 
## converged
summary(nominal_fit)
## Call:
## multinom(formula = status ~ year, data = pneumo, weights = Freq, 
##     Hess = TRUE)
## 
## Coefficients:
##        (Intercept)        year
## normal   4.2916723 -0.08356506
## severe  -0.7681706  0.02572027
## 
## Std. Errors:
##        (Intercept)       year
## normal   0.5214110 0.01528044
## severe   0.7377192 0.01976662
## 
## Residual Deviance: 417.4496 
## AIC: 425.4496
predict.nominal = predict(nominal_fit,newdata=data.frame(year=25),type = "probs")
predict.nominal
##       mild     normal     severe 
## 0.09148821 0.82778696 0.08072483

The nominal model predicts that the status of a miner with 25 years of service will most likely be normal, with a probability of 82.8%.

(b) Repeat the analysis with the pneumonoconiosis status being treated as ordinal.

ordinal_fit <- polr(status~year,weights=Freq,data=pneumo,Hess=TRUE) 
summary(ordinal_fit)
## Call:
## polr(formula = status ~ year, data = pneumo, weights = Freq, 
##     Hess = TRUE)
## 
## Coefficients:
##        Value Std. Error t value
## year 0.01566   0.009057    1.73
## 
## Intercepts:
##               Value   Std. Error t value
## mild|normal   -1.8449  0.2492    -7.4039
## normal|severe  2.3676  0.2709     8.7411
## 
## Residual Deviance: 502.1551 
## AIC: 508.1551
predict(ordinal_fit,newdata=data.frame(year=25),type = "probs")
##       mild     normal     severe 
## 0.09652357 0.78172799 0.12174844

So with the ordinal model, normal status is the predicted outcome for a miner with 25 years of service, albeit the probability is slightly lower at 78.2% than the nominal model seen in part a.

(c) Now treat the response variable as hierarchical with top level indicating whether the miner has the disease and the second level indicating, given they have the disease, whether they have a moderate or severe case.

data_normal<-pneumo[pneumo$status=="normal",]
data_mild<-pneumo[pneumo$status=="mild",]
data_severe<-pneumo[pneumo$status=="severe",]
data_abnormal<-rbind(data_mild,data_severe)
# str(data_abnormal)
# Making a copy of data_abnormal for use with the 2nd binomial model down the line
data_abnormal.2 <- copy(data_abnormal)

data_abnormal$status<-rep("abnormal",dim(data_abnormal)[1])

data_hier <- rbind(data_normal, data_abnormal)
data_hier$status <- factor(data_hier$status)
str(data_hier)
## 'data.frame':    24 obs. of  3 variables:
##  $ Freq  : num  98 51 34 35 32 23 12 4 0 2 ...
##  $ status: Factor w/ 2 levels "normal","abnormal": 1 1 1 1 1 1 1 1 2 2 ...
##  $ year  : num  5.8 15 21.5 27.5 33.5 39.5 46 51.5 5.8 15 ...
data_abnormal.2$status <- factor(data_abnormal.2$status)
str(data_abnormal.2)
## 'data.frame':    16 obs. of  3 variables:
##  $ Freq  : num  0 2 6 5 10 7 6 2 0 1 ...
##  $ status: Factor w/ 2 levels "mild","severe": 1 1 1 1 1 1 1 1 2 2 ...
##  $ year  : num  5.8 15 21.5 27.5 33.5 39.5 46 51.5 5.8 15 ...
binomial_fit = glm( status ~ year, data = data_hier, 
                   family = binomial(link = "logit"), weights = Freq)
binomial_abnormal_fit = glm( status ~ year, data = data_abnormal.2, 
                     family = binomial(link = "logit"), weights = Freq)


# summary(binomial_abnormal_fit)
top_level_pred <- predict(binomial_fit,newdata=data.frame(year=25),type="response")
top_level_pred
##        1 
## 0.173701
second_level_pred <- predict(binomial_abnormal_fit,newdata=data.frame(year=25),type="response")
second_level_pred
##         1 
## 0.4435842

So the interpretation is as follows,

The predicted probability of a miner with 25 years of service for being abnormal is 17.3% whereas the predicted probability of the same miner for being normal is \(1- top\_level\_pred = 82.7\%\)

The predicted probability of a miner with 25 years of service, given he’s a disease, for having a severe status is \(second\_level\_pred * top\_level\_pred = 7.7\%\)

The predicted probability of a miner with 25 years of service, given he’s a disease, for having a mild status is \((1-second\_level\_pred) * top\_level\_pred = 9.7\%\)

(d) Compare the three analyses

#residual plot1
pneumo_new<-reshape2::dcast(pneumo, year ~ status, value.var = "Freq")
pneumo_new<-pneumo_new %>%
  mutate(total=apply(pneumo_new[,2:4],1,sum))
pneumo_new[,2:4]<-round(pneumo_new[,2:4]/pneumo_new[,"total"],2)
pred1<-predict(nominal_fit,newdata=pneumo_new,type="p")
resid1<-pneumo_new[,2:4]-pred1
par(mfrow=c(1,3))
for(i in 1:3)
  binnedplot(pred1[,i],resid1[,i], main="Nominal Fit")

#residual plot2
pred2<-predict(ordinal_fit,newdata=pneumo_new,type="p")
resid2<-pneumo_new[,2:4]-pred2
par(mfrow=c(1,3))
for(i in 1:3)
  binnedplot(pred2[,i],resid2[,i], main="Ordinal Fit")

#residual plot3
p_abnormal<-predict(binomial_fit,pneumo_new,type="r")
p_normal<-1-p_abnormal
p_severe<-p_abnormal*predict(binomial_abnormal_fit,pneumo_new,type="r")
p_mild<-p_abnormal*(1-predict(binomial_abnormal_fit,pneumo_new,type="r"))
pred3<-cbind(p_mild,p_normal,p_severe)
resid3<-pneumo_new[,2:4]-pred3
par(mfrow=c(1,3))
for(i in 1:3)
  binnedplot(pred3[,i],resid3[,i], main="Hierarchical Fit")

Observations: The nominal fit and the hierarchical fit have some similar binned residual plots for all 3 statuses of outcome. The ordinal fit appear to have more extreme residuals in all 3 statuses of outcome as you can tell the residuals don’t fall within the boundaries of the CIs. Thus analysis 2, namely, the ordinal fit, is deemed as the one that we should not opt for. And I’ll be indifferent to analysis 1 & 3, namely the nominal fit & hierarchical fit.