2.3 Extending the Linear Model With R

A study was conducted on children who had corrective spinal surgery. We are interested in factors that might result in kyphosis (a kind of deformation) after surgery.

2.3.a

  1. Make plots of the response as it relates to each of the three predictors. You may find a jittered scatterplot more effective than the interleaved histogram for a dataset of this size. Comment on how the predictors appear to be related to the response.
data(kyphosis, package = "rpart")
library(ggplot2)
library(faraway)
## Warning: package 'faraway' was built under R version 3.5.3
#2.3.a
kyphosis$test <- ifelse(kyphosis$Kyphosis == 'absent', 0,1)
ggplot(kyphosis, aes(x=Age, y= test)) + geom_point() + geom_jitter(width = 0.1, height = 0.1)

ggplot(kyphosis, aes(x=Number, y= test)) + geom_point() + geom_jitter(width = 0.1, height = 0.1)

ggplot(kyphosis, aes(x=Start, y= test)) + geom_point() + geom_jitter(width = 0.1, height = 0.1)

Commentary : The age predictor seems to indicate higher ages increases likelihood, but doesn’t look strongly correlated while the number predictor seems to indicate higher numbers increases likelihood. Lastly the start predictor seems to indicate lower numbers increases likelihood.

2.3.b

Fit a GLM with the kyphosis indicator as the response and the other three variables as predictors. Plot the deviance residuals against the fitted values. What can be concluded from this plot?

kData <- kyphosis[,2:5]
kMod <- glm(test ~ ., data=kData, family =  "binomial")
summary(kMod)
## 
## Call:
## glm(formula = test ~ ., family = "binomial", data = kData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3124  -0.5484  -0.3632  -0.1659   2.1613  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -2.036934   1.449575  -1.405  0.15996   
## Age          0.010930   0.006446   1.696  0.08996 . 
## Number       0.410601   0.224861   1.826  0.06785 . 
## Start       -0.206510   0.067699  -3.050  0.00229 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83.234  on 80  degrees of freedom
## Residual deviance: 61.380  on 77  degrees of freedom
## AIC: 69.38
## 
## Number of Fisher Scoring iterations: 5
sumary(kMod)
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0369335  1.4495745 -1.4052 0.159964
## Age          0.0109305  0.0064463  1.6956 0.089955
## Number       0.4106012  0.2248608  1.8260 0.067847
## Start       -0.2065100  0.0676989 -3.0504 0.002285
## 
## n = 81 p = 4
## Deviance = 61.37993 Null Deviance = 83.23447 (Difference = 21.85455)
linPred <- predict(kMod)
predProb <- predict(kMod, type="response")


rawres <- kData$test - predProb

plot(rawres ~ linPred, xlab="linear predictor", ylab="residuals")

The plot of this conveys minimal information which is why in the next section 2.3.c we will bin the residuals and plot those.

2.3.c

Produce a binned residual plot as described in the text. You will need to select an appropriate amount of binning. Comment on the plot.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
kData <- mutate(kData, Residuals=residuals(kMod), linPred=predict(kMod))

gDf <- group_by(kData, cut(linPred, breaks=unique(quantile(linPred, (0:12)/12))))
## Warning: Factor `cut(linPred, breaks = unique(quantile(linPred, (0:12)/
## 12)))` contains implicit NA, consider using `forcats::fct_explicit_na`
diagDf <- summarise(gDf, Residuals = mean(Residuals), linPred=mean(linPred))

plot(Residuals ~ linPred, data = diagDf, xlab = "Linear Predictor")

We created 12 bins, and by binning this and plotting, we are provided a better picture, and the residuals look relatively random which is a good sign for our model, athough in the first set (before -2 on the x-axis) there does seem to be somewhat of a downward pattern there.

2.4.d

Plot the residuals against the Start predictor, using binning as appropriate. Comment on the plot.

gDf <- group_by(kData, Start)
diagDf <- summarise(gDf, Residuals=mean(Residuals))
ggplot(diagDf, aes(x= Start, y= Residuals)) + geom_point()

This plot looks pretty random, which is of course, what is desired, although the residuals above zero tend to be higher than those below.

2.4.e

Produce a normal QQ plot for the residuals. Interpret the plot.

qqnorm(residuals(kMod))

Residual plot is not linear, but as noted in ELM text comments, we aren’t expecting linear as we don’t anticpate normally distributed results.

2.4.f

Make a plot of the leverages. Interpret the plot.

halfnorm(hatvalues(kMod))

There really aren’t any strong outliers here. Points 53 and 24 have been automatically labeled, but they are following the line generally well, so they are not really outlying, nor are the likely to be leverage points, influencing curve either up or down.

2.4.g

Check the goodness of fit for this model. Create a plot like Figure 2.9. Compute the Hosmer-Lemeshow statistic and associated p-value. What do you conclude?

linPred <- predict(kMod)
kDataM <- mutate(kData, predProb = predict(kMod, type = "response"))
gDf <- group_by(kDataM, cut(linPred, breaks = unique(quantile(linPred, (0:12)/12))))
## Warning: Factor `cut(linPred, breaks = unique(quantile(linPred, (0:12)/
## 12)))` contains implicit NA, consider using `forcats::fct_explicit_na`
hlDf <- summarise(gDf, y= sum(test), pPred=mean(predProb), count = n())

hlDf <- mutate(hlDf, se.fit=sqrt(pPred * (1-(pPred)/count)))


ggplot(hlDf,aes(x=pPred,y=y/count,ymin=y/count-2*se.fit,ymax=y/count+2*se.fit)) +
     geom_point()+geom_linerange(color=grey(0.75))+geom_abline(intercept=0,slope=1) +
     xlab("Predicted Probability") +
     ylab("Observed Proportion")

hlStat<- with(hlDf, sum( ( y- count* pPred)^2/(count*pPred*(1-pPred))))

hlStat
## [1] 9.887868
1-pchisq(hlStat, nrow(hlDf) -1)
## [1] 0.6257973

The p value is very high, b/c of this then, the shows that there is not “lack of fit”, so model is good (at least in terms of that diagnostic). The plot shows 95% confidence intervals and our line fits through all of them.

2.3.h

Use the model to classify the subjects into predicted outcomes using a 0.5 cutoff. Produce cross-tabulation of these predicted outcomes with the actual outcomes. When kyphosis is actually present, what is the probability that this model would predict a present outcome? What is the name for this characteristic of the test?

kDataM <- mutate(kDataM, predout=ifelse(predProb < 0.5, "no", "yes"))
xtabs( ~ test + predout, kDataM)
##     predout
## test no yes
##    0 61   3
##    1 10   7
(61 + 7)/(61+7+3+10)
## [1] 0.8395062
7/(10+7)
## [1] 0.4117647

The model is about 84% accurate overall.

The sensitivity metric is about 41%.