A study was conducted on children who had corrective spinal surgery. We are interested in factors that might result in kyphosis (a kind of deformation) after surgery.
data(kyphosis, package = "rpart")
library(ggplot2)
library(faraway)
## Warning: package 'faraway' was built under R version 3.5.3
#2.3.a
kyphosis$test <- ifelse(kyphosis$Kyphosis == 'absent', 0,1)
ggplot(kyphosis, aes(x=Age, y= test)) + geom_point() + geom_jitter(width = 0.1, height = 0.1)
ggplot(kyphosis, aes(x=Number, y= test)) + geom_point() + geom_jitter(width = 0.1, height = 0.1)
ggplot(kyphosis, aes(x=Start, y= test)) + geom_point() + geom_jitter(width = 0.1, height = 0.1)
Commentary : The age predictor seems to indicate higher ages increases likelihood, but doesn’t look strongly correlated while the number predictor seems to indicate higher numbers increases likelihood. Lastly the start predictor seems to indicate lower numbers increases likelihood.
Fit a GLM with the kyphosis indicator as the response and the other three variables as predictors. Plot the deviance residuals against the fitted values. What can be concluded from this plot?
kData <- kyphosis[,2:5]
kMod <- glm(test ~ ., data=kData, family = "binomial")
summary(kMod)
##
## Call:
## glm(formula = test ~ ., family = "binomial", data = kData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3124 -0.5484 -0.3632 -0.1659 2.1613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.036934 1.449575 -1.405 0.15996
## Age 0.010930 0.006446 1.696 0.08996 .
## Number 0.410601 0.224861 1.826 0.06785 .
## Start -0.206510 0.067699 -3.050 0.00229 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 83.234 on 80 degrees of freedom
## Residual deviance: 61.380 on 77 degrees of freedom
## AIC: 69.38
##
## Number of Fisher Scoring iterations: 5
sumary(kMod)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0369335 1.4495745 -1.4052 0.159964
## Age 0.0109305 0.0064463 1.6956 0.089955
## Number 0.4106012 0.2248608 1.8260 0.067847
## Start -0.2065100 0.0676989 -3.0504 0.002285
##
## n = 81 p = 4
## Deviance = 61.37993 Null Deviance = 83.23447 (Difference = 21.85455)
linPred <- predict(kMod)
predProb <- predict(kMod, type="response")
rawres <- kData$test - predProb
plot(rawres ~ linPred, xlab="linear predictor", ylab="residuals")
The plot of this conveys minimal information which is why in the next section 2.3.c we will bin the residuals and plot those.
Produce a binned residual plot as described in the text. You will need to select an appropriate amount of binning. Comment on the plot.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
kData <- mutate(kData, Residuals=residuals(kMod), linPred=predict(kMod))
gDf <- group_by(kData, cut(linPred, breaks=unique(quantile(linPred, (0:12)/12))))
## Warning: Factor `cut(linPred, breaks = unique(quantile(linPred, (0:12)/
## 12)))` contains implicit NA, consider using `forcats::fct_explicit_na`
diagDf <- summarise(gDf, Residuals = mean(Residuals), linPred=mean(linPred))
plot(Residuals ~ linPred, data = diagDf, xlab = "Linear Predictor")
We created 12 bins, and by binning this and plotting, we are provided a better picture, and the residuals look relatively random which is a good sign for our model, athough in the first set (before -2 on the x-axis) there does seem to be somewhat of a downward pattern there.
Plot the residuals against the Start predictor, using binning as appropriate. Comment on the plot.
gDf <- group_by(kData, Start)
diagDf <- summarise(gDf, Residuals=mean(Residuals))
ggplot(diagDf, aes(x= Start, y= Residuals)) + geom_point()
This plot looks pretty random, which is of course, what is desired, although the residuals above zero tend to be higher than those below.
Produce a normal QQ plot for the residuals. Interpret the plot.
qqnorm(residuals(kMod))
Residual plot is not linear, but as noted in ELM text comments, we aren’t expecting linear as we don’t anticpate normally distributed results.
Make a plot of the leverages. Interpret the plot.
halfnorm(hatvalues(kMod))
There really aren’t any strong outliers here. Points 53 and 24 have been automatically labeled, but they are following the line generally well, so they are not really outlying, nor are the likely to be leverage points, influencing curve either up or down.
Check the goodness of fit for this model. Create a plot like Figure 2.9. Compute the Hosmer-Lemeshow statistic and associated p-value. What do you conclude?
linPred <- predict(kMod)
kDataM <- mutate(kData, predProb = predict(kMod, type = "response"))
gDf <- group_by(kDataM, cut(linPred, breaks = unique(quantile(linPred, (0:12)/12))))
## Warning: Factor `cut(linPred, breaks = unique(quantile(linPred, (0:12)/
## 12)))` contains implicit NA, consider using `forcats::fct_explicit_na`
hlDf <- summarise(gDf, y= sum(test), pPred=mean(predProb), count = n())
hlDf <- mutate(hlDf, se.fit=sqrt(pPred * (1-(pPred)/count)))
ggplot(hlDf,aes(x=pPred,y=y/count,ymin=y/count-2*se.fit,ymax=y/count+2*se.fit)) +
geom_point()+geom_linerange(color=grey(0.75))+geom_abline(intercept=0,slope=1) +
xlab("Predicted Probability") +
ylab("Observed Proportion")
hlStat<- with(hlDf, sum( ( y- count* pPred)^2/(count*pPred*(1-pPred))))
hlStat
## [1] 9.887868
1-pchisq(hlStat, nrow(hlDf) -1)
## [1] 0.6257973
Use the model to classify the subjects into predicted outcomes using a 0.5 cutoff. Produce cross-tabulation of these predicted outcomes with the actual outcomes. When kyphosis is actually present, what is the probability that this model would predict a present outcome? What is the name for this characteristic of the test?
kDataM <- mutate(kDataM, predout=ifelse(predProb < 0.5, "no", "yes"))
xtabs( ~ test + predout, kDataM)
## predout
## test no yes
## 0 61 3
## 1 10 7
(61 + 7)/(61+7+3+10)
## [1] 0.8395062
7/(10+7)
## [1] 0.4117647
The model is about 84% accurate overall.
The sensitivity metric is about 41%.