Preface: these questions were adapted by Dr. Steven Miller from Faraway (2016). Many of the answers use code derived from this textbook, or from other sources.
data(prostate)
#xyplot(data = prostate, lweight ~ age, main = "Prostate: log (prostate weight) vs age of patient", xlab = "") - lattice
p <- plot_ly(prostate, x = ~age, y = ~lweight,
marker = list(size = 5,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',
width = 1),opacity = 0.5)) %>%
layout(title = 'Prostate: log (prostate weight) vs age of patient',
yaxis = list(zeroline = FALSE,
title = "log (prostate weight)"),
xaxis = list(zeroline = FALSE),
title = "age (years)") %>%
config(displayModeBar = F)
Two implementations are written below. One that manually optimises using a very straightforward bruteforce search (I would prefer the function to reduce the step size once a minimum change in error is reached, but we are keeping things simple). The second uses the optimise function built into R.
The second method actually produced highly inconsistent results. Adjusting the upper bound of the search interval produced considerably different estimates. My own manual adjusting (and plotting), showed a tight (low) bandwidth of lambda =< 0.6 minimised error (as can be seen below, banwidths of less than 0.6 result in identical fit). The only use of optimise() that found the global CV error minimum (12.36312) was when bandwidth was already restricted to an interval near this range, hence, we can spectulate that the function may be “overshooting” local minima. Allowing the optimiser a reasonable range of bandwidths (0,100), produced a reasonable estimate of bandwidth via the first cross-validation implementation (gcv_ksmooth) with the need for brute force (gcv_ksmooth_optim).
The first method (gcv_ksmooth) produces the expected curve with low bandiwth. A graph of the errror vs bandwidth is shown. The CV used is simple “leave one out CV” (LOOCV). It is certainly possible to use a more generalised version of CV (k-fold CV), might produce a more reasonable fit. Indeed, we can see quite a clear overfitting result at low bandwidths, followed by a more reasonable response of bandwidth and error, from which we might suggest a bandwidth of 13.
We then repeat this process with the outlier removed. We see that the discontinuity of the change in error at low bandwidth remains, but we also see a major improvement in the minimum LOOCV error, down to ~8.5 from a previous minimum around ~13. This suggests this single data point caused a considerable loss in fit. We see that visually, there is not a great deal of difference in the fit, with or without interpolation.
For reference, we plot both datasets with a more reasonable bandwidth (by ignoring this discontinuity). However, simply by visual examination we can clearly tell this is far too rough. It’s possible we would achieve a smoother fit by predicting on more points (i.e. not predict only on the given x points but also interpolating) with the N-W smoother, and this is demonstrated.
It’s worth noting this overfitting at low bandwidths occur because we did not penalise cases where the NW estimator is unable to predict a value, due to a lack of data within the bandwidth. For example, we are unable to apply this interpolation technique to visualise very low bandwidths, as the vast majority of interpolated values are unable to be estimated.
plot(lweight ~ age,data = prostate, main = "NW estimator: similarity of low bandwidths (fitted on original data)")
lines(ksmooth(y = prostate$lweight,x = prostate$age,n.points = length(prostate$age),x.points = prostate$age,"normal",0.1)$y,col = 'red')
lines(ksmooth(y = prostate$lweight,x = prostate$age,n.points = length(prostate$age),x.points = prostate$age,"normal",0.2)$y,col = 'red')
lines(ksmooth(y = prostate$lweight,x = prostate$age,n.points = length(prostate$age),x.points = prostate$age,"normal",0.3)$y,col = 'red')
lines(ksmooth(y = prostate$lweight,x = prostate$age,n.points = length(prostate$age),x.points = prostate$age,"normal",0.4)$y,col = 'red')
lines(ksmooth(y = prostate$lweight,x = prostate$age,n.points = length(prostate$age),x.points = prostate$age,"normal",0.5)$y,col = 'red')
lines(ksmooth(y = prostate$lweight,x = prostate$age,n.points = length(prostate$age),x.points = prostate$age,"normal",0.6)$y,col = 'red')
sum(ksmooth(y = prostate$lweight,x = prostate$age,n.points = length(prostate$age),x.points = prostate$age,"normal",0.2)$y)
## [1] 354.3108
sum(ksmooth(y = prostate$lweight,x = prostate$age,n.points = length(prostate$age),x.points = prostate$age,"normal",0.4)$y)
## [1] 354.3108
sum(ksmooth(y = prostate$lweight,x = prostate$age,n.points = length(prostate$age),x.points = prostate$age,"normal",0.6)$y)
## [1] 354.3108
gcv_ksmooth <- function(lambda,x_vec,y_vec) {
error_i <- 0
error_vec <- c()
for (i in 1:length(x_vec)) {
error_j <- (ksmooth(x = x_vec[-i],y = y_vec[-i],n.points = 1,x.points = x_vec[i],range.x = range(x_vec),
#range.x = c(min(x_vec)*max_extrap,max(x_vec)*max_extrap),
kernel = "normal",bandwidth = lambda)$y - y_vec[i])^2
if (!is.na(error_j)) {
error_i <- error_i + error_j
error_vec <- c(error_vec,error_i)
}
}
if (length(error_vec) == 0) {
lambda_error_2 <- Inf
} else {
lambda_error_2 <- sum(error_vec/length(error_vec)) # average error
}
return(lambda_error_2)
}
# function takes in ksmooth, calculate CV error, then iterates using a linear search
# manually optimising (without reducing step size)
gcv_ksmooth_optim <- function(step = 0.01,x_vec,y_vec,upper_limit = 10) {#min_error_change = 0.01) {
lambda_i <- 0
error_best <- Inf
lambda_error_vec <- c()
#i <- 0
#error_change <- Inf
#while (error_change > min_error_change) {
lambda_i_vec <- c()
while (lambda_i <= upper_limit-step) {
#i <- i + 1
#name <- paste('lambda', i,'plot.png', sep='_')
#png(name)
#plot(y_vec ~ x_vec, main = "GIF of CrossValidation select of best \n Normal Nadaraya-Watson Kernel Estimate",
# ylab = "log (prostate weight)",xlab = "Age")
lambda_i <- lambda_i + step # stepping through bandwidth
error_i <- 0
lambda_error <- Inf
error_vec <- c()
for (i in 1:length(x_vec)) {
error_j <- (ksmooth(x = x_vec[-i],
y = y_vec[-i],
n.points = 1,
x.points = x_vec[i],
#range.x = c(min(x_vec)*max_extrap,max(x_vec)*max_extrap),
kernel = "normal",
bandwidth = lambda_i)$y - y_vec[i])^2
if (!is.na(error_j)) {
error_i <- error_i + error_j
error_vec <- c(error_vec,error_i)
}
}
#lines(ksmooth(y = y_vec,x = x_vec,n.points = length(x_vec),x.points = x_vec,"normal",lambda_i)$y,col = 'black')
#dev.off()
if (length(error_vec) != 0) {
lambda_error <- sum(error_vec/length(error_vec)) # average error
}
lambda_error_vec <- c(lambda_error_vec,lambda_error)
lambda_i_vec <- c(lambda_i_vec,lambda_i)
#error_change = lambda_error_1 / lambda_error_2
if (lambda_error < error_best) {
lambda_f <- lambda_i
error_best <- lambda_error
}
}
#name <- paste('lambda', 9999,'plot.png', sep='_')
#png(name)
#plot(y_vec ~ x_vec, main = "GIF of CrossValidation select of best \n Normal Nadaraya-Watson Kernel Estimate",
# ylab = "log (prostate weight)",xlab = "Age")
#lines(ksmooth(y = y_vec,x = x_vec,n.points = length(x_vec),x.points = x_vec,"normal",lambda_f)$y,col = 'red')
#dev.off()
return(list('final_lambda' = lambda_f,'fitted_y' = ksmooth(x = x_vec,
y = y_vec,
n.points = length(x_vec),
x.points = x_vec,
#range.x = c(min(x_vec)*max_extrap,max(x_vec)*max_extrap),
kernel = "normal",
bandwidth = lambda_f)$y,'cv_errors' = lambda_error_vec,'lambdas' = lambda_i_vec))
}
x_vec <- prostate$age
y_vec <- prostate$lweight
paste("Optimal Bandwidth selection via optimise: ",optimise(f = gcv_ksmooth,interval = c(0,100),x_vec = prostate$age,y_vec = prostate$lweight,maximum = F,tol = 9)$minimum)
## [1] "Optimal Bandwidth selection via optimise: 14.5898033750315"
paste("Optimal LOOCV via optimise: ",optimise(f = gcv_ksmooth,interval = c(0,100),x_vec = prostate$age,y_vec = prostate$lweight,maximum = F,tol = 9)$objective)
## [1] "Optimal LOOCV via optimise: 12.7886576691188"
cv_best.list.99 <- gcv_ksmooth_optim(x_vec = x_vec,y_vec = y_vec,upper_limit = 99)
par(mfrow = c(2,2))
plot(cv_best.list.99$lambdas,cv_best.list.99$cv_errors,main = "CV Error vs Bandwidth",ylab = "CV Error (LOOCV)",xlab = "Bandwidth",type = 'l',sub = "Bandwidth < 100")
plot(cv_best.list.99$lambdas[1:2000],cv_best.list.99$cv_errors[1:2000],main = "CV Error vs Bandwidth",ylab = "CV Error (LOOCV)",xlab = "Bandwidth",type = 'l',lwd = 2,sub = "Bandwidth < 20")
plot(cv_best.list.99$lambdas[1:500],cv_best.list.99$cv_errors[1:500],main = "CV Error vs Bandwidth",ylab = "CV Error (LOOCV)",xlab = "Bandwidth",type = 'l',lwd = 2,sub = "Bandwidth < 5")
pred.df <- as.data.frame(list(prostate$age,cv_best.list.99$fitted_y,prostate$lweight))
colnames(pred.df) <- c("Age","Pred_lweight","lweight")
which(y_vec > 5) # identifying outlier
## [1] 32
x_vec_nout <- prostate$age[-32]
y_vec_nout <- prostate$lweight[-32]
optimise(f = gcv_ksmooth,interval = c(0,100),x_vec = x_vec_nout,y_vec = y_vec_nout,
maximum = F,tol = 9)
## $minimum
## [1] 14.5898
##
## $objective
## [1] 8.407509
cv_best.list.99.no <- gcv_ksmooth_optim(x_vec = x_vec_nout,y_vec = y_vec_nout,upper_limit = 99)
par(mfrow = c(1,3))
plot(cv_best.list.99.no$lambdas,cv_best.list.99.no$cv_errors,
main = "CV Error vs Bandwidth",ylab = "CV Error (LOOCV)",
xlab = "Bandwidth",type = 'l',sub = "Bandwidth < 1000 | Outlier removed")
plot(cv_best.list.99.no$lambdas[1:2000],cv_best.list.99.no$cv_errors[1:2000],
main = "CV Error vs Bandwidth",ylab = "CV Error (LOOCV)",
xlab = "Bandwidth",type = 'l',lwd = 2,sub = "Bandwidth < 20 | Outlier removed")
plot(cv_best.list.99.no$lambdas[1:500],cv_best.list.99.no$cv_errors[1:500],
main = "CV Error vs Bandwidth",ylab = "CV Error (LOOCV)",
xlab = "Bandwidth",type = 'l',lwd = 2,sub = "Bandwidth < 5 | Outlier removed")
pred.no.df <- as.data.frame(list(prostate$age[-32],cv_best.list.99.no$fitted_y,prostate$lweight[-32]))
colnames(pred.no.df) <- c("Age","Pred_lweight","lweight")
g1 <- ggplot(data = pred.df) +
geom_point(mapping = aes(x = Age,y= lweight)) +
geom_line(mapping = aes(x = Age,y = Pred_lweight),col = 'red') +
theme_classic() +
labs(title = "Age vs log (prostate weight)",subtitle = "F = 0.01",y = "log (prostate weight)", x = "Age")
g2 <- ggplot(data = pred.no.df) +
geom_point(mapping = aes(x = Age,y= lweight)) +
geom_line(mapping = aes(x = Age,y = Pred_lweight),col = 'red') +
theme_classic() +
labs(title = "Age vs log (prostate weight)",subtitle = "F = 0.01",y = "log (prostate weight)", x = "Age", caption = "Outlier removed")
grid.arrange(g1, g2, ncol = 2)
min(cv_best.list.99$cv_errors)
## [1] 12.36312
min(cv_best.list.99.no$cv_errors)
## [1] 8.300866
# ignoring initial overfit
min(cv_best.list.99$cv_errors[100:9900])
## [1] NA
min(cv_best.list.99.no$cv_errors[100:9900])
## [1] NA
# selecting lambdas
median(cv_best.list.99$lambdas[which(cv_best.list.99$cv_errors < 12.78740 & cv_best.list.99$cv_errors > 12.78738)])
## [1] 13.97
median(cv_best.list.99.no$lambdas[which(cv_best.list.99.no$cv_errors < 8.407509 & cv_best.list.99.no$cv_errors > 8.407507)])
## [1] 14.57
pred.df <- cbind(pred.df,'best_fit' = ksmooth(pred.df$Age,pred.df$lweight,kernel = "normal",bandwidth = 14.57,x.points = pred.df$Age)$y)
pred.no.df <- cbind(pred.no.df,'best_fit' = ksmooth(pred.no.df$Age,pred.no.df$lweight,kernel = "normal",bandwidth = 14.57,x.points = pred.no.df$Age)$y)
g1 <- ggplot(data = pred.df) +
geom_point(mapping = aes(x = Age,y= lweight)) +
geom_line(mapping = aes(x = Age,y = best_fit),col = 'red') +
theme_classic() +
labs(title = "Age vs log (prostate weight)",subtitle = "N-W normal smooth | F = 13.97",y = "log (prostate weight)", x = "Age", caption = "Reasonable minimum selected")
g2 <- ggplot(data = pred.no.df) +
geom_point(mapping = aes(x = Age,y= lweight)) +
geom_line(mapping = aes(x = Age,y = best_fit),col = 'red') +
theme_classic() +
labs(title = "Age vs log (prostate weight)",subtitle = "N-W normal smooth | F = 14.57",y = "log (prostate weight)", x = "Age",caption = "Outlier removed | Reasonable minimum selected ")
grid.arrange(g1, g2, ncol = 2)
smooth_data <- ksmooth(x = pred.df$Age,y = pred.df$lweight,n.points = 600,range.x = range(pred.df$Age),kernel = "normal",bandwidth = 14.57)$y
smooth_data.df <- as.data.frame(smooth_data);colnames(smooth_data.df) <- "smooth_y"
smooth_data.df$x <- seq(from = 41, to = 79, length.out = 600)
overfit_data <- ksmooth(x = pred.df$Age,y = pred.df$lweight,n.points = 600,range.x = range(pred.df$Age),kernel = "normal",bandwidth = 0.01)$y
overfit_data.df <- as.data.frame(overfit_data);colnames(overfit_data.df) <- "overfit_y"
overfit_data.df$x <- seq(from = 41, to = 79, length.out = 600)
# rm(overfit_data,overfit_data.df)
smooth_data_no <- ksmooth(x = pred.no.df$Age,y = pred.no.df$lweight,n.points = 600,range.x = range(pred.no.df$Age),kernel = "normal",bandwidth = 14.57)$y
smooth_data_no.df <- as.data.frame(smooth_data_no);colnames(smooth_data_no.df) <- "smooth_y"
smooth_data_no.df$x <- seq(from = 41, to = 79, length.out = 600)
g1 <- ggplot(data = pred.df) +
geom_point(mapping = aes(x = Age,y= lweight)) +
geom_line(data = smooth_data.df,mapping = aes(x = x,y = smooth_y),col = 'red') +
theme_classic() +
labs(title = "Age vs log (prostate weight)",subtitle = "N-W normal smooth | F = 13.97",y = "log (prostate weight)", x = "Age", caption = "Reasonable minimum selected \n Interpolated smooth")
g2 <- ggplot(data = pred.no.df) +
geom_point(mapping = aes(x = Age,y= lweight)) +
geom_line(data = smooth_data_no.df,mapping = aes(x = x,y = smooth_y),col = 'red') +
theme_classic() +
labs(title = "Age vs log (prostate weight)",subtitle = "N-W normal smooth | F = 14.57",y = "log (prostate weight)", x = "Age", caption = "Outlier removed | Reasonable minimum selected \n Interpolated smooth")
grid.arrange(g1, g2 ,ncol = 2)
The curve is almost completely linear, showing little curvature. It is difficult to compare the error reported by the smoothing function with the (non-GCV) error reported by the function in Q1 b. However, the curves are broadly similar, with the difference mostly at the edges of the data, where the more limited pool of data from which to calculate the smooth has produced divergent effects. This follows logically, as both smooths are calculated using CV (NW using LOOCV and splines using GCV).
It is interesting to note that on calculating the smooth via LOOCV, a warning message informs us that CV with non-unique x values is not recommended. This probably suggests that we should view our NW smooth with extra skepticism.
Note: interpolating the smooth spline made no difference to the shape of the final fit.
smoother <- smooth.spline(x = prostate$age,y = prostate$lweight)
smoother
## Call:
## smooth.spline(x = prostate$age, y = prostate$lweight)
##
## Smoothing Parameter spar= 1.499836 lambda= 4341.236 (28 iterations)
## Equivalent Degrees of Freedom (Df): 2.000035
## Penalized Criterion (RSS): 7.547608
## GCV: 0.2304165
pred.df$smoothing_splines <- predict(smoother,prostate$age)$y
smooth_data.df$smoothing_splines <- predict(smoother,smooth_data.df$x)$y
ggplot(data = pred.df) +
geom_point(mapping = aes(x = Age, y = lweight)) +
geom_line(data = smooth_data.df,mapping = aes(x = x,y = smooth_y,colour = 'N-W (BW = 14)')) +
geom_line(data = smooth_data.df, mapping = aes(x = x, y = smoothing_splines,colour = 'Smoothing Spline')) +
geom_line(mapping = aes(x = Age, y = smoothing_splines,colour = 'orange')) +
theme_classic() +
labs(title = "Age vs log (prostate weight)",y = "log (prostate weight)", x = "Age", subtitle = "Smooths added (Nadaraya-Watson, Smoothing Splines)",caption = "For N-W, a reasonable minimum selected, smooth interpolated (n = 600)") +
scale_y_continuous(limits = c(2,6.5)) +
scale_colour_manual("Smooths", breaks = c("N-W (BW = 14)", "Smoothing Spline"),
values = c("red", "orange",'orange'))
Quite a few warnings were generated, as there are many similar age values in the data, which complicates the smoothing process (as age is behaving partially like a factor - source. Unfortunately, in order to fully understand these difficults and whether jittering is an appropriate response, we would need to delve into the LOWESS algorithm (not described in detail by faraway). However, as a visualisation aid, using a neglible amount of jitter could be an appropriate response for this. As the curves are able to be fit here, we will move past this without worrying too much.
More detail here and original FORTRAN code used by R (!) here.
A linear (straight line) fit does not seems entirely plausible, given that the regression shows a fair amount of curvature even when span = 0.99. However, it is worth noting that the confidence bounds (default 95%) shown about the smooth easily would allow for a range of linear fits. If the true function of the relation between age and log (prostate weight) is indeed non-linear, it will likely be relatively close to a linear fit (perhaps a weakly exponential function). The non-linearity showns with span = 0.33 certainly doesn’t seem plausible.
We might, on the basis of background knowledge about the dataset, expect the cancer in older men to be more advanced (purely due to their age) and therefore the prostates that are to be removed to be slightly heavier. We could also reason that cancer symptoms are less likely to be caught in ‘younger’ men (relatively speaking, i.e. 40-50) due to reduced screening probabilities and perhaps as their otherwise good health might mask the symptoms for sometime. On the flip side, we might expect older men to recieve more regular health checkups and have their cancer caught earlier. Ultimately, this is speculation on the basis on the two variables we are considering alone and, without bringing in an analysis of the rest of the data present in the set (or without doing a literature review), it is difficult to comment on the likelihood of these ideas.
table(prostate$age)
##
## 41 43 44 47 49 50 52 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 1 1 1 2 1 2 1 1 1 2 4 3 5 5 3 7 7 7 6 3 12 5 4 1 4
## 73 74 76 77 78 79
## 2 1 1 2 1 1
# code directly from faraway
g1 <- ggplot(prostate, aes(x=age,y=lweight)) +
geom_point(alpha=0.25) +
geom_smooth(method="loess", span=0.99,aes(fill = '0.99',colour = "0.99")) +
theme_classic() +
labs(title = "Age vs log (prostate weight)",y = "log (prostate weight)",x = "Age", subtitle = "Lowess smooths fitted") +
scale_fill_manual("Smooth Span", breaks = c("0.99"), values = c("red")) +
scale_colour_manual("Smooth Span", breaks = c("0.99"), values = c("firebrick"))
g2 <- ggplot(prostate, aes(x=age,y=lweight)) +
geom_point(alpha=0.25) +
geom_smooth(method="loess", span=0.66,aes(fill = '0.66',colour = "0.66")) +
theme_classic() +
labs(title = "Age vs log (prostate weight)",y = "log (prostate weight)",x = "Age", subtitle = "Lowess smooths fitted") +
scale_fill_manual("Smooth Span", breaks = c("0.66"), values = c("orange")) +
scale_colour_manual("Smooth Span", breaks = c("0.66"), values = c("chocolate4"))
g3 <- ggplot(prostate, aes(x=age,y=lweight)) +
geom_point(alpha=0.25) +
geom_smooth(method="loess", span=0.33,aes(fill = '0.33',colour = "0.33")) +
theme_classic() +
labs(title = "Age vs log (prostate weight)",y = "log (prostate weight)",x = "Age", subtitle = "Lowess smooths fitted") +
scale_fill_manual("Smooth Span", breaks = c("0.33"),values = c('yellow')) +
scale_colour_manual("Smooth Span", breaks = c("0.33"),values = c('gold4'))
g4 <- ggplot(prostate, aes(x=age,y=lweight)) +
geom_point(alpha=0.25) +
geom_smooth(method="loess", span=0.2,aes(fill = '0.1',colour = '0.1')) +
theme_classic() +
labs(title = "Age vs log (prostate weight)",y = "log (prostate weight)",x = "Age",subtitle = "Lowess smooths fitted") +
scale_fill_manual("Smooth Span", breaks = c("0.1"), values = c("blue")) +
scale_colour_manual("Smooth Span", breaks = c("0.1"), values = c("darkblue"))
grid.arrange(g1,g2,g3,g4,ncol = 2,nrow = 2)
Occam’s razor seems like a guiding principle here. The smoothing spline fit is feasible via the LOWESS confidence bounds, and is not dissimilar from the NW smoothing fit, with the exception of not being affected unduly by the lack of data at the tails. Hence, from all three methods, the smoothing splines fit bes represents the relationship between log (prostate weight) and age. The LOWESS fits with smaller spans can be mostly discounted due to unusual roughness that appears to be overfitting, the LOWESS (span = 0.99) fit appears to mimic the NW fit and the LOWESS (span = 0.66) fit appears to be highly compatible with the preferred smoothing spline approach (i.e. it is relatively similar to a linear fit).
As shown below, there appears to be no interaction between the terms, and a directly linear relationship on both variables.
library(mgcv)
amod <- gam(lweight ~ s(age,lpsa), data=prostate)
vis.gam(amod, color = "topo", ticktype="detailed",theta=-30,phi = 30,n.grid = 60, main = "Bivariate fit of log (prostate weight) \n with Age and log (prostate specific antigen)",zlab = "log-pw",xlab = "Age",ylab = "log-psa")
vis.gam(amod, color = "topo", ticktype="simple",theta=0,phi = 0,n.grid = 60, main = "Bivariate fit of log (prostate weight) \n with Age and log (prostate specific antigen)",zlab = "log-pw",xlab = "Age",ylab = "log-psa")
vis.gam(amod, color = "topo", ticktype="simple",theta=90,phi = 0,n.grid = 60, main = "Bivariate fit of log (prostate weight) \n with Age and log (prostate specific antigen)",zlab = "log-pw",xlab = "Age",ylab = "log-psa")
vis.gam(amod,ticktype="detailed",theta=-30,phi = 30,n.grid = 60, main = "Bivariate fit of log (prostate weight) \n with Age and log (prostate specific antigen)",zlab = "log-pw",xlab = "Age",ylab = "log-psa", se = 5)
vis.gam(amod, ticktype="simple",theta=0,phi = 0,n.grid = 60, main = "Bivariate fit of log (prostate weight) \n with Age and log (prostate specific antigen)",zlab = "log-pw",xlab = "Age",ylab = "log-psa",se = 5)
vis.gam(amod, ticktype="simple",theta=90,phi = 0,n.grid = 60, main = "Bivariate fit of log (prostate weight) \n with Age and log (prostate specific antigen)",zlab = "log-pw",xlab = "Age",ylab = "log-psa", se = 5)
The variables agesq, chcond1, freepoor,age,income,nonpresc seem insignificant. freerepa is borderline. The documentation for GAM selection recommends maximum likelihood smoothness selection if p-values are to be used to aid variable selection.
nondocco_gam <- gam(nondocco ~
sex+freepoor+freerepa+chcond1+chcond2+agesq+
s(age)+s(income)+s(illness, k = 6)+s(actdays)+s(hscore)+s(prescrib, k = 6)+s(nonpresc, k = 6),
data= dvisits,family = "poisson",scale = -1,method = "ML")
summary(nondocco_gam)
##
## Family: poisson
## Link function: log
##
## Formula:
## nondocco ~ sex + freepoor + freerepa + chcond1 + chcond2 + agesq +
## s(age) + s(income) + s(illness, k = 6) + s(actdays) + s(hscore) +
## s(prescrib, k = 6) + s(nonpresc, k = 6)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.38616 2.38843 -0.580 0.561670
## sex 0.28764 0.07004 4.107 4.01e-05 ***
## freepoor -0.12911 0.20350 -0.634 0.525791
## freerepa 0.22882 0.07934 2.884 0.003926 **
## chcond1 0.33185 0.08880 3.737 0.000186 ***
## chcond2 0.85153 0.09930 8.575 < 2e-16 ***
## agesq -5.14661 11.52781 -0.446 0.655271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 3.603 4.590 17.004 0.00244 **
## s(income) 1.001 1.001 1.110 0.29208
## s(illness) 4.326 4.795 46.504 1.08e-08 ***
## s(actdays) 1.000 1.000 213.481 < 2e-16 ***
## s(hscore) 3.593 4.429 55.218 1.01e-10 ***
## s(prescrib) 3.355 3.959 48.752 7.47e-10 ***
## s(nonpresc) 1.000 1.000 0.154 0.69492
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.082 Deviance explained = 20.6%
## -ML = 3050.3 Scale est. = 1 n = 5190
A quick search brings up a discussion that suggests using diagnostics plots as opposed to backward elmination to avoid introducing bias into coefficient estimation. The documentation for step.gam also suggests using GCV, AIC, Mallow’s Cp or likelihood based approaches. However, as requested, manual backward elimination will be performend. In fairness, the documentation for gam.selection states:
It is perfectly possible to perform backwards selection using p-values in the usual way: that is by sequentially dropping the single term with the highest non-significant p-value from the model and re-fitting, until all terms are significant.
Using backward elimination, agesq and freepoor were removed from the model.
nondocco_gam_1 <- gam(nondocco ~
sex+freepoor+freerepa+chcond1+chcond2+
s(age)+s(income)+s(illness, k = 6)+s(actdays)+s(hscore)+s(prescrib, k = 6)+s(nonpresc, k = 6),
data= dvisits,family = "poisson",scale = -1,method = "ML")
summary(nondocco_gam_1)
##
## Family: poisson
## Link function: log
##
## Formula:
## nondocco ~ sex + freepoor + freerepa + chcond1 + chcond2 + s(age) +
## s(income) + s(illness, k = 6) + s(actdays) + s(hscore) +
## s(prescrib, k = 6) + s(nonpresc, k = 6)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.45171 0.08091 -30.303 < 2e-16 ***
## sex 0.28717 0.07003 4.101 4.12e-05 ***
## freepoor -0.12930 0.20348 -0.635 0.525157
## freerepa 0.22986 0.07934 2.897 0.003765 **
## chcond1 0.33283 0.08879 3.748 0.000178 ***
## chcond2 0.85209 0.09930 8.581 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 4.446 5.431 33.820 3.94e-06 ***
## s(income) 1.000 1.001 1.034 0.309
## s(illness) 4.325 4.794 46.455 1.11e-08 ***
## s(actdays) 1.000 1.000 213.405 < 2e-16 ***
## s(hscore) 3.592 4.427 55.190 1.02e-10 ***
## s(prescrib) 3.356 3.960 48.775 7.37e-10 ***
## s(nonpresc) 1.000 1.000 0.159 0.690
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0819 Deviance explained = 20.6%
## -ML = 3050.4 Scale est. = 1 n = 5190
nondocco_gam_2 <- gam(nondocco ~
sex+freerepa+chcond1+chcond2+
s(age)+s(income)+s(illness, k = 6)+s(actdays)+s(hscore)+s(prescrib, k = 6)+s(nonpresc, k = 6),
data= dvisits,family = "poisson",scale = -1,method = "ML")
summary(nondocco_gam_2)
##
## Family: poisson
## Link function: log
##
## Formula:
## nondocco ~ sex + freerepa + chcond1 + chcond2 + s(age) + s(income) +
## s(illness, k = 6) + s(actdays) + s(hscore) + s(prescrib,
## k = 6) + s(nonpresc, k = 6)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.45971 0.08002 -30.737 < 2e-16 ***
## sex 0.29080 0.06982 4.165 3.12e-05 ***
## freerepa 0.23486 0.07903 2.972 0.002961 **
## chcond1 0.33193 0.08879 3.738 0.000185 ***
## chcond2 0.85091 0.09926 8.572 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 4.443 5.428 34.780 2.59e-06 ***
## s(income) 1.000 1.001 0.828 0.363
## s(illness) 4.328 4.796 46.630 1.03e-08 ***
## s(actdays) 1.000 1.001 213.531 < 2e-16 ***
## s(hscore) 3.576 4.408 54.900 1.12e-10 ***
## s(prescrib) 3.365 3.969 49.038 6.38e-10 ***
## s(nonpresc) 1.000 1.000 0.155 0.694
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0822 Deviance explained = 20.6%
## -ML = 3050.6 Scale est. = 1 n = 5190
The smooth terms income and nonpresc were removed from the model.
nondocco_gam_4 <- gam(nondocco ~
sex+freerepa+chcond1+chcond2+
s(age)+s(income)+s(illness, k = 6)+s(actdays)+s(hscore)+s(prescrib, k = 6),
data = dvisits,family = "poisson",scale = -1,method = "ML")
summary(nondocco_gam_4)
##
## Family: poisson
## Link function: log
##
## Formula:
## nondocco ~ sex + freerepa + chcond1 + chcond2 + s(age) + s(income) +
## s(illness, k = 6) + s(actdays) + s(hscore) + s(prescrib,
## k = 6)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.46084 0.07998 -30.767 < 2e-16 ***
## sex 0.29251 0.06970 4.197 2.71e-05 ***
## freerepa 0.23315 0.07891 2.955 0.003132 **
## chcond1 0.33360 0.08869 3.761 0.000169 ***
## chcond2 0.85206 0.09922 8.587 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 4.458 5.445 34.762 2.67e-06 ***
## s(income) 1.001 1.001 0.828 0.363
## s(illness) 4.332 4.798 46.813 9.42e-09 ***
## s(actdays) 1.000 1.000 213.839 < 2e-16 ***
## s(hscore) 3.565 4.395 54.960 1.06e-10 ***
## s(prescrib) 3.368 3.972 48.891 6.79e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0828 Deviance explained = 20.6%
## -ML = 3050.7 Scale est. = 1 n = 5190
A warning message recommends that F testing is not appropriate for poisson distributed families. To check the p-values from using this test, I followed the method described here (similar to the chapter on random effects). This is similar to the chi-sq approximations used in the count (i.e. poisson) regression sections of Faraway (2016). It’s important to note the p-values here are only approximate (Faraway,2016).
Relatively similar p-values result from either method. Both forms of tests consistently found the change in terms to be significant or non-significant. actdays was replaced with a linear equivalent. One interesting point to note is that, in the call of the function, if the s() term is replaced with a linear term between other s() terms, the model will interpret this as a request to weight the model by that term (i.e. the smooth term will be removed and the model will instead be weighted by that variable specified in the term).
#age
nondocco_gam_4a <- gam(nondocco ~
sex+freerepa+chcond1+chcond2+age+s(illness, k = 6)+s(actdays)+s(hscore)+s(prescrib, k = 6),
data = dvisits,family = "poisson",scale = -1,method = "ML")
#illness
nondocco_gam_4b <- gam(nondocco ~
sex+freerepa+chcond1+chcond2+illness+
s(age)+s(actdays)+s(hscore)+s(prescrib, k = 6),
data = dvisits,family = "poisson",scale = -1,method = "ML")
#presib
nondocco_gam_4c <- gam(nondocco ~
sex+freerepa+chcond1+chcond2+prescrib+
s(age)+s(illness, k = 6)+s(actdays)+s(hscore),
data = dvisits,family = "poisson",scale = -1,method = "ML")
#hscore
nondocco_gam_4d <- gam(nondocco ~
sex+freerepa+chcond1+chcond2+hscore+
s(age)+s(illness, k = 6)+s(actdays)+s(prescrib, k = 6),
data = dvisits,family = "poisson",scale = -1,method = "ML")
#actdays
nondocco_gam_4e <- gam(nondocco ~
sex+freerepa+chcond1+chcond2+actdays+
s(age)+s(illness, k = 6)+s(hscore)+s(prescrib, k = 6),
data = dvisits,family = "poisson",scale = -1,method = "ML")
anova(nondocco_gam_4,nondocco_gam_4a,test = "F")
## Analysis of Deviance Table
##
## Model 1: nondocco ~ sex + freerepa + chcond1 + chcond2 + s(age) + s(income) +
## s(illness, k = 6) + s(actdays) + s(hscore) + s(prescrib,
## k = 6)
## Model 2: nondocco ~ sex + freerepa + chcond1 + chcond2 + age + s(illness,
## k = 6) + s(actdays) + s(hscore) + s(prescrib, k = 6)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 5161.5 4864.4
## 2 5168.1 4892.1 -6.5696 -27.678 4.213 0.0001767 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(nondocco_gam_4,nondocco_gam_4b,test = "F")
## Analysis of Deviance Table
##
## Model 1: nondocco ~ sex + freerepa + chcond1 + chcond2 + s(age) + s(income) +
## s(illness, k = 6) + s(actdays) + s(hscore) + s(prescrib,
## k = 6)
## Model 2: nondocco ~ sex + freerepa + chcond1 + chcond2 + illness + s(age) +
## s(actdays) + s(hscore) + s(prescrib, k = 6)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 5161.5 4864.4
## 2 5166.5 4917.7 -4.9888 -53.209 10.666 3.003e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(nondocco_gam_4,nondocco_gam_4c,test = "F")
## Analysis of Deviance Table
##
## Model 1: nondocco ~ sex + freerepa + chcond1 + chcond2 + s(age) + s(income) +
## s(illness, k = 6) + s(actdays) + s(hscore) + s(prescrib,
## k = 6)
## Model 2: nondocco ~ sex + freerepa + chcond1 + chcond2 + prescrib + s(age) +
## s(illness, k = 6) + s(actdays) + s(hscore)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 5161.5 4864.4
## 2 5166.3 4908.4 -4.7734 -43.921 9.2011 1.819e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(nondocco_gam_4,nondocco_gam_4d,test = "F")
## Analysis of Deviance Table
##
## Model 1: nondocco ~ sex + freerepa + chcond1 + chcond2 + s(age) + s(income) +
## s(illness, k = 6) + s(actdays) + s(hscore) + s(prescrib,
## k = 6)
## Model 2: nondocco ~ sex + freerepa + chcond1 + chcond2 + hscore + s(age) +
## s(illness, k = 6) + s(actdays) + s(prescrib, k = 6)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 5161.5 4864.4
## 2 5167.2 4914.7 -5.7444 -50.262 8.7498 3.077e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(nondocco_gam_4,nondocco_gam_4e,test = "F")
## Analysis of Deviance Table
##
## Model 1: nondocco ~ sex + freerepa + chcond1 + chcond2 + s(age) + s(income) +
## s(illness, k = 6) + s(actdays) + s(hscore) + s(prescrib,
## k = 6)
## Model 2: nondocco ~ sex + freerepa + chcond1 + chcond2 + actdays + s(age) +
## s(illness, k = 6) + s(hscore) + s(prescrib, k = 6)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 5161.5 4864.4
## 2 5162.9 4866.6 -1.3635 -2.1678 1.5899 0.2091
pchisq(deviance(nondocco_gam_4a)-deviance(nondocco_gam_4),
df.residual(nondocco_gam_4a)-df.residual(nondocco_gam_4),
lower.tail=FALSE)
## [1] 2.295512e-05
pchisq(deviance(nondocco_gam_4b)-deviance(nondocco_gam_4),
df.residual(nondocco_gam_4b)-df.residual(nondocco_gam_4),
lower.tail=FALSE)
## [1] 8.391845e-11
pchisq(deviance(nondocco_gam_4c)-deviance(nondocco_gam_4),
df.residual(nondocco_gam_4c)-df.residual(nondocco_gam_4),
lower.tail=FALSE)
## [1] 3.163223e-09
pchisq(deviance(nondocco_gam_4d)-deviance(nondocco_gam_4),
df.residual(nondocco_gam_4d)-df.residual(nondocco_gam_4),
lower.tail=FALSE)
## [1] 2.895425e-10
pchisq(deviance(nondocco_gam_4e)-deviance(nondocco_gam_4),
df.residual(nondocco_gam_4e)-df.residual(nondocco_gam_4),
lower.tail=FALSE)
## [1] 0.1889665
As the tests below indicate, the model is does not have statistically significantly different fit from the initial full GAM model. However, the forumulation is considerably different, as we have removed several linear and smooth terms, and changed a smooth term to a linear term.
summary(nondocco_gam_4e)
##
## Family: poisson
## Link function: log
##
## Formula:
## nondocco ~ sex + freerepa + chcond1 + chcond2 + actdays + s(age) +
## s(illness, k = 6) + s(hscore) + s(prescrib, k = 6)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.546196 0.079867 -31.880 < 2e-16 ***
## sex 0.301794 0.068807 4.386 1.15e-05 ***
## freerepa 0.254320 0.075684 3.360 0.000779 ***
## chcond1 0.334805 0.088684 3.775 0.000160 ***
## chcond2 0.856167 0.099091 8.640 < 2e-16 ***
## actdays 0.088760 0.006074 14.614 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 4.194 5.125 34.66 1.92e-06 ***
## s(illness) 4.328 4.796 46.81 9.45e-09 ***
## s(hscore) 3.581 4.414 55.60 8.19e-11 ***
## s(prescrib) 3.360 3.964 48.75 7.40e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0829 Deviance explained = 20.6%
## -ML = 3051.1 Scale est. = 1 n = 5190
anova(nondocco_gam,nondocco_gam_4,test = "F")
## Analysis of Deviance Table
##
## Model 1: nondocco ~ sex + freepoor + freerepa + chcond1 + chcond2 + agesq +
## s(age) + s(income) + s(illness, k = 6) + s(actdays) + s(hscore) +
## s(prescrib, k = 6) + s(nonpresc, k = 6)
## Model 2: nondocco ~ sex + freerepa + chcond1 + chcond2 + s(age) + s(income) +
## s(illness, k = 6) + s(actdays) + s(hscore) + s(prescrib,
## k = 6)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 5159.3 4863.2
## 2 5161.5 4864.4 -2.1711 -1.2798 0.5895 0.5681
pchisq(deviance(nondocco_gam_4)-deviance(nondocco_gam),
df.residual(nondocco_gam_4)-df.residual(nondocco_gam),
lower.tail=FALSE)
## [1] 0.5645231
nondocco_gam_4e$call
## gam(formula = nondocco ~ sex + freerepa + chcond1 + chcond2 +
## actdays + s(age) + s(illness, k = 6) + s(hscore) + s(prescrib,
## k = 6), family = "poisson", data = dvisits, method = "ML",
## scale = -1)
nondocco_gam$call
## gam(formula = nondocco ~ sex + freepoor + freerepa + chcond1 +
## chcond2 + agesq + s(age) + s(income) + s(illness, k = 6) +
## s(actdays) + s(hscore) + s(prescrib, k = 6) + s(nonpresc,
## k = 6), family = "poisson", data = dvisits, method = "ML",
## scale = -1)
names(nondocco_gam$coefficients)
## [1] "(Intercept)" "sex" "freepoor" "freerepa"
## [5] "chcond1" "chcond2" "agesq" "s(age).1"
## [9] "s(age).2" "s(age).3" "s(age).4" "s(age).5"
## [13] "s(age).6" "s(age).7" "s(age).8" "s(age).9"
## [17] "s(income).1" "s(income).2" "s(income).3" "s(income).4"
## [21] "s(income).5" "s(income).6" "s(income).7" "s(income).8"
## [25] "s(income).9" "s(illness).1" "s(illness).2" "s(illness).3"
## [29] "s(illness).4" "s(illness).5" "s(actdays).1" "s(actdays).2"
## [33] "s(actdays).3" "s(actdays).4" "s(actdays).5" "s(actdays).6"
## [37] "s(actdays).7" "s(actdays).8" "s(actdays).9" "s(hscore).1"
## [41] "s(hscore).2" "s(hscore).3" "s(hscore).4" "s(hscore).5"
## [45] "s(hscore).6" "s(hscore).7" "s(hscore).8" "s(hscore).9"
## [49] "s(prescrib).1" "s(prescrib).2" "s(prescrib).3" "s(prescrib).4"
## [53] "s(prescrib).5" "s(nonpresc).1" "s(nonpresc).2" "s(nonpresc).3"
## [57] "s(nonpresc).4" "s(nonpresc).5"
names(nondocco_gam_4e$coefficients)
## [1] "(Intercept)" "sex" "freerepa" "chcond1"
## [5] "chcond2" "actdays" "s(age).1" "s(age).2"
## [9] "s(age).3" "s(age).4" "s(age).5" "s(age).6"
## [13] "s(age).7" "s(age).8" "s(age).9" "s(illness).1"
## [17] "s(illness).2" "s(illness).3" "s(illness).4" "s(illness).5"
## [21] "s(hscore).1" "s(hscore).2" "s(hscore).3" "s(hscore).4"
## [25] "s(hscore).5" "s(hscore).6" "s(hscore).7" "s(hscore).8"
## [29] "s(hscore).9" "s(prescrib).1" "s(prescrib).2" "s(prescrib).3"
## [33] "s(prescrib).4" "s(prescrib).5"
The model fit using a GLM the same conditions (read: including the same variables and specifying the same family) is significantly different from the GAM. The manual chi-sq approximated deviance test does not seem to be appropriate for comparing a GLM and GAM.
nondocco_reduced_glm <- glm(nondocco ~ sex+freerepa+chcond1+chcond2+age+illness+actdays+hscore+prescrib,data = dvisits,family = "poisson")
anova(nondocco_gam,nondocco_reduced_glm,test = "F")
## Analysis of Deviance Table
##
## Model 1: nondocco ~ sex + freepoor + freerepa + chcond1 + chcond2 + agesq +
## s(age) + s(income) + s(illness, k = 6) + s(actdays) + s(hscore) +
## s(prescrib, k = 6) + s(nonpresc, k = 6)
## Model 2: nondocco ~ sex + freerepa + chcond1 + chcond2 + age + illness +
## actdays + hscore + prescrib
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 5159.3 4863.2
## 2 5190.0 5055.7 -30.67 -192.49 6.276 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#pchisq(deviance(nondocco_gam_4)-deviance(nondocco_reduced_glm),
# df.residual(nondocco_gam_4)-df.residual(nondocco_gam),
# lower.tail=FALSE)
(ignoring possible interactions…) Someone who is:
options(scipen = 10)
summary(nondocco_gam_4e)[1]
## $p.coeff
## (Intercept) sex freerepa chcond1 chcond2 actdays
## -2.54619613 0.30179405 0.25432047 0.33480507 0.85616735 0.08875951
Conditional plots of the effect of the smooth variables on the response are shown. Uncertainly of the mean, as well as the smooth, are shown in the shaded regions. The second set of plots show the conditional residuals in addition to the above information. The rugs indicate that the data is highly grouped! (Something I did not initially pick up…always do an EDA). The tables below show this.
All possible 2d combinations of smooths (6) are also shown below. Age and illness appear to have a complicated relationship.
The models indicate some interesting univariate conditional trends;
Age: does not have the directly linear relationship we might intuitively expect, although it is true that young people have less visits and old more. There is a considerable trough around 0.4 - 0.6 where visits are low. Between age 0.3 and 0.6 there does not appear to be a significant trend present
Illness: we can see that the having no illness is associated with a lowered chance of visiting (as would be expected). Having several days maximises the (conditional) chance, whilst 2-5 days has an increased (but lower than 2 days) chance of being associated with a visit. Past 2 days of illness there does not appear to be a significant trend present.
HScore: Suprising, this appears to have an almost quadratic fit, with the number of non-doctor visits maximised at poor, but not very bad health, lowered with good health and minimised with very bad health. A plausible explanation for this is that those with very poor health may be in the primary care phase of their treatment, i.e. you are unlikely to need physio when your legs are broken (a flippant example).
Prescrib: those with many prescribed medications (n>6) recently used and those with non or few medications (n<3) have a lowered chance of non-doctors visits. Those with some, but not a plethora of medications (3
plot(nondocco_gam_4e,pages=1,scheme = 1,cex = 0.25,col = '#FF8000',shade.col = 'gray90',n = 100000,main = "Residuals of GAM",residuals = T)
par(mfrow = c(2,2),mar = c(0.5,0.5,0.5,0.5))
vis.gam(nondocco_gam_4e,theta=-45,color="heat",view = c("illness","prescrib"),nCol = 5,ticktype = "simple",zlab = "Non-Doctors Visits",type = 'response')
vis.gam(nondocco_gam_4e,theta=-45,color="heat",view = c("illness","age"),nCol = 5,ticktype = "simple",zlab = "Non-Doctors Visits",type = 'response')
vis.gam(nondocco_gam_4e,theta=-45,color="heat",view = c("illness","hscore"),nCol = 5,ticktype = "simple",zlab = "Non-Doctors Visits",type = 'response')
vis.gam(nondocco_gam_4e,theta=-45,color="heat",view = c("prescrib","age"),nCol = 5,ticktype = "simple",zlab = "Non-Doctors Visits",type = 'response')
vis.gam(nondocco_gam_4e,theta=-45,color="heat",view = c("prescrib","hscore"),nCol = 5,ticktype = "simple",zlab = "Non-Doctors Visits",type = 'response')
vis.gam(nondocco_gam_4e,theta=-45,color="heat",view = c("age","hscore"),nCol = 5,ticktype = "simple",zlab = "Non-Doctors Visits",type = 'response')
table(dvisits$age)
##
## 0.19 0.22 0.27 0.32 0.37 0.42 0.47 0.52 0.57 0.62 0.67 0.72
## 752 1213 523 301 146 126 181 222 273 316 315 822
table(dvisits$income)
##
## 0 0.01 0.06 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.9 1.1 1.3 1.5
## 79 35 80 249 1195 462 400 467 455 441 589 361 162 215
table(dvisits$illness)
##
## 0 1 2 3 4 5
## 1554 1638 946 542 274 236
table(dvisits$hscore)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 3026 823 446 273 187 132 104 61 42 32 21 24 19
table(dvisits$prescrib)
##
## 0 1 2 3 4 5 6 7 8
## 3085 994 509 276 157 80 40 23 26
Let’s call him Dave (he is male after all). The gist of this procedure (from a slightly more complicated prediction scenario) comes from here.
(Dave <- dvisits[nrow(dvisits),])
## sex age agesq income levyplus freepoor freerepa illness actdays
## 5190 0 0.72 0.5184 0.25 0 0 1 0 0
## hscore chcond1 chcond2 doctorco nondocco hospadmi hospdays medicine
## 5190 0 0 0 0 0 0 0 0
## prescrib nonpresc
## 5190 0 0
predict.glm(nondocco_reduced_glm,Dave,type = "response",se.fit = T)
## $fit
## 5190
## 0.1089584
##
## $se.fit
## 5190
## 0.0119574
##
## $residual.scale
## [1] 1
hist(rpois(100000,rnorm(100000,0.06627531,0.009154955)),main = "Predicted Posterior Distribution: GLM",xlab = "Number of Non-Doctor Health Professional Visits",col = 'red')
table(rpois(100000,rnorm(100000,0.06627531,0.009154955)))
##
## 0 1 2 3
## 93729 6071 195 5
predict.gam(nondocco_gam_4e,Dave,type = "response",se.fit = T,unconditional = T)
## $fit
## 5190
## 0.06627531
##
## $se.fit
## 5190
## 0.009154955
hist(rpois(100000,rnorm(100000,0.1089584,0.0119574)),main = "Predicted Posterior Distribution: GAM",xlab = "Number of Non-Doctor Health Professional Visits",col = 'red')
table(rpois(100000,rnorm(100000,0.1089584,0.0119574)))
##
## 0 1 2 3 4
## 89713 9729 533 23 2
The errors on the predictions for the GLM are much higher than the GAM model. The GLM model is more dispersed, having a higher mean and error on that mean, than the GLM model. Hence, the GLM model predicts a higher probability that Dave will visit the doctor once, or even twice. In both models. These differences are perhaps illustrated better in tabular form, where we see the GLM model predicts Dave is only 64% as likely to go for one non-doctor visit (and much less likely to go for 2,3 or 4 visits).