Introduction
This report examines the effectiveness of applying several machine learning methods to examine diabetes disease progression using data colleced from 442 patients. The data comes from Efron et al. (2003). Specifically we will examine if age, sex, body mass index, average blood pressure, and six blood serum measurements are useful in explaining progression in a quantitative measure of disease progression one year after a baseline measurement. The first section of the report consists of a very brief exploratory data analysis and data quality check. The second section of the report consists of model building.The third section gives our results. The report concludes with a brief commentary on our results.
The dataframe contains 442 observations and 10 predictor variables. Based on the summary of the data, the variables have been adjusted to a common scale with mean 0. There are no records with missing values so we proceed to model building.
Analysis
Before model building, we randomly split the data into a training data set containing 75% of the rows and a testing data set containing 25% of the rows. The training data will be used to “train” our machine learning methods. The testing data will be used to evaluate the models.
We begin with a baseline model by fitting a regular least squares multiple linear regression model using all the variables. The test mse is calculated by taking the average of the squares of differences between what the model predicts and the observed value.
The coefficients for the model are shown along with their standard errors, t-values, and p-values. Interestingly age has a negative coefficient which is unexpected. The coefficient is -66.76 which can be interepreted as a 66.76 reduction in the diabetes progression metric for every 1 unit increase in age. There are several variables with p-values above 0.05 indicating that their true parameters may be zero and their relationship to the response variable possibly insignifcant. These are AGE, TC, LDL, TCH (just barely), and GLU. We offer a very brief model validation using the residual plot vs. fitted values. It does not indicate any severe violations of random distribution of the residuals.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 149.92 2.98 50.38 0.00
age -66.76 68.95 -0.97 0.33
sex -304.65 69.85 -4.36 0.00
bmi 518.66 76.57 6.77 0.00
map 388.11 72.75 5.33 0.00
tc -815.27 537.55 -1.52 0.13
ldl 387.60 439.16 0.88 0.38
hdl 162.90 269.12 0.61 0.55
tch 323.83 186.80 1.73 0.08
ltg 673.62 206.89 3.26 0.00
glu 94.22 79.59 1.18 0.24
The test MSE for the full linear regression model is: 3111.26
Now we apply the best subset selection algorithm to attempt to find a subset of predictors that works well. The algorithm performs an exhaustic search by fitting a least squares regression for each combination of predictors for k=1,2,…p. For each size k=1,2,…p, the algorithm will pick the model with the lowest resiual sum of squares. From these models, we’ll select the model that produces the lowest BIC. Graphs are included to visualize how we make this selection.
The number of predictors selected best subset is 6 with a BIC of -201.12
(Intercept) sex bmi map tc tch ltg
Estimate Std. Error t value Pr(>|t|)
(Intercept) 150.12 2.96 50.68 0
sex -306.04 69.04 -4.43 0
bmi 538.83 74.22 7.26 0
map 389.07 69.92 5.56 0
tc -379.04 82.27 -4.61 0
tch 332.67 88.97 3.74 0
ltg 527.57 87.07 6.06 0
The best model with regards to a BIC value contains 6 variables as indicated by red dot on the plot above. The variables and coefficient estimates are reported above. Interestingly the model with 6 variables now has significant p-values for every predictor. AGE, LDL, HDL, and GLU were removed. The BIC statistic generally places a heavier penalty on models with many variables. Below we give the test mse for the model selected by best subset selection.
The test MSE for the best subsets model is: 3095.48
The test MSE is 3095.48 compared to 3111.26 of the full linear regression model. This is a decrease of 15.78 which is only ~0.5%.
Next, we apply 10-fold cross validation combined with the best subsets approach. This approach divides the data into 10 equal parts and iteratively uses each part as a validation set, fitting the model on the remaining 9 parts. This process continues until each part was used as a validation set once. The mean squared error is calculated for each test part and then averaged. The model selected will produce the lowest cross validation error.
(Intercept) sex bmi map tc tch
150.1166 -306.0420 538.8274 389.0673 -379.0379 332.6735
ltg
527.5658
The test MSE for the best subsets model using CV is: 3095.48
Cross-validation also selects a 6 variable model. We re-trained the 6 variable model on the full training data set to get the test mse above. The model is identical to the model selected by best subsets in the previous step. The test mse remains 3095.48 indicating no improvement.
Next, we are going to apply ridge regression using 10-fold cross-validation. Ridge regression is very similar to least squares regression except that the coefficients are estimated using a slightly different method. In short, the method can sometimes produce coefficents that reduce the variance of the model with a small concession in bias. Our goal is to minimize the model’s test error. Test mse is a function of variance plus squared bias so if we reduce variance significantly we may achieve higher prediction accuracy.
The test MSE for the ridge regression model using CV is: 3070.64
Coefficient
age -11.28346
sex -156.90257
bmi 374.45359
map 264.87631
tc -32.00363
ldl -66.94094
hdl -173.93914
tch 123.98677
ltg 307.69357
glu 134.49577
The test mse for Ridge Regression is 3070.64 compared to 3095.48 and 3111.26 for the best subsets model and the linear regression model respectively. This is a 1.3% overall decrease which still seems quite low. The coefficients are reported above. As expected they are constrained in comparison to those given by the linear regression model. Age had a coefficient of -66.75 in the linear regression model. The ridge regression model producs a coefficent of -11.28 for the age varible. Additionally, we can observe that all variables are included in the model. This might be a disadvantage as we noticed previously that several variables were insignificant.
Now we fit a model using the Lasso procedure. The lasso’s advantage over ridge regression is that it performs variable selection and therefore might produce a model that is more interpretable. As we did with ridge regression, we’ll use a 10-fold cross validation to identify the model that produces the lowest test error.
The test MSE for the lasso model using CV is: 2920.04
Coefficient
age 0.00000
sex -119.62208
bmi 501.56473
map 270.92614
tc 0.00000
ldl 0.00000
hdl -180.29437
tch 0.00000
ltg 390.55001
glu 16.58881
The test error for the lasso model is 2920.04 which is a decrease of 150, representing a significant improvement over the ridge regression model. As we can see from looking at the coefficients above, the lasso set several predictors to zero. Interestingly all of these predictors had insignificant p-values in the original full least squares model.
Results
Our results are summarized in the table below:
Model Test.Error
Full Least Squares Model 3111.26
Best Subsets Model 3095.48
Ridge Regression Model 3070.64
Lasso Model 2920.04
The test MAE for the lasso model using CV is: 8.76
20th, 50th, and 80th quantiles of the errors:
20% 50% 80%
-39.755441 4.984314 57.448337
Kurtosis of the error distribution: -0.58
Conclusion
The best performing model using a simple validation set approach was the lasso model with 6 predictors. It produced a MSE of 2920.04 and a mean absolute error of 8.76 (reported above), meaning that the model produced predictions that were off by +/- 8.76 from the actual value on average. This looks small but upon further investigation of the errors, we observed that the 20th quantile of the errors is -39.75 indicating that 20% of the errors were off by -39.75 or more. This means that the model would severly underpredict diabetes progression in 1 of 5 patients. This error is unacceptable for the purpose of predicting diabetes progression. Despite this, the analysis does indicate that diabetes progression is predictable.
Code
###LOADING THE DATA & DATA CHECKING###
rm(list = ls())
library(lars)
data("diabetes")
attach(diabetes)
data.all = data.frame(cbind(diabetes$x, y = diabetes$y))
str(data.all)
summary(data.all)
anyNA(data.all)
###FORMATTING FOR MODEL BUILDING###
n <- dim(data.all)[1]
set.seed(1306)
test <- sample(n, round(n/4))
data.train <- data.all[-test,]
data.test <- data.all[test,]
x <- model.matrix(y~ ., data = data.all)[,-1]
x.train <- x[-test,]
x.test <- x[test,]
y <- data.all$y
y.train <- y[-test]
y.test <- y[test]
n.train <- dim(data.train)[1]
n.test <- dim(data.test)[1]
###LEAST SQUARES LINEAR REGRESSION###
lm.full = lm(y~. , data = data.train)
round(coef(summary(lm.full)),2)
pred = predict(lm.full, data.test)
cat(paste0("The test MSE for the full linear regression model is: ",round(mean((y.test - pred)^2),2)))
plot(lm.full$fitted.values,resid(lm.full), xlab="Fitted Values", ylab="Residuals", main="Residual Plot", pch=21, col="blue")
abline(h=0.0, lwd=2, lty=2)
###BEST SUBSETS USING BIC###
library(leaps)
regfit.full = regsubsets(y~. , data = data.train , nvmax = 10)
reg.summary = summary(regfit.full)
plot(reg.summary$bic, xlab="Number of Variables", ylab="BIC", type="l")
x = which.min(reg.summary$bic)
y = reg.summary$bic[6]
points(x,y, col="red", cex=1, pch=19)
cat(paste0("The number of predictors selected best subset is 6 with a BIC of -201.12"))
cat(names(coef(regfit.full,6)))
round(coef(summary(lm(y~sex+bmi+map+tc+tch+ltg, data = data.train))),2)
coefi = coef(regfit.full,id=6)
pred = coefi[1] + (x.test[, names(coefi[-1])]%*%coefi[-1])
cat(paste0("The test MSE for the best subsets model is: ",round(mean((y.test - pred)^2),2)))
###BEST SUBSETS USING 10-FOLD CROSS VALIDATION###
predict.regsubsets = function(object, newdata, id,...){
form=as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id=id)
xvars = names(coefi)
mat[,xvars]%*%coefi
}
k=10
set.seed(1306)
folds = sample(1:k, nrow(data.train),replace = TRUE)
cv.errors = matrix(NA,k,10, dimnames = list(NULL, paste(1:10)))
for (j in 1:k){
best.fit = regsubsets(y~. ,data=data.train[folds!=j,],nvmax = 10)
for (i in 1:10){
pred = predict.regsubsets(best.fit, data.train[folds==j,], id=i)
cv.errors[j,i] = mean((data.train$y[folds==j]-pred)^2)
}
}
mean.cv.errors = apply(cv.errors, 2, mean)
plot(mean.cv.errors, type="b", main = "Mean CV errors",xlab = "Number of Predictors",
ylab="Mean CV Errors")
y = min(mean.cv.errors)
x = which.min(mean.cv.errors)
points(x,y, col="red", cex=1, pch=19)
regfit.cv = regsubsets(y~. , data = data.train , nvmax = 10)
coefi = coef(regfit.cv,id=6)
pred = coefi[1] + (x.test[, names(coefi[-1])]%*%coefi[-1])
coefi
cat(paste0("The test MSE for the best subsets model using CV is: ",round(mean((y.test - pred)^2),2)))
###RIDGE REGRESSION USING CROSS VALIDATION###
library(glmnet)
grid = 10^seq(10,-2,length=100)
set.seed(1306)
cv.out = cv.glmnet(x.train, y.train, alpha = 0)
largelam = cv.out$lambda.1se
ridge.mod = glmnet(x.train, y.train, alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred = predict(ridge.mod, s=largelam, newx=x.test)
cat(paste0("The test MSE for the ridge regression model using CV is: ",round(mean((y.test - ridge.pred)^2),2)))
coef = glmnet(x.train, y.train, alpha = 0, lambda = largelam, thresh = 1e-12)$beta
matrix(coef, dimnames = list(row.names(coef), c("Coefficient")))
###LASSO MODEL USING CROSS VALIDATION###
set.seed(1306)
cv.out = cv.glmnet(x.train, y.train, alpha = 1)
largelam = cv.out$lambda.1se
lasso.mod = glmnet(x.train, y.train, alpha = 1, lambda = largelam)
lasso.pred = predict(lasso.mod, s=largelam, newx = x.test)
cat(paste0("The test MSE for the lasso model using CV is: ",round(mean((y.test - lasso.pred)^2),2)))
coef= glmnet(x.train, y.train, alpha = 1, lambda = largelam)$beta
matrix(coef, dimnames = list(row.names(coef), c("Coefficient")))
###RESULTS###
results_df = data.frame(Model = c("Full Least Squares Model", "Best Subsets Model", "Ridge Regression Model", "Lasso Model"), "Test Error" = c(3111.26,3095.48,3070.64,2920.04))
print(results_df, row.names=FALSE)
cat(paste0("The test MAE for the lasso model using CV is: ",round(mean((y.test - abs(lasso.pred))),2)))
cat(paste0("20th, 50th, and 80th quantiles of the errors:"))
quantile(y.test-lasso.pred,probs=c(.20,.50,.80))
library(e1071)
k = kurtosis(y.test-lasso.pred)
k = round(k,2)
cat(paste0("Kurtosis of the error distribution: ", k))