To build predictive models for the recommendation of strength training, I utilized several variables related to biometrics and exercise information (independent variables). The predicted variable (dependent variable) is a variable about recommendation on strength training. These varaibles are recorded in BRFSS survey by CDC. The data I am using includes 15 years of surveys during 2000-2015.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(126)
#data = readRDS(file = 'exer_trainD.rds')
mydata <- readRDS(file = 'exer_data1.rds')
F1 <- createDataPartition(mydata$X_PAINDX1_F, p = 0.5, groups = 2)
data_sub <- mydata[F1$Resample1,]
F2 <- createDataPartition(data_sub$X_PAINDX1_F, p = 0.8, groups = 2)
####I changed EXRACT11_N ###
select <- c('X_PASTRNG_F_logic', 'X_HCVU651_F',
'cutPADUR', 'cutPAFRE', 'X_AGE_G',
'SEX', 'EXRACT11',
'X_BMI5CAT')
select_col <- c()
for (i in 1:length(names(mydata))) {
if (names(mydata)[i] %in% select) {
select_col = c(select_col, i)
}
}
train <- data_sub[F2$Resample1,select_col]
test <- data_sub[-F2$Resample1,select_col]
y <- 'X_PASTRNG_F_logic'
x <- select[2:8]
fmla <- paste(y, paste(x, collapse = "+"), sep = '~')
#model_glm <- glm(fmla, data = train, family = binomial(link = 'logit'))
model_glm <- readRDS('model_glm.rds')
By looking at coefficients of the glm model, I found several variabes positively correlated with the strength training recommendation (TRUE). These variables include: * No health care coverage * Almost all age groups * BMI category 3 and 4: BMI > 25 * Cardio exercise: they are clearly not good indication of strength training.
Other variables are negatively correlated with the cardio recommendation.
model <- model_glm
model$coefficients
## (Intercept) X_HCVU651_FTRUE cutPADUR(30,45]
## 1.48167176 -0.15002566 -0.30634262
## cutPADUR(45,60] cutPADUR(60,270] cutPAFRE(2e+03,3e+03]
## -0.26809749 -0.32566458 -0.92230742
## cutPAFRE(3e+03,5e+03] cutPAFRE(5e+03,9.8e+04] X_AGE_G2
## -1.26211744 -1.29531766 0.18586173
## X_AGE_G3 X_AGE_G4 X_AGE_G5
## 0.26556824 0.24197937 0.30243729
## SEX2 EXRACT112 EXRACT113
## 0.20178762 -1.22118783 -0.14708597
## EXRACT114 EXRACT115 EXRACT116
## 0.44692269 -0.29068407 -0.77485529
## EXRACT117 EXRACT118 EXRACT119
## -0.04825295 -0.14665405 1.01552114
## EXRACT1110 EXRACT1111 EXRACT1112
## -1.49579103 -1.66453414 -0.03863771
## EXRACT1113 EXRACT1114 EXRACT1115
## 1.22489944 -0.14768059 -1.12828286
## EXRACT1116 EXRACT1117 EXRACT1118
## 0.57044948 0.28778215 1.01513294
## EXRACT1119 EXRACT1120 EXRACT1121
## 0.33457943 0.02988609 -0.62073571
## EXRACT1122 EXRACT1123 EXRACT1124
## 0.07538013 -0.12407425 0.36221737
## EXRACT1125 EXRACT1126 EXRACT1127
## 0.96537989 0.62590486 0.25774489
## EXRACT1128 EXRACT1129 EXRACT1130
## -0.72941574 -1.32199570 -0.34584174
## EXRACT1131 EXRACT1132 EXRACT1133
## 0.72229220 -1.24625707 0.49249585
## EXRACT1134 EXRACT1135 EXRACT1136
## -1.70386577 -0.04242175 0.44706298
## EXRACT1137 EXRACT1138 EXRACT1139
## -0.91846455 -0.98736391 -1.13590254
## EXRACT1140 EXRACT1141 EXRACT1142
## -1.28713517 -1.41344556 1.63777786
## EXRACT1143 EXRACT1144 EXRACT1145
## -0.02564600 0.09356304 0.67597457
## EXRACT1146 EXRACT1147 EXRACT1148
## 0.24016301 0.16594798 0.58556310
## EXRACT1149 EXRACT1150 EXRACT1151
## -0.17631662 -0.06111520 -0.06142010
## EXRACT1152 EXRACT1153 EXRACT1154
## 0.20517486 -0.65665443 -0.67875623
## EXRACT1155 EXRACT1156 EXRACT1157
## 8.83769595 -0.35053753 0.13324127
## EXRACT1158 EXRACT1159 EXRACT1160
## -0.24092182 0.26786524 -0.85083321
## EXRACT1161 EXRACT1162 EXRACT1163
## -0.07747809 -0.35215165 0.01795210
## EXRACT1164 EXRACT1165 EXRACT1166
## 0.40595664 -10.46996793 -0.04971758
## EXRACT1167 EXRACT1168 EXRACT1169
## -2.56966504 -1.20406596 -1.98536150
## EXRACT1170 EXRACT1171 EXRACT1172
## -0.70653477 0.99577761 0.89811125
## EXRACT1173 EXRACT1174 EXRACT1175
## 1.19190719 -1.47703143 -1.38959372
## EXRACT1176 EXRACT1198 X_BMI5CAT2
## 0.91044393 -1.10688729 -0.15660556
## X_BMI5CAT3 X_BMI5CAT4
## 0.03565510 0.29541320
train$pred <- predict(model, newdata = train, type = "response")
test$pred <- predict(model, newdata = test, type = "response")
The density plot shows that the positive and negative prediction are not well-separated. Picking up a thresold seems tricky. A low thresould will introduce lots of false positive and a high thresold will introduce lots of false negative.
library(ggplot2)
ggplot(train, aes(x= pred, color = X_PASTRNG_F_logic, linetype=X_PASTRNG_F_logic)) +
geom_density()
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## Warning: Removed 1 rows containing missing values (geom_path).
The above figre illustrates the trade-off between precision and recall. In the dataset which incldues 15 years surveys, the ‘prevalence’ of giving cardio exercise recommendation is 33.4%. It might be rational to use 0.334 or a bit higher as the threshold to interpret the model outcome. The precision of the glm model is pretty low. Now I choose to try another type of classification model to see if I can get a higher precision.
Building RF model in r can be very slow. Here I use turnrf() to find the best parameter, specifcially mtry. mtry is the parameter specify how many features to use when building each individual tree in the random forest. I also use MET value as a proxy to replace the type of exercise feature (EXRACT11), since there are more than 60 categories of exercise type and random forest cannot handle variables with more than 50 categories (there are way too many dummy variables).
## use cvgrid search to tune the mtry paramter
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(999)
# mtry = sqrt(length(select))
# tunegrid <- expand.grid(.mtry=mtry)
trCtrl <- trainControl(method = 'cv', number = 3, summaryFunction=defaultSummary)
Grid <- expand.grid(.mtry=c(1:7))
select <- c('X_PASTRNG_F_logic', 'X_HCVU651_F',
'cutPADUR', 'cutPAFRE', 'X_AGE_G',
'SEX', 'met',
'X_BMI5CAT')
select_col <- c()
for (i in 1:length(names(mydata))) {
if (names(mydata)[i] %in% select) {
select_col = c(select_col, i)
}
}
train <- data_sub[F2$Resample1,select_col]
test <- data_sub[-F2$Resample1,select_col]
train <- train[complete.cases(train), ]
test <- test[complete.cases(test),]
select2 <- c('X_HCVU651_F',
'cutPADUR', 'cutPAFRE', 'X_AGE_G',
'SEX', 'met', 'X_BMI5CAT')
train$X_PASTRNG_F_logic <- as.factor(train$X_PASTRNG_F_logic)
test$X_PASTRNG_F_logic <- as.factor(test$X_PASTRNG_F_logic)
#fit <- train(train$X_PASTRNG_F_logic~., data = train, method = 'rf', trControl = trCtrl,tuneGrid = Grid)
# plot(fit.cp)
tune.rf <- readRDS('tunerf.rds')
plot(tune.rf)
#tune.rf <- tuneRF( train[,select2], train$X_PASTRNG_F_logic, stepFactor=2)
best.rf <- readRDS('rf-model.rds')
# fit <- randomForest(train$X_PASTRNG_F_logic~.,
# data = train,
# mtry = 2)
# best.rf <- fit
turnRF() tells me about the best paramter (mtry = 2) when the out of bag error is the smallest. (mtry is the number of variables selected each time for building the tree). This error rate is still pretty high, but we may want to keep the model for further comparison.
varImpPlot(best.rf)
pred <- predict(best.rf, newdata = test[,select2])
ctab.test = table(pred, test$X_PASTRNG_F_logic)
precision <- ctab.test[2,2]/sum(ctab.test[2,])
print(precision)
## [1] 0.7339967
This random forest model can achieve 73% precision.
select = c('X_PASTRNG_F_logic', 'X_HCVU651_F',
'cutPADUR', 'cutPAFRE', 'X_AGE_G',
'SEX', 'met',
'X_BMI5CAT')
select_col = c()
for (i in 1:length(names(mydata))) {
if (names(mydata)[i] %in% select) {
select_col = c(select_col, i)
}
}
train = data_sub[F2$Resample1,select_col]
test = data_sub[-F2$Resample1,select_col]
train <- train[complete.cases(train), ]
test <- test[complete.cases(test),]
train$X_PASTRNG_F_logic <- as.factor(train$X_PASTRNG_F_logic)
test$X_PASTRNG_F_logic <- as.factor(test$X_PASTRNG_F_logic)
set.seed(999)
trCtrl <- trainControl(method = 'cv',
number = 3, summaryFunction=defaultSummary)
Grid <- expand.grid(cp = seq(0, 0.03, 0.01))
# fit.cp <- train(X_PASTRNG_F_logic~., data = train,
# method = 'rpart', trControl = trCtrl,
# tuneGrid = Grid)
fit.cp <- readRDS('rpart-cp.rds')
plot(fit.cp)
pred <- predict(fit.cp, newdata = test)
## Loading required package: rpart
ctab.test <- table(pred, test$X_PASTRNG_F_logic)
precision <- ctab.test[2,2]/sum(ctab.test[2,])
print(precision)
## [1] 0.735177
The plot on the fit.cp shows that the model reaches the highest accuracy when complexity paramter = 0
set.seed(999)
Grid2 <- expand.grid(.maxdepth = seq(0, 10, 5))
# fit.rpart2 <- train(X_PASTRNG_F_logic~., data = train,
# method = 'rpart2', trControl = trCtrl,
# tuneGrid = Grid2)
fit.rpart2 <- readRDS('rpart-depth.rds')
plot(fit.rpart2)
pred <- predict(fit.rpart2, newdata = test)
ctab.test <- table(pred, test$X_PASTRNG_F_logic)
precision <- ctab.test[2,2]/sum(ctab.test[2,])
print(precision)
## [1] 0.7394846
The model has the best accuracy when max tree depth = 10
The next model is Gradient Boosting Model (GBM). GBM won several kaggle challenges. (“If linear regression was a Toyota Camry, then gradient boosting would be a UH-60 Blackhawk Helicopter” see this article) The cross validation shows that the model reaches its highest accuracy when n.trees = 85.
trCtl <- trainControl(method = 'cv', number = 3,
summaryFunction=defaultSummary)
Grid <- expand.grid( n.trees = seq(5, 100, 20),
interaction.depth = c(10),
shrinkage = c(0.1), n.minobsinnode = 20)
# fit.gbm <- train(X_PASTRNG_F_logic~., data = train,
# method = 'gbm', trControl = trCtl, tuneGrid = Grid)
fit.gbm <- readRDS('rpart-gbm.rds')
plot(fit.gbm)
pred <- predict(fit.gbm, newdata = test)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
ctab.test <- table(pred, test$X_PASTRNG_F_logic)
precision <- ctab.test[2,2]/sum(ctab.test[2,])
print(precision)
## [1] 0.7365648
compare <- resamples(list(ClassificationTree1 = fit.cp,
ClassificationTree2 = fit.rpart2,
GradientBoosting = fit.gbm))
bwplot(compare)
The plot shows that gbm performs slighly better than other two classification tree models.
Now I choose GBM as the final model, and I want to use createFolds() to generate cross-validation samples for the evalution of out-of-sample errors.
pred1 <- predict(fit.gbm, test, n.trees = 85)
set.seed(888)
fld <- createFolds(train$X_PASTRNG_F_logic, k = 6)
Est.error <- rep(0, 6)
##outOfSampleError.accuracy <- sum(predictions == testValidateData$classe)/length(predictions)
for (k in 1:6) {
test = train[unlist(fld[k]), ]
a.pred = predict(fit.gbm, n.trees = 85, test)
accur = sum(as.numeric(a.pred) == as.numeric(test$X_PASTRNG_F_logic))/length(a.pred)
error = 1 - accur
Est.error[k] = error
}
Estimation <- mean(Est.error)*100
paste0("Out of sample error estimation: ", round(Estimation, digits = 2), "%")
## [1] "Out of sample error estimation: 27.52%"
library(knitr)
imp <- varImp(fit.gbm)
col_index <- imp$importance %>%
mutate(var_name=row.names(.)) %>%
arrange(-Overall)
kable(col_index[1:10,])
| Overall | var_name |
|---|---|
| 100.000000 | met35 |
| 34.789323 | cutPAFRE(3e+03,5e+03] |
| 34.673279 | cutPAFRE(2e+03,3e+03] |
| 30.324468 | cutPAFRE(5e+03,9.8e+04] |
| 27.148363 | met50 |
| 8.448563 | met68 |
| 7.842101 | X_BMI5CAT4 |
| 5.566063 | met40 |
| 4.735028 | SEX2 |
| 4.267126 | X_AGE_G5 |