To build predictive models for the recommendation of strength training, I utilized several variables related to biometrics and exercise information (independent variables). The predicted variable (dependent variable) is a variable about recommendation on strength training. These varaibles are recorded in BRFSS survey by CDC. The data I am using includes 15 years of surveys during 2000-2015.

Building a logistic regression model

Using biometrics and exercise information to build a naive logistic model
    library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
    set.seed(126)
    
    #data = readRDS(file = 'exer_trainD.rds')
    mydata <- readRDS(file = 'exer_data1.rds')
    F1 <- createDataPartition(mydata$X_PAINDX1_F, p = 0.5, groups = 2)
    data_sub <- mydata[F1$Resample1,]
    F2 <- createDataPartition(data_sub$X_PAINDX1_F, p = 0.8, groups = 2)
    
    ####I changed EXRACT11_N ###
    select <- c('X_PASTRNG_F_logic', 'X_HCVU651_F', 
               'cutPADUR', 'cutPAFRE', 'X_AGE_G',
               'SEX',  'EXRACT11',
               'X_BMI5CAT')
    
    select_col <- c()
    for (i in 1:length(names(mydata))) {
        if (names(mydata)[i] %in% select) {
            select_col = c(select_col, i)
        }
    }
    
    train <- data_sub[F2$Resample1,select_col]
    test <- data_sub[-F2$Resample1,select_col]
    y <- 'X_PASTRNG_F_logic'
    x <- select[2:8]
    fmla <- paste(y, paste(x, collapse = "+"), sep = '~')
    #model_glm <- glm(fmla, data = train, family = binomial(link = 'logit'))
    model_glm <- readRDS('model_glm.rds')
Extracting advice from logistic model

By looking at coefficients of the glm model, I found several variabes positively correlated with the strength training recommendation (TRUE). These variables include: * No health care coverage * Almost all age groups * BMI category 3 and 4: BMI > 25 * Cardio exercise: they are clearly not good indication of strength training.

Other variables are negatively correlated with the cardio recommendation.

model <- model_glm
model$coefficients
##             (Intercept)         X_HCVU651_FTRUE         cutPADUR(30,45] 
##              1.48167176             -0.15002566             -0.30634262 
##         cutPADUR(45,60]        cutPADUR(60,270]   cutPAFRE(2e+03,3e+03] 
##             -0.26809749             -0.32566458             -0.92230742 
##   cutPAFRE(3e+03,5e+03] cutPAFRE(5e+03,9.8e+04]                X_AGE_G2 
##             -1.26211744             -1.29531766              0.18586173 
##                X_AGE_G3                X_AGE_G4                X_AGE_G5 
##              0.26556824              0.24197937              0.30243729 
##                    SEX2               EXRACT112               EXRACT113 
##              0.20178762             -1.22118783             -0.14708597 
##               EXRACT114               EXRACT115               EXRACT116 
##              0.44692269             -0.29068407             -0.77485529 
##               EXRACT117               EXRACT118               EXRACT119 
##             -0.04825295             -0.14665405              1.01552114 
##              EXRACT1110              EXRACT1111              EXRACT1112 
##             -1.49579103             -1.66453414             -0.03863771 
##              EXRACT1113              EXRACT1114              EXRACT1115 
##              1.22489944             -0.14768059             -1.12828286 
##              EXRACT1116              EXRACT1117              EXRACT1118 
##              0.57044948              0.28778215              1.01513294 
##              EXRACT1119              EXRACT1120              EXRACT1121 
##              0.33457943              0.02988609             -0.62073571 
##              EXRACT1122              EXRACT1123              EXRACT1124 
##              0.07538013             -0.12407425              0.36221737 
##              EXRACT1125              EXRACT1126              EXRACT1127 
##              0.96537989              0.62590486              0.25774489 
##              EXRACT1128              EXRACT1129              EXRACT1130 
##             -0.72941574             -1.32199570             -0.34584174 
##              EXRACT1131              EXRACT1132              EXRACT1133 
##              0.72229220             -1.24625707              0.49249585 
##              EXRACT1134              EXRACT1135              EXRACT1136 
##             -1.70386577             -0.04242175              0.44706298 
##              EXRACT1137              EXRACT1138              EXRACT1139 
##             -0.91846455             -0.98736391             -1.13590254 
##              EXRACT1140              EXRACT1141              EXRACT1142 
##             -1.28713517             -1.41344556              1.63777786 
##              EXRACT1143              EXRACT1144              EXRACT1145 
##             -0.02564600              0.09356304              0.67597457 
##              EXRACT1146              EXRACT1147              EXRACT1148 
##              0.24016301              0.16594798              0.58556310 
##              EXRACT1149              EXRACT1150              EXRACT1151 
##             -0.17631662             -0.06111520             -0.06142010 
##              EXRACT1152              EXRACT1153              EXRACT1154 
##              0.20517486             -0.65665443             -0.67875623 
##              EXRACT1155              EXRACT1156              EXRACT1157 
##              8.83769595             -0.35053753              0.13324127 
##              EXRACT1158              EXRACT1159              EXRACT1160 
##             -0.24092182              0.26786524             -0.85083321 
##              EXRACT1161              EXRACT1162              EXRACT1163 
##             -0.07747809             -0.35215165              0.01795210 
##              EXRACT1164              EXRACT1165              EXRACT1166 
##              0.40595664            -10.46996793             -0.04971758 
##              EXRACT1167              EXRACT1168              EXRACT1169 
##             -2.56966504             -1.20406596             -1.98536150 
##              EXRACT1170              EXRACT1171              EXRACT1172 
##             -0.70653477              0.99577761              0.89811125 
##              EXRACT1173              EXRACT1174              EXRACT1175 
##              1.19190719             -1.47703143             -1.38959372 
##              EXRACT1176              EXRACT1198              X_BMI5CAT2 
##              0.91044393             -1.10688729             -0.15660556 
##              X_BMI5CAT3              X_BMI5CAT4 
##              0.03565510              0.29541320

Make predictions and evaluate predictions

predict
train$pred <- predict(model, newdata = train, type = "response")
test$pred <- predict(model, newdata = test, type = "response")
characterize prediction quality

The density plot shows that the positive and negative prediction are not well-separated. Picking up a thresold seems tricky. A low thresould will introduce lots of false positive and a high thresold will introduce lots of false negative.

library(ggplot2)
ggplot(train, aes(x= pred, color = X_PASTRNG_F_logic, linetype=X_PASTRNG_F_logic)) +
    geom_density()

Model trade-offs
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
## Warning: Removed 1 rows containing missing values (geom_path).

Pick a threshold

The above figre illustrates the trade-off between precision and recall. In the dataset which incldues 15 years surveys, the ‘prevalence’ of giving cardio exercise recommendation is 33.4%. It might be rational to use 0.334 or a bit higher as the threshold to interpret the model outcome. The precision of the glm model is pretty low. Now I choose to try another type of classification model to see if I can get a higher precision.

Building random forest model

Building RF model in r can be very slow. Here I use turnrf() to find the best parameter, specifcially mtry. mtry is the parameter specify how many features to use when building each individual tree in the random forest. I also use MET value as a proxy to replace the type of exercise feature (EXRACT11), since there are more than 60 categories of exercise type and random forest cannot handle variables with more than 50 categories (there are way too many dummy variables).

## use cvgrid search to tune the mtry paramter
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(999)
# mtry = sqrt(length(select))
# tunegrid <- expand.grid(.mtry=mtry)
trCtrl <- trainControl(method = 'cv', number = 3, summaryFunction=defaultSummary)
Grid <- expand.grid(.mtry=c(1:7))

select <- c('X_PASTRNG_F_logic', 'X_HCVU651_F', 
               'cutPADUR', 'cutPAFRE', 'X_AGE_G',
               'SEX',  'met',
               'X_BMI5CAT')
    
    select_col <- c()
    for (i in 1:length(names(mydata))) {
        if (names(mydata)[i] %in% select) {
            select_col = c(select_col, i)
        }
    }
    
    train <- data_sub[F2$Resample1,select_col]
    test <- data_sub[-F2$Resample1,select_col]

train <- train[complete.cases(train), ]
test <- test[complete.cases(test),]

select2 <- c('X_HCVU651_F', 
               'cutPADUR', 'cutPAFRE', 'X_AGE_G',
               'SEX',  'met', 'X_BMI5CAT')

train$X_PASTRNG_F_logic <- as.factor(train$X_PASTRNG_F_logic)
test$X_PASTRNG_F_logic <- as.factor(test$X_PASTRNG_F_logic)    

#fit <- train(train$X_PASTRNG_F_logic~., data = train, method = 'rf', trControl = trCtrl,tuneGrid = Grid)
# plot(fit.cp)
tune.rf <- readRDS('tunerf.rds')
plot(tune.rf)

#tune.rf <- tuneRF( train[,select2], train$X_PASTRNG_F_logic, stepFactor=2)
best.rf <- readRDS('rf-model.rds')
# fit <- randomForest(train$X_PASTRNG_F_logic~., 
#              data = train, 
#             mtry = 2)
# best.rf <- fit

turnRF() tells me about the best paramter (mtry = 2) when the out of bag error is the smallest. (mtry is the number of variables selected each time for building the tree). This error rate is still pretty high, but we may want to keep the model for further comparison.

varImpPlot(best.rf)

pred <- predict(best.rf, newdata = test[,select2])
ctab.test = table(pred, test$X_PASTRNG_F_logic)
precision <- ctab.test[2,2]/sum(ctab.test[2,])
print(precision)
## [1] 0.7339967

This random forest model can achieve 73% precision.

Classfication tree model

Tuning complexity parameter
select = c('X_PASTRNG_F_logic', 'X_HCVU651_F', 
               'cutPADUR', 'cutPAFRE', 'X_AGE_G',
               'SEX',  'met',
               'X_BMI5CAT')
    
    select_col = c()
    for (i in 1:length(names(mydata))) {
        if (names(mydata)[i] %in% select) {
            select_col = c(select_col, i)
        }
    }
    
    train = data_sub[F2$Resample1,select_col]
    test = data_sub[-F2$Resample1,select_col]

train <- train[complete.cases(train), ]
test <- test[complete.cases(test),]
train$X_PASTRNG_F_logic <- as.factor(train$X_PASTRNG_F_logic)
test$X_PASTRNG_F_logic <- as.factor(test$X_PASTRNG_F_logic)   

set.seed(999)
trCtrl <- trainControl(method = 'cv', 
                       number = 3, summaryFunction=defaultSummary)
Grid <- expand.grid(cp = seq(0, 0.03, 0.01))
# fit.cp <- train(X_PASTRNG_F_logic~., data = train,
#                 method = 'rpart', trControl = trCtrl,
#                 tuneGrid = Grid)
fit.cp <- readRDS('rpart-cp.rds')
plot(fit.cp)

pred <- predict(fit.cp, newdata = test)
## Loading required package: rpart
ctab.test <- table(pred, test$X_PASTRNG_F_logic)
precision <- ctab.test[2,2]/sum(ctab.test[2,])
print(precision)
## [1] 0.735177

The plot on the fit.cp shows that the model reaches the highest accuracy when complexity paramter = 0

Tuning tree depth paramter
set.seed(999) 
Grid2 <- expand.grid(.maxdepth = seq(0, 10, 5)) 
# fit.rpart2 <- train(X_PASTRNG_F_logic~., data = train,
#                     method = 'rpart2', trControl = trCtrl,
#                     tuneGrid = Grid2)
fit.rpart2 <- readRDS('rpart-depth.rds')
plot(fit.rpart2)

pred <- predict(fit.rpart2, newdata = test)
ctab.test <- table(pred, test$X_PASTRNG_F_logic)
precision <- ctab.test[2,2]/sum(ctab.test[2,])
print(precision)
## [1] 0.7394846

The model has the best accuracy when max tree depth = 10

GBM

The next model is Gradient Boosting Model (GBM). GBM won several kaggle challenges. (“If linear regression was a Toyota Camry, then gradient boosting would be a UH-60 Blackhawk Helicopter” see this article) The cross validation shows that the model reaches its highest accuracy when n.trees = 85.

trCtl <- trainControl(method = 'cv', number = 3, 
                      summaryFunction=defaultSummary)
Grid <- expand.grid( n.trees = seq(5, 100, 20), 
                     interaction.depth = c(10), 
                     shrinkage = c(0.1), n.minobsinnode = 20)
# fit.gbm <- train(X_PASTRNG_F_logic~., data = train,
#                  method = 'gbm', trControl = trCtl, tuneGrid = Grid)
fit.gbm <- readRDS('rpart-gbm.rds')
plot(fit.gbm)

pred <- predict(fit.gbm, newdata = test)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
ctab.test <- table(pred, test$X_PASTRNG_F_logic)
precision <- ctab.test[2,2]/sum(ctab.test[2,])
print(precision)
## [1] 0.7365648

Evaluate on different models

compare <- resamples(list(ClassificationTree1 = fit.cp,
                         ClassificationTree2 = fit.rpart2,
                         GradientBoosting = fit.gbm))
bwplot(compare)

The plot shows that gbm performs slighly better than other two classification tree models.

Make predictions

Now I choose GBM as the final model, and I want to use createFolds() to generate cross-validation samples for the evalution of out-of-sample errors.

pred1 <- predict(fit.gbm, test, n.trees = 85)

set.seed(888)


fld <- createFolds(train$X_PASTRNG_F_logic, k = 6)
Est.error <- rep(0, 6)
##outOfSampleError.accuracy <- sum(predictions == testValidateData$classe)/length(predictions)
for (k in 1:6) {
    test = train[unlist(fld[k]), ]
    a.pred = predict(fit.gbm, n.trees = 85, test)
    accur = sum(as.numeric(a.pred) == as.numeric(test$X_PASTRNG_F_logic))/length(a.pred)
    error = 1 - accur
    Est.error[k] = error
}
Estimation <- mean(Est.error)*100
paste0("Out of sample error estimation: ", round(Estimation, digits = 2), "%")
## [1] "Out of sample error estimation: 27.52%"
library(knitr)
imp <- varImp(fit.gbm)
col_index <- imp$importance %>% 
  mutate(var_name=row.names(.)) %>%
  arrange(-Overall)
kable(col_index[1:10,])
Overall var_name
100.000000 met35
34.789323 cutPAFRE(3e+03,5e+03]
34.673279 cutPAFRE(2e+03,3e+03]
30.324468 cutPAFRE(5e+03,9.8e+04]
27.148363 met50
8.448563 met68
7.842101 X_BMI5CAT4
5.566063 met40
4.735028 SEX2
4.267126 X_AGE_G5