Model Building 2

Harold Nelson

11/5/2020

A Model Building Exercise

Can we predict how much weight people want to lose or gain?

The Data

We’ll use the cleaned version of the cdc dataset, cdc2.

Load the data.

load("cdc2.Rdata")

Packages

Make a few packages available.

library(broom)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(formula.tools)

Workflow

We’ll build a number of models, saving the results in an organized way. Finally, we’ll combine the key results in a dataframe and compare them.

Task 1

Splitting the data

We want to split our data into train and test. We will build the model on train and evaluate its performance on test. Randomly select about 75% of the data for train and 25% for test.

Follow the process in the Mount-Zumel course in Datacamp to create train and test.

Answer

print(N <- nrow(cdc2))
## [1] 19997
print(target <- round(.75 * N))
## [1] 14998
set.seed(123)
gp <- runif(N)
train = cdc2[gp < .75,]
test = cdc2[gp >= .75,]

nrow(train)
## [1] 15064
nrow(test)
## [1] 4933

Task 2

Model 1

Focus on predicting DesActRatio. For our first model, we will use only BMI as a predictor. Create lm1 using train. Then use tidy() and glance() to examine it. Store the formula in the glance dataframe, which you should name glance1.

Use lm1 to create the variable pred1 in the dataframe test.

Answer

fml1 = DesActRatio ~ BMI
lm1 = lm(fml1, data = train)
tidy1 = tidy(lm1)
tidy1$formula = as.character(fml1)
tidy1
## # A tibble: 2 × 6
##   term        estimate std.error statistic p.value formula          
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl> <chr>            
## 1 (Intercept)   1.32    0.00324       408.       0 DesActRatio ~ BMI
## 2 BMI          -0.0151  0.000121     -125.       0 DesActRatio ~ BMI
glance1 = glance(lm1)
glance1$formula = as.character(fml1)
glance1
## # A tibble: 1 × 13
##   r.squared adj.r.squared  sigma statistic p.value    df logLik     AIC     BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1     0.509         0.509 0.0771    15590.       0     1 17230. -34454. -34431.
## # … with 4 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## #   formula <chr>
test$pred1 = predict(lm1,newdata = test)

Task 3

Create an R function RMSE() with parameters predicted and actual to compute the root mean squared error. We will use this to compare the performance of our models on the test data. This is the value sigma from the glance() output.

Use the function to compare the actual and predicted values from the test dataframe. save the result as test_RMSE in glance1.

Answer

RMSE = function(actual, predicted){
  
  Residual = actual - predicted
  result = sqrt(mean(Residual^2))
  return(result)
}

glance1$test_RMSE = RMSE(test$DesActRatio,test$pred1)
glance1
## # A tibble: 1 × 14
##   r.squared adj.r.squared  sigma statistic p.value    df logLik     AIC     BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1     0.509         0.509 0.0771    15590.       0     1 17230. -34454. -34431.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## #   formula <chr>, test_RMSE <dbl>

Our results for lm1 are very similar for train and test. Overfitted models show a significant drop in performance from train to test.

Task 4

Create a model lm2 by adding the categorical variable gender to lm1. Store the formula in glance2 as you did with Model 1

Answer

fml2 = DesActRatio ~ BMI + gender
lm2 = lm(fml2, data = train)
tidy2 = tidy(lm2)
tidy2$formula = as.character(fml2)
tidy2
## # A tibble: 3 × 6
##   term        estimate std.error statistic p.value formula                   
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl> <chr>                     
## 1 (Intercept)   1.38    0.00300      461.        0 DesActRatio ~ BMI + gender
## 2 BMI          -0.0159  0.000108    -148.        0 DesActRatio ~ BMI + gender
## 3 genderf      -0.0730  0.00112      -65.3       0 DesActRatio ~ BMI + gender
glance2 = glance(lm2)
glance2$formula = as.character(fml2)
glance2
## # A tibble: 1 × 13
##   r.squared adj.r.squared  sigma statistic p.value    df logLik     AIC     BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1     0.617         0.617 0.0681    12139.       0     2 19110. -38211. -38181.
## # … with 4 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## #   formula <chr>

Task 5

Examine the performance of Model 2 on the test data using the function we created and add the results to glance2.

Answer

test$pred2 = predict(lm2,newdata = test)

glance2$test_RMSE = RMSE(test$DesActRatio,test$pred2)

Task 6

Repeat the entire exercise for a third model by adding genhlth. Do everything in one step.

Answer

fml3 = DesActRatio ~ BMI + gender + genhlth
lm3 = lm(fml3, data = train)
tidy3 = tidy(lm3)
tidy3$formula = as.character(fml3)
tidy3
## # A tibble: 7 × 6
##   term              estimate std.error statistic p.value formula                
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl> <chr>                  
## 1 (Intercept)       1.39      0.00308    450.    0       DesActRatio ~ BMI + ge…
## 2 BMI              -0.0160    0.000110  -146.    0       DesActRatio ~ BMI + ge…
## 3 genderf          -0.0733    0.00112    -65.5   0       DesActRatio ~ BMI + ge…
## 4 genhlthvery good -0.00444   0.00149     -2.98  0.00289 DesActRatio ~ BMI + ge…
## 5 genhlthgood      -0.000954  0.00156     -0.610 0.542   DesActRatio ~ BMI + ge…
## 6 genhlthfair       0.00552   0.00212      2.60  0.00935 DesActRatio ~ BMI + ge…
## 7 genhlthpoor       0.00753   0.00324      2.33  0.0201  DesActRatio ~ BMI + ge…
glance3 = glance(lm3)
glance3$formula = as.character(fml3)
test$pred3 = predict(lm3,newdata = test)

glance3$test_RMSE = RMSE(test$DesActRatio,test$pred3)



glance3
## # A tibble: 1 × 14
##   r.squared adj.r.squared  sigma statistic p.value    df logLik     AIC     BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1     0.618         0.618 0.0680     4061.       0     6 19127. -38239. -38178.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## #   formula <chr>, test_RMSE <dbl>

Task 7

Model 4 is similar to Model 3, but the third variable is ageCat instead of genhlth.

Repeat everything and produce glance4 for comparison.

Answer

fml4 = DesActRatio ~ BMI + gender + ageCat
lm4 = lm(fml4, data = train)
tidy4 = tidy(lm4)
tidy4$formula = as.character(fml4)
tidy4
## # A tibble: 6 × 6
##   term        estimate std.error statistic  p.value formula                     
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl> <chr>                       
## 1 (Intercept)  1.38     0.00303     456.   0        DesActRatio ~ BMI + gender …
## 2 BMI         -0.0158   0.000108   -146.   0        DesActRatio ~ BMI + gender …
## 3 genderf     -0.0733   0.00111     -65.9  0        DesActRatio ~ BMI + gender …
## 4 ageCat32-43 -0.00889  0.00154      -5.76 8.44e- 9 DesActRatio ~ BMI + gender …
## 5 ageCat44-57 -0.0118   0.00159      -7.43 1.13e-13 DesActRatio ~ BMI + gender …
## 6 ageCat58-99  0.00557  0.00157       3.54 3.94e- 4 DesActRatio ~ BMI + gender …
glance4 = glance(lm4)
glance4$formula = as.character(fml4)
test$pred4 = predict(lm4,newdata = test)

glance4$test_RMSE = RMSE(test$DesActRatio,test$pred4)



glance4
## # A tibble: 1 × 14
##   r.squared adj.r.squared  sigma statistic p.value    df logLik     AIC     BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1     0.621         0.621 0.0677     4935.       0     5 19186. -38358. -38305.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## #   formula <chr>, test_RMSE <dbl>

Task 8 - Model 5

Include gender, genhlth, and ageCat in the model. Rerun everything to produce glance5.

Answer

fml5 = DesActRatio ~ BMI + gender + ageCat + genhlth
lm5 = lm(fml5, data = train)
tidy5 = tidy(lm5)
tidy5$formula = as.character(fml5)
tidy5
## # A tibble: 10 × 6
##    term             estimate std.error statistic  p.value formula               
##    <chr>               <dbl>     <dbl>     <dbl>    <dbl> <chr>                 
##  1 (Intercept)       1.39     0.00311     446.   0        DesActRatio ~ BMI + g…
##  2 BMI              -0.0158   0.000110   -144.   0        DesActRatio ~ BMI + g…
##  3 genderf          -0.0735   0.00111     -65.9  0        DesActRatio ~ BMI + g…
##  4 ageCat32-43      -0.00890  0.00154      -5.77 8.04e- 9 DesActRatio ~ BMI + g…
##  5 ageCat44-57      -0.0121   0.00159      -7.63 2.53e-14 DesActRatio ~ BMI + g…
##  6 ageCat58-99       0.00470  0.00160       2.93 3.37e- 3 DesActRatio ~ BMI + g…
##  7 genhlthvery good -0.00502  0.00148      -3.38 7.31e- 4 DesActRatio ~ BMI + g…
##  8 genhlthgood      -0.00242  0.00157      -1.54 1.23e- 1 DesActRatio ~ BMI + g…
##  9 genhlthfair       0.00275  0.00214       1.28 1.99e- 1 DesActRatio ~ BMI + g…
## 10 genhlthpoor       0.00400  0.00327       1.23 2.20e- 1 DesActRatio ~ BMI + g…
glance5 = glance(lm5)
glance5$formula = as.character(fml5)
test$pred5 = predict(lm5,newdata = test)

glance5$test_RMSE = RMSE(test$DesActRatio,test$pred5)



glance5
## # A tibble: 1 × 14
##   r.squared adj.r.squared  sigma statistic p.value    df logLik     AIC     BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1     0.622         0.621 0.0677     2748.       0     9 19199. -38375. -38291.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## #   formula <chr>, test_RMSE <dbl>

Task 10 - Model 6

So far, we have always included gender as one of the categorical variables. In Model 6, just include genhlth and ageCat. Repeat all of the steps and produce glance6.

Answer

fml6 = DesActRatio ~ BMI + ageCat + genhlth
lm6= lm(fml6, data = train)
tidy6 = tidy(lm6)
tidy6$formula = as.character(fml6)
tidy6
## # A tibble: 9 × 6
##   term              estimate std.error statistic  p.value formula               
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl> <chr>                 
## 1 (Intercept)       1.33      0.00339    392.    0        DesActRatio ~ BMI + a…
## 2 BMI              -0.0149    0.000124  -120.    0        DesActRatio ~ BMI + a…
## 3 ageCat32-43      -0.00984   0.00175     -5.62  1.92e- 8 DesActRatio ~ BMI + a…
## 4 ageCat44-57      -0.0145    0.00180     -8.01  1.26e-15 DesActRatio ~ BMI + a…
## 5 ageCat58-99       0.000548  0.00182      0.301 7.63e- 1 DesActRatio ~ BMI + a…
## 6 genhlthvery good -0.00654   0.00169     -3.88  1.06e- 4 DesActRatio ~ BMI + a…
## 7 genhlthgood      -0.00533   0.00178     -3.00  2.75e- 3 DesActRatio ~ BMI + a…
## 8 genhlthfair      -0.00358   0.00243     -1.47  1.41e- 1 DesActRatio ~ BMI + a…
## 9 genhlthpoor      -0.00430   0.00371     -1.16  2.46e- 1 DesActRatio ~ BMI + a…
glance6 = glance(lm6)
glance6$formula = as.character(fml6)
test$pred6 = predict(lm6,newdata = test)

glance6$test_RMSE = RMSE(test$DesActRatio,test$pred6)



glance6
## # A tibble: 1 × 14
##   r.squared adj.r.squared  sigma statistic p.value    df logLik     AIC     BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1     0.512         0.512 0.0768     1977.       0     8 17288. -34555. -34479.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## #   formula <chr>, test_RMSE <dbl>

Task 11 - Compare All 6

Use rbind() to combine glance1,…, glance6 in one dataframe. Keep only the formula and the test perormance statistics. use arrange() to order the dataframe by test_RMSE. Which model is best?

Answer

all6 = rbind(glance1, glance2, glance3, glance4, glance5, glance6) %>% 
  select(formula, test_RMSE) %>% 
  arrange(test_RMSE)

all6
## # A tibble: 6 × 2
##   formula                                       test_RMSE
##   <chr>                                             <dbl>
## 1 DesActRatio ~ BMI + gender + ageCat + genhlth    0.0662
## 2 DesActRatio ~ BMI + gender + ageCat              0.0663
## 3 DesActRatio ~ BMI + gender + genhlth             0.0667
## 4 DesActRatio ~ BMI + gender                       0.0668
## 5 DesActRatio ~ BMI + ageCat + genhlth             0.0766
## 6 DesActRatio ~ BMI                                0.0770

Task 12 - Graphics

Graphical Evaluation Look at the relationship between predicted and actual values using ggplot2. Use the dataframe test with the predicted values from the best model. Use ggplot2 to make a scatterplot of the actual and predicted values. Map the predicted values to x.

Set alpha to a low value like .1 in your geom_point().

Add the line y = x using geom_abline(). Color it red.

Add a default smoother using geom_smooth(). Color it blue.

Note what you see.

Answer

test %>% 
  ggplot(aes(x = pred5, y = DesActRatio)) +
      geom_point(alpha=.1) +
      geom_abline(color = "red") +
      geom_smooth(color = "blue") + 
      ggtitle("Performance of Model 5")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'