Harold Nelson
11/5/2020
Can we predict how much weight people want to lose or gain?
We’ll use the cleaned version of the cdc dataset, cdc2.
Make a few packages available.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We’ll build a number of models, saving the results in an organized way. Finally, we’ll combine the key results in a dataframe and compare them.
Splitting the data
We want to split our data into train and test. We will build the model on train and evaluate its performance on test. Randomly select about 75% of the data for train and 25% for test.
Follow the process in the Mount-Zumel course in Datacamp to create train and test.
## [1] 19997
## [1] 14998
## [1] 15064
## [1] 4933
Model 1
Focus on predicting DesActRatio. For our first model, we will use only BMI as a predictor. Create lm1 using train. Then use tidy() and glance() to examine it. Store the formula in the glance dataframe, which you should name glance1.
Use lm1 to create the variable pred1 in the dataframe test.
fml1 = DesActRatio ~ BMI
lm1 = lm(fml1, data = train)
tidy1 = tidy(lm1)
tidy1$formula = as.character(fml1)
tidy1
## # A tibble: 2 × 6
## term estimate std.error statistic p.value formula
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 1.32 0.00324 408. 0 DesActRatio ~ BMI
## 2 BMI -0.0151 0.000121 -125. 0 DesActRatio ~ BMI
## # A tibble: 1 × 13
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.509 0.509 0.0771 15590. 0 1 17230. -34454. -34431.
## # … with 4 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # formula <chr>
Create an R function RMSE() with parameters predicted and actual to compute the root mean squared error. We will use this to compare the performance of our models on the test data. This is the value sigma from the glance() output.
Use the function to compare the actual and predicted values from the test dataframe. save the result as test_RMSE in glance1.
RMSE = function(actual, predicted){
Residual = actual - predicted
result = sqrt(mean(Residual^2))
return(result)
}
glance1$test_RMSE = RMSE(test$DesActRatio,test$pred1)
glance1
## # A tibble: 1 × 14
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.509 0.509 0.0771 15590. 0 1 17230. -34454. -34431.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # formula <chr>, test_RMSE <dbl>
Our results for lm1 are very similar for train and test. Overfitted models show a significant drop in performance from train to test.
Create a model lm2 by adding the categorical variable gender to lm1. Store the formula in glance2 as you did with Model 1
fml2 = DesActRatio ~ BMI + gender
lm2 = lm(fml2, data = train)
tidy2 = tidy(lm2)
tidy2$formula = as.character(fml2)
tidy2
## # A tibble: 3 × 6
## term estimate std.error statistic p.value formula
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 1.38 0.00300 461. 0 DesActRatio ~ BMI + gender
## 2 BMI -0.0159 0.000108 -148. 0 DesActRatio ~ BMI + gender
## 3 genderf -0.0730 0.00112 -65.3 0 DesActRatio ~ BMI + gender
## # A tibble: 1 × 13
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.617 0.617 0.0681 12139. 0 2 19110. -38211. -38181.
## # … with 4 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # formula <chr>
Examine the performance of Model 2 on the test data using the function we created and add the results to glance2.
Repeat the entire exercise for a third model by adding genhlth. Do everything in one step.
fml3 = DesActRatio ~ BMI + gender + genhlth
lm3 = lm(fml3, data = train)
tidy3 = tidy(lm3)
tidy3$formula = as.character(fml3)
tidy3
## # A tibble: 7 × 6
## term estimate std.error statistic p.value formula
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 1.39 0.00308 450. 0 DesActRatio ~ BMI + ge…
## 2 BMI -0.0160 0.000110 -146. 0 DesActRatio ~ BMI + ge…
## 3 genderf -0.0733 0.00112 -65.5 0 DesActRatio ~ BMI + ge…
## 4 genhlthvery good -0.00444 0.00149 -2.98 0.00289 DesActRatio ~ BMI + ge…
## 5 genhlthgood -0.000954 0.00156 -0.610 0.542 DesActRatio ~ BMI + ge…
## 6 genhlthfair 0.00552 0.00212 2.60 0.00935 DesActRatio ~ BMI + ge…
## 7 genhlthpoor 0.00753 0.00324 2.33 0.0201 DesActRatio ~ BMI + ge…
glance3 = glance(lm3)
glance3$formula = as.character(fml3)
test$pred3 = predict(lm3,newdata = test)
glance3$test_RMSE = RMSE(test$DesActRatio,test$pred3)
glance3
## # A tibble: 1 × 14
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.618 0.618 0.0680 4061. 0 6 19127. -38239. -38178.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # formula <chr>, test_RMSE <dbl>
Model 4 is similar to Model 3, but the third variable is ageCat instead of genhlth.
Repeat everything and produce glance4 for comparison.
fml4 = DesActRatio ~ BMI + gender + ageCat
lm4 = lm(fml4, data = train)
tidy4 = tidy(lm4)
tidy4$formula = as.character(fml4)
tidy4
## # A tibble: 6 × 6
## term estimate std.error statistic p.value formula
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 1.38 0.00303 456. 0 DesActRatio ~ BMI + gender …
## 2 BMI -0.0158 0.000108 -146. 0 DesActRatio ~ BMI + gender …
## 3 genderf -0.0733 0.00111 -65.9 0 DesActRatio ~ BMI + gender …
## 4 ageCat32-43 -0.00889 0.00154 -5.76 8.44e- 9 DesActRatio ~ BMI + gender …
## 5 ageCat44-57 -0.0118 0.00159 -7.43 1.13e-13 DesActRatio ~ BMI + gender …
## 6 ageCat58-99 0.00557 0.00157 3.54 3.94e- 4 DesActRatio ~ BMI + gender …
glance4 = glance(lm4)
glance4$formula = as.character(fml4)
test$pred4 = predict(lm4,newdata = test)
glance4$test_RMSE = RMSE(test$DesActRatio,test$pred4)
glance4
## # A tibble: 1 × 14
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.621 0.621 0.0677 4935. 0 5 19186. -38358. -38305.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # formula <chr>, test_RMSE <dbl>
Include gender, genhlth, and ageCat in the model. Rerun everything to produce glance5.
fml5 = DesActRatio ~ BMI + gender + ageCat + genhlth
lm5 = lm(fml5, data = train)
tidy5 = tidy(lm5)
tidy5$formula = as.character(fml5)
tidy5
## # A tibble: 10 × 6
## term estimate std.error statistic p.value formula
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 1.39 0.00311 446. 0 DesActRatio ~ BMI + g…
## 2 BMI -0.0158 0.000110 -144. 0 DesActRatio ~ BMI + g…
## 3 genderf -0.0735 0.00111 -65.9 0 DesActRatio ~ BMI + g…
## 4 ageCat32-43 -0.00890 0.00154 -5.77 8.04e- 9 DesActRatio ~ BMI + g…
## 5 ageCat44-57 -0.0121 0.00159 -7.63 2.53e-14 DesActRatio ~ BMI + g…
## 6 ageCat58-99 0.00470 0.00160 2.93 3.37e- 3 DesActRatio ~ BMI + g…
## 7 genhlthvery good -0.00502 0.00148 -3.38 7.31e- 4 DesActRatio ~ BMI + g…
## 8 genhlthgood -0.00242 0.00157 -1.54 1.23e- 1 DesActRatio ~ BMI + g…
## 9 genhlthfair 0.00275 0.00214 1.28 1.99e- 1 DesActRatio ~ BMI + g…
## 10 genhlthpoor 0.00400 0.00327 1.23 2.20e- 1 DesActRatio ~ BMI + g…
glance5 = glance(lm5)
glance5$formula = as.character(fml5)
test$pred5 = predict(lm5,newdata = test)
glance5$test_RMSE = RMSE(test$DesActRatio,test$pred5)
glance5
## # A tibble: 1 × 14
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.622 0.621 0.0677 2748. 0 9 19199. -38375. -38291.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # formula <chr>, test_RMSE <dbl>
So far, we have always included gender as one of the categorical variables. In Model 6, just include genhlth and ageCat. Repeat all of the steps and produce glance6.
fml6 = DesActRatio ~ BMI + ageCat + genhlth
lm6= lm(fml6, data = train)
tidy6 = tidy(lm6)
tidy6$formula = as.character(fml6)
tidy6
## # A tibble: 9 × 6
## term estimate std.error statistic p.value formula
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 1.33 0.00339 392. 0 DesActRatio ~ BMI + a…
## 2 BMI -0.0149 0.000124 -120. 0 DesActRatio ~ BMI + a…
## 3 ageCat32-43 -0.00984 0.00175 -5.62 1.92e- 8 DesActRatio ~ BMI + a…
## 4 ageCat44-57 -0.0145 0.00180 -8.01 1.26e-15 DesActRatio ~ BMI + a…
## 5 ageCat58-99 0.000548 0.00182 0.301 7.63e- 1 DesActRatio ~ BMI + a…
## 6 genhlthvery good -0.00654 0.00169 -3.88 1.06e- 4 DesActRatio ~ BMI + a…
## 7 genhlthgood -0.00533 0.00178 -3.00 2.75e- 3 DesActRatio ~ BMI + a…
## 8 genhlthfair -0.00358 0.00243 -1.47 1.41e- 1 DesActRatio ~ BMI + a…
## 9 genhlthpoor -0.00430 0.00371 -1.16 2.46e- 1 DesActRatio ~ BMI + a…
glance6 = glance(lm6)
glance6$formula = as.character(fml6)
test$pred6 = predict(lm6,newdata = test)
glance6$test_RMSE = RMSE(test$DesActRatio,test$pred6)
glance6
## # A tibble: 1 × 14
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.512 0.512 0.0768 1977. 0 8 17288. -34555. -34479.
## # … with 5 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # formula <chr>, test_RMSE <dbl>
Use rbind() to combine glance1,…, glance6 in one dataframe. Keep only the formula and the test perormance statistics. use arrange() to order the dataframe by test_RMSE. Which model is best?
all6 = rbind(glance1, glance2, glance3, glance4, glance5, glance6) %>%
select(formula, test_RMSE) %>%
arrange(test_RMSE)
all6
## # A tibble: 6 × 2
## formula test_RMSE
## <chr> <dbl>
## 1 DesActRatio ~ BMI + gender + ageCat + genhlth 0.0662
## 2 DesActRatio ~ BMI + gender + ageCat 0.0663
## 3 DesActRatio ~ BMI + gender + genhlth 0.0667
## 4 DesActRatio ~ BMI + gender 0.0668
## 5 DesActRatio ~ BMI + ageCat + genhlth 0.0766
## 6 DesActRatio ~ BMI 0.0770
Graphical Evaluation Look at the relationship between predicted and actual values using ggplot2. Use the dataframe test with the predicted values from the best model. Use ggplot2 to make a scatterplot of the actual and predicted values. Map the predicted values to x.
Set alpha to a low value like .1 in your geom_point().
Add the line y = x using geom_abline(). Color it red.
Add a default smoother using geom_smooth(). Color it blue.
Note what you see.