In this project we will develop a model that predicts blood pressure, based on anonymized data from patients. The data is derived from National Health and Nutrition Examination Survey
First we import the library and data
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(corrplot)
## corrplot 0.92 loaded
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
health <- read.csv("C:/Users/User/Downloads/Full Book/Chapter 4 Code/Student/Data/health.csv")
Take a glimpse at our data
glimpse(health)
## Rows: 1,475
## Columns: 9
## $ systolic <int> 100, 112, 134, 108, 128, 102, 126, 124, 166, 138, 118, 124, 9~
## $ weight   <dbl> 98.6, 96.9, 108.2, 84.8, 97.0, 102.4, 99.4, 53.6, 78.6, 135.5~
## $ height   <dbl> 172.0, 186.0, 154.4, 168.9, 175.3, 150.5, 157.8, 162.4, 156.9~
## $ bmi      <dbl> 33.3, 28.0, 45.4, 29.7, 31.6, 45.2, 39.9, 20.3, 31.9, 41.7, 2~
## $ waist    <dbl> 120.4, 107.8, 120.3, 109.0, 111.1, 130.7, 113.2, 74.6, 102.8,~
## $ age      <int> 43, 57, 38, 75, 42, 63, 58, 26, 51, 61, 47, 52, 64, 55, 72, 8~
## $ diabetes <int> 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
## $ smoker   <int> 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1~
## $ fastfood <int> 5, 0, 2, 1, 1, 3, 6, 5, 0, 1, 0, 3, 0, 1, 0, 5, 0, 2, 1, 3, 2~
Systolic will be the value we aim to predict, and the rest will be predictors. From the data above we can derive that we need to convert the categorical variables as factors
health <- health %>%
  mutate(diabetes=as.factor(diabetes)) %>%
  mutate(smoker=as.factor(smoker))
Explore the data further by looking at the summary stats
summary(health)
##     systolic         weight           height           bmi       
##  Min.   : 80.0   Min.   : 29.10   Min.   :141.2   Min.   :13.40  
##  1st Qu.:114.0   1st Qu.: 69.15   1st Qu.:163.8   1st Qu.:24.10  
##  Median :122.0   Median : 81.00   Median :170.3   Median :27.90  
##  Mean   :124.7   Mean   : 83.56   Mean   :170.2   Mean   :28.79  
##  3rd Qu.:134.0   3rd Qu.: 94.50   3rd Qu.:176.8   3rd Qu.:32.10  
##  Max.   :224.0   Max.   :203.50   Max.   :200.4   Max.   :62.00  
##      waist            age        diabetes smoker     fastfood    
##  Min.   : 56.2   Min.   :20.00   0:1265   0:770   Min.   : 0.00  
##  1st Qu.: 88.4   1st Qu.:34.00   1: 210   1:705   1st Qu.: 0.00  
##  Median : 98.9   Median :49.00                    Median : 1.00  
##  Mean   :100.0   Mean   :48.89                    Mean   : 2.14  
##  3rd Qu.:109.5   3rd Qu.:62.00                    3rd Qu.: 3.00  
##  Max.   :176.0   Max.   :80.00                    Max.   :22.00
health %>%
  ggplot() +
  geom_histogram(mapping = aes(x=systolic), fill="lightblue", color="black") +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the visualized histogram we can visualize a roughly normal distribution for the output value. Lets review the distribution for the rest of the variables - namely the numeric predictors
health %>%
  select(-systolic) %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot() +
  geom_histogram(mapping = aes(x=value,fill=key), color="black") + facet_wrap(~key, scales = "free") +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now lets look at the correlation between variables
corrplot(cor(health[, c("systolic","weight","height","bmi","waist","age","fastfood")]))

For the predictor value (systolic), the highest correlation seems to be aligned with age.
Lets run a linear regression model and see if the age predictor is significant
health_mod1 <- lm(data = health, systolic~age)
summary(health_mod1)
## 
## Call:
## lm(formula = systolic ~ age, data = health)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.028 -10.109  -1.101   8.223  98.806 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104.34474    1.28169   81.41   <2e-16 ***
## age           0.41698    0.02477   16.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.14 on 1473 degrees of freedom
## Multiple R-squared:  0.1614, Adjusted R-squared:  0.1608 
## F-statistic: 283.4 on 1 and 1473 DF,  p-value: < 2.2e-16
From the age coefficient we can tell that for every 0.4 increase in patients age, blood pressure increases by a point.
F-stat is significant and residual error is low, which are indicators of a strong predictor. However, Multiple R Squared is at 0.1614, meaning that roughly only 16% explains the variablility in the response.
Lets run a Multiple Linear Regression Model and see if we can improve the MRS and find other significant predictors.
health_mod2 <- lm(data = health, systolic ~ .)
summary(health_mod2)
## 
## Call:
## lm(formula = systolic ~ ., data = health)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.463 -10.105  -0.765   8.148 100.398 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 163.30026   33.52545   4.871 1.23e-06 ***
## weight        0.55135    0.19835   2.780  0.00551 ** 
## height       -0.39201    0.19553  -2.005  0.04516 *  
## bmi          -1.36839    0.57574  -2.377  0.01759 *  
## waist        -0.00955    0.08358  -0.114  0.90905    
## age           0.43345    0.03199  13.549  < 2e-16 ***
## diabetes1     2.20636    1.26536   1.744  0.08143 .  
## smoker1       1.13983    0.90964   1.253  0.21039    
## fastfood      0.17638    0.15322   1.151  0.24985    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.99 on 1466 degrees of freedom
## Multiple R-squared:  0.1808, Adjusted R-squared:  0.1763 
## F-statistic: 40.44 on 8 and 1466 DF,  p-value: < 2.2e-16
We can discern from the summary table above that age, height, bmi, weight and diabetes are significant for the model. MRS has been improved to 0.1808, showing an overall better fit model.

Now lets run some diagnostics for our model

mean(health_mod2$residuals)
## [1] -1.121831e-15
The residual mean is close enough to zero, therefore it should be sufficient to pass the test
ols_plot_resid_hist(health_mod2)

##### The residual plot is almost normally distributed, so it satisfies our test.

Lets check for the heteroscedasticity in the distribution of residuals, versus fitted values
ols_plot_resid_fit(health_mod2)

The distribution of points seems even around the origin line. So no heteroscedasticity of residuals vs fitted values.

For practice try to check for multicollinearity among variables, and how it would change accordingly