Data 605 Discussion 12

library(knitr)
library(rmdformats)

## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=31)

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(broom)
library(ggpubr)

A primer on multiple regression

Loading the data from a csv

heart_data <- read_csv("heart.data.csv", col_types = cols(X1 = col_skip()))

Take a look of the summary of the Heart.data

summary(heart_data)
     biking          smoking        heart.disease    
 Min.   : 1.119   Min.   : 0.5259   Min.   : 0.5519  
 1st Qu.:20.205   1st Qu.: 8.2798   1st Qu.: 6.5137  
 Median :35.824   Median :15.8146   Median :10.3853  
 Mean   :37.788   Mean   :15.4350   Mean   :10.1745  
 3rd Qu.:57.853   3rd Qu.:22.5689   3rd Qu.:13.7240  
 Max.   :74.907   Max.   :29.9467   Max.   :20.4535  

1. Independence of observations (aka no autocorrelation)

  1. Use cor( ) function to test the relationship between the independent variables just to make sure they are not highly correlated to avoid collinarity. Or, you can do a collinarity diagnostics on the linear model (lm) after the model is built. Use the function ols_coll_diag( ) to do the check on Tolerance and Variance Inflation Factor (VIF) and Eigenvalue and Condition Index
cor(heart_data$biking, heart_data$smoking)
[1] 0.01513618

The correlation between biking and smoking, the 2 independent variables, is small at merely 1.5%. So we include both variables in our model.

2. Normality

Use hist( ) function to plot out the distribution of the dependent variable and see if it follows a normal distribution

hist(heart_data$heart.disease)

The distribution of observations is roughly bell-shaped so we can proceed with checking linearity.

3. Linearity

plot(heart.disease ~ biking, data = heart_data)

plot(heart.disease ~ smoking, data = heart_data)

The relationship between heart.disease and biking is apparently linear. The relationship between heart.disease and smoking is not as apparent but it is still linear. We can then proceed with the next check.

4. Homoscedasticity

We can check for this after we built the model.

Plotting scatter plots of the variables against each other.

plot(heart_data, which = 1)

Building the linear model.

heart.disease.lm <- lm(heart.disease ~ biking + smoking, data = heart_data)

summary(heart.disease.lm)

Call:
lm(formula = heart.disease ~ biking + smoking, data = heart_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1789 -0.4463  0.0362  0.4422  1.9331 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.984658   0.080137  186.99   <2e-16 ***
biking      -0.200133   0.001366 -146.53   <2e-16 ***
smoking      0.178334   0.003539   50.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.654 on 495 degrees of freedom
Multiple R-squared:  0.9796,    Adjusted R-squared:  0.9795 
F-statistic: 1.19e+04 on 2 and 495 DF,  p-value: < 2.2e-16

The estimated effect of biking on heart disease is –0.2 while the estimated effect of smoking is .18.

The standard errors for these regression variables are very small resulting in small p-values. Meaning there is almost zero probability that these relationship occurred by chance.

Let’s go back to checking Homoscedasticity

par(mfrow = c(2, 2))
plot(heart.disease.lm)

par(mfrow = c(1, 1))
car::ncvTest(heart.disease.lm)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 5.814046, Df = 1, p = 0.015899

This test gives me a p-value of .02, which allows us to reject the null hypothesis of homoscedasticity. This tells me that the error variance changes with the level of the response.

Plotting the model

  • Create a sequence from the lowest to the highest value of your observed biking data;

  • Choose the minimum, mean, and maximum values of smoking, in order to make 3 levels of smoking over which to predict rates of heart disease.

plotting.data <- expand.grid(biking = seq(min(heart_data$biking), max(heart_data$biking), 
    length.out = 30), smoking = c(min(heart_data$smoking), mean(heart_data$smoking), 
    max(heart_data$smoking)))

Predict the values of heart disease based on your linear model

plotting.data$predicted <- predict.lm(heart.disease.lm, newdata = plotting.data)

# round the smoking numbers into 2 decimals
plotting.data$smoking <- round(plotting.data$smoking, digits = 2)

# Change the smoking variable into a factor
plotting.data$smoking <- as.factor(plotting.data$smoking)

Plot the original data

heart.plot <- ggplot(heart_data, aes(x = biking, y = heart.disease)) + geom_point()

heart.plot

Add the regression line

heart.plot <- heart.plot + geom_line(data = plotting.data, aes(x = biking, y = predicted, 
    color = smoking), size = 1.35)

heart.plot

Add title for the graph and better descriptors to the x-labels and y-labels

heart.plot <- heart.plot + theme_bw() + labs(title = "Rates of Heart Disease (% of population)", 
    x = "Biking to Work (% of pop.)", y = "Heart Disease (% of pop.)", color = "Smoking \n (% of pop.)")

heart.plot

heart.plot + annotate(geom = "text", x = 45, y = 20, label = "Heart Disease = 14.98 + (-0.20 x biking) + (0.18 x smoking)")

In conclusion, we found a 0.2% decrease (± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a 0.18% increase (± 0.0035) in the frequency of heart disease for every 1% increase in smoking.

Based on the results of the residual analysis, although we failed to pass the Homoscedasticity test, I do think I would give it a pass on meeting the conditions of a linear model.

Appendix on Collinearity Diagnostics

olsrr::ols_coll_diag(heart.disease.lm)
Tolerance and Variance Inflation Factor
---------------------------------------
  Variables Tolerance      VIF
1    biking 0.9997709 1.000229
2   smoking 0.9997709 1.000229


Eigenvalue and Condition Index
------------------------------
  Eigenvalue Condition Index    intercept     biking    smoking
1 2.68130373        1.000000 0.0175371071 0.02933512 0.02715632
2 0.23052794        3.410446 0.0007629378 0.55891182 0.45583598
3 0.08816833        5.514634 0.9816999551 0.41175306 0.51700770