Data 605 Discussion 12
library(knitr)
library(rmdformats)
## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=31)
library(readr)
library(ggplot2)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
A primer on multiple regression
Loading the data from a csv
Take a look of the summary of the Heart.data
biking smoking heart.disease
Min. : 1.119 Min. : 0.5259 Min. : 0.5519
1st Qu.:20.205 1st Qu.: 8.2798 1st Qu.: 6.5137
Median :35.824 Median :15.8146 Median :10.3853
Mean :37.788 Mean :15.4350 Mean :10.1745
3rd Qu.:57.853 3rd Qu.:22.5689 3rd Qu.:13.7240
Max. :74.907 Max. :29.9467 Max. :20.4535
1. Independence of observations (aka no autocorrelation)
- Use cor( ) function to test the relationship between the independent variables just to make sure they are not highly correlated to avoid collinarity. Or, you can do a collinarity diagnostics on the linear model (lm) after the model is built. Use the function ols_coll_diag( ) to do the check on Tolerance and Variance Inflation Factor (VIF) and Eigenvalue and Condition Index
[1] 0.01513618
The correlation between biking and smoking, the 2 independent variables, is small at merely 1.5%. So we include both variables in our model.
2. Normality
Use hist( ) function to plot out the distribution of the dependent variable and see if it follows a normal distribution
The distribution of observations is roughly bell-shaped so we can proceed with checking linearity.
3. Linearity
The relationship between heart.disease and biking is apparently linear. The relationship between heart.disease and smoking is not as apparent but it is still linear. We can then proceed with the next check.
4. Homoscedasticity
We can check for this after we built the model.
Plotting scatter plots of the variables against each other.
Building the linear model.
heart.disease.lm <- lm(heart.disease ~ biking + smoking, data = heart_data)
summary(heart.disease.lm)
Call:
lm(formula = heart.disease ~ biking + smoking, data = heart_data)
Residuals:
Min 1Q Median 3Q Max
-2.1789 -0.4463 0.0362 0.4422 1.9331
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.984658 0.080137 186.99 <2e-16 ***
biking -0.200133 0.001366 -146.53 <2e-16 ***
smoking 0.178334 0.003539 50.39 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.654 on 495 degrees of freedom
Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795
F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16
The estimated effect of biking on heart disease is –0.2 while the estimated effect of smoking is .18.
The standard errors for these regression variables are very small resulting in small p-values. Meaning there is almost zero probability that these relationship occurred by chance.
Let’s go back to checking Homoscedasticity
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 5.814046, Df = 1, p = 0.015899
This test gives me a p-value of .02, which allows us to reject the null hypothesis of homoscedasticity. This tells me that the error variance changes with the level of the response.
Plotting the model
Create a sequence from the lowest to the highest value of your observed biking data;
Choose the minimum, mean, and maximum values of smoking, in order to make 3 levels of smoking over which to predict rates of heart disease.
Predict the values of heart disease based on your linear model
Plot the original data
Add the regression line
heart.plot <- heart.plot + geom_line(data = plotting.data, aes(x = biking, y = predicted,
color = smoking), size = 1.35)
heart.plotAdd title for the graph and better descriptors to the x-labels and y-labels
heart.plot <- heart.plot + theme_bw() + labs(title = "Rates of Heart Disease (% of population)",
x = "Biking to Work (% of pop.)", y = "Heart Disease (% of pop.)", color = "Smoking \n (% of pop.)")
heart.plotheart.plot + annotate(geom = "text", x = 45, y = 20, label = "Heart Disease = 14.98 + (-0.20 x biking) + (0.18 x smoking)")In conclusion, we found a 0.2% decrease (± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a 0.18% increase (± 0.0035) in the frequency of heart disease for every 1% increase in smoking.
Based on the results of the residual analysis, although we failed to pass the Homoscedasticity test, I do think I would give it a pass on meeting the conditions of a linear model.
Appendix on Collinearity Diagnostics
Tolerance and Variance Inflation Factor
---------------------------------------
Variables Tolerance VIF
1 biking 0.9997709 1.000229
2 smoking 0.9997709 1.000229
Eigenvalue and Condition Index
------------------------------
Eigenvalue Condition Index intercept biking smoking
1 2.68130373 1.000000 0.0175371071 0.02933512 0.02715632
2 0.23052794 3.410446 0.0007629378 0.55891182 0.45583598
3 0.08816833 5.514634 0.9816999551 0.41175306 0.51700770