From https://www.scribbr.com/statistics/linear-regression-in-r/ Linear Regression in R | An Easy Step-by-Step Guide published 2/25/2020 by Rebecca Bevens
Given a dataset of 500 observations of rates of smoking, biking, and heart disease in 500 imaginary small town, run a linear regression to check for correlation between predictive variables of smoking and bikin and dependent variable of heart disease.
# install packages only once
#install.packages("ggplot2")
# install.packages("dplyr")
# install.packages("broom")
#install.packages("ggpubr")
# load packages
library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
## Warning: package 'broom' was built under R version 4.1.2
library(ggpubr)
heart.data <- read.csv("~/StMarys/stats 2/heart.data.csv")
View(heart.data)
summary(heart.data)
## X biking smoking heart.disease
## Min. : 1.0 Min. : 1.119 Min. : 0.5259 Min. : 0.5519
## 1st Qu.:125.2 1st Qu.:20.205 1st Qu.: 8.2798 1st Qu.: 6.5137
## Median :249.5 Median :35.824 Median :15.8146 Median :10.3853
## Mean :249.5 Mean :37.788 Mean :15.4350 Mean :10.1745
## 3rd Qu.:373.8 3rd Qu.:57.853 3rd Qu.:22.5689 3rd Qu.:13.7240
## Max. :498.0 Max. :74.907 Max. :29.9467 Max. :20.4535
cor(heart.data$biking, heart.data$smoking)
## [1] 0.01513618
The correlation of 1.5% is small enough that we can include both variables in the model.
hist(heart.data$heart.disease)
The distribution of observations is roughly bell shaped so we can proceed.
We’ll check this with two scatterplots: for biking and heart disease, and for smoking and heart disease.
plot(heart.disease ~ biking, data=heart.data)
plot(heart.disease ~ smoking, data=heart.data)
Both look roughly linear– let’s go for it!
heart.disease.lm<-lm(heart.disease ~ biking + smoking, data=heart.data)
summary(heart.disease.lm)
##
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1789 -0.4463 0.0362 0.4422 1.9331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.984658 0.080137 186.99 <2e-16 ***
## biking -0.200133 0.001366 -146.53 <2e-16 ***
## smoking 0.178334 0.003539 50.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795
## F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16
What does this mean?
* Estimate for model parameters: value of y-intercept, estimated effect of biking on heart disease, estimated effect of smoking on heart disease. * Standard error of the estimated values * Test statistic aka t value aka t statistic * P value (P >|t|) aka the probability of finding the given t-statistic if the null hypothesis of no relationship were true (we want this to be low!!!)
The final 3 lines are model diagnostics– the most important is the p-value– 2.2e-16 is very close to zero– which indicates that the model fits the data well.
Interpretation: * For every 1% increase in biking to work, there is a correlated 0.2% decrease in heart disease. * For every 1% increase in smoking, there is a correlated 0.178% increase in heart disease.
The standard errors for these regression coefficients are very small, and the t-statistics are very large (-147 and 50.4, respectively). The p-values reflect these small errors and large t-statistics. For both parameters, there is almost zero probability that this effect is due to chance.
Look at the residuals!
par(mfrow=c(2,2))
plot(heart.disease.lm)
par(mfrow=c(1,1))
Residuals are the unexplained variance. They are not exactly the same as model error, but they are calculated from it, so seeing a bias in the residuals would also indicate a bias in the error.
The most important thing to look for is that the red lines representing the mean of the residuals are all basically horizontal and centered around zero. This means there are no outliers or biases in the data that would make a linear regression invalid.
In the Normal Q-Qplot in the top right, we can see that the real residuals from our model form an almost perfectly one-to-one line with the theoretical residuals from a perfect model.
Based on these residuals, we can say that our model meets the assumption of homoscedasticity.
The visualization step for multiple regression is more difficult than for simple regression, because we now have two predictors. One option is to plot a plane, but these are difficult to read and not often published.
We will try a different method: plotting the relationship between biking and heart disease at different levels of smoking. In this example, smoking will be treated as a factor with three levels, just for the purposes of displaying the relationships in our data.
Use the function expand.grid() to create a dataframe that will: * create a sequence from the lowest to highest value of observed biking data * choose minimum, mean, and maximum of smoking to create 3 levels of smoking over which to predict heart disease
plotting.data <- expand.grid(
biking=seq(min(heart.data$biking), max(heart.data$biking), length.out=30),
smoking=c(min(heart.data$smoking), mean(heart.data$smoking), max(heart.data$smoking))
)
View(plotting.data)
plotting.data$predicted.y <- predict.lm(heart.disease.lm, newdata=plotting.data)
View(plotting.data)
plotting.data$smoking <-round(plotting.data$smoking, digits=2)
View(plotting.data)
plotting.data$smoking <- as.factor(plotting.data$smoking)
View(plotting.data)
heart.plot <- ggplot(heart.data, aes(x=biking, y=heart.disease)) +
geom_point()
heart.plot
heart.plot <- heart.plot +
geom_line(data=plotting.data, aes(x=biking, y=predicted.y, color=smoking), size=1.25)
heart.plot
heart.plot <-
heart.plot +
theme_bw() +
labs(title = "Rates of heart disease (% of population) \n as a function of biking to work and smoking",
x = "Biking to work (% of population)",
y = "Heart disease (% of population)",
color = "Smoking \n (% of population)")
heart.plot
heart.plot + annotate(geom="text", x=30, y=1.75, label=" = 15 + (-0.2*biking) + (0.178*smoking)")
In our survey of 500 towns, we found significant relationships between the frequency of biking to work and the frequency of heart disease and the frequency of smoking and frequency of heart disease (p < 0 and p<0.001, respectively).
Specifically we found a 0.2% decrease (± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a 0.178% increase (± 0.0035) in the frequency of heart disease for every 1% increase in smoking.
(Note that we pulled the margin of error from the std error for our coefficients!)