Multiple Linear Regression in R

In this dataset, it contains observations on the percentage of people biking to work each day, the percentage of people smoking, and the percentage of people with heart disease in an imaginary sample of 500 towns.

Linear regression is a regression model that uses a straight line to describe the relationship between variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model. ## Step 1: Load the data into R 1. In RStudio, go to File > Import dataset > From Text (base). 2. Choose the data file you have downloaded heart.data, and an Import Dataset window pops up. 3.In the Data Frame window, you should see an X (index) column and columns listing the data for each of the variables (income and happiness or biking, smoking, and heart.disease). 4.Click on the Import button and the file should appear in your Environment tab on the upper right side of the RStudio screen. After you’ve loaded the data, check that it has been read in correctly using summary(). ### Summary because the variables are quantitative, running the code produces a numeric summary of the data for the independent variables (smoking and biking) and the dependent variable (heart disease):

heart_data <- read_csv("~/R_Folder/heart.data.csv")
## New names:
## Rows: 498 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (4): ...1, biking, smoking, heart.disease
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
summary(heart_data)
##       ...1           biking          smoking        heart.disease    
##  Min.   :  1.0   Min.   : 1.119   Min.   : 0.5259   Min.   : 0.5519  
##  1st Qu.:125.2   1st Qu.:20.205   1st Qu.: 8.2798   1st Qu.: 6.5137  
##  Median :249.5   Median :35.824   Median :15.8146   Median :10.3853  
##  Mean   :249.5   Mean   :37.788   Mean   :15.4350   Mean   :10.1745  
##  3rd Qu.:373.8   3rd Qu.:57.853   3rd Qu.:22.5689   3rd Qu.:13.7240  
##  Max.   :498.0   Max.   :74.907   Max.   :29.9467   Max.   :20.4535

Normality Test

You can also embed plots, histogram for example:

Plot scatter diagram

plot(heart.disease ~ biking, data=heart_data)

plot(heart.disease ~ smoking, data=heart_data)

Create linear regression model

Check linear model statistical summary

heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart_data)
summary(heart.disease.lm)
## 
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1789 -0.4463  0.0362  0.4422  1.9331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.984658   0.080137  186.99   <2e-16 ***
## biking      -0.200133   0.001366 -146.53   <2e-16 ***
## smoking      0.178334   0.003539   50.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared:  0.9796, Adjusted R-squared:  0.9795 
## F-statistic: 1.19e+04 on 2 and 495 DF,  p-value: < 2.2e-16

Run diagnostic test

par(mfrow=c(2,2))
plot(heart.disease.lm)

par(mfrow=c(1,1))

Generate Final detailed plots

plotting.data<-expand.grid(
  biking = seq(min(heart_data$biking), max(heart_data$biking), length.out=30),
  smoking=c(min(heart_data$smoking), mean(heart_data$smoking), max(heart_data$smoking)))

plotting.data$predicted.y <- predict.lm(heart.disease.lm, newdata=plotting.data)

plotting.data$smoking <- round(plotting.data$smoking, digits = 2)

plotting.data$smoking <- as.factor(plotting.data$smoking)

heart.plot <- ggplot(heart_data, aes(x=biking, y=heart.disease)) +
  geom_point()

heart.plot

heart.plot <- heart.plot +
  geom_line(data=plotting.data, aes(x=biking, y=predicted.y, color=smoking), size=1.25)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
heart.plot

Label final plot

heart.plot <-
  heart.plot +
  theme_bw() +
  labs(title = "Rates of heart disease (% of population) \n as a function of biking to work and smoking",
       x = "Biking to work (% of population)",
       y = "Heart disease (% of population)",
       color = "Smoking \n (% of population)")

heart.plot