In this dataset, it contains observations on the percentage of people biking to work each day, the percentage of people smoking, and the percentage of people with heart disease in an imaginary sample of 500 towns.
Linear regression is a regression model that uses a straight line to describe the relationship between variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model. ## Step 1: Load the data into R 1. In RStudio, go to File > Import dataset > From Text (base). 2. Choose the data file you have downloaded heart.data, and an Import Dataset window pops up. 3.In the Data Frame window, you should see an X (index) column and columns listing the data for each of the variables (income and happiness or biking, smoking, and heart.disease). 4.Click on the Import button and the file should appear in your Environment tab on the upper right side of the RStudio screen. After you’ve loaded the data, check that it has been read in correctly using summary(). ### Summary because the variables are quantitative, running the code produces a numeric summary of the data for the independent variables (smoking and biking) and the dependent variable (heart disease):
heart_data <- read_csv("~/R_Folder/heart.data.csv")
## New names:
## Rows: 498 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (4): ...1, biking, smoking, heart.disease
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
summary(heart_data)
## ...1 biking smoking heart.disease
## Min. : 1.0 Min. : 1.119 Min. : 0.5259 Min. : 0.5519
## 1st Qu.:125.2 1st Qu.:20.205 1st Qu.: 8.2798 1st Qu.: 6.5137
## Median :249.5 Median :35.824 Median :15.8146 Median :10.3853
## Mean :249.5 Mean :37.788 Mean :15.4350 Mean :10.1745
## 3rd Qu.:373.8 3rd Qu.:57.853 3rd Qu.:22.5689 3rd Qu.:13.7240
## Max. :498.0 Max. :74.907 Max. :29.9467 Max. :20.4535
You can also embed plots, histogram for example:
plot(heart.disease ~ biking, data=heart_data)
plot(heart.disease ~ smoking, data=heart_data)
Check linear model statistical summary
heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart_data)
summary(heart.disease.lm)
##
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1789 -0.4463 0.0362 0.4422 1.9331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.984658 0.080137 186.99 <2e-16 ***
## biking -0.200133 0.001366 -146.53 <2e-16 ***
## smoking 0.178334 0.003539 50.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795
## F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(heart.disease.lm)
par(mfrow=c(1,1))
plotting.data<-expand.grid(
biking = seq(min(heart_data$biking), max(heart_data$biking), length.out=30),
smoking=c(min(heart_data$smoking), mean(heart_data$smoking), max(heart_data$smoking)))
plotting.data$predicted.y <- predict.lm(heart.disease.lm, newdata=plotting.data)
plotting.data$smoking <- round(plotting.data$smoking, digits = 2)
plotting.data$smoking <- as.factor(plotting.data$smoking)
heart.plot <- ggplot(heart_data, aes(x=biking, y=heart.disease)) +
geom_point()
heart.plot
heart.plot <- heart.plot +
geom_line(data=plotting.data, aes(x=biking, y=predicted.y, color=smoking), size=1.25)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
heart.plot
heart.plot <-
heart.plot +
theme_bw() +
labs(title = "Rates of heart disease (% of population) \n as a function of biking to work and smoking",
x = "Biking to work (% of population)",
y = "Heart disease (% of population)",
color = "Smoking \n (% of population)")
heart.plot