library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
First I like to run some summary stats and plots to explore the data.
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
cars %>%
ggplot(aes(x=speed)) + geom_bar()
cars %>%
ggplot(aes(x=dist)) + geom_bar()
Plotting a scatter shows a pretty clear linear correlation bettween the two variables. The next step is to actually create a linear model.
cars %>%
ggplot(aes(x=speed, y=dist)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
model <- lm(dist ~ speed, cars)
summary(model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The results from the model gives us the function:
\[ Distance = 3.93*Speed - 17.58 \]
Our P(>|t|) value is low so the results are significant
Now lets check the residuals to understand if a regression makes sense for this data set.
qqnorm(model$residuals)
qqline(model$residuals)
Here the Q-Q plot is almost linear, which is what we want.
hist(model$residuals)
The residuals plotted on a histogram is pretty normal, which satisfies our condition.
plot(x=model$residuals)
Lastly, it looks like the residuals are randomly distributed across the x-axis. A regression model works for this data set.