Regression Model

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

First I like to run some summary stats and plots to explore the data.

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
cars %>%
  ggplot(aes(x=speed)) + geom_bar()

cars %>%
  ggplot(aes(x=dist)) + geom_bar()

Plotting a scatter shows a pretty clear linear correlation bettween the two variables. The next step is to actually create a linear model.

cars %>%
  ggplot(aes(x=speed, y=dist)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

model <- lm(dist ~ speed, cars)
summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The results from the model gives us the function:

\[ Distance = 3.93*Speed - 17.58 \]

Our P(>|t|) value is low so the results are significant

Residuals

Now lets check the residuals to understand if a regression makes sense for this data set.

qqnorm(model$residuals)
qqline(model$residuals)

Here the Q-Q plot is almost linear, which is what we want.

hist(model$residuals)

The residuals plotted on a histogram is pretty normal, which satisfies our condition.

plot(x=model$residuals)

Lastly, it looks like the residuals are randomly distributed across the x-axis. A regression model works for this data set.