library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
There are only two variables in the cars dataset, speed and dist. We’ll look at distance as a function of speed.
distance is our y variable, as it responds based on speed, the explanatory variable.
cars %>%
ggplot(aes(x=speed, y=dist, color=dist)) +
geom_point() +
labs(title = "Distance based on speed", x="Speed",y="Distance",colour="Distance")
## Regression using linear model
cars_linear <- lm(dist ~ speed,data=cars)
coef(cars_linear)
## (Intercept) speed
## -17.579095 3.932409
summary(cars_linear)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The model’s intercept has a negative value (-17), which we wouldn’t see a car stopping at a negative value, should be 0. Would suggest some modifications should be model to normalize the values. The model states for every unit of speed, distance increases 3.93.
The coefficient for speed, as well as the model’s pvalue is well below 0.05 significance, meaning the model does a good job in explaining distance as a result of speed. In addition, the model explains 65% of the data’s variation, despite the strong correlation between the variables. The residuals have a median near 0, and the min and max values are almost equidistant, suggesting it may have some sort of normal distribution, in which the model can be used.
The model’s standard error of 15.38 is slightly higher than 1-1.5x the 1Q/3Q of +/- 9.215, a bit higher than the desired error.
plot(cars_linear)
cars %>%
ggplot(aes(x=speed,y=dist))+
geom_point() +
geom_smooth(method="lm",color="red") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
cor(cars$speed,cars$dist)
## [1] 0.8068949
There is a correlation of 0.81, which is close to 1, suggesting a strong relationship between the variables.
plot(fitted(cars_linear), resid(cars_linear))
abline(0,0)
It appears the residuals are kind of evenly distributed above/below 0,
suggesting a normal distribution of the data.
qqnorm(resid(cars_linear))
qqline(resid(cars_linear))
We do see as the model progresses towards the theoretical quartiles 1-2, there is some larger variations, however, aside from this, the residuals seem to be centered are the mean, suggesting the model is a good predictor of distance as a function of speed.