library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(graphics)
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
Consider some data on the speed and stopping distances of cars in the 1920s. We plot the data and fit a linear model:
data(cars)
dim(cars)
## [1] 50 2
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The cars dataset gives “speed”" and stopping distances of Cars. This dataset is a data frame with 50 rows and 2 variables. The rows refer to cars and the variables refer to speed (the numeric Speed in mph) and dist (the numeric stopping distance in ft.). As the summary output above shows, the cars dataset’s speed variable varies from cars with speed of 4 mph to 25 mph. When it comes to distance to stop, there are cars that can stop in 2 feet and cars that need 120 feet to come to a stop.
plot (dist ~ speed, cars, ylab="distance")
g <- lm(dist ~ speed, cars)
summary (g)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Residuals are essentially the difference between the actual observed response values and the response values that the model predicted. When assessing how well the model fit the data, we should look for a symmetrical distribution across these points on the mean value zero (0). Here, we can see that the distribution of the residuals do not appear to be strongly symmetrical which means the model predicts certain points are fall far away from the actual observed points.
The coefficient Estimate contains two rows; the first one is the intercept. The intercept is the expected value of the distance required for a car to stop when we consider the average speed of all cars in the dataset. The second row in the Coefficients is the slope, for every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324 feet.
The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable. We’d ideally want a lower number relative to its coefficients. It means the required distance for a car to stop can vary by 0.4155 feet.
The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. The t-statistic values (9.4) are relatively far away from zero and are large relative to the standard error, which could indicate a relationship exists. In general, t-values are also used to compute p-values.
The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (speed) and response (dist) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model example, the p-values are very close to zero.
The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line. The actual distance required to stop can deviate from the true regression line by approximately 15.38 feet on average. Multiple R-squared, Adjusted R-squared
The R-squared (R2) statistic provides a measure of how well the model is fitting the actual data. It usually lies between 0 and 1. In our example, the R2 we get is 0.6511. Or roughly 65% of the variance found in the response variable (dist) can be explained by the predictor variable (speed).
This means that our model is a good fit but not an excellent fit for the data provided.
attach(cars)
plot(speed, dist)
abline(g)
` #Residual Analysis
plot(fitted(g), resid(g))
Comparing residual values versus the actual observed data, our moduel has a tendecy to underestimate the actual value for the data. It is a good model with a lot of the data clutterd near the zero line. But a few positive outliers (~+40 residual values) as the value for (speed) increases. So for lower speeds, the model can better predit the stopping distance than with higher speeds.
qqnorm(resid(g))
qqline(resid(g))
Another use of residuals is to generate a quantile to quantile plot of sample data quantiles against theoretical quantiles (quantile values as predicted by the model). It is a normal distribution of the observed data. But we can see a divergence though towards the higher positive quantiles.
Besides the residual analysis, the quantile to quantile (Q-Q plot) analysis also shows that the model is an excellent representation of the observed data execpt for larger values of the observed data. For speeds less than 20 (75th quantile), the model is an excellent predictor of stopping distance.