#Load the packages needed
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
## Warning: package 'broom' was built under R version 4.2.2
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.2.2
library(carData)
cars.data = cars
head(cars.data)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
plot(cars.data[,"speed"],cars.data[,"dist"], main="Cars",
xlab="Speed", ylab="Distance")
The figure above shows that the Distance tends to increase as the Speed increases. If we superimpose a straight line on this scatter plot (See below), we see that the relationship between the predictor (Speed) and the output (Distance) is roughly linear.
It is not perfectly linear, however. As the Speed increases, we see a larger spread in Distance.
scatter.smooth(cars.data[,"speed"],cars.data[,"dist"], main="Cars",
xlab="Speed", ylab="Distance")
cars.lm <- lm(dist ~ speed, data=cars.data)
cars.lm
##
## Call:
## lm(formula = dist ~ speed, data = cars.data)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
The y-intercept is a0 = -17.579 and the slope is a1 = 3.932. Thus, the final regression model is: dist = -17.579 + 3.932 * speed.
plot(dist ~ speed, data=cars.data)
abline(cars.lm)
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
For this model, the residual values are not too far off.
If we wanted to predict the Distance required for a car to stop given its speed, we would get a training set and produce estimates of the coefficients to then use it in the model formula.
The estimate coefficient in intercept is essentially the expected value of the distance required for a car to stop when we consider the average speed of all cars in the dataset. That is -17.579 feet to come to a stop.
Hypothetically, this would mean that a car going 0 mph takes -17.59 feet to stop. Having a neagtive stopping distance is impossible so y intercept cannot be interpreted in this instance.
The second row in the Coefficients is the slope, the effect speed has in distance required for a car to stop. The slope term in our model is saying that for every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324088 feet.
Hypothetically, this would mean that a car going 0 mph takes -17.59 feet to stop. Having a neagtive stopping distance is impossible so y intercept cannot be interpreted in this instance.
The standard error for speed is 9.464 times smaller than the coefficient value. This ratio means that there is strong variability in the slope estimate, a1. The standard error for the intercept, a0, is 6.7584 , which is not roughly the same as the estimated value of -17.5791 for this coefficient. This suggest that there is more certainty in the estimate of this coefficient for this model.
The Pr(>|t|) is 1.49e-12, we can say that there is strong evidence of a linear relationship between speed and distance.
The Residual standard Error the actual distance required to stop can deviate from the true regression line by approximately 15.38 feet, on average.
The multiple R-squared is 0.6511, which means that this model explains 65.11% of the data’s variation.
The F-statistic is larger than 1 which indicates there is a realtionship between predictor and response variable.
plot(fitted(cars.lm),resid(cars.lm))
The plot above shows that the residuals look uniformly distributed around zero. The residuals appear to be uniformly scattered above and below zero.
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))
This behavior indicates that the residuals are normally distributed. This test further confirms that using only the speed as a predictor in the model is sufficient to explain the data.
par(mfrow=c(2,2))
plot(cars.lm)
The scale-location the red line is approximately horizontal. Then the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values. In regards to the spread around the red line varying with the fitted values so then the variability of magnitudes doesn’t vary much as a function of the fitted values is less clear.
In regards to the Residuals vs Leverage, this can be used to detect heteroskedasticity and non-linearity. The spread of standardized residuals shouldn’t change as a function of leverage: here it appears to decrease, indicating heteroskedasticity. Also, points with high leverage may be influential: that is, deleting them would change the model a lot.
From our one-factor linear model of the cars dataset, we see a number of promising regularities in the residuals indicating a strong correlation between the explanatory and response variable (speed and stopping distance). The residuals have been shown to be evenly distributed about a center with directionality, while the model demonstrated a highly significant correlation among the variables (low p-values). Despite this the model is not a perfect fit, with the model only explaining 65% of the actual data, and showing some residual divergence indicating skew and overestimation, among other factors yet to be discovered.
It is likely that speed could play a significant role in a multivariate model which could be driven correlation indicated by statistics and the nearly normal properties of the residuals. It is likely that a larger number of observations might otherwise improve our model through its influence on normality.