Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.
The simplest linear regression model finds the relationship between one input variable, which is called the predictor variable(speed), and the output,which is called the system’s response (dist).We want to see the relationship between dependent variable “dist” and independent variable “speed” from cars dataset in R.
library(ggplot2)
library(tidyverse)
library(gtsummary)
library(jtools)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
View(cars)
Once the data is loaded we can explore the data by using the summary(). This will summarize the variables including the following information for the numeric variables:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
First I will run bellow code to see if there is any missing values in my data.
print(colSums(is.na(cars)))
## speed dist
## 0 0
We can see that there is no missing values in my data. So no further cleanup needed for this data.
plot(cars[,"speed"],cars[,"dist"], main="cars",
xlab="speed", ylab="dist")
The figure above shows that the Distance tends to increase as the Speed increases. If we superimpose a straight line on this scatter plot (See below), we see that the relationship between the predictor (Speed) and the output (Distance) is roughly linear.
It is not perfectly linear, however. As the Speed increases, we see a larger spread in Distance.
scatter.smooth(cars[,"speed"],cars[,"dist"], main="Cars",
xlab="Speed", ylab="Distance")
Mathematical form of simple Linear function y^ = a0 + a1x1
car_lm <- lm(dist ~ speed, data=cars)
car_lm
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
In this case, the y-intercept is a0 = -17.579 and the slope is a1 = 3.932 . Thus, the final regression model is: y^ = = a0 + a1x1 = -17.579 +3.932*speed
where x1 is the input to the system, a0 is the y-intercept of the line, a1 is the slope, and yˆ is the output value the model predicts. The ^ indicates a predicted or estimated value, not the actual observed value.
plot(speed ~ dist, data=cars)
abline(car_lm)
summary(car_lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The Residual Analysis examines the residual values to see what they can tell us about the model’s quality. Residual values greater than zero mean that the regression model predicted a value that was too small compared to the actual measured value, and negative values indicate that the regression model predicted a value that was too large. A model that fits the data well would tend to over-predict as often as it under-predicts. Thus, if we plot the residual values, we would expect to see them distributed normally around zero for a well-fitted model.
Also,the Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. We compute it by dividing the total variation that the model explains by the data’s total variation. Multiplying this value by 100 gives a value that we can interpret as a percentage between 0 and 100. The reported R2 of 0.6438 for this model means that 64.38% of the variability in performance is explained by the variation in cars speed (or you can say the variation in the model). Random chance and measurement errors creep in, so the model will never explain all data variation. Consequently, you should not ever expect an R2 value of exactly one. In general, values of R2 that are closer to one indicate a better-fitting model. However, a good model does not require a large R2 value. It may still accurately predict future observations, even with a small R2 value.
plot(fitted(car_lm),resid(car_lm))
In the above plot we see that the residuals look uniformly distributed
around zero. The residuals appear to be uniformly scattered above and
below zero.
qqnorm(resid(car_lm))
qqline(resid(car_lm))
This behavior indicates that the residuals are normally distributed. This test further confirms that using only the speed as a predictor in the model is sufficient to explain the data.
The above two diagnostic plots, and two additional other plots can be obtained by using the plot() function with the linear model as the parameter. The “Scale-Location” plot is an alternate way of visualizing the residuals versus fitted values from the linear regression model, however, the residuals are standardized and then transformed by square-root. This essentially folds the residuals and can aid in finding patterns in the residuals.
The Residuals vs Leverage plot can be used to identify possible outliers.
par(mfrow=c(2,2))
plot(car_lm)
The scale-location the red line is approximately horizontal. Then the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values. In regards to the spread around the red line varying with the fitted values so then the variability of magnitudes doesn’t vary much as a function of the fitted values is less clear.
In regards to the Residuals vs Leverage, this can be used to detect heteroskedasticity and non-linearity. The spread of standardized residuals shouldn’t change as a function of leverage: here it appears to decrease, indicating heteroskedasticity. Also, points with high leverage may be influential: that is, deleting them would change the model a lot.
We see a number of promising regularities in the residuals indicating a strong correlation between the explanatory and response variable (speed and stopping distance). The residuals have been shown to be evenly distributed about a center. While the model demonstrated a highly significant correlation among the variables (low p-values).
Despite this the model is not a perfect fit, with the model only explaining 65% of the actual data, and showing some residual divergence indicating skew and overestimation.
Bevans, R. (2020, February 25). Linear Regression in R | An Easy Step-by-Step Guide. Scribbr. https://www.scribbr.com/statistics/linear-regression-in-r/
Zach. (2020, October 26). How to perform simple linear regression in R (step-by-step). Statology. Retrieved November 1, 2022, from https://www.statology.org/simple-linear-regression-in-r/
`