head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
tail(cars)
## speed dist
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85
The data cars has 50 observations and 2 variables (speed and distance).
# statistical summary for the data
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
# Check for missing values
sum(is.na(cars))
## [1] 0
The cars data seems clean and has no missingness, so we will proceed with the next steps on analyzing the data;
We need to build a linear model of “stopping distance as a function of speed”, based on this sentence, the independent variable \((x)\) is “speed” (also known as the explanatory or predictor) where the dependent variable \((y)\) is stopping distance (also known as the response variable).
plot(cars[,"speed"],cars[,"dist"], main="Stopping Distance as a Funtion of Speed",
xlab="Speed", ylab="Stopping Distance")
cars_lm <- lm(dist ~ speed, data = cars)
cars_lm
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
Based on the output above, the linear model function is: \(\widehat{dist} = -17.579 + 3.932 \cdot speed\)
Now let’s plot the line of best fit of the data:
plot(dist ~ speed, data=cars)
abline(cars_lm, col = "deeppink", lwd = 2)
Both the scatter plot and the line of best fit show that there is a positive correlation between the speed and the stopping distance, where the initial point (y-intercept) is negative as expected and as indicated on the equation of the line. The data points are scatter around the fitted line, but is the linear regression a good model? Let’s evaluate its quality;
Let’s start by extracting more information about the linear model:
summary(cars_lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The median is \(-2.272\) which is close to zero. The minimum and the maximum are of the same magnitude, similarly for the first and third quartile. This indicates that the linear model is a good model. In addition, the standard error is over nine times smaller than the coefficient \(\frac{3.9324}{0.4155} = 9.464\). The p-value is \(1.49 \times 10^{-12}\) which is very small, this indicates that there is a strong linear relationship between the speed and the stooping distance.
Now let’s plot the residuals, the Q-Q and the other diagnostic plots:
par(mfrow=c(2,2))
plot(cars_lm)
I am going to plot the residual and the Q-Q plots each separately to have a better visualization:
plot(fitted(cars_lm, resid(cars_lm)))
The residual plot shows that there is a slight abnormality on the right side end of the plot, a better model might be constructed for better prediction.
Let’s look at the Q-Q plot;
qqnorm(resid(cars_lm))
qqline(resid(cars_lm))
Yes! The Q-Q plot confirms the light divergence on the right side end of the points. This indicates that the speed as a predictor is not sufficient to explain the stopping distance; there may be other predictors that influence the distance. Also, if we look at the summary(cars_lm) output, the Multiple R-squared: 0.6511 shows that \(65.11\%\) of the variability in stopping distance is explained by the variation in speed.
The cars data has only two variables; speed and distance, so if we want to perform multiple regression model we have to gather more data about these cars using other variables. In conclusion, the linear regression model assumes that there is a linear relationship between the predictor variable (speed) and the response variable (stopping distance).