Assignment: Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
In this assignment, I have used the ‘cars’ dataset that has come with R. Then, I have built a linear regression model to predict the stopping distance based on car speed. It is noted here that the stopping distance is the response variable and the car speed is the explanatory variable.
library(ggplot2)
library(dplyr)
data(cars)
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…
The dataset contains 50 observations of speed and stopping distance.
car_data<-cars%>%rename(stopping_distance=dist)
head(car_data)
## speed stopping_distance
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
ggplot(car_data, aes(x =speed, y =stopping_distance)) + geom_point(stat = "identity")
From the scatter plot above, it is seen that there is a linear trend exists between the two variables
box_plot<-boxplot(car_data$stopping_distance)
outliers<-box_plot$out
outliers
## [1] 120
An outlier found at stopping distance of 120.
car_data %>%
summarise(cor(speed,stopping_distance))
## cor(speed, stopping_distance)
## 1 0.8068949
The correlation coefficient of 0.81 suggests that there is a strong positive linear relationship between the two variables. This means that the predictor variable, ‘speed’ is a good candidate for predicting the response variable, ‘stopping distance’.
ggplot(car_data, aes(x =speed, y = stopping_distance)) +geom_point(stat="identity")+stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
model <- lm(stopping_distance ~ speed, data = car_data)
summary(model)
##
## Call:
## lm(formula = stopping_distance ~ speed, data = car_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The linear regression model:
stopping_distance = -17.5791 + 3.9324 * speed
The regression coefficient for ‘speed’ is 3.9324, which means that for every one unit increase in speed, the predicted stopping distance increases by 3.9324 units. Similarly, a decrease of one unit in speed would result in decrease of 3.9324 units in predicted stopping distance.
Here, the multiple R-squared value is 0.6511, which means that the regression line explains 65.11% of the variation in the response variable i.e stopping distance. Hence, I can say that the model is reasonably a good fit for the data. Also, the p-value for both intercept and slope are less than 0.05, indicating that both coefficients are statistically significant. Therefore, I can conclude that the speed of the car is a significant predictor of the stopping distance.
ggplot(data =model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
hist(model$residuals)
ggplot(data =model, aes(sample = .resid)) +
stat_qq()
In the residual analysis, the linearity, nearly normal residual and constant variability or homoscedasticity of the residuals assumptions have been checked to see whether the linear model is reliable, and the test results are given below:
Residuals analysis: The residuals appear to be randomly scattered around zero, with no clear pattern or trend, indicating that the assumptions of linearity and homoscedasticity are satisfied.
Histogram of residuals: The histogram of residuals is approximately normally distributed with slight right skewness towards the positive end, which is another good sign.
Normality assumption: The normal probability plot (or q-q plot) of residuals appears to be fairly linear, indicating that the residuals are approximately normally distributed. This assumption is also satisfied.
Based on the above residual analysis, the model seems to be a good fit for the data, with no major violations of the assumptions. Overall, it can be said that the linear model was appropriate.