Assignment: Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

In this assignment, I have used the ‘cars’ dataset that has come with R. Then, I have built a linear regression model to predict the stopping distance based on car speed. It is noted here that the stopping distance is the response variable and the car speed is the explanatory variable.

Load libraries

library(ggplot2)
library(dplyr)

Load the dataset

data(cars)
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…

The dataset contains 50 observations of speed and stopping distance.

Rename the column

car_data<-cars%>%rename(stopping_distance=dist)
head(car_data)
##   speed stopping_distance
## 1     4                 2
## 2     4                10
## 3     7                 4
## 4     7                22
## 5     8                16
## 6     9                10

Data visualization:

Scatter Plot of the relationship between the speed and stopping distance

ggplot(car_data, aes(x =speed, y =stopping_distance)) + geom_point(stat = "identity")

From the scatter plot above, it is seen that there is a linear trend exists between the two variables

Boxplot of response variable

box_plot<-boxplot(car_data$stopping_distance)

Identify outlier

outliers<-box_plot$out
outliers
## [1] 120

An outlier found at stopping distance of 120.

Find correlation coefficient

car_data %>% 
summarise(cor(speed,stopping_distance))
##   cor(speed, stopping_distance)
## 1                     0.8068949

The correlation coefficient of 0.81 suggests that there is a strong positive linear relationship between the two variables. This means that the predictor variable, ‘speed’ is a good candidate for predicting the response variable, ‘stopping distance’.

Plotting the relationship between the stopping distance and the speed with the least squares line

ggplot(car_data, aes(x =speed, y = stopping_distance)) +geom_point(stat="identity")+stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Building the linear regression model to fit the data

model <- lm(stopping_distance ~ speed, data = car_data)

Evaluate the model quality:

Summary statistics of the model

summary(model)
## 
## Call:
## lm(formula = stopping_distance ~ speed, data = car_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The linear regression model:

stopping_distance = -17.5791 + 3.9324 * speed

The regression coefficient for ‘speed’ is 3.9324, which means that for every one unit increase in speed, the predicted stopping distance increases by 3.9324 units. Similarly, a decrease of one unit in speed would result in decrease of 3.9324 units in predicted stopping distance.

Here, the multiple R-squared value is 0.6511, which means that the regression line explains 65.11% of the variation in the response variable i.e stopping distance. Hence, I can say that the model is reasonably a good fit for the data. Also, the p-value for both intercept and slope are less than 0.05, indicating that both coefficients are statistically significant. Therefore, I can conclude that the speed of the car is a significant predictor of the stopping distance.

Residual analysis:

Residuals vs fitted value (predicted value) plot:

ggplot(data =model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")

Histogram of the residuals

hist(model$residuals)

Normal probability plot of the residual

ggplot(data =model, aes(sample = .resid)) +
stat_qq()

In the residual analysis, the linearity, nearly normal residual and constant variability or homoscedasticity of the residuals assumptions have been checked to see whether the linear model is reliable, and the test results are given below:

  1. Residuals analysis: The residuals appear to be randomly scattered around zero, with no clear pattern or trend, indicating that the assumptions of linearity and homoscedasticity are satisfied.

  2. Histogram of residuals: The histogram of residuals is approximately normally distributed with slight right skewness towards the positive end, which is another good sign.

  3. Normality assumption: The normal probability plot (or q-q plot) of residuals appears to be fairly linear, indicating that the residuals are approximately normally distributed. This assumption is also satisfied.

Based on the above residual analysis, the model seems to be a good fit for the data, with no major violations of the assumptions. Overall, it can be said that the linear model was appropriate.