Cars Regression Analysis

We are going to a regression analysis on the car data set.

# import car dataset
require(carData)
## Loading required package: carData
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
dim(cars)
## [1] 50  2
require(ResourceSelection)
## Loading required package: ResourceSelection
## ResourceSelection 0.3-6   2023-06-27
kdepairs(cars)
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

There 50 observations and 2 features from the data set. speed seems to increase with driving long distance. the speed variable is distributed in the center while the dist variable is distributed to the left. variables are well correlated with each other. they have a correlation of 0.807

plot(cars[,"speed"],cars[,"dist"], main= "speed vs distance",
     xlab="speed", ylab="Distance")

# see how data is skewed 
ggplot(cars, aes(x=speed)) +
  geom_histogram(binwidth = 5, fill="grey", color="black") +
  labs(title="Histogram of Car Speeds", x="Speed (mph)", y="Frequency") +
  theme_minimal()

we are going to use regression to predict the cars speed using the distance variables. The correlation already shows that there is a good relationship among the variables. Based on the summary, it takes a car going 0 mph to stop - 17 feet to stop. every time need need to stop, it need to an additional to 3.9324 feet to stop.

cars_model <- lm(dist ~ speed, data = cars)
summary(cars_model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
par(ask=F)
par(mfrow=c(2,2))
plot(cars_model)

hist(cars_model$residuals)

This fitted values graph shows a good relations between fitted values and and residual values. there are only 3 outliers which are 23, 35, 49. The data appear to be well modeled by a linear relationship between speed and dist, and the points appear to be randomly spread out about the line, with no discernible non-linear trends or indications of non-constant variance.

par(ask=F)
par(mfrow=c(2,2))
plot(cars_model)

plot(fitted(cars_model, resid(cars_model)))
qqnorm(resid(cars_model))
qqline(resid(cars_model))

Both features have good relationship , but we can use other evaluation techniques to improve the \(R^2\) from 0.6511 to better score.