Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
scatter.smooth(cars$speed,cars$dist, main="Speed ~ Distance")
# From the scatter plot we could see some linear relationship between Speed and Distance.
# Next we will check the outliers for both the data points.
par(mfrow=c(1, 2))
boxplot(cars$speed, main="Speed", sub=paste("Outlier rows: ", boxplot.stats(cars$speed)$out))
boxplot(cars$dist, main="Distance", sub=paste("Outlier rows: ", boxplot.stats(cars$dist)$out))
# We do see some outliers for Distance.
# Let us do a density plot to see if response variable are close to normality.
par(mfrow=c(1, 2))
plot(density(cars$speed), main="Density Plot: Speed", ylab="Frequency") # Plot for speed looks normal.
plot(density(cars$dist), main="Density Plot: Speed", ylab="Frequency") # plot for distance skewed towards right.
# Next we will try to build linear model.
linearmodel = lm(cars$dist ~ cars$speed)
print(linearmodel)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
# From this we can say the following
# Distance = -17.579 + 3.932 * Speed
# From this we can say for a increase in speed of 1 mile per hour,
# the stopping distance increase by 3.9 feet.
summary(linearmodel)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# Let us see how the Residuals look like,
# Now we will see the partial residual plot to see how the relationship between predictor and independent variable
car::crPlots(linearmodel)
# From the plot we can see there is a deviation from linear relationship.
mean(linearmodel$residuals)
## [1] 8.65974e-17
# Mean of the residuals is less near zero.
par(mfrow=c(2, 2))
hist(linearmodel$residuals)
# There is right skeweness.
plot(linearmodel, which = 1)
qqnorm(linearmodel$residuals)
qqline(linearmodel$residuals)
# The residuals also do not look normal.
From the QQ Plot, we can see at the center the points are little off from the fitted line, i,e there is residuals are not that normally distributed. And also little towards the upper and lower end, there are lot of outliers. We could say that there is no linear relation between Distance as a function of Speed.