Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
For this section I will load the ‘cars’ dataset in R, and glimpse so that I can see the data. After taking a glimpse at the ‘cars’ dataset I will use the function in R to get a summary of the dataset, which will show min, mean, median, 1st quartile, 3rd quartile and the max.
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The plot below show the speed vs dist
ggplot(data = cars, aes(x = speed, y = dist, fill = speed)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Speed VS. Dist",
x = "Speed", y = "Dist"
)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
The blue line show that there’s a positive linear correlation between
the speed and distance of the cars. Base on Chapter text book I will
create a linear model of the car dataset.
lm_cars <- lm(dist ~ speed, data = cars)
summary(lm_cars)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
what does the summary of the lm_cars mean?
This is a plot of the linear model which is similar to the previous plot, for this plot I used slope and intercept of speed and distance of the car dataset.
ggplot(data = cars, aes(x = speed , y = dist)) + geom_point() +
geom_abline(slope = coef(lm_cars)[[2]], intercept = coef(lm_cars)[[1]])
From the plot of residuals vs fitted plot I can the variance appears to be constant, and there is no obvious curvature.
ggplot(data = cars ,aes(x=fitted(lm_cars), y = resid(lm_cars))) +
geom_point() +
geom_abline(slope = 0 , intercept = 0)
in this graph below the residuals are normaly distributed.
ggplot(data=lm_cars , aes(qqnorm(.stdresid)[[1]], .stdresid)) +
geom_point(na.rm = TRUE) +
geom_abline(aes(slope = 1, intercept = 0 , qqline(.stdresid))) +
xlab("Quantiles") +
ylab("Residuals")
## Warning in geom_abline(aes(slope = 1, intercept = 0, qqline(.stdresid))):
## Ignoring unknown aesthetics: x
The gvlma package will test for the actual value and the p-value of the skewness, link function, Global stat and Heteroscedasticity. The results are below.
gvlma(lm_cars)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = lm_cars)
##
## Value p-value Decision
## Global Stat 15.801 0.003298 Assumptions NOT satisfied!
## Skewness 6.528 0.010621 Assumptions NOT satisfied!
## Kurtosis 1.661 0.197449 Assumptions acceptable.
## Link Function 2.329 0.126998 Assumptions acceptable.
## Heteroscedasticity 5.283 0.021530 Assumptions NOT satisfied!
now I will try to plot the lm_cars to see what it show. After plotting lm_cars we have four different graphs and each of the graph a red line which I believe to be possible outliers.
plot(lm_cars)