Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Dataset load

For this section I will load the ‘cars’ dataset in R, and glimpse so that I can see the data. After taking a glimpse at the ‘cars’ dataset I will use the function in R to get a summary of the dataset, which will show min, mean, median, 1st quartile, 3rd quartile and the max.

glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Data plot

The plot below show the speed vs dist

ggplot(data = cars, aes(x = speed, y = dist, fill = speed)) +
  geom_point() +
geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Speed VS. Dist",
    x = "Speed", y = "Dist"
      )
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

The blue line show that there’s a positive linear correlation between the speed and distance of the cars. Base on Chapter text book I will create a linear model of the car dataset.

Linear Model

lm_cars <- lm(dist ~ speed, data = cars)

summary(lm_cars)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

what does the summary of the lm_cars mean?

Linear model

This is a plot of the linear model which is similar to the previous plot, for this plot I used slope and intercept of speed and distance of the car dataset.

ggplot(data = cars, aes(x = speed , y = dist)) + geom_point() +
  geom_abline(slope = coef(lm_cars)[[2]], intercept = coef(lm_cars)[[1]])

Residuals VS Fitted

From the plot of residuals vs fitted plot I can the variance appears to be constant, and there is no obvious curvature.

ggplot(data = cars ,aes(x=fitted(lm_cars), y = resid(lm_cars))) +
  geom_point() +
  geom_abline(slope = 0 , intercept = 0)

in this graph below the residuals are normaly distributed.

ggplot(data=lm_cars , aes(qqnorm(.stdresid)[[1]], .stdresid)) + 
  geom_point(na.rm = TRUE) +
  geom_abline(aes(slope = 1, intercept = 0 , qqline(.stdresid))) + 
  xlab("Quantiles") + 
  ylab("Residuals") 
## Warning in geom_abline(aes(slope = 1, intercept = 0, qqline(.stdresid))):
## Ignoring unknown aesthetics: x

GVLMA

The gvlma package will test for the actual value and the p-value of the skewness, link function, Global stat and Heteroscedasticity. The results are below.

gvlma(lm_cars)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = lm_cars) 
## 
##                     Value  p-value                   Decision
## Global Stat        15.801 0.003298 Assumptions NOT satisfied!
## Skewness            6.528 0.010621 Assumptions NOT satisfied!
## Kurtosis            1.661 0.197449    Assumptions acceptable.
## Link Function       2.329 0.126998    Assumptions acceptable.
## Heteroscedasticity  5.283 0.021530 Assumptions NOT satisfied!

now I will try to plot the lm_cars to see what it show. After plotting lm_cars we have four different graphs and each of the graph a red line which I believe to be possible outliers.

plot(lm_cars)