Assignment 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Dataset load

For this section I will load the ‘cars’ dataset in R, and glimpse so that I can see the data. After taking a glimpse at the ‘cars’ dataset I will use the function in R to get a summary of the dataset, which will show min, mean, median, 1st quartile, 3rd quartile and the max.

glimpse(cars)

## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Data plot

The plot below show the speed vs dist

ggplot(data = cars, aes(x = speed, y = dist, fill = speed)) +
  geom_point() +
geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Speed VS. Dist",
    x = "Speed", y = "Dist"
      )

## `geom_smooth()` using formula = 'y ~ x'

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

The blue line show that there’s a positive linear correlation between the speed and distance of the cars. Base on Chapter text book I will create a linear model of the car dataset.

Linear Model

lm_cars <- lm(dist ~ speed, data = cars)

summary(lm_cars)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

what does the summary of the lm_cars mean?

The model can be expressed as \(\widehat{dist} = -17.5791 + 3.9324 \cdot \widehat{speed}\).
This can be interpreted as the following: for each increase of speed by one mile per hour, there is 3.9324 increase in the stopping distance.
The median of the residuals is close to 0, the first and third quantiles are close.
The R-Squared value of 0.6438 means that the model accounts for 64.38% of the variability found in the stopping distance.
The p values are small, as can be seen by the * and *** next to them. Therefore the probability that the coefficient and intercept are not relevant is small.

Linear model

This is a plot of the linear model which is similar to the previous plot, for this plot I used slope and intercept of speed and distance of the car dataset.

ggplot(data = cars, aes(x = speed , y = dist)) + geom_point() +
  geom_abline(slope = coef(lm_cars)[[2]], intercept = coef(lm_cars)[[1]])

Residuals VS Fitted

From the plot of residuals vs fitted plot I can the variance appears to be constant, and there is no obvious curvature.

ggplot(data = cars ,aes(x=fitted(lm_cars), y = resid(lm_cars))) +
  geom_point() +
  geom_abline(slope = 0 , intercept = 0)

in this graph below the residuals are normaly distributed.

ggplot(data=lm_cars , aes(qqnorm(.stdresid)[[1]], .stdresid)) + 
  geom_point(na.rm = TRUE) +
  geom_abline(aes(slope = 1, intercept = 0 , qqline(.stdresid))) + 
  xlab("Quantiles") + 
  ylab("Residuals")

## Warning in geom_abline(aes(slope = 1, intercept = 0, qqline(.stdresid))):
## Ignoring unknown aesthetics: x

GVLMA

The gvlma package will test for the actual value and the p-value of the skewness, link function, Global stat and Heteroscedasticity. The results are below.

gvlma(lm_cars)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = lm_cars) 
## 
##                     Value  p-value                   Decision
## Global Stat        15.801 0.003298 Assumptions NOT satisfied!
## Skewness            6.528 0.010621 Assumptions NOT satisfied!
## Kurtosis            1.661 0.197449    Assumptions acceptable.
## Link Function       2.329 0.126998    Assumptions acceptable.
## Heteroscedasticity  5.283 0.021530 Assumptions NOT satisfied!

now I will try to plot the lm_cars to see what it show. After plotting lm_cars we have four different graphs and each of the graph a red line which I believe to be possible outliers.

plot(lm_cars)