Regression Models

Problem Statement

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

LOad Packages

library(ggplot2)
library(tidyverse)
library(knitr)

Answer:

First let’s lay out some basic information about cars dataset:

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

So, it has two columns speed and dist.

Some basic statistics about cars:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Dimensions of cars:

dim(cars)

## [1] 50  2

We observe 2 columns and 50 rows of cars data.

Now, we’ll produce the scatter plot of cars data, with speed as explanatory and dist as response variables:

cars %>% 
  ggplot(aes(speed, dist)) + geom_point() +
  labs(title = "Stopping Distance vs Speed", x = "Speed", y = "Stopping Distance")

The general trajectory of the distribution of the scatter points is increasing to the right.

Linear model:

linear_model <- lm(cars$dist ~ cars$speed)
linear_model

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

We observe that the intercept is -17.579 and the slope is 3.932.

Now, let’s fit a line to cars:

linear_model_fn <- function(m){
    eq <- substitute(italic(y) == a + b %.% italic(x) * "," ~~ italic(r)^2 ~ "=" ~ r2, 
         list(a = unname(coef(m)[1]), b = unname(coef(m)[2]), r2 = format(summary(m)$r.squared, digits = 3)))
    as.character(as.expression(eq))
}

cars %>% ggplot(aes(speed, dist)) + geom_point() +
  geom_abline(slope = linear_model$coefficients[2], intercept = linear_model$coefficients[1], color="blue") +
  labs(title = "Stopping Distance vs Speed", x = "Speed", y = "Stopping Distance") +
  geom_label(x = 10, y = 115, label.size = NA, label = linear_model_fn(linear_model), parse = TRUE)

Now, we’ll draw some conclusion from the summary of linear_model:

summary(linear_model)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Conclusions (this analysis is based on Art 3.3, page 20, of the textbook, “Linear Regression Using R”):
1) For the Residuals min, max and median are -29.069, 43.201 and -2.272 respectively. The 1st and 3rd quartiles are -9.525 and 9.215 respectively. So, it appears to be Normal, which we’ll varify below.
2) Estimated and Standard Error values for speed are 3.9324 and 0.4155 respectively. Therefore, error is 3.9324/0.4155 = 9.46 times less than the estimated value. Based on the book’s recommendation, this is good model.
3) The probability that speed is not relevant is 1.49e-12. Therefore, speed is relevant in stopping distance.
4) Intercept’s p-value is 0.0123. Therefore, Intercept is important.
5) R-squared is 0.6511. Therefore, the model explains 65% of the variation.

So, I would think this is a good model.

Verification of the normality of Residuals:

The visualization of Residuals suggests that the distribution is normal.

Marker: 605-11