Data 605 HW11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

#Load the cars dataset into a variable as a data frame
cars <- datasets::cars


#Let's investigate the data
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
describe(cars)
##       vars  n  mean    sd median trimmed   mad min max range  skew kurtosis
## speed    1 50 15.40  5.29     15   15.47  5.93   4  25    21 -0.11    -0.67
## dist     2 50 42.98 25.77     36   40.88 23.72   2 120   118  0.76     0.12
##         se
## speed 0.75
## dist  3.64
#Using the lm formula to build the linear regression model
cars.lm <- lm(dist ~ speed, data=cars)

#The summary will give us coefficients as well the slope and intercept that can be used to build the linear regression model evaluating the relationship between distance and speed.
summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

What’s important to take note is the p value which is less than 0.05, making it statistically significant. The intercept is -17.5791 and the slope is 3.9324, meaning the distance changes by 3.9324 per changes in speed.

The adjusted R^2 is 0.6438.

Visualizing the Data

#ggplot without getting the slope and intercept manually
ggplot(data = cars, aes(x = speed, y = dist)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

#ggplot with slope and intercept gotten from cars.lm

mx <- coef(cars.lm)[1]
slope <- coef(cars.lm)[2]

ggplot(cars.lm, aes(cars$speed, cars$dist)) +
  geom_point() +
  geom_abline(slope = slope, intercept = mx, show.legend = TRUE)

Model Diagnostic

In order to evaluate whether the linear model is reliable, 3 things need to be accounted for a)linearity b)normal residuals 3)variability

Let’s check for variability:

ggplot(data = cars.lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

There is no pattern in the plot above which is a sign of linearity.

Check for Normal Residuals:

ggplot(data = cars.lm, aes(x = .resid)) +
  geom_histogram(binwidth = 10) +
  xlab("Residuals")

qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

The residuals seem close to normally distributed even though the histogram and Q-Q plot show outliers on the far right end. The residuals don’t have to be perfectly normal but close enough. I would conclude that it is a viable model.