605_homework

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Loading data

data(cars)

print(sum(is.na(cars$speed)))

## [1] 0

print(sum(is.na(cars$dist)))

## [1] 0

There are no missing values.

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

sum(abs((cars$speed - mean(cars$speed)) / sd(cars$speed)) > 2.5)

## [1] 0

sum(abs((cars$dist - mean(cars$dist)) / sd(cars$dist)) > 2.5)

## [1] 1

There aren’t any outliers in speed. There is one potential outlier in stopping distance.

cars[abs((cars$dist - mean(cars$dist)) / sd(cars$dist)) > 2.5,]

##    speed dist
## 49    24  120

cars %>%
  ggplot(aes(x=speed, y=dist)) +
  geom_point(position="jitter") +
  geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

Based on the scatterplot above, we can see that there is a strong positive correlation between speed and stopping distance of a vehicle.

cor(cars$speed, cars$dist)

## [1] 0.8068949

In fact, there is a near perfect relationship, with a correlation of 80%.

Building a model

model <- lm(dist ~ speed, data=cars)

summary(model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We can see from the above summary that:

speed is a good feature variable for stopping distance. With a p-value so close to 0 we know that it is highly unlikely to get such a large t-statistic assuming the two values are not linearly related.
this model accounts for just under 65% of the variation of stopping distance.
model can be discribed as: "stopping_distance = -17.5791 + 3.9324*speed"

Analyzing Residuals

ggplot(data = model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

From the residuals plot above we see that for the most part there is a tendancy towards zero. However there is slight heteroscedasticity.

qqnorm(resid(model))
qqline(resid(model))

Using the qqplot, we can see that our residuals are not normally distributed. In this case, the residuals distribution is heavier on the right tail.

ggplot(data = model, aes(x = .resid)) +
  geom_histogram(binwidth=8)

605_homework_11

Alec

4/10/2022

Loading data

Building a model

Analyzing Residuals