605 assignment 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

#whats is stopping distance as a function of speed
library(tibble)
glimpse (cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…

Visualization

#step 1: determine whether or not it looks as though a linear relationship exists; yes it does look like it
plot(cars[, "speed"], cars[, "dist"], main = "Car Data", xlab = "Speed", ylab = "Distance")

#or can use 'plot(dist ~ speed, data = cars)'

Linear Model function

cars.lm <- lm(dist ~ speed, data=cars)
cars.lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932
#final regression model is: predicted distance = -17.58 + 3.93 * speed.
# plot the original data along with the fitted line, 
plot(dist ~ speed, data=cars)
abline(cars.lm)

Quality evaluation of the model.

a. Median residual is -2.27, which is close to 0, suggesting it has a normal distribution (good fit); min and max are similar magnitude

b. Standard error of speed is 0.4155, we want this to be at least 5 to 10 times smaller than coefficient which would mean low variability in the slope= estimate: 3.9324/0.4155 = 9.46

c. t - test statistic, doe this have more importance besides being used to compute the next column Pr(>|t|)?

d. Pr(>|t|) = probability of observing t more extreme than observed, meaning the probability of observing a t value of 9.464 or more, assuming there is no linear relationship is very very small: 1.49e-12, meaning we can say that there is strong evidence of a linear relationship between speed and distance

summary(cars.lm) #tells us more about quality of model
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
#this model means that 65.11% of the variability in distance is explained by the variation in speed

Residual Analysis

hist(resid(cars.lm), main = "Histogram of Residuals", xlab = "Residuals")

plot(fitted(cars.lm),resid(cars.lm))

qqnorm(resid(cars.lm))
qqline(resid(cars.lm)) #manual way to lot this and the one above, can do it differently as well

par(mfrow=c(2,2))
plot(cars.lm)

Conclusion - linear model might not be a good fit

Top left: checks for linearity and we do not want a pattern, which would mean its not linear; I do not believe there is a pattern.

Top right: checks for normality and we want point to fall on the line; I believe it does for the most part, while some points on both ends dont fall on the line

Bottom left: checks for constant variance of residual and we want to see a straight line, it is somewhat straight but it increase around x = 50 making me question the homoscedasticity

Bottom right: checks if outliers influence the model; there seem to be some points that are both high-leverage and have large residuals meaning the outliers impact the regression model.