Basic exploration

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
## [1] 0

## [1] 0.8068949
  • No NA values
  • Speed looks normally distributed, as does distance
    • Distance appears to have one outlier
  • High correlation between speed and distance .8

Visualization

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
my_cars <- cars
ggplot(cars, aes(x=speed, y=dist)) +
    geom_point() +
    geom_smooth(method="lm",span = 0.3)

  • The geom smooth line likely tells me our residuals are going to look a little wacky
    • We can tell that the residuals seems to spread out on the ends

Run models

Model 1

my_fit <- lm(speed ~ dist, data = my_cars)
layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(my_fit)
## 
## Call:
## lm(formula = speed ~ dist, data = my_cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Summary From model 1

  • 64% of the variance in speed can be explained by the distance
  • the median for out residuals is a little off of center
  • quartile ranges seem somewhat skewed as well
  • lets take a look at the plots of residuals

Plot residuals model 1

layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)

  • You can see that the residuals are not constant. There is curvature in the residuals graph
    • perhaps the relationship would require a 2nd degree polynomial variable
    • You can see it looks like the model might be effected by index 49, its the lone dot on the residuals vs fitted graph on right side and it shows up near cooks distance plot in bottom right
  • There are several outliers from the graphs as well, lets explore those outliers

remove all outliers plot model 1

  • When we remove all the outliers, we still see residuals have some curvature
minus_outliers <- my_cars[-c(2,23,39,49),]
my_fit2 <- lm(speed ~ dist, data = minus_outliers)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit2)

remove high leverage outlier and plot model 1

  • Cooks distance test seems to show that index[49,] outlier has a possibility of having high leverage, so I decided to just take that out as well
  • still some curvature in the residuals
minus_outliers <- my_cars[-c(49),]
my_fit2 <- lm(speed ~ dist, data = minus_outliers)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit2)

Model 2- Attempt higher degree polynomial

  • residuals look better
library(stats)
#my_fit2 <- my_fit <- lm(speed ~dist+ I(dist^2), data = my_cars)
my_fit2 <- lm(speed ~ I(dist^2)+dist, data = my_cars)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit2)

summary(my_fit2)
## 
## Call:
## lm(formula = speed ~ I(dist^2) + dist, data = my_cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.559 -1.722  0.473  1.932  5.942 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.1439610  1.2954573   3.971 0.000244 ***
## I(dist^2)   -0.0015284  0.0004939  -3.095 0.003316 ** 
## dist         0.3274544  0.0547392   5.982 2.86e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.907 on 47 degrees of freedom
## Multiple R-squared:  0.7101, Adjusted R-squared:  0.6978 
## F-statistic: 57.57 on 2 and 47 DF,  p-value: 2.299e-13

Summary

  • the model seems like it needs a higher degree polynomial
  • Clearly there is a strong relationship between distance to stop and speed
    • Likely more observations are needed though to create a stronger linear regression model without having to resort to a multi_regression model.