Basic exploration

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

## [1] 0

## [1] 0.8068949

No NA values
Speed looks normally distributed, as does distance
- Distance appears to have one outlier
High correlation between speed and distance .8

Visualization

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

my_cars <- cars
ggplot(cars, aes(x=speed, y=dist)) +
    geom_point() +
    geom_smooth(method="lm",span = 0.3)

The geom smooth line likely tells me our residuals are going to look a little wacky
- We can tell that the residuals seems to spread out on the ends

Run models

Model 1

my_fit <- lm(speed ~ dist, data = my_cars)
layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(my_fit)

## 
## Call:
## lm(formula = speed ~ dist, data = my_cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Summary From model 1

64% of the variance in speed can be explained by the distance
the median for out residuals is a little off of center
quartile ranges seem somewhat skewed as well
lets take a look at the plots of residuals

Plot residuals model 1

layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)

You can see that the residuals are not constant. There is curvature in the residuals graph
- perhaps the relationship would require a 2nd degree polynomial variable
- You can see it looks like the model might be effected by index 49, its the lone dot on the residuals vs fitted graph on right side and it shows up near cooks distance plot in bottom right
There are several outliers from the graphs as well, lets explore those outliers

remove all outliers plot model 1

When we remove all the outliers, we still see residuals have some curvature

minus_outliers <- my_cars[-c(2,23,39,49),]
my_fit2 <- lm(speed ~ dist, data = minus_outliers)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit2)

remove high leverage outlier and plot model 1

Cooks distance test seems to show that index[49,] outlier has a possibility of having high leverage, so I decided to just take that out as well
still some curvature in the residuals

minus_outliers <- my_cars[-c(49),]
my_fit2 <- lm(speed ~ dist, data = minus_outliers)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit2)

Model 2- Attempt higher degree polynomial

residuals look better

library(stats)
#my_fit2 <- my_fit <- lm(speed ~dist+ I(dist^2), data = my_cars)
my_fit2 <- lm(speed ~ I(dist^2)+dist, data = my_cars)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit2)

summary(my_fit2)

## 
## Call:
## lm(formula = speed ~ I(dist^2) + dist, data = my_cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.559 -1.722  0.473  1.932  5.942 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.1439610  1.2954573   3.971 0.000244 ***
## I(dist^2)   -0.0015284  0.0004939  -3.095 0.003316 ** 
## dist         0.3274544  0.0547392   5.982 2.86e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.907 on 47 degrees of freedom
## Multiple R-squared:  0.7101, Adjusted R-squared:  0.6978 
## F-statistic: 57.57 on 2 and 47 DF,  p-value: 2.299e-13

Summary

the model seems like it needs a higher degree polynomial
Clearly there is a strong relationship between distance to stop and speed
- Likely more observations are needed though to create a stronger linear regression model without having to resort to a multi_regression model.

Untitled

Justin Herman

November 10, 2018

Basic exploration

Visualization

Run models

Model 1

Summary From model 1

Plot residuals model 1

remove all outliers plot model 1

remove high leverage outlier and plot model 1

Model 2- Attempt higher degree polynomial

Summary