DATA 605 - A11

Michael Muller

November 14, 2017


describe(cars)
##       vars  n  mean    sd median trimmed   mad min max range  skew
## speed    1 50 15.40  5.29     15   15.47  5.93   4  25    21 -0.11
## dist     2 50 42.98 25.77     36   40.88 23.72   2 120   118  0.76
##       kurtosis   se
## speed    -0.67 0.75
## dist      0.12 3.64
cars$time = cars$dist / cars$speed
ggplot(cars, aes(dist, speed)) + geom_point(aes(colour = time))+scale_colour_gradientn(colours = terrain.colors(20))

The above is a graph plotting distance for each individual car, while going a certain speed.

Even though it is given; we can see that as distance goes down, so does the amount of time needed to achieve that distance.

Modeling

lmf = lm(cars$dist~cars$speed)
lmf
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

Intercept at -17.579 means we’ll expect a distance of at least 4.4707528 before we see any speed; which goes to show that linear regression isn’t perfect.

Our formula for distance becomes \(-17.579 + 3.932 *\) cars$speed…Lets see the model compared.

cars$regressed = -17.579 + (3.932*cars$speed)
comparing = melt(cars,id.var=1)
ggplot() +
  geom_point(data=cars, aes(speed, dist, color='Actual' )) +
  geom_point(data=cars, aes(speed, regressed, color='Predicted'))

Model statistics

summary(lmf)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

I’d say having a residual standard error of 15.38 is actually pretty bad; I would not use this model.

Standard linear regression plots

plot(fitted(lmf),resid(lmf))

This model consistently undervalues the dependent. Probably because of the high valued intercept and limited range.

qqnorm(resid(lmf))
qqline(resid(lmf))

Heavy tails, indicates we have some extreme values distorting our data.