Assignment 11

Week 11, Regression 1

Fundamentals of Computational Mathematics

CUNY MSDS DATA 605, Fall 2018

Rose Koh

11/6/2018

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Visualization

#plot(cars$speed, cars$dist)
p1 <- ggplot(cars, aes(speed,dist)) + 
  geom_point() +
  labs(title = "Speed~Stopping distance",
       x = "Speed",
       y = "distance")
p1

Quality evaluation of the model

attach(cars)
cars.lm <- lm(dist~speed)
summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The Least Squares Regression line for the lm is as follows: \[ \begin{aligned} \widehat{stoppingDistance} &= -17.5791 + 3.9324 \times speed \end{aligned} \]

The median value of the residuals is somewhat close to zero and quartiles and min/max values are roughly the same magnitude. The standard Error for speed of 3.9324 is 9.46 (t-value) times smaller than our corresponding coefficient, 0.4155. The p-value that represents the probability that speed is not relevant in this model is 1.49-12 (\(1.29 \times 10^{-16}\)), which is very small. The prob. that the intercept is irrelavant to the model is 0.0123 (\(\approx 1\)%). The \(R^2 = 65.11\)% means that with speed, 65.11% of the variability in distance is explained by speed variable.

#plot(cars.lm)
#plot(speed,dist)
#abline(cars.lm)

p1 + geom_smooth(method='lm', se = FALSE, color = 'blue')

#plot(fitted(cars.lm), resid(cars.lm))
#
#plot(cars.lm$fitted.values, cars.lm$residuals, xlab='Fitted Values', ylab='Residuals')
#abline(0,0)
#
#plot_ss(cars.lm$fitted.values, cars.lm$residuals, data = cars)
plot_ss(cars.lm$fitted.values, cars.lm$residuals, data = cars, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -3.672e-15    8.543e-17  
## 
## Sum of Squares:  11353.52
hist(cars.lm$residuals)

qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

We can spot some outliers, however, the residuals shows somewhat reasonable normal distribution. It is not a perfect example but a good place to start to predict the values. I would say that linear model seems appropreate.