Linear regression inference

Exercises:

Exercise 1:

#added cars data

Exercise 2:

plot(cars$speed, cars$dist,
     xlab = "Speed (mph)",
     ylab = "Stopping Distance (ft)",
     main = "Stopping Distance vs Speed",
     pch = 19)

cor(cars$speed, cars$dist)

## [1] 0.8068949

Answer: The relationship is a positive strong linear relationship. As the speed increases, the stopping distance increases. There is a point that sits away from the trend which is an outlier.

Exercise 3:

model <- lm(dist ~ speed, data = cars)
summary(model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Exercise 4:

ANSWER: The least squares regression line for the linear model is y=−17.5791+3.9324x.

Exercise 5:

ANSWER: R2= 0.6511. This means that about 65.11% of the variability in stopping distance is explained by the linear relationship with speed.

Exercise 6:

model <- lm(dist ~ speed, data = cars)
plot(cars$speed, cars$dist,
     xlab = "Speed (mph)",
     ylab = "Stopping Distance (ft)",
     main = "Stopping Distance vs Speed",
     pch = 19)

#least squares regression line
abline(model, col = "blue", lwd = 2)

Exercise 7:

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability. Plot the residuals vs. speed. Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship?

model <- lm(dist ~ speed, data = cars)
#residuals vs speed
plot(cars$speed, resid(model),
     xlab = "Speed (mph)",
     ylab = "Residuals",
     main = "Residuals vs Speed",
     pch = 19)

abline(h = 0, col = "red", lwd = 2)

Answer:The plot really isn’t linear. There are negative and positive residuals. This indicates that the relationship is not perfectly linear.

Exercise 8:

Plot a histogram and normal probability plot of the residuals. Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met? Based on these plots (or others), does the constant variability condition appear to be met?

# Histogram of residuals
hist(resid(model),
     main = "Histogram of Residuals",
     xlab = "Residuals",
     col = "gray",
     border = "black")

# Normal probability (Q-Q) plot
qqnorm(resid(model), main = "Normal Q-Q Plot of Residuals")
qqline(resid(model), col = "red", lwd = 2)

ANSWER: The residuals are approximately normal. The constant variability condition does not appear to be met as not all the data points appear to have equal variance.

Exercise 9:

Now focus on the second row of summary output to test whether the slope β of the regression line equals 0. What is the estimate for β? What is the associated standard error? Explain how the t-statistic is calculated and use the pt function to verify the p-value presented from the summary output. Finally, explain the result from this test in the context of whether the regression line is useful for prediction.

model <- lm(dist ~ speed, data = cars)
summary(model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

pt(q = 9.464, df = 48)

## [1] 1

Answer: B=3.9324 and standar error is 0.4155. T statistic= 3.9324/0.4155=9.464. Because the p-value: 1.49e-12 < 0.05, we reject null hypothesis. The regression line is useful for prediction.

Exercise 10:

Construct a 95% confidence interval for β. Does this interval contain 0? Explain how this connects to the hypothesis test in (7).

confint(model, level = 0.95)

##                  2.5 %    97.5 %
## (Intercept) -31.167850 -3.990340
## speed         3.096964  4.767853

Answer: The confidence interval does not include zero. This means that we would reject the null hypothesis.

Exercise 11:

Find a 95% confidence interval for the population mean for speed=20 and interpret in the context of the problem. Find a 95% prediction interval for a single observation of speed=20 and interpret in the context of the problem. Explain the difference between these two intervals.

new <- data.frame(speed = 20)
predict(model, new, interval = "confidence", level = 0.95)

##        fit      lwr      upr
## 1 61.06908 55.24729 66.89088

predict(model, new, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 61.06908 29.60309 92.53507

ANSWER: For cars traveling at 20mph: We are 95% confident that the average stopping distance for all cars is between 55.25 and 66.89 feet & we are 95% confident that the stopping distance for an individual car traveling will fall between 29.60 and 92.53 feet. The difference in these is that the the individual car observation could vary hence it is more wide.