Assignment 11

Solution

Creating and visualizing the model

# Creating the model
data(cars)
cars_model <- lm(dist ~ speed, data = cars)

# Visualizing the model
cars %>% 
  ggplot(mapping = aes(x = speed, y = dist)) + 
  geom_point(color = 'darkgreen') + 
  geom_smooth(method = 'lm', se = FALSE, color = 'red') + 
  labs(title = 'Linear Model', 
       subtitle = 'stopping distance as function of speed', 
       x = 'speed', 
       y = 'stopping distance') + 
  theme(
    plot.title=element_text(hjust=0.5), 
    plot.subtitle=element_text(hjust=0.5))

# Model information
summary(cars_model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

As we can see from the summary() call, the y intercept is -17.5790949, and the slope is 3.9324088. So, the final regression model is:

\[\hat{dist}=3.9324{\times}speed-17.5791\]

Quality evaluation of the model

Residuals: The median is -2.272, which is fairly close to 0. The magnitudes of the first and third quartiles are very close (9.525 vs. 9.215). These are indications that the model is a good one. There is however some discrepancy in the magnitudes of the minimum and maximum residual values.

T and p values of the slope: The t value of the slope is 9.464, which is sufficient (5 to 10 is good) to say that the model is good. Meanwhile, the p value of the slope is really small (1.49e-12). That means, the probability of getting a t value of 9.464 or higher is less than 1.49e-12 if there is no linear relationship between the variables. Thus, there is strong evidence that the model works well in suggesting a linear relationship between distance to stop and speed.

Residual standard error and the quartiles: The first and third quartile magnitudes should be close to the residual standard error times 1.5. This is clearly not the case since the quartile magnitudes are smaller than the standard error. So, this is perhaps an indication that the model could be better.

\(R^2\) values: Both \(R^2\) values are around 65%. So, around 65% of the variation in distance traveled is explained by the variation in speed.

Residual analysis

If the residuals are normally distributed around a mean of 0, we can say that the linear model is a good fit for the data. Let’s see if they really are normally distributed using 2 plots: the residuals plot and the quantile-quantile plot.

# The residuals plot
qplot(fitted.values(cars_model), residuals(cars_model)) + geom_smooth(method = 'lm', se = FALSE)

# The Q-Q plot
qqnorm(resid(cars_model))
qqline(resid(cars_model))

The residuals plot looks to be free of any obvious pattern, although the width of the band of points around the line is not totally consistent. It looks pretty good. However, when we look at the Q-Q plot, we become more sure that the residuals are normally distributed. The points are mostly on the line, and the shape and density of the 2 tails are similar.

Assignment 11

Prinon Mahdi

Assignment

Solution

Creating and visualizing the model

Quality evaluation of the model

Residual analysis