library(tidyverse)

1 Question 1

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

1.1 Load Dataset

data <- datasets::cars

data

1.2 Visualization

The data set shows a linear relationship between variables Distance and Speed.

plot(data, 
     xlab = 'Speed', 
     ylab = 'Distance', 
     main = 'Scatter Plot: Distance vs Speed')

1.3 Build Simple Linear Model

A baseline model is built using all data points from the data.

data.lm <- lm(dist ~ speed, data)

data.lm

## 
## Call:
## lm(formula = dist ~ speed, data = data)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

plot(data, 
     xlab = 'Speed', 
     ylab = 'Distance', 
     main = 'Linear Regression: Distance vs Speed')
abline(data.lm)

1.4 Quality Evaluation of Model

According to the model summary,

The distribution of residuals has median of -2.272 which is close to 0, the 1Q and 3Q are balanced around the median, however the maximum has a slightly larger magnitude than the minimum.
The p-value of Intercept and slope all strong significance.
The Multiple R-squared 0.6511 means that the model explained 65.11% of the data’s variation.
The Adjusted R-squared is modified to take into account the number of predictors used in the model. Since the model has only one predictor, therefore this value is not useful in this case.
The p-value of the F-statistic showed is very small, means that the model fits the data better than the model without independent variables.

summary(data.lm)

## 
## Call:
## lm(formula = dist ~ speed, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

1.5 Residual Analysis

The resudual plot doesn’t demostrate patterns of non-mormality, curvature or any violation of the constant variance assumption. However there are a few outliners with magnitudes of residual higher than 30 might have negative effect on building the reguression model.

plot(fitted(data.lm), 
     resid(data.lm),
     xlab = 'Predicted Y Value',
     ylab = 'Residual',
     main = 'Residual Plot')
abline(h = 0, lty = 2)

The QQ-plot shows that the distribution of the residuals are close to normal except a few outliners.

qqnorm(resid(data.lm))
qqline(resid(data.lm))

1.6 Rebuid a Model Without Outliners

Removed data points with magnitude of residual higher than 30 and rebuilt the model.

The residuals demostrated a distribution closer to normal distritubion compared to the baseline model.
The P-values of the coeeficients demostrated stronger significance.
The multiple R-sqared is 0.7263, means that the model explained 72.63% of the data’s variation, higher than the baseline model.

data.mod <- cbind(data, resid = resid(data.lm)) %>% 
  filter(abs(resid) <= 30)

data.mod.lm <- lm(dist ~ speed, data.mod)

summary(data.mod.lm)

## 
## Call:
## lm(formula = dist ~ speed, data = data.mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.032  -7.686  -1.032   6.576  26.185 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.1371     5.3053  -2.853  0.00652 ** 
## speed         3.6085     0.3302  10.928    3e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.84 on 45 degrees of freedom
## Multiple R-squared:  0.7263, Adjusted R-squared:  0.7202 
## F-statistic: 119.4 on 1 and 45 DF,  p-value: 3.003e-14

plot(data, 
     xlab = 'Speed', 
     ylab = 'Distance', 
     main = 'Linear Regression 2: Distance vs Speed')
abline(data.mod.lm)

The Resudual plot of the second model demostrated more significant normality.

plot(fitted(data.mod.lm), 
     resid(data.mod.lm),
     xlab = 'Predicted Y Value',
     ylab = 'Residual',
     main = 'Residual Plot 2')
abline(h = 0, lty = 2)

The QQ-plot of the second model demostrated more significant normality as well.

qqnorm(resid(data.mod.lm))
qqline(resid(data.mod.lm))

DATA605 ASSIGNMENT 11