DATA605 ASSIGNMENT 11
1 Question 1
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
1.2 Visualization
The data set shows a linear relationship between variables Distance and Speed.
1.3 Build Simple Linear Model
A baseline model is built using all data points from the data.
##
## Call:
## lm(formula = dist ~ speed, data = data)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
plot(data,
xlab = 'Speed',
ylab = 'Distance',
main = 'Linear Regression: Distance vs Speed')
abline(data.lm)
1.4 Quality Evaluation of Model
According to the model summary,
The distribution of residuals has median of -2.272 which is close to 0, the 1Q and 3Q are balanced around the median, however the maximum has a slightly larger magnitude than the minimum.
The p-value of Intercept and slope all strong significance.
The Multiple R-squared 0.6511 means that the model explained 65.11% of the data’s variation.
The Adjusted R-squared is modified to take into account the number of predictors used in the model. Since the model has only one predictor, therefore this value is not useful in this case.
The p-value of the F-statistic showed is very small, means that the model fits the data better than the model without independent variables.
##
## Call:
## lm(formula = dist ~ speed, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
1.5 Residual Analysis
The resudual plot doesn’t demostrate patterns of non-mormality, curvature or any violation of the constant variance assumption. However there are a few outliners with magnitudes of residual higher than 30 might have negative effect on building the reguression model.
plot(fitted(data.lm),
resid(data.lm),
xlab = 'Predicted Y Value',
ylab = 'Residual',
main = 'Residual Plot')
abline(h = 0, lty = 2)
The QQ-plot shows that the distribution of the residuals are close to normal except a few outliners.
1.6 Rebuid a Model Without Outliners
Removed data points with magnitude of residual higher than 30 and rebuilt the model.
The residuals demostrated a distribution closer to normal distritubion compared to the baseline model.
The P-values of the coeeficients demostrated stronger significance.
The multiple R-sqared is 0.7263, means that the model explained 72.63% of the data’s variation, higher than the baseline model.
data.mod <- cbind(data, resid = resid(data.lm)) %>%
filter(abs(resid) <= 30)
data.mod.lm <- lm(dist ~ speed, data.mod)
summary(data.mod.lm)
##
## Call:
## lm(formula = dist ~ speed, data = data.mod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.032 -7.686 -1.032 6.576 26.185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.1371 5.3053 -2.853 0.00652 **
## speed 3.6085 0.3302 10.928 3e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.84 on 45 degrees of freedom
## Multiple R-squared: 0.7263, Adjusted R-squared: 0.7202
## F-statistic: 119.4 on 1 and 45 DF, p-value: 3.003e-14
plot(data,
xlab = 'Speed',
ylab = 'Distance',
main = 'Linear Regression 2: Distance vs Speed')
abline(data.mod.lm)
The Resudual plot of the second model demostrated more significant normality.
plot(fitted(data.mod.lm),
resid(data.mod.lm),
xlab = 'Predicted Y Value',
ylab = 'Residual',
main = 'Residual Plot 2')
abline(h = 0, lty = 2)
The QQ-plot of the second model demostrated more significant normality as well.