This work is part of my effort to become a well versed data analyst. At this point in time, and for the immediate future, I will undoubtedly be a novice at using R and solving the problem sets from this book. Hence, my solutions will at times reflect my limited abilities. But, with more practice, the quality and depth of my work will improve ( That is the whole point!). I welcome you to comment and critic my work to help me improve
Produce a scatterplot matrix which includes all of the variables in the data set.
library("ISLR")
pairs(Auto)
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(Auto[, names(Auto) !="name"])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
model = lm(mpg ~. -name, data = Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response? Yes, there is. However, some predictors do not have a statistically significant effect on the response. R-squared value implies that 82% of the changes in the response can be explained by the predictors in this regression model.
ii. Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year, origin .
iii. What does the coefficient for the year variable suggest? When every other predictor held constant, the mpg value increases with each year that passes. Specifically, mpg increase by 1.43 each year.
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2,2))
plot(model)
model = lm(mpg ~.-name+displacement:weight, data = Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9027 -1.8092 -0.0946 1.5549 12.1687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.389e+00 4.301e+00 -1.253 0.2109
## cylinders 1.175e-01 2.943e-01 0.399 0.6899
## displacement -6.837e-02 1.104e-02 -6.193 1.52e-09 ***
## horsepower -3.280e-02 1.238e-02 -2.649 0.0084 **
## weight -1.064e-02 7.136e-04 -14.915 < 2e-16 ***
## acceleration 6.724e-02 8.805e-02 0.764 0.4455
## year 7.852e-01 4.553e-02 17.246 < 2e-16 ***
## origin 5.610e-01 2.622e-01 2.139 0.0331 *
## displacement:weight 2.269e-05 2.257e-06 10.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8558
## F-statistic: 291.1 on 8 and 383 DF, p-value: < 2.2e-16
model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+acceleration:horsepower, data=Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight +
## acceleration:horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3344 -1.6333 0.0188 1.4740 11.9723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.725e+01 5.328e+00 -3.237 0.00131 **
## cylinders 6.354e-01 6.106e-01 1.041 0.29870
## displacement -6.805e-02 1.337e-02 -5.088 5.68e-07 ***
## horsepower 6.026e-02 2.601e-02 2.317 0.02105 *
## weight -8.864e-03 1.097e-03 -8.084 8.43e-15 ***
## acceleration 6.257e-01 1.592e-01 3.931 0.00010 ***
## year 7.845e-01 4.470e-02 17.549 < 2e-16 ***
## origin 4.668e-01 2.595e-01 1.799 0.07284 .
## cylinders:displacement -1.337e-03 2.726e-03 -0.490 0.62415
## displacement:weight 2.071e-05 3.638e-06 5.694 2.49e-08 ***
## horsepower:acceleration -7.467e-03 1.784e-03 -4.185 3.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.905 on 381 degrees of freedom
## Multiple R-squared: 0.865, Adjusted R-squared: 0.8615
## F-statistic: 244.2 on 10 and 381 DF, p-value: < 2.2e-16
model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+year:origin+acceleration:horsepower, data=Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight +
## year:origin + acceleration:horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6504 -1.6476 0.0381 1.4254 12.7893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.287e+00 9.074e+00 0.583 0.560429
## cylinders 4.249e-01 6.079e-01 0.699 0.485011
## displacement -7.322e-02 1.334e-02 -5.490 7.38e-08 ***
## horsepower 5.252e-02 2.586e-02 2.031 0.042913 *
## weight -8.689e-03 1.086e-03 -7.998 1.54e-14 ***
## acceleration 5.796e-01 1.582e-01 3.665 0.000283 ***
## year 5.116e-01 9.976e-02 5.129 4.66e-07 ***
## origin -1.220e+01 4.161e+00 -2.933 0.003560 **
## cylinders:displacement -4.368e-04 2.712e-03 -0.161 0.872156
## displacement:weight 1.992e-05 3.608e-06 5.522 6.21e-08 ***
## year:origin 1.630e-01 5.341e-02 3.051 0.002440 **
## horsepower:acceleration -6.735e-03 1.781e-03 -3.781 0.000181 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.874 on 380 degrees of freedom
## Multiple R-squared: 0.8683, Adjusted R-squared: 0.8644
## F-statistic: 227.7 on 11 and 380 DF, p-value: < 2.2e-16
model = lm(mpg ~.-name-cylinders-acceleration+year:origin+displacement:weight+
displacement:weight+acceleration:horsepower+acceleration:weight, data=Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name - cylinders - acceleration + year:origin +
## displacement:weight + displacement:weight + acceleration:horsepower +
## acceleration:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5074 -1.6324 0.0599 1.4577 12.7376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.868e+01 7.796e+00 2.396 0.017051 *
## displacement -7.794e-02 9.026e-03 -8.636 < 2e-16 ***
## horsepower 8.719e-02 3.167e-02 2.753 0.006183 **
## weight -1.350e-02 1.287e-03 -10.490 < 2e-16 ***
## year 4.911e-01 9.825e-02 4.998 8.83e-07 ***
## origin -1.262e+01 4.109e+00 -3.071 0.002288 **
## year:origin 1.686e-01 5.277e-02 3.195 0.001516 **
## displacement:weight 2.253e-05 2.184e-06 10.312 < 2e-16 ***
## horsepower:acceleration -9.164e-03 2.222e-03 -4.125 4.56e-05 ***
## weight:acceleration 2.784e-04 7.087e-05 3.929 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.861 on 382 degrees of freedom
## Multiple R-squared: 0.8687, Adjusted R-squared: 0.8656
## F-statistic: 280.8 on 9 and 382 DF, p-value: < 2.2e-16
From all the 4 models, the last model is the only one with all variables being significant. And, based on results from a few trials not show here, it is very likely that it is the best combination of predictors and interaction terms. The R-squared statistics estimates that 87% of the changes in the response can be explained by this particular set of predictors ( single and interaction.) A higher value was not obtained from the trials.
Ahmed TADDE