This work is part of my effort to become a well versed data analyst. At this point in time, and for the immediate future, I will undoubtedly be a novice at using R and solving the problem sets from this book. Hence, my solutions will at times reflect my limited abilities. But, with more practice, the quality and depth of my work will improve ( That is the whole point!). I welcome you to comment and critic my work to help me improve

Question-a

Produce a scatterplot matrix which includes all of the variables in the data set.

library("ISLR")
pairs(Auto)

Question-b

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

  cor(Auto[, names(Auto) !="name"])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Question-c

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

model = lm(mpg ~. -name, data = Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response? Yes, there is. However, some predictors do not have a statistically significant effect on the response. R-squared value implies that 82% of the changes in the response can be explained by the predictors in this regression model.

ii. Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year, origin .

iii. What does the coefficient for the year variable suggest? When every other predictor held constant, the mpg value increases with each year that passes. Specifically, mpg increase by 1.43 each year.

Question-d

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(model)

  1. The first graph shows that there is a non-linear relationship between the responce and the predictors;
  2. The second graph shows that the residuals are normally distributed and right skewed;
  3. The third graph shows that the constant variance of error assumption is not true for this model;
  4. The Third graphs shows that there are no leverage points. However, there on observation that stands out as a potential leverage point (labeled 14 on the graph)

Question-c

  1. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
model = lm(mpg ~.-name+displacement:weight, data = Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9027 -1.8092 -0.0946  1.5549 12.1687 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
## cylinders            1.175e-01  2.943e-01   0.399   0.6899    
## displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
## horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
## weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
## acceleration         6.724e-02  8.805e-02   0.764   0.4455    
## year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
## origin               5.610e-01  2.622e-01   2.139   0.0331 *  
## displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8558 
## F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16
model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+acceleration:horsepower, data=Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight + 
##     acceleration:horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3344 -1.6333  0.0188  1.4740 11.9723 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.725e+01  5.328e+00  -3.237  0.00131 ** 
## cylinders                6.354e-01  6.106e-01   1.041  0.29870    
## displacement            -6.805e-02  1.337e-02  -5.088 5.68e-07 ***
## horsepower               6.026e-02  2.601e-02   2.317  0.02105 *  
## weight                  -8.864e-03  1.097e-03  -8.084 8.43e-15 ***
## acceleration             6.257e-01  1.592e-01   3.931  0.00010 ***
## year                     7.845e-01  4.470e-02  17.549  < 2e-16 ***
## origin                   4.668e-01  2.595e-01   1.799  0.07284 .  
## cylinders:displacement  -1.337e-03  2.726e-03  -0.490  0.62415    
## displacement:weight      2.071e-05  3.638e-06   5.694 2.49e-08 ***
## horsepower:acceleration -7.467e-03  1.784e-03  -4.185 3.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.905 on 381 degrees of freedom
## Multiple R-squared:  0.865,  Adjusted R-squared:  0.8615 
## F-statistic: 244.2 on 10 and 381 DF,  p-value: < 2.2e-16
model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+year:origin+acceleration:horsepower, data=Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight + 
##     year:origin + acceleration:horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6504 -1.6476  0.0381  1.4254 12.7893 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.287e+00  9.074e+00   0.583 0.560429    
## cylinders                4.249e-01  6.079e-01   0.699 0.485011    
## displacement            -7.322e-02  1.334e-02  -5.490 7.38e-08 ***
## horsepower               5.252e-02  2.586e-02   2.031 0.042913 *  
## weight                  -8.689e-03  1.086e-03  -7.998 1.54e-14 ***
## acceleration             5.796e-01  1.582e-01   3.665 0.000283 ***
## year                     5.116e-01  9.976e-02   5.129 4.66e-07 ***
## origin                  -1.220e+01  4.161e+00  -2.933 0.003560 ** 
## cylinders:displacement  -4.368e-04  2.712e-03  -0.161 0.872156    
## displacement:weight      1.992e-05  3.608e-06   5.522 6.21e-08 ***
## year:origin              1.630e-01  5.341e-02   3.051 0.002440 ** 
## horsepower:acceleration -6.735e-03  1.781e-03  -3.781 0.000181 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.874 on 380 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8644 
## F-statistic: 227.7 on 11 and 380 DF,  p-value: < 2.2e-16
model = lm(mpg ~.-name-cylinders-acceleration+year:origin+displacement:weight+
                  displacement:weight+acceleration:horsepower+acceleration:weight, data=Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name - cylinders - acceleration + year:origin + 
##     displacement:weight + displacement:weight + acceleration:horsepower + 
##     acceleration:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5074 -1.6324  0.0599  1.4577 12.7376 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.868e+01  7.796e+00   2.396 0.017051 *  
## displacement            -7.794e-02  9.026e-03  -8.636  < 2e-16 ***
## horsepower               8.719e-02  3.167e-02   2.753 0.006183 ** 
## weight                  -1.350e-02  1.287e-03 -10.490  < 2e-16 ***
## year                     4.911e-01  9.825e-02   4.998 8.83e-07 ***
## origin                  -1.262e+01  4.109e+00  -3.071 0.002288 ** 
## year:origin              1.686e-01  5.277e-02   3.195 0.001516 ** 
## displacement:weight      2.253e-05  2.184e-06  10.312  < 2e-16 ***
## horsepower:acceleration -9.164e-03  2.222e-03  -4.125 4.56e-05 ***
## weight:acceleration      2.784e-04  7.087e-05   3.929 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.861 on 382 degrees of freedom
## Multiple R-squared:  0.8687, Adjusted R-squared:  0.8656 
## F-statistic: 280.8 on 9 and 382 DF,  p-value: < 2.2e-16

From all the 4 models, the last model is the only one with all variables being significant. And, based on results from a few trials not show here, it is very likely that it is the best combination of predictors and interaction terms. The R-squared statistics estimates that 87% of the changes in the response can be explained by this particular set of predictors ( single and interaction.) A higher value was not obtained from the trials.


Ahmed TADDE