Question: This question involves the use of simple linear regression on the Auto data set. (a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results.

library(ISLR)
library(MASS)
data("Auto")
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
lm.fit<-lm(mpg~horsepower,data=Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Comment on the output. i. Is there a relationship between the predictor and the response? Yes, since p value is 2.2e-16

  1. How strong is the relationship between the predictor and the response? The R^{2} value indicates that about 61% of the variation in the response variable ( mpg) is due to the predictor variable (horsepower).

  2. Is the relationship between the predictor and the response positive or negative? Negative

  3. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

predict(lm.fit,data.frame(horsepower=c(98)),interval="prediction")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476
predict(lm.fit,data.frame(horsepower=c(98)),interval="confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
  1. Plot the response and the predictor. Use the abline() function to display the least squares regression line.
attach(Auto)
plot(horsepower,mpg)
abline(lm.fit,lwd=5,col="blue")

  1. Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
which.max(hatvalues(lm.fit))
## 117 
## 116
par(mfrow = c(2,2))
plot(lm.fit)

  1. This question involves the use of multiple linear regression on the Auto data set.
  1. Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(Auto)

  1. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor()which is qualitative.
Auto$name<-NULL
cor(Auto,method = c("pearson"))
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
  1. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
lm.fit<-lm(mpg~.,data=Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ ., data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comment on the output. For instance: i. Is there a relationship between the predictors and the response? Yes. ii. Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year, origin .

  1. What does the coefficient for the year variable suggest? When every other predictor held constant, the mpg value increase

  2. What does the coefficient for the year variable suggest? When every other predictor held constant, the mpg value increases with each year that passes.

  1. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit.
which.max(hatvalues(lm.fit))
## 14 
## 14
par(mfrow = c(2,2))
plot(lm.fit)

Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage? The first graph shows that there is a non-linear relationship between the responce and the predictors; The second graph shows that the residuals are normally distributed and right skewed; The third graph shows that the constant variance of error assumption is not true for this model; The fourth graphs shows that there are no leverage points. However, there on observation that stands out as a potential leverage point (labeled 14 on the graph)

  1. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
lm.fit = lm(mpg ~.-name+displacement:weight, data = Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9027 -1.8092 -0.0946  1.5549 12.1687 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
## cylinders            1.175e-01  2.943e-01   0.399   0.6899    
## displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
## horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
## weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
## acceleration         6.724e-02  8.805e-02   0.764   0.4455    
## year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
## origin               5.610e-01  2.622e-01   2.139   0.0331 *  
## displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8558 
## F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16
  1. Try a few different transformations of the variables, such as log(X),√X, X2. Comment on your findings.
lm.fit = lm(mpg ~.-name+I((displacement)^2)+log(displacement)+displacement:weight, data = Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ . - name + I((displacement)^2) + log(displacement) + 
##     displacement:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7453 -1.8071  0.0077  1.5523 12.2398 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.372e+01  2.127e+01  -2.056 0.040508 *  
## cylinders            6.809e-01  3.756e-01   1.813 0.070618 .  
## displacement        -1.965e-01  6.336e-02  -3.101 0.002073 ** 
## horsepower          -4.658e-02  1.390e-02  -3.351 0.000886 ***
## weight              -9.389e-03  1.415e-03  -6.633 1.13e-10 ***
## acceleration         4.618e-02  8.993e-02   0.514 0.607885    
## year                 7.673e-01  4.596e-02  16.696  < 2e-16 ***
## origin               5.165e-01  2.713e-01   1.904 0.057702 .  
## I((displacement)^2)  1.737e-04  7.263e-05   2.391 0.017291 *  
## log(displacement)    1.046e+01  5.796e+00   1.805 0.071801 .  
## displacement:weight  1.889e-05  4.645e-06   4.067 5.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.949 on 381 degrees of freedom
## Multiple R-squared:  0.8609, Adjusted R-squared:  0.8572 
## F-statistic: 235.7 on 10 and 381 DF,  p-value: < 2.2e-16