Applied

Auto dataset

a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.

  1. Is there a relationship between the predictor and the response?
  2. How strong is the relationship between the predictor and the response?
attach(Auto)
## The following object is masked from package:ggplot2:
## 
##     mpg
eq <- lm(data = Auto, formula = mpg ~ horsepower)
S <- summary(eq)
S$coefficients; S$adj.r.squared
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81
## [1] 0.6049379

Given the p.value for the coeffieicent of horsepower is much less than 0.05. And adjusted r.squared value for this equation is more than 0.6. So we can conclude that the relationship is not just significant, but also highly strong.

  1. Is the relationship between the predictor and the response positive or negative?
cor(mpg, horsepower)
## [1] -0.7784268

Cor function would help us verify the direction of correlation. As you see above, the direction is minus

  1. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?

We can varify this simply with confint function.

kable(confint(eq,level = 0.95))
2.5 % 97.5 %
(Intercept) 38.525212 41.3465103
horsepower -0.170517 -0.1451725
kable(confint(eq,level = 0.98))
1 % 99 %
(Intercept) 38.2598220 41.6119001
horsepower -0.1729011 -0.1427884

b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

attach(Auto)
## The following objects are masked from Auto (pos = 3):
## 
##     acceleration, cylinders, displacement, horsepower, mpg, name,
##     origin, weight, year
## The following object is masked from package:ggplot2:
## 
##     mpg
eq <- lm(data = Auto, formula = mpg ~ horsepower)
Pred.mpg.h <- predict(eq)
newdt <- data.frame(cbind(mpg,horsepower,Pred.mpg.h))
head(newdt)
##   mpg horsepower Pred.mpg.h
## 1  18        130  19.416046
## 2  15        165  13.891480
## 3  18        150  16.259151
## 4  16        150  16.259151
## 5  17        140  17.837598
## 6  15        198   8.682604
plot1 <- ggplot(newdt, aes(x = horsepower, y = mpg)) + geom_point() + geom_line(aes(y=Pred.mpg.h,colour = "red"))
plot1

Polynomial Version - I used the log

attach(Auto)
## The following objects are masked from Auto (pos = 3):
## 
##     acceleration, cylinders, displacement, horsepower, mpg, name,
##     origin, weight, year
## The following objects are masked from Auto (pos = 4):
## 
##     acceleration, cylinders, displacement, horsepower, mpg, name,
##     origin, weight, year
## The following object is masked from package:ggplot2:
## 
##     mpg
eq1 <- lm(data = Auto, formula = mpg ~ log(horsepower))
Pred.mpg.h1 <- predict(eq1)
newdt1 <- data.frame(cbind(mpg,horsepower,Pred.mpg.h1))
plot2 <- ggplot(newdt1, aes(x = horsepower, y = mpg)) + geom_point() + geom_line(aes(y=Pred.mpg.h1,colour = "red"))
plot2

c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

As you see in the graph, the residule shows normal distribution. And The leverage is quite low. Meaning there is no outlier in the residule. You can’t see this kind of plot in the real world actually…

par(mfrow = c(2,2))
plot(eq)

Multiple linear regression.

a) Produce a scatterplot matrix which includes all of the variables in the data set.

Note that three explanatory variables, displacement, horsepower, weight, shows non-linear relationship with mpg.

pairs(Auto)

b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, name.

For the variables with non-linear relationship, each has strong correlation. Um…

cor(Auto[,1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

  1. Is there a relationship between the predictors and the response?
  2. Which predictors appear to have a statistically significant relationship to the response?

As you see in the model, the r.square value is high, and most of variable has low p.values. But horsepower, unlike the first analysis we took, has high p.value, 0.21963. It might mean there is some interaction between some variables and horsepower.

model <- lm(data = Auto[,1:8], mpg ~.)
Sm <- summary(model)
Sm$coefficients
##                   Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  -17.218434622 4.6442941494 -3.707438 2.401841e-04
## cylinders     -0.493376319 0.3232823146 -1.526147 1.277965e-01
## displacement   0.019895644 0.0075150792  2.647430 8.444649e-03
## horsepower    -0.016951144 0.0137868914 -1.229512 2.196328e-01
## weight        -0.006474043 0.0006520478 -9.928787 7.874953e-21
## acceleration   0.080575838 0.0988449567  0.815174 4.154780e-01
## year           0.750772678 0.0509731223 14.728795 3.055983e-39
## origin         1.426140495 0.2781360924  5.127492 4.665681e-07
  1. What does the coefficient for the year variable suggest?

The coefficients of year is 0.750773. It means as the year of Model increases, Miles per gallon(mpg) increases. That means there has been technical improvement on mpg as the time passed.

d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

It shows high leverage as you see in the right-bottom of graph.

par(mfrow = c(2,2))
plot(model)

e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Let’s try with displacement, horsepower weight ones which show the corlinearity along with higher correlation with mpg.

model2 <- lm(data = Auto[,1:8], mpg ~ horsepower*weight*displacement)
model3 <- update(model2, mpg ~.- horsepower:weight)
model4 <- update(model3, mpg ~.- horsepower:weight:displacement)
model5 <- update(model4, mpg ~.- weight:displacement)
summary(model5)
## 
## Call:
## lm(formula = mpg ~ horsepower + weight + displacement + horsepower:displacement, 
##     data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8438  -2.2068  -0.2347   2.0116  16.4152 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.446e+01  1.544e+00  35.276  < 2e-16 ***
## horsepower              -1.989e-01  2.134e-02  -9.319  < 2e-16 ***
## weight                  -2.749e-03  7.148e-04  -3.846  0.00014 ***
## displacement            -7.129e-02  9.565e-03  -7.454    6e-13 ***
## horsepower:displacement  4.937e-04  5.604e-05   8.811  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.876 on 387 degrees of freedom
## Multiple R-squared:  0.7559, Adjusted R-squared:  0.7534 
## F-statistic: 299.6 on 4 and 387 DF,  p-value: < 2.2e-16

As you see, the interaction between horsepower and displacement has been detacted.

f) Try a few different transformations of the variables, such as log(X), √X, X 2 . Comment on your findings.

Then, how about log one? All parameters are significant, and Adjusted R squre is quite higher than the previous model.

model2 <- lm(data = Auto[,1:8], mpg ~ log(horsepower)*log(weight)*log(displacement))
summary(model2)
## 
## Call:
## lm(formula = mpg ~ log(horsepower) * log(weight) * log(displacement), 
##     data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3760  -2.0492  -0.3067   1.8102  16.2392 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   -2513.045   1007.463  -2.494
## log(horsepower)                                 554.944    225.352   2.463
## log(weight)                                     359.082    127.785   2.810
## log(displacement)                               445.825    195.671   2.278
## log(horsepower):log(weight)                     -78.126     28.579  -2.734
## log(horsepower):log(displacement)               -93.910     42.413  -2.214
## log(weight):log(displacement)                   -62.672     24.329  -2.576
## log(horsepower):log(weight):log(displacement)    13.168      5.278   2.495
##                                               Pr(>|t|)   
## (Intercept)                                    0.01304 * 
## log(horsepower)                                0.01423 * 
## log(weight)                                    0.00521 **
## log(displacement)                              0.02325 * 
## log(horsepower):log(weight)                    0.00655 **
## log(horsepower):log(displacement)              0.02740 * 
## log(weight):log(displacement)                  0.01037 * 
## log(horsepower):log(weight):log(displacement)  0.01302 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.85 on 384 degrees of freedom
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.7566 
## F-statistic: 174.7 on 7 and 384 DF,  p-value: < 2.2e-16