Problem 2

(a) Use the lm() function to perform a simple linear regression with mpg as response and horsepower as predictor. Use the summary() function to print the results

auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data", 
                   header=TRUE,
                   na.strings = "?")
auto=na.omit(auto)
attach(auto)
mod <- lm(mpg ~ horsepower)
summary(mod)
## 
## Call:
## lm(formula = mpg ~ horsepower)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictor and the response?

Yes.

  1. How strong is the relationship between the predictor and response?

The relationship is strong, as indicated by the R^2 value of .6059. About 60% of the variance in mpg can be accounted for by knowledge of horsepower.

  1. Is the relationship between the predictor and the response positive or negative?

It is negative, as indicated by the negative sign on the estimate for horsepower. For every additional unit of horsepower, there is a .157 decrease in units of mpg.

  1. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intevals?
newdata<-data.frame(horsepower=c(98))
predict(mod, newdata, interval="confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
predict(mod, newdata, interval="predict")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

The predicted mpg with a horsepower of 98 would be 24.47. Prediction and confidence intervals are shown above.

(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(mpg ~ horsepower)
abline(mod)

(c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

plot(mpg, mod$residuals)
abline(h=0)

qqnorm(mod$residuals)
qqline(mod$residuals)

hist(mod$residuals)

The plots indicate a right skew of the residuals. Additionally, the error is uneven given the level of mpg (not constant variance). The error is largely negative at low mpg and largely positive at high mpg.

Problem 3

(a) Produce a scatter plot matrix

auto <- auto[, -c(8:9)]
pairs(auto)

(b) Compute the matrix of correlations betwen the variables using cor()

cor(auto) 
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables as predictors. Use the summary () function to print the results. Comment on the output.

mlr_mod <- lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year)
summary(mlr_mod)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6927 -2.3864 -0.0801  2.0291 14.3607 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.454e+01  4.764e+00  -3.051  0.00244 ** 
## cylinders    -3.299e-01  3.321e-01  -0.993  0.32122    
## displacement  7.678e-03  7.358e-03   1.044  0.29733    
## horsepower   -3.914e-04  1.384e-02  -0.028  0.97745    
## weight       -6.795e-03  6.700e-04 -10.141  < 2e-16 ***
## acceleration  8.527e-02  1.020e-01   0.836  0.40383    
## year          7.534e-01  5.262e-02  14.318  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.435 on 385 degrees of freedom
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8063 
## F-statistic: 272.2 on 6 and 385 DF,  p-value: < 2.2e-16
  1. There is a relationship between the predictors and the response.
  2. weight and year are significant. The other variables are not.
  3. The coefficient for the year variable suggests that for each additional year, the prediction for mpg would increase by .75 units, if all other variables are held constant.

(d) Write the code using matrix algebra to produce the summary output

Y <- as.matrix(mpg)
n<-dim(Y)[1]

X <- matrix(c(rep(1, n),
                  cylinders,
                  displacement,
                  horsepower,
                  weight,
                  acceleration,
                  year), 
                  ncol = 7,
                  byrow = FALSE)
betaHat<-solve(t(X)%*%X)%*%t(X)%*%Y
betaHat
##               [,1]
## [1,] -1.453525e+01
## [2,] -3.298591e-01
## [3,]  7.678430e-03
## [4,] -3.913556e-04
## [5,] -6.794618e-03
## [6,]  8.527325e-02
## [7,]  7.533672e-01

(e) Produce diagnostic plots of linear regression fit. Comment on any problems with the fit. Do the residual plots suggest any unusually large outliers?

plot(mpg, mlr_mod$residuals)
abline(h=0)

qqnorm(mlr_mod$residuals)
qqline(mlr_mod$residuals)

hist(mlr_mod$residuals)

The fit has similar problems as the simple linear regression fit. It has a right skew and relatively curved shape (variance is not constant). There are some large outliers for residuals when true mpg is very high.