Data Explanation
Origin: The dataset was used in the 1983 American Statistical Association Exposition.
Description: Gas mileage, horsepower, and other information for 392 vehicles.
Format: A data frame with 392 observations on the following 9 variables.
Q8(a) Simple Linear Regression on the “Auto” data set
library(MASS); library(ISLR)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
attach(Auto)
(i) Is there a relationship between the predictor and the response?
lm.fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Yes, there is a relationship between horsepower and mpg as determined by testing the null hypothesis of all regression coefficients equal to zero. Since the F-statistic (599.7) is far larger than 1 and the p-value (2.2e-16) of the F-statistic is close to zero we can reject the null hypothesis and state there is a statistically significant relationship between horsepower and mpg.
(ii) How strong is the relationship between the predictor and the response ?
mean(mpg)
## [1] 23.44592
(4.906/23.446)*100
## [1] 20.92468
To calculate the residual error relative to the response we use the mean of the response and the RSE. The mean of mpg is 23.4459. The RSE of the lm.fit was 4.906 which indicates a percentage error of 20.9248%. The R-squared of the lm.fit was about 0.6059, meaning 60.5948% of the variance in mpg is explained by horsepower.
(iii) Is the relationship between the predictor and the response positive or negative ?
The relationship between mpg and horsepower is negative. The more horsepower an automobile has the linear regression indicates the less mpg fuel efficiency the automobile will have.
(iv) What is the predicted mpg associated with a “horsepower” of 98 ? What are the associated 95% confidence and prediction intervals ?
predict(lm.fit, data.frame(horsepower = 98), interval = "confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(lm.fit, data.frame(horsepower = 98), interval = "prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
The predicted mpg associated with a “horsepower” of 98 is about 24.47.
We are 95% confident that the average mpg of a car with horsepower of 98 is between 23.97 to 24.96.
We are 95% confident that the mpg of a car with horsepower of 98 is between 14.81 to 34.12.
Q8(b) Plot the response and the predictor, display the least squares regression line.
plot(horsepower, mpg, main = "Scatterplot of mpg vs. horsepower", xlab = "horsepower", ylab = "mpg", col = "blue")
abline(lm.fit, lwd = 3, col = "magenta")
Q8(c) Produce diagnostic plots of the least squares regression fit. Comment.
par(mfrow = c(2,2))
plot(lm.fit)
The plot of residuals versus fitted values indicates the presence of non linearity in the data. The plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and a few high leverage points.
Q9(a) Multiple Linear Regression on the Auto data set
pairs(Auto)
Q9(b) Compute matrix of correlations between the variables. Exclude “name” variable which is qualitative.
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Q9(c) Perform a multiple linear regression with “mpg” as the response and all other variables except “name” as the predictors.
(i) Is there a relationship between the predictors and the response ?
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Testing the hypothesis H0:βi=0 ∀i. The p-value corresponding to the F-statistic is 2.037105910^{-139} this indicates a clear evidence of a relationship between “mpg” and the other predictors.
(ii) Which predictors appear to have a statistically significant relationship to the response ?
By checking the p-values associated with each predictor’s t-statistic. We may conclude that “displacement”, “weight”, “year”, and “origin” have a statistically significant relationship while “cylinders”, “horsepower” and “acceleration” do not.
(iii) What does the coefficient for the “year” variable suggest ?
The coefficient ot the “year” variable suggests that the average effect of an increase of 1 year is an increase of 0.7507727 in “mpg” (all other predictors remaining constant). In other words, cars become more fuel efficient every year by almost 1 mpg / year.
Q9(d) Produce diagnostic plots of the linear regression fit.
par(mfrow = c(2, 2))
plot(lm.fit2)
As before, the plot of residuals versus fitted values indicates the presence of mild non linearity in the data. The plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and one high leverage point (point 14).
Q9(e) Fit linear regression models with interaction effects. Do any interactions appear to be statistically significant ?
lm.fit3 <- lm(mpg ~ cylinders * displacement+displacement * weight, data = Auto[, 1:8])
summary(lm.fit3)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement *
## weight, data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2934 -2.5184 -0.3476 1.8399 17.7723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.262e+01 2.237e+00 23.519 < 2e-16 ***
## cylinders 7.606e-01 7.669e-01 0.992 0.322
## displacement -7.351e-02 1.669e-02 -4.403 1.38e-05 ***
## weight -9.888e-03 1.329e-03 -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03 3.426e-03 -0.872 0.384
## displacement:weight 2.128e-05 5.002e-06 4.254 2.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7237
## F-statistic: 205.8 on 5 and 386 DF, p-value: < 2.2e-16
From the correlation matrix, I obtained the two highest correlated pairs and used them in picking my interaction effects. From the p-values, we can see that the interaction between displacement and weight is statistically signifcant, while the interactiion between cylinders and displacement is not.
Q9(f) Try a few different transformations of the variables, such as logX, √X, X2. Comment on your findings.
lm.fit3 = lm(mpg~log(weight)+sqrt(horsepower)+acceleration+I(acceleration^2))
summary(lm.fit3)
##
## Call:
## lm(formula = mpg ~ log(weight) + sqrt(horsepower) + acceleration +
## I(acceleration^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2932 -2.5082 -0.2237 2.0237 15.7650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 178.30303 10.80451 16.503 < 2e-16 ***
## log(weight) -14.74259 1.73994 -8.473 5.06e-16 ***
## sqrt(horsepower) -1.85192 0.36005 -5.144 4.29e-07 ***
## acceleration -2.19890 0.63903 -3.441 0.000643 ***
## I(acceleration^2) 0.06139 0.01857 3.305 0.001037 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.99 on 387 degrees of freedom
## Multiple R-squared: 0.7414, Adjusted R-squared: 0.7387
## F-statistic: 277.3 on 4 and 387 DF, p-value: < 2.2e-16
Apparently, from the p-values, the log(weight), sqrt(horsepower), and acceleration^2 all have statistical significance of some sort.
par(mfrow=c(2,2))
plot(lm.fit3)
par(mfrow=c(1,1))
plot(predict(lm.fit3), rstudent(lm.fit3), col = "navy")
The residuals plot has less of a discernible pattern than the plot of all linear regression terms. The studentized residuals displays potential outliers (>3). The leverage plot indicates more than three points with high leverage.
However, 2 problems are observed from the above plots:
So, a better transformation need to be applied to our model. From the correlation matrix in 9a., displacement, horsepower and weight show a similar nonlinear pattern against our response mpg. This nonlinear pattern is very close to a log form. So in the next attempt, we use log(mpg)
as our response variable.
lm.fit4<-lm(log(mpg)~cylinders+displacement+horsepower+weight+acceleration+year+origin,data=Auto)
summary(lm.fit4)
##
## Call:
## lm(formula = log(mpg) ~ cylinders + displacement + horsepower +
## weight + acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40955 -0.06533 0.00079 0.06785 0.33925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.751e+00 1.662e-01 10.533 < 2e-16 ***
## cylinders -2.795e-02 1.157e-02 -2.415 0.01619 *
## displacement 6.362e-04 2.690e-04 2.365 0.01852 *
## horsepower -1.475e-03 4.935e-04 -2.989 0.00298 **
## weight -2.551e-04 2.334e-05 -10.931 < 2e-16 ***
## acceleration -1.348e-03 3.538e-03 -0.381 0.70339
## year 2.958e-02 1.824e-03 16.211 < 2e-16 ***
## origin 4.071e-02 9.955e-03 4.089 5.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared: 0.8795, Adjusted R-squared: 0.8773
## F-statistic: 400.4 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.fit4)
par(mfrow=c(1,1))
plot(predict(lm.fit4),rstudent(lm.fit4), col = "navy")
Q10(a) Fit MLR on the “Carseats” data set to predict “Sales” Using “Price”, “Urban”, “US”
library(ISLR)
attach(Carseats)
lm.fit = lm(Sales ~ Price + Urban + US)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
The model may be written as [Sales = 178.3030263 + (-14.7425892)Price + (-1.8519214)Urban + (-2.1988951)US + ] with Urban=1 if the store is in an urban location and 0 if not, and US=1 if the store is in the US and 0 if not.