auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data",
header=TRUE,
na.strings = "?")
auto = na.omit(auto)
auto <- auto[, -c(8:9)]
attach(auto)
str(auto)
## 'data.frame': 392 obs. of 7 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
int_mod1 <- lm(mpg ~ cylinders*weight)
summary(int_mod1)
##
## Call:
## lm(formula = mpg ~ cylinders * weight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4916 -2.6225 -0.3927 1.7794 16.7087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.3864559 3.7333137 17.514 < 2e-16 ***
## cylinders -4.2097950 0.7238315 -5.816 1.26e-08 ***
## weight -0.0128348 0.0013628 -9.418 < 2e-16 ***
## cylinders:weight 0.0010979 0.0002101 5.226 2.83e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.165 on 388 degrees of freedom
## Multiple R-squared: 0.7174, Adjusted R-squared: 0.7152
## F-statistic: 328.3 on 3 and 388 DF, p-value: < 2.2e-16
plot(int_mod1$fitted, int_mod1$residual)
int_mod2 <- lm(horsepower ~ cylinders + year + cylinders:year)
summary((int_mod2))
##
## Call:
## lm(formula = horsepower ~ cylinders + year + cylinders:year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.778 -12.188 -2.439 11.428 71.569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -342.7626 69.3281 -4.944 1.14e-06 ***
## cylinders 106.2874 12.5642 8.460 5.54e-16 ***
## year 4.6485 0.9139 5.086 5.69e-07 ***
## cylinders:year -1.1788 0.1674 -7.042 8.69e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.96 on 388 degrees of freedom
## Multiple R-squared: 0.7591, Adjusted R-squared: 0.7573
## F-statistic: 407.6 on 3 and 388 DF, p-value: < 2.2e-16
plot(int_mod2$fitted, int_mod2$residual)
Yes, the interaction between cylinders and weight is a significant predictor of mpg. Additionally, the interaction between cylinders and year is a significant predictor of horsepower. All main effects in both models are significant as well. int_mod1 showed a fan shaped distribution, indicating heteroscedasticity.
int_mod3 <- lm(mpg ~ cylinders*log10(weight))
summary(int_mod3)
##
## Call:
## lm(formula = mpg ~ cylinders * log10(weight))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9668 -2.6279 -0.4572 1.9242 16.6468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 244.663 29.498 8.294 1.82e-15 ***
## cylinders -12.870 5.634 -2.284 0.0229 *
## log10(weight) -62.657 8.553 -7.326 1.38e-12 ***
## cylinders:log10(weight) 3.445 1.587 2.171 0.0306 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.142 on 388 degrees of freedom
## Multiple R-squared: 0.7205, Adjusted R-squared: 0.7183
## F-statistic: 333.4 on 3 and 388 DF, p-value: < 2.2e-16
plot(int_mod3$fitted, int_mod3$residual)
int_mod4 <- lm(horsepower ~ I(cylinders^2)*(year))
summary(int_mod4)
##
## Call:
## lm(formula = horsepower ~ I(cylinders^2) * (year))
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.686 -11.456 -2.571 10.665 69.372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.01703 37.07645 -1.511 0.1316
## I(cylinders^2) 8.73001 1.02780 8.494 4.32e-16 ***
## year 1.45909 0.48505 3.008 0.0028 **
## I(cylinders^2):year -0.09602 0.01372 -6.999 1.14e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.2 on 388 degrees of freedom
## Multiple R-squared: 0.7781, Adjusted R-squared: 0.7764
## F-statistic: 453.6 on 3 and 388 DF, p-value: < 2.2e-16
plot(int_mod4$fitted, int_mod4$residual)
int_mod5 <- lm(horsepower ~ sqrt(cylinders)*year)
summary(int_mod5)
##
## Call:
## lm(formula = horsepower ~ sqrt(cylinders) * year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.697 -12.092 -2.264 11.240 72.908
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -946.1913 141.0563 -6.708 7.00e-11 ***
## sqrt(cylinders) 514.9939 61.4599 8.379 9.88e-16 ***
## year 11.3925 1.8681 6.098 2.59e-09 ***
## sqrt(cylinders):year -5.7391 0.8182 -7.015 1.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.48 on 388 degrees of freedom
## Multiple R-squared: 0.7457, Adjusted R-squared: 0.7438
## F-statistic: 379.3 on 3 and 388 DF, p-value: < 2.2e-16
plot(int_mod5$fitted, int_mod5$residual)
Each of the transfromations resulted in a higher R^2 value, except for int_mod5. However, int_mod4, which squared the cylinders variable, led to the largest increase in explained variance. Additionally, int_mod3 still showed heteroscedasticity in the same way that was observed in in_mod1.
library(MASS)
library(ISLR)
attach(Carseats)
mlr_mod <- lm(Sales ~ Price + Urban + US)
summary(mlr_mod)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The coefficient for price suggests that for each additional price unit, you would predict an decrease of .054 units in Sales, when all other variables are held constant. The coefficient for Urban suggests that a store being urban is associated with .022 unit decrease in sales, whereas not being urban is associated with no change in sales (both assuming that all other variables are held constant). However, this relationship is not significant and is likely due to chance. The coefficient for US suggests that a store being in the US is associated with a 1.2 unit increase in Sales and not being in the US is associated with no change in sales (both assuming that all other variables are held constant)
y = 13.043469 + -.054459(PRICE) + -.021916(URBAN) + 1.200573(US)
Urban: 1 if urban, 0 if not
US: 1 if US, 0 if not
You can reject the null hypothesis that the slope for the predictor Price is different from 0 and for the difference between US and non-US sales.
mlr_mod_small <- lm(Sales ~ Price + US)
summary(mlr_mod_small)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
They fit the data equally well, showing that Urban is not a necessary predictor. They both explain 23.93% of the varaince in Sales. However, the model in (e) has a higher adjusted R-squared due to predicting equally well with less predictors.
confint(mlr_mod_small)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632