Q10(a) Fit MLR on the “Carseats” data set to predict “Sales” Using “Price”, “Urban”, “US”
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.3.2
attach(Carseats)
lm.fit = lm(Sales ~ Price + Urban + US)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
The model may be written as \[Sales = 13.0434689 + (-0.0544588)\times Price + (-0.0219162)\times Urban + (1.2005727)\times US + \varepsilon\] with \(Urban = 1\) if the store is in an urban location and \(0\) if not, and \(US = 1\) if the store is in the US and \(0\) if not.
(d)For which of the predictors can you reject the null hypothesis \(H_0 : \beta_j = 0\) ?
We can reject the null hypothesis for the “Price” and “US” variables.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.fit2 = lm(Sales ~ Price + US)
summary(lm.fit2)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data ?
Based on the RSE and \(R^2\) of the linear regressions, they both fit the data similarly, with linear regression from (e) fitting the data slightly better. Essentially about 23.9262888% of the variability is explained by the second model.
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(lm.fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e) ?
plot(predict(lm.fit2), rstudent(lm.fit2))
All studentized residuals appear to be bounded by (-3 to 3), so not potential outliers are suggested from the linear regression.
par(mfrow = c(2,2))
plot(lm.fit2)
However, there are some points that exceed \((p + 1)/n\) (0.0075) that suggest that the corresponding points have high leverage.