The Carseats data set tracks sales information for car seats. It has 400 observations (each at a different store) and 11 variables:
Sales: unit sales in thousands
CompPrice: price charged by competitor at each location
Income: community income level in 1000s of dollars
Advertising: local ad budget at each location in 1000s of dollars
Population: regional pop in thousands
Price: price for car seats at each site
ShelveLoc: Bad, Good or Medium indicates quality of shelving location
Age: age level of the population
Education: ed level at location
Urban: Yes/No
US: Yes/No
library(ISLR)
attach(Carseats)
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
Let’s build a model using all the predictors
# we could do it like this:
#lm1 = lm(Sales~CompPrice+Income+... but there are a lot of predictors
# instead, use the .
lm1 = lm(Sales~., data=Carseats)
summary(lm1)
##
## Call:
## lm(formula = Sales ~ ., data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8692 -0.6908 0.0211 0.6636 3.4115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6606231 0.6034487 9.380 < 2e-16 ***
## CompPrice 0.0928153 0.0041477 22.378 < 2e-16 ***
## Income 0.0158028 0.0018451 8.565 2.58e-16 ***
## Advertising 0.1230951 0.0111237 11.066 < 2e-16 ***
## Population 0.0002079 0.0003705 0.561 0.575
## Price -0.0953579 0.0026711 -35.700 < 2e-16 ***
## ShelveLocGood 4.8501827 0.1531100 31.678 < 2e-16 ***
## ShelveLocMedium 1.9567148 0.1261056 15.516 < 2e-16 ***
## Age -0.0460452 0.0031817 -14.472 < 2e-16 ***
## Education -0.0211018 0.0197205 -1.070 0.285
## UrbanYes 0.1228864 0.1129761 1.088 0.277
## USYes -0.1840928 0.1498423 -1.229 0.220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.019 on 388 degrees of freedom
## Multiple R-squared: 0.8734, Adjusted R-squared: 0.8698
## F-statistic: 243.4 on 11 and 388 DF, p-value: < 2.2e-16
Why do we have variables ShelveLocGood, ShelveLocGood, USYes and UrbanYes when those variables don’t exist in the data set?
R generates dummy variables for us from qualitative variables. The contrasts() function returns the coding that R uses.
contrasts(ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
cor(subset(Carseats, select=-c(ShelveLoc,Urban,US))) # omit qualitative data
## Sales CompPrice Income Advertising Population
## Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984
## CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516
## Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994
## Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145
## Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000
## Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620
## Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355
## Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231
## Price Age Education
## Sales -0.44495073 -0.231815440 -0.051955242
## CompPrice 0.58484777 -0.100238817 0.025197050
## Income -0.05669820 -0.004670094 -0.056855422
## Advertising 0.04453687 -0.004557497 -0.033594307
## Population -0.01214362 -0.042663355 -0.106378231
## Price 1.00000000 -0.102176839 0.011746599
## Age -0.10217684 1.000000000 0.006488032
## Education 0.01174660 0.006488032 1.000000000
The linear regression suggests a relationship between price and sales given the low p-value of the t-statistic. The coefficient states a negative relationship between Price and Sales: as Price increases, Sales decreases.
The linear regression suggests that there isn’t a relationship between the location of the store and the number of sales based on the high p-value of the t-statistic.
The linear regression suggests there is a relationship between whether the store is in the US or not and the amount of sales. The coefficient states a positive relationship between USYes and Sales: if the store is in the US, the sales will increase by approximately 1201 units.
Sales = 13.04 + -0.05 Price + -0.02 UrbanYes + 1.20 USYes
lm2 = lm(Sales~Price+Urban+US)
summary(lm2)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Since the Urban variable had a high p-value, lets build a model without it.
lm3 = lm(Sales ~ Price + US)
summary(lm3)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Based on the RSE and R^2 of the linear regressions, they both fit the data similarly, with linear regression from (e) fitting the data slightly better.
confint(lm3)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
All studentized residuals appear to be bounded by -3 to 3, so no potential outliers are suggested from the linear regression.
plot(predict(lm3), rstudent(lm3))
Let’s plot the residuals.
par(mfrow=c(2,2))
plot(lm3)
Here is an explanation of these plots: http://data.library.virginia.edu/diagnostic-plots/