Multiple Linear Regression with the Carseats Data Set

Linear regression with more than one predictor

Karen Mazidi

Using the Carsets data set from the ISLR package

The Carseats data set tracks sales information for car seats. It has 400 observations (each at a different store) and 11 variables:

Sales: unit sales in thousands
CompPrice: price charged by competitor at each location
Income: community income level in 1000s of dollars
Advertising: local ad budget at each location in 1000s of dollars
Population: regional pop in thousands
Price: price for car seats at each site
ShelveLoc: Bad, Good or Medium indicates quality of shelving location
Age: age level of the population
Education: ed level at location
Urban: Yes/No
US: Yes/No

library(ISLR)
attach(Carseats)
names(Carseats)

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

Multiple linear model

Let’s build a model using all the predictors

# we could do it like this:
#lm1 = lm(Sales~CompPrice+Income+... but there are a lot of predictors
# instead, use the . 
lm1 = lm(Sales~., data=Carseats)
summary(lm1)

## 
## Call:
## lm(formula = Sales ~ ., data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8692 -0.6908  0.0211  0.6636  3.4115 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.6606231  0.6034487   9.380  < 2e-16 ***
## CompPrice        0.0928153  0.0041477  22.378  < 2e-16 ***
## Income           0.0158028  0.0018451   8.565 2.58e-16 ***
## Advertising      0.1230951  0.0111237  11.066  < 2e-16 ***
## Population       0.0002079  0.0003705   0.561    0.575    
## Price           -0.0953579  0.0026711 -35.700  < 2e-16 ***
## ShelveLocGood    4.8501827  0.1531100  31.678  < 2e-16 ***
## ShelveLocMedium  1.9567148  0.1261056  15.516  < 2e-16 ***
## Age             -0.0460452  0.0031817 -14.472  < 2e-16 ***
## Education       -0.0211018  0.0197205  -1.070    0.285    
## UrbanYes         0.1228864  0.1129761   1.088    0.277    
## USYes           -0.1840928  0.1498423  -1.229    0.220    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 388 degrees of freedom
## Multiple R-squared:  0.8734, Adjusted R-squared:  0.8698 
## F-statistic: 243.4 on 11 and 388 DF,  p-value: < 2.2e-16

Dummy variables

Why do we have variables ShelveLocGood, ShelveLocGood, USYes and UrbanYes when those variables don’t exist in the data set?

R generates dummy variables for us from qualitative variables. The contrasts() function returns the coding that R uses.

contrasts(ShelveLoc)

##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1

Correlations with quantitative data

cor(subset(Carseats, select=-c(ShelveLoc,Urban,US)))  # omit qualitative data

##                   Sales   CompPrice       Income  Advertising   Population
## Sales        1.00000000  0.06407873  0.151950979  0.269506781  0.050470984
## CompPrice    0.06407873  1.00000000 -0.080653423 -0.024198788 -0.094706516
## Income       0.15195098 -0.08065342  1.000000000  0.058994706 -0.007876994
## Advertising  0.26950678 -0.02419879  0.058994706  1.000000000  0.265652145
## Population   0.05047098 -0.09470652 -0.007876994  0.265652145  1.000000000
## Price       -0.44495073  0.58484777 -0.056698202  0.044536874 -0.012143620
## Age         -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355
## Education   -0.05195524  0.02519705 -0.056855422 -0.033594307 -0.106378231
##                   Price          Age    Education
## Sales       -0.44495073 -0.231815440 -0.051955242
## CompPrice    0.58484777 -0.100238817  0.025197050
## Income      -0.05669820 -0.004670094 -0.056855422
## Advertising  0.04453687 -0.004557497 -0.033594307
## Population  -0.01214362 -0.042663355 -0.106378231
## Price        1.00000000 -0.102176839  0.011746599
## Age         -0.10217684  1.000000000  0.006488032
## Education    0.01174660  0.006488032  1.000000000

Model with selected predictors: Price, Urban, US

Price

The linear regression suggests a relationship between price and sales given the low p-value of the t-statistic. The coefficient states a negative relationship between Price and Sales: as Price increases, Sales decreases.

UrbanYes

The linear regression suggests that there isn’t a relationship between the location of the store and the number of sales based on the high p-value of the t-statistic.

USYes

The linear regression suggests there is a relationship between whether the store is in the US or not and the amount of sales. The coefficient states a positive relationship between USYes and Sales: if the store is in the US, the sales will increase by approximately 1201 units.

Model in equation form

Sales = 13.04 + -0.05 Price + -0.02 UrbanYes + 1.20 USYes

lm2 = lm(Sales~Price+Urban+US)
summary(lm2)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Fitting a smaller model

Since the Urban variable had a high p-value, lets build a model without it.

lm3 = lm(Sales ~ Price + US)
summary(lm3)

## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Comparing the models

Based on the RSE and R^2 of the linear regressions, they both fit the data similarly, with linear regression from (e) fitting the data slightly better.

Confidence intervals

confint(lm3)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Looking for outliers

All studentized residuals appear to be bounded by -3 to 3, so no potential outliers are suggested from the linear regression.

plot(predict(lm3), rstudent(lm3))

Let’s plot the residuals.

par(mfrow=c(2,2))
plot(lm3)

Here is an explanation of these plots: http://data.library.virginia.edu/diagnostic-plots/