This discussion analyzes Carseats data from the ISLR R package included with the textbook “Introduction to Statistical Learning: With Applications in R” by Trevor Hastie, Robert Tibshirani, Daniela Witten, Gareth James. The objective of this article is to conduct a multiple linear regression with the following variable features:
The article begins with an overview of the data, summarizing basic statistics, then fit the model, analyze residuals and end with a discussion of our findings.
The code below loads the ISLR library and displays basic facts about the Carsets data set.
library(ISLR)
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1 9.50 138 73 11 276 120 Bad 42
## 2 11.22 111 48 16 260 83 Good 65
## 3 10.06 113 35 10 269 80 Medium 59
## 4 7.40 117 100 4 466 97 Medium 55
## 5 4.15 141 64 3 340 128 Bad 38
## 6 10.81 124 113 13 501 72 Bad 78
## Education Urban US
## 1 17 Yes Yes
## 2 10 Yes Yes
## 3 12 Yes Yes
## 4 14 Yes Yes
## 5 13 Yes No
## 6 16 No Yes
We make several observations:
There are \(n=400\) observations of 11 variables in the dataframe Carseats.
The data set is store level sales data from stores in the US and outside selling Carseats.
The textbook notes that the dataset contains simulated data.
We are going to summarize the individual data elements of the data set.
summary(Carseats[,c(1:4,6:8,11)])
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Price ShelveLoc Age US
## Min. : 24.0 Bad : 96 Min. :25.00 No :142
## 1st Qu.:100.0 Good : 85 1st Qu.:39.75 Yes:258
## Median :117.0 Medium:219 Median :54.50
## Mean :115.8 Mean :53.32
## 3rd Qu.:131.0 3rd Qu.:66.00
## Max. :191.0 Max. :80.00
ggplot(data=Carseats, aes(x=Price)) + geom_histogram(color="orange", fill="orange", bins=30)
ggplot(data=Carseats, aes(x=Sales)) + geom_histogram(color="red", fill="red", bins=30)
ggplot(data=Carseats, aes(x=Income)) + geom_histogram(color="blue", fill="blue", bins=30)
ggplot(data=Carseats, aes(x=Advertising)) + geom_histogram(color="green", fill="green", bins = 30)
ggplot(data=Carseats, aes(x=Age)) + geom_histogram(color="green", fill="tan", bins = 30)
We draw basic conclusions from the diagnostic plots.
We will seek to explain each store’s Carseat price as a function of other variables. Since store prices vary, we can expect that the economic laws of supply and demand have a role to play. Perhaps the other variables also affect those economics laws.
The following linear model plm1 explains price as a function of Sales, Income, Income-squared, Competitor Price, Advertising, ShelfLocation, Age, US and the interfaction of US with Income.
plm1 = lm( Price ~ Sales + Income + CompPrice +
Advertising + ShelveLoc + Age + I(Income^2) +
US + US * Income, data=Carseats)
summary(plm1)
##
## Call:
## lm(formula = Price ~ Sales + Income + CompPrice + Advertising +
## ShelveLoc + Age + I(Income^2) + US + US * Income, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.7634 -6.6244 0.5941 6.3158 26.8659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.9355437 5.2715410 9.662 <2e-16 ***
## Sales -8.0153407 0.2248809 -35.643 <2e-16 ***
## Income -0.0732989 0.0899600 -0.815 0.4157
## CompPrice 0.9630368 0.0309733 31.092 <2e-16 ***
## Advertising 1.0420181 0.0994714 10.476 <2e-16 ***
## ShelveLocGood 39.1599867 1.7494366 22.384 <2e-16 ***
## ShelveLocMedium 15.6245470 1.2280668 12.723 <2e-16 ***
## Age -0.3820463 0.0303585 -12.584 <2e-16 ***
## I(Income^2) 0.0012687 0.0006336 2.002 0.0459 *
## USYes -3.6626652 2.7302177 -1.342 0.1805
## Income:USYes 0.0350630 0.0354840 0.988 0.3237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.312 on 389 degrees of freedom
## Multiple R-squared: 0.8492, Adjusted R-squared: 0.8453
## F-statistic: 219.1 on 10 and 389 DF, p-value: < 2.2e-16
\[\begin{align} price &= 50.93 + -8.01 Sales -0.07 Income - 0.96 CompPrice + 1.04 Advertising \\ & + 39.15 Ind\{Shelf Loc = Good \} + 15.62 Ind\{ Shelf Loc=Medium \} + \\ & -0.38 Age + 0.001 Age - 3.66 Ind\{ Location=US \} \\ \end{align} \]
I believe the interaction between US Location and Income may not be statistically significant or relevant to the carseat pricing. The inclusion of the quadratic term Income^2 is also of limited value. Thus we consider an alternate smaller model below which drops those terms:
plm2 = lm( Price ~ Sales + Income + CompPrice +
Advertising + ShelveLoc + Age +
US , data=Carseats)
summary(plm2)
##
## Call:
## lm(formula = Price ~ Sales + Income + CompPrice + Advertising +
## ShelveLoc + Age + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.3650 -6.5868 0.1072 6.4287 27.2797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.52254 4.76374 9.556 < 2e-16 ***
## Sales -8.03250 0.22498 -35.703 < 2e-16 ***
## Income 0.12540 0.01733 7.237 2.45e-12 ***
## CompPrice 0.95502 0.03091 30.892 < 2e-16 ***
## Advertising 1.04248 0.09992 10.433 < 2e-16 ***
## ShelveLocGood 39.28903 1.74909 22.463 < 2e-16 ***
## ShelveLocMedium 15.72316 1.22859 12.798 < 2e-16 ***
## Age -0.38474 0.03047 -12.628 < 2e-16 ***
## USYes -1.29448 1.35093 -0.958 0.339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.354 on 391 degrees of freedom
## Multiple R-squared: 0.8471, Adjusted R-squared: 0.8439
## F-statistic: 270.7 on 8 and 391 DF, p-value: < 2.2e-16
The above second model (plm2) will be discussed in the next section as a suitable candidate model to explain pricing.
The quality of the model fit is reasonable in both cases.
Plm1 (the first full model) has the reasonable sign and magnitudes of regression coefficients.
Moreover, 7 of the variables are extremely significant. The R-squared is high at 84.53%. This model explains over 84% of the variance of the price.
We observe that
\[\beta_{Sales} = -8.03\] means higher scales drive down prices (consistent with economic law of supply /demand) \[\beta_{Income} = 0.12\] means higher local income drives up prices (also reasonable) \[\beta_{CompPrice} = 0.95\] means higher competitor prices will drive up the store price. (consistent with microeconomics) \[\beta_{Advertising} = 1.04\] means higher advertising boosts higher prices. \[\beta_{ShelfLoc=Good} = 39.16\] means a good shelf location can add 39.16 dollars to the price. This is probably the single biggest factor to commanding a higher price. \[\beta_{ShelfLoc=Medium} = 15.62\] means a medicre shelf location adds 15.62 dollars to the price over a bad one. \[\beta_{Age} = -0.38\] means an older local population tends to drive down the demand and price of carseats. \[\beta_{Income^2} = 0.00\] means the effect of income squared is negiglible. Its p-value is at the 5 percent significance. \[\beta_{US} = -3.66\] means a US store commands a lower price than a foreign store. But it is not statistically significant. \[\beta_{US*Income} = 0.03\] means a being the US means a higher $1000 additional income can increase the carseat price by 3 cents more than outside the US. However, this interaction effect is not statistically significant.
par(mfrow=c(2,2))
plot(plm1)
In the above panel of residual plots, it is clear that the main model has no issues with regards to residuals and influential outliers.
par(mfrow=c(2,2))
plot(plm2)
Looking at the second alternative model above, we see a similiar results.
Both models are acceptable based on residual plots.
We conclude that price is a function of other variables like Sales volume, Income and Shelf Location. I believe the parsimonious plm2 model is preferable (i.e. the one with no quadratic or interaction term). While these terms may seem more advanced, they gain little substantial economic or intuitive support to justify their inclusion.
The plm2 model has slightly lower adjusted R-squared but is still able to explain 84 percent of the price variation. Moreover, all coefficients except US are significant. Morever, the exclusion of Income^2 makes Income statistically significant again. Lower sales, higher income and higher advertising and good shelf location are key drivers to improve carseat sales.