Data 605 Discussion Week 12

Alexander Ng

11/15/2019

Introduction

This discussion analyzes Carseats data from the ISLR R package included with the textbook “Introduction to Statistical Learning: With Applications in R” by Trevor Hastie, Robert Tibshirani, Daniela Witten, Gareth James. The objective of this article is to conduct a multiple linear regression with the following variable features:

The article begins with an overview of the data, summarizing basic statistics, then fit the model, analyze residuals and end with a discussion of our findings.

Overview of the Data

The code below loads the ISLR library and displays basic facts about the Carsets data set.

library(ISLR)

str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1  9.50       138     73          11        276   120       Bad  42
## 2 11.22       111     48          16        260    83      Good  65
## 3 10.06       113     35          10        269    80    Medium  59
## 4  7.40       117    100           4        466    97    Medium  55
## 5  4.15       141     64           3        340   128       Bad  38
## 6 10.81       124    113          13        501    72       Bad  78
##   Education Urban  US
## 1        17   Yes Yes
## 2        10   Yes Yes
## 3        12   Yes Yes
## 4        14   Yes Yes
## 5        13   Yes  No
## 6        16    No Yes

We make several observations:

  1. There are \(n=400\) observations of 11 variables in the dataframe Carseats.

  2. The data set is store level sales data from stores in the US and outside selling Carseats.

  3. There are 3 factor variables:
    • ShelveLoc, a factor with 3 levels: Bad, Good and Medium indicates the quality of shelving location
    • Urban , a factor with Yes and No to indicate the store location in a rural or urban setting
    • US , a factor to indicate if the store is in the US or not
  4. The dimensions of the remaining varibles are:
    • Sales – unit sales (in thousands) at each location
    • Income – community income level (in thousands of dollars)
    • Price – price company charges for car seats at each site.
    • CompPrice – Price charged by competitor at each location
    • Advertising – Local advertising budget at each location (in thousands of dollars)
    • Age – average age of local population

The textbook notes that the dataset contains simulated data.

Summarizing the Data

We are going to summarize the individual data elements of the data set.

summary(Carseats[,c(1:4,6:8,11)])
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##      Price        ShelveLoc        Age          US     
##  Min.   : 24.0   Bad   : 96   Min.   :25.00   No :142  
##  1st Qu.:100.0   Good  : 85   1st Qu.:39.75   Yes:258  
##  Median :117.0   Medium:219   Median :54.50            
##  Mean   :115.8                Mean   :53.32            
##  3rd Qu.:131.0                3rd Qu.:66.00            
##  Max.   :191.0                Max.   :80.00
ggplot(data=Carseats, aes(x=Price)) + geom_histogram(color="orange", fill="orange", bins=30) 

ggplot(data=Carseats, aes(x=Sales)) + geom_histogram(color="red", fill="red", bins=30) 

ggplot(data=Carseats, aes(x=Income)) + geom_histogram(color="blue", fill="blue", bins=30) 

ggplot(data=Carseats, aes(x=Advertising)) + geom_histogram(color="green", fill="green", bins = 30)

ggplot(data=Carseats, aes(x=Age)) + geom_histogram(color="green", fill="tan", bins = 30)

We draw basic conclusions from the diagnostic plots.

Fitting the Model

We will seek to explain each store’s Carseat price as a function of other variables. Since store prices vary, we can expect that the economic laws of supply and demand have a role to play. Perhaps the other variables also affect those economics laws.

The following linear model plm1 explains price as a function of Sales, Income, Income-squared, Competitor Price, Advertising, ShelfLocation, Age, US and the interfaction of US with Income.

plm1 = lm( Price ~ Sales + Income + CompPrice + 
             Advertising + ShelveLoc + Age + I(Income^2) + 
             US + US * Income, data=Carseats)

summary(plm1)
## 
## Call:
## lm(formula = Price ~ Sales + Income + CompPrice + Advertising + 
##     ShelveLoc + Age + I(Income^2) + US + US * Income, data = Carseats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.7634  -6.6244   0.5941   6.3158  26.8659 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     50.9355437  5.2715410   9.662   <2e-16 ***
## Sales           -8.0153407  0.2248809 -35.643   <2e-16 ***
## Income          -0.0732989  0.0899600  -0.815   0.4157    
## CompPrice        0.9630368  0.0309733  31.092   <2e-16 ***
## Advertising      1.0420181  0.0994714  10.476   <2e-16 ***
## ShelveLocGood   39.1599867  1.7494366  22.384   <2e-16 ***
## ShelveLocMedium 15.6245470  1.2280668  12.723   <2e-16 ***
## Age             -0.3820463  0.0303585 -12.584   <2e-16 ***
## I(Income^2)      0.0012687  0.0006336   2.002   0.0459 *  
## USYes           -3.6626652  2.7302177  -1.342   0.1805    
## Income:USYes     0.0350630  0.0354840   0.988   0.3237    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.312 on 389 degrees of freedom
## Multiple R-squared:  0.8492, Adjusted R-squared:  0.8453 
## F-statistic: 219.1 on 10 and 389 DF,  p-value: < 2.2e-16

\[\begin{align} price &= 50.93 + -8.01 Sales -0.07 Income - 0.96 CompPrice + 1.04 Advertising \\ & + 39.15 Ind\{Shelf Loc = Good \} + 15.62 Ind\{ Shelf Loc=Medium \} + \\ & -0.38 Age + 0.001 Age - 3.66 Ind\{ Location=US \} \\ \end{align} \]

I believe the interaction between US Location and Income may not be statistically significant or relevant to the carseat pricing. The inclusion of the quadratic term Income^2 is also of limited value. Thus we consider an alternate smaller model below which drops those terms:

plm2 = lm( Price ~ Sales + Income + CompPrice + 
             Advertising + ShelveLoc + Age  + 
             US , data=Carseats)

summary(plm2)
## 
## Call:
## lm(formula = Price ~ Sales + Income + CompPrice + Advertising + 
##     ShelveLoc + Age + US, data = Carseats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.3650  -6.5868   0.1072   6.4287  27.2797 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     45.52254    4.76374   9.556  < 2e-16 ***
## Sales           -8.03250    0.22498 -35.703  < 2e-16 ***
## Income           0.12540    0.01733   7.237 2.45e-12 ***
## CompPrice        0.95502    0.03091  30.892  < 2e-16 ***
## Advertising      1.04248    0.09992  10.433  < 2e-16 ***
## ShelveLocGood   39.28903    1.74909  22.463  < 2e-16 ***
## ShelveLocMedium 15.72316    1.22859  12.798  < 2e-16 ***
## Age             -0.38474    0.03047 -12.628  < 2e-16 ***
## USYes           -1.29448    1.35093  -0.958    0.339    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.354 on 391 degrees of freedom
## Multiple R-squared:  0.8471, Adjusted R-squared:  0.8439 
## F-statistic: 270.7 on 8 and 391 DF,  p-value: < 2.2e-16

The above second model (plm2) will be discussed in the next section as a suitable candidate model to explain pricing.

Quality of Fit

The quality of the model fit is reasonable in both cases.

Plm1 (the first full model) has the reasonable sign and magnitudes of regression coefficients.
Moreover, 7 of the variables are extremely significant. The R-squared is high at 84.53%. This model explains over 84% of the variance of the price.

We observe that

\[\beta_{Sales} = -8.03\] means higher scales drive down prices (consistent with economic law of supply /demand) \[\beta_{Income} = 0.12\] means higher local income drives up prices (also reasonable) \[\beta_{CompPrice} = 0.95\] means higher competitor prices will drive up the store price. (consistent with microeconomics) \[\beta_{Advertising} = 1.04\] means higher advertising boosts higher prices. \[\beta_{ShelfLoc=Good} = 39.16\] means a good shelf location can add 39.16 dollars to the price. This is probably the single biggest factor to commanding a higher price. \[\beta_{ShelfLoc=Medium} = 15.62\] means a medicre shelf location adds 15.62 dollars to the price over a bad one. \[\beta_{Age} = -0.38\] means an older local population tends to drive down the demand and price of carseats. \[\beta_{Income^2} = 0.00\] means the effect of income squared is negiglible. Its p-value is at the 5 percent significance. \[\beta_{US} = -3.66\] means a US store commands a lower price than a foreign store. But it is not statistically significant. \[\beta_{US*Income} = 0.03\] means a being the US means a higher $1000 additional income can increase the carseat price by 3 cents more than outside the US. However, this interaction effect is not statistically significant.

Residual Analysis

par(mfrow=c(2,2))
plot(plm1)

In the above panel of residual plots, it is clear that the main model has no issues with regards to residuals and influential outliers.

par(mfrow=c(2,2))
plot(plm2)

Looking at the second alternative model above, we see a similiar results.

Both models are acceptable based on residual plots.

Discussion

We conclude that price is a function of other variables like Sales volume, Income and Shelf Location. I believe the parsimonious plm2 model is preferable (i.e. the one with no quadratic or interaction term). While these terms may seem more advanced, they gain little substantial economic or intuitive support to justify their inclusion.

The plm2 model has slightly lower adjusted R-squared but is still able to explain 84 percent of the price variation. Moreover, all coefficients except US are significant. Morever, the exclusion of Income^2 makes Income statistically significant again. Lower sales, higher income and higher advertising and good shelf location are key drivers to improve carseat sales.