Car Seats Analysis

Sameer Mathur (2018)

Regression Analysis

Read the Car Seats Data

  Sales CompPrice Income Advertising Population Price ShelveLoc Age
1  9.50       138     73          11        276   120     0-Bad  42
2  4.15       141     64           3        340   128     0-Bad  38
3 10.81       124    113          13        501    72     0-Bad  78
4  9.01       121     78           9        150   100     0-Bad  26
5 10.14       145    119          16        294   113     0-Bad  42
6  2.99       103     74           0        359    97     0-Bad  55
  Education Urban  US Revenue Profit
1        17   Yes Yes 1140.00 228.00
2        13   Yes  No  531.20 106.24
3        16    No Yes  778.32 155.66
4        10    No Yes  901.00 180.20
5        12   Yes Yes 1145.82 229.16
6        11   Yes Yes  290.03  58.01

Structure of the data frame

# data structure of the dataframe
str(carSeats.df)
'data.frame':   400 obs. of  13 variables:
 $ Sales      : num  9.5 4.15 10.81 9.01 10.14 ...
 $ CompPrice  : int  138 141 124 121 145 103 104 130 119 157 ...
 $ Income     : int  73 64 113 78 119 74 99 60 98 53 ...
 $ Advertising: int  11 3 13 9 16 0 15 0 0 0 ...
 $ Population : int  276 340 501 150 294 359 226 144 18 403 ...
 $ Price      : int  120 128 72 100 113 97 102 138 126 124 ...
 $ ShelveLoc  : Factor w/ 3 levels "0-Bad","1-Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Age        : int  42 38 78 26 42 55 58 38 73 58 ...
 $ Education  : int  17 13 16 10 12 11 17 10 17 16 ...
 $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 2 1 1 2 ...
 $ US         : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 1 1 1 ...
 $ Revenue    : num  1140 531 778 901 1146 ...
 $ Profit     : num  228 106 156 180 229 ...

MODEL 0: Regress Profit on Advertising


Call:
lm(formula = p0, data = carSeats.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-171.85  -34.38   -3.78   35.97  168.85 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 147.3290     4.0320  36.540  < 2e-16 ***
Advertising   3.0660     0.4295   7.139 4.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57.05 on 398 degrees of freedom
Multiple R-squared:  0.1135,    Adjusted R-squared:  0.1113 
F-statistic: 50.97 on 1 and 398 DF,  p-value: 4.493e-12

IMPACT OF ADVERTISING ON PROFIT

  • An increase in advertising by $1000, increases the profit by $3066

MODEL 1: Regress Profit on ShelveLoc


Call:
lm(formula = p1, data = carSeats.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-163.350  -33.330   -1.365   31.033  153.050 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        120.041      4.808   24.97  < 2e-16 ***
ShelveLoc1-Medium   43.309      5.767    7.51 3.93e-13 ***
ShelveLoc2-Good    112.558      7.016   16.04  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.11 on 397 degrees of freedom
Multiple R-squared:  0.3971,    Adjusted R-squared:  0.394 
F-statistic: 130.7 on 2 and 397 DF,  p-value: < 2.2e-16

IMPACT OF SHELF LOCATION ON PROFIT

  • A change in Shelf Location from Bad to Medium, increases the profit by $43,309
  • A change in Shelf Location from Bad to Good, increases the profit by $112,558

MODEL 2: Regress Profit on both Advertising and ShelveLoc


Call:
lm(formula = p2, data = carSeats.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-145.446  -25.160    0.039   24.796  104.054 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        103.014      4.886  21.085  < 2e-16 ***
Advertising          2.738      0.328   8.347 1.18e-15 ***
ShelveLoc1-Medium   42.433      5.325   7.968 1.73e-14 ***
ShelveLoc2-Good    109.453      6.489  16.867  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.5 on 396 degrees of freedom
Multiple R-squared:  0.4873,    Adjusted R-squared:  0.4834 
F-statistic: 125.4 on 3 and 396 DF,  p-value: < 2.2e-16

IMPACT OF ADVERTISING ON PROFIT

  • An increase in advertising by $1, increases the profit by $2.74


IMPACT OF SHELF LOCATION ON PROFIT

  • A change in Shelf Location from Bad to Medium, increases the profit by $42,433
  • A change in Shelf Location from Bad to Good, increases the profit by $109,453

  • Notice that when we jointly consider the role of Advertising and Shelf Location, the 'answers' are different.

  • We were 'over-predicting' the impact of Advertising when we ignored the impact of Shelf Location (See Model 0)

  • We were 'over-predicting' the impact of Shelf Location when we ignored the impact of Advertising (See Model 1)

MODEL 3: Is there an INTERACTION between Advertising and Shelf Location?

Regress Profit on both Advertising and ShelveLoc


Call:
lm(formula = p3, data = carSeats.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-143.299  -24.554    1.032   24.480  106.201 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   108.6051     6.1743  17.590  < 2e-16 ***
Advertising                     1.8390     0.6903   2.664  0.00803 ** 
ShelveLoc1-Medium              34.6937     7.4206   4.675 4.04e-06 ***
ShelveLoc2-Good               103.1563     9.3093  11.081  < 2e-16 ***
Advertising:ShelveLoc1-Medium   1.2276     0.8190   1.499  0.13472    
Advertising:ShelveLoc2-Good     0.9950     0.9812   1.014  0.31118    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.48 on 394 degrees of freedom
Multiple R-squared:  0.4902,    Adjusted R-squared:  0.4837 
F-statistic: 75.77 on 5 and 394 DF,  p-value: < 2.2e-16

MODEL 4:

p4 <- Profit ~ Advertising + ShelveLoc + CompPrice + Population + Income + Age + Education + Urban + US
fitp4 = lm(p4 ,data=carSeats.df )
summary (fitp4)

MODEL 4:


Call:
lm(formula = p4, data = carSeats.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-159.654  -17.856    2.075   20.401   68.740 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -60.136334  18.171281  -3.309  0.00102 ** 
Advertising         2.959638   0.334864   8.838  < 2e-16 ***
ShelveLoc1-Medium  44.625673   3.797403  11.752  < 2e-16 ***
ShelveLoc2-Good   109.212841   4.608355  23.699  < 2e-16 ***
CompPrice           1.563453   0.101946  15.336  < 2e-16 ***
Population          0.005803   0.011149   0.521  0.60299    
Income              0.370657   0.055560   6.671 8.73e-11 ***
Age                -0.952238   0.095685  -9.952  < 2e-16 ***
Education          -0.679343   0.593883  -1.144  0.25337    
UrbanYes            2.981043   3.402003   0.876  0.38143    
USYes              -5.783463   4.511614  -1.282  0.20064    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30.69 on 389 degrees of freedom
Multiple R-squared:  0.7493,    Adjusted R-squared:  0.7429 
F-statistic: 116.3 on 10 and 389 DF,  p-value: < 2.2e-16

NOTES

  • The Adjusted R-squared is higher (74% compared to 48%)
  • The impact of Advertising and Shelf Location after controling for the competition and the consumer demographics is higher than the predicted impact when we ignored these control variables
  • The control factors that statistically significant are {CompPrice, Income, Age}

Comparing MODEL 2, MODEL 4, : How Is Model 4 'better' than Model 2?

# Print the summary of the two regression models
summary(fitp2)

Call:
lm(formula = p2, data = carSeats.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-145.446  -25.160    0.039   24.796  104.054 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        103.014      4.886  21.085  < 2e-16 ***
Advertising          2.738      0.328   8.347 1.18e-15 ***
ShelveLoc1-Medium   42.433      5.325   7.968 1.73e-14 ***
ShelveLoc2-Good    109.453      6.489  16.867  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.5 on 396 degrees of freedom
Multiple R-squared:  0.4873,    Adjusted R-squared:  0.4834 
F-statistic: 125.4 on 3 and 396 DF,  p-value: < 2.2e-16
#summary (fitp4)

Comparing MODEL 2, MODEL 4, : How Is Model 4 'better' than Model 2?

# Print the summary of the two regression models
#summary(fitp2)
summary (fitp4)

Comparing MODEL 2, MODEL 4, : How Is Model 4 'better' than Model 2?


Call:
lm(formula = p4, data = carSeats.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-159.654  -17.856    2.075   20.401   68.740 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -60.136334  18.171281  -3.309  0.00102 ** 
Advertising         2.959638   0.334864   8.838  < 2e-16 ***
ShelveLoc1-Medium  44.625673   3.797403  11.752  < 2e-16 ***
ShelveLoc2-Good   109.212841   4.608355  23.699  < 2e-16 ***
CompPrice           1.563453   0.101946  15.336  < 2e-16 ***
Population          0.005803   0.011149   0.521  0.60299    
Income              0.370657   0.055560   6.671 8.73e-11 ***
Age                -0.952238   0.095685  -9.952  < 2e-16 ***
Education          -0.679343   0.593883  -1.144  0.25337    
UrbanYes            2.981043   3.402003   0.876  0.38143    
USYes              -5.783463   4.511614  -1.282  0.20064    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30.69 on 389 degrees of freedom
Multiple R-squared:  0.7493,    Adjusted R-squared:  0.7429 
F-statistic: 116.3 on 10 and 389 DF,  p-value: < 2.2e-16

Conceptually: Model 4 studies the joint effect of many more factors than Model 2

Method 1: Compare the Adjusted R Squared

(summary(fitp2))$adj.r.squared
[1] 0.483383
(summary(fitp4))$adj.r.squared
[1] 0.7428985

Method 2: Compare the AIC

AIC(fitp2)
[1] 4159.3
AIC(fitp4)
[1] 3887.034

Method 3: ANOVA (since Model 2 is nested within Model 4)

anova(fitp2, fitp4)
Analysis of Variance Table

Model 1: Profit ~ Advertising + ShelveLoc
Model 2: Profit ~ Advertising + ShelveLoc + CompPrice + Population + Income + 
    Age + Education + Urban + US
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    396 749257                                  
2    389 366287  7    382970 58.103 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MODEL 5:

p5 <- Profit ~ Advertising*Income + ShelveLoc + CompPrice + Population + Age + Education + Urban + US
fitp5 = lm(p5 ,data=carSeats.df )
summary (fitp5)

Call:
lm(formula = p5, data = carSeats.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-159.421  -17.994    1.991   19.878   73.950 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -53.675367  18.315500  -2.931  0.00358 ** 
Advertising          1.650310   0.679306   2.429  0.01558 *  
Income               0.248046   0.078290   3.168  0.00165 ** 
ShelveLoc1-Medium   44.793177   3.779306  11.852  < 2e-16 ***
ShelveLoc2-Good    109.503315   4.587353  23.871  < 2e-16 ***
CompPrice            1.572174   0.101517  15.487  < 2e-16 ***
Population           0.004853   0.011102   0.437  0.66225    
Age                 -0.937659   0.095438  -9.825  < 2e-16 ***
Education           -0.700463   0.591011  -1.185  0.23667    
UrbanYes             3.243896   3.387197   0.958  0.33881    
USYes               -5.243940   4.495835  -1.166  0.24417    
Advertising:Income   0.018512   0.008370   2.212  0.02756 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30.53 on 388 degrees of freedom
Multiple R-squared:  0.7525,    Adjusted R-squared:  0.7454 
F-statistic: 107.2 on 11 and 388 DF,  p-value: < 2.2e-16

Conceptually: Model 5 is 'better than Model 4

Method 1: Compare the Adjusted R Squared

(summary(fitp4))$adj.r.squared
[1] 0.7428985
(summary(fitp5))$adj.r.squared
[1] 0.7454454

Method 2: Compare the AIC

AIC(fitp4)
[1] 3887.034
AIC(fitp5)
[1] 3884.022

Method 3: ANOVA (since Model 4 is nested within Model 5)

anova(fitp4, fitp5)
Analysis of Variance Table

Model 1: Profit ~ Advertising + ShelveLoc + CompPrice + Population + Income + 
    Age + Education + Urban + US
Model 2: Profit ~ Advertising * Income + ShelveLoc + CompPrice + Population + 
    Age + Education + Urban + US
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    389 366287                              
2    388 361726  1    4560.8 4.8921 0.02756 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1