Multiple Regression: Svedka Case Study

Summary of variables used for this exercise

setwd("~/Documents/ESCP/Consumer Insight Analytics/Berlin Semester")
svedka <- read.csv("data_svedka.csv", sep=";")
svedka <- svedka[,c("LnDiff", "LnLPrice","Tier1", "Tier2", "LnPrint", "LnOut", "LnBroad", "LagTotalMinusSales", "Firstintro")]
svedka$Tier1 <- factor(svedka$Tier1)
svedka$Tier2 <- factor(svedka$Tier2)
svedka$Firstintro <- factor(svedka$Firstintro)
skimr::skim(svedka)

Data summary
Name	svedka
Number of rows	263
Number of columns	9
_______________________
Column type frequency:
factor	3
numeric	6
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Tier1	1	FALSE	2	0: 202, 1: 61
Tier2	1	FALSE	2	0: 198, 1: 65
Firstintro	1	FALSE	2	0: 259, 1: 4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
LnDiff	0	1.00	0.06	0.18	-0.56	-0.02	0.04	0.10	1.15	▁▇▂▁▁
LnLPrice	13	0.95	4.22	0.79	3.31	3.63	3.93	4.76	9.94	▇▃▁▁▁
LnPrint	0	1.00	3.34	4.22	0.00	0.00	0.00	8.31	10.45	▇▁▁▂▃
LnOut	0	1.00	1.35	2.53	0.00	0.00	0.00	0.00	8.09	▇▁▁▁▁
LnBroad	0	1.00	0.98	2.43	-0.51	0.00	0.00	0.00	8.97	▇▁▁▁▁
LagTotalMinusSales	0	1.00	62673.94	1548.35	55687.00	62459.00	63204.00	63616.00	64131.00	▁▁▁▂▇

Standardize data

for(i in 2:ncol(svedka) ){
  if(!is.factor(svedka[,i])){
    svedka[,i] = BBmisc::normalize(svedka[,i], method="standardize")
  }
}

Question 1

model1 <- lm(LnDiff~LnLPrice, data= svedka)
summary(model1)

## 
## Call:
## lm(formula = LnDiff ~ LnLPrice, data = svedka)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64941 -0.06471 -0.01342  0.03412  1.08970 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.05606    0.01090   5.144 5.46e-07 ***
## LnLPrice     0.03296    0.01092   3.018  0.00281 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1723 on 248 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.03543,    Adjusted R-squared:  0.03154 
## F-statistic:  9.11 on 1 and 248 DF,  p-value: 0.002807

The estimated coefficient on “LnLPrice” is positive, meaning that a 1% change in lagged price is accompanied by an increase in the change of Sales (i.e. an 3.30% increase). This is counterintuitive since we would expected the change of Sales to decrease when Price rises. This might the case because if the item is a Geffen good

Question 2

svedka$Tier1 = factor(svedka$Tier1)
svedka$Tier2 = factor(svedka$Tier2)
model2 <- lm(LnDiff~LnLPrice + Tier1+Tier2, data= svedka)
summary(model2)

## 
## Call:
## lm(formula = LnDiff ~ LnLPrice + Tier1 + Tier2, data = svedka)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69098 -0.07163 -0.01123  0.04441  1.02773 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01213    0.01896  -0.640 0.523047    
## LnLPrice    -0.02521    0.01762  -1.430 0.153922    
## Tier11       0.16245    0.04274   3.800 0.000182 ***
## Tier21       0.13202    0.03271   4.037 7.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1667 on 246 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.1045, Adjusted R-squared:  0.0936 
## F-statistic: 9.571 on 3 and 246 DF,  p-value: 5.316e-06

After adding Tier1 and Tier2, both of which are significant and positively related to “lnDiff”, “LnLPrice” looses its significance

Question 3

model3 <- lm(LnDiff~LnLPrice + Tier1+Tier2+LnPrint+LnOut+ LnBroad, data= svedka)
summary(model3)

## 
## Call:
## lm(formula = LnDiff ~ LnLPrice + Tier1 + Tier2 + LnPrint + LnOut + 
##     LnBroad, data = svedka)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74956 -0.05876 -0.00335  0.04323  1.03341 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.007096   0.024365  -0.291  0.77112   
## LnLPrice    -0.029882   0.017634  -1.695  0.09144 . 
## Tier11       0.145045   0.057519   2.522  0.01232 * 
## Tier21       0.130454   0.043547   2.996  0.00302 **
## LnPrint      0.039301   0.021414   1.835  0.06769 . 
## LnOut       -0.032326   0.015352  -2.106  0.03626 * 
## LnBroad     -0.012019   0.013017  -0.923  0.35676   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1651 on 243 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.1329, Adjusted R-squared:  0.1115 
## F-statistic: 6.207 on 6 and 243 DF,  p-value: 4.441e-06

Results remain unchanged, “Tier1” and “Tier2” are still significant at a 5% lvl while “LnLPrice” is not. The only type of marketing expenditure to be significant is “LnOut” with a negative effect on change in sales (coeff = -0.03).

Question 4

model4 <- lm(LnDiff~LnLPrice + Tier1+Tier2+LnPrint+LnOut+ LnBroad+LagTotalMinusSales, data= svedka)
summary(model4)

## 
## Call:
## lm(formula = LnDiff ~ LnLPrice + Tier1 + Tier2 + LnPrint + LnOut + 
##     LnBroad + LagTotalMinusSales, data = svedka)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76515 -0.05951 -0.00818  0.04514  1.01927 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.0005737  0.0238519  -0.024 0.980830    
## LnLPrice           -0.0453686  0.0177397  -2.557 0.011155 *  
## Tier11              0.1292767  0.0563161   2.296 0.022557 *  
## Tier21              0.1210729  0.0425868   2.843 0.004851 ** 
## LnPrint             0.0485755  0.0210608   2.306 0.021931 *  
## LnOut              -0.0151311  0.0157246  -0.962 0.336880    
## LnBroad             0.0086629  0.0139391   0.621 0.534867    
## LagTotalMinusSales  0.0492794  0.0136533   3.609 0.000373 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1611 on 242 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.1772, Adjusted R-squared:  0.1534 
## F-statistic: 7.445 on 7 and 242 DF,  p-value: 4.119e-08

“LagTotalMinusSales” is significant with a positive coefficient equal to 0.05. However, now “LnLPrice” is significant and has a negative effect (coeff = -0.05) on change in Sales, which is consistent with our intuition. Also, “LnPrint” has become significant (coeff = 0.05) while “LnOut” has lost its significance.

Question 5

svedka$Firstintro = factor(svedka$Firstintro)
model5 <- lm(LnDiff~LnLPrice + Tier1+Tier2+LnPrint+LnOut+ LnBroad+LagTotalMinusSales+Firstintro, data= svedka)
summary(model5)

## 
## Call:
## lm(formula = LnDiff ~ LnLPrice + Tier1 + Tier2 + LnPrint + LnOut + 
##     LnBroad + LagTotalMinusSales + Firstintro, data = svedka)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73740 -0.05241 -0.00542  0.04732  0.87169 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.01065    0.02187  -0.487 0.626748    
## LnLPrice           -0.06858    0.01657  -4.139 4.82e-05 ***
## Tier11              0.12732    0.05152   2.471 0.014147 *  
## Tier21              0.12873    0.03897   3.303 0.001102 ** 
## LnPrint             0.05447    0.01928   2.824 0.005134 ** 
## LnOut              -0.01987    0.01440  -1.380 0.168835    
## LnBroad             0.01847    0.01283   1.440 0.151247    
## LagTotalMinusSales  0.04790    0.01249   3.835 0.000161 ***
## Firstintro1         0.54270    0.07816   6.943 3.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1474 on 241 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.3143, Adjusted R-squared:  0.2916 
## F-statistic: 13.81 on 8 and 241 DF,  p-value: < 2.2e-16

Firstintro1 is highly significant with a coefficient of 0.54. Notably, the adjusted R-squared has increased by almost 100%

Question 6

The coefficients of price and advertising change because we are fixing the issue of omitted variable bias. When we exclude variables that explain a large proportion of the variance in the dependent variable, the model suffers from omitted variable bias and can cause the coefficients within the model to become misleading. In other words, omitting a variable may lead to either the overestimation or the underestimation of a coefficient in the model. In Question 1, we perform simple regression of change in sales on lagged price, so the model is very likely to suffer from omitted variable bias (given only one independent variable), and therefore, we get a coefficient for price that is counterintuitive.

Question 7

Based on the above results I would recommend the producers of Svedka to move into the highest Tier (i.e. Tier 2) since it yields the largest change in sales (12.87%) compared to all other tiers. Furthermore, I would recommend prioritizing marketing expenditure on print since it yields the highest increase in sales (5.45%) while also considering broadcasting expenditure since it yields an increase in sales of 1.85%. Finally, the first three years of a brand’s introduction yield a 54.27% higher change in sales compared to subsequent years, therefore, if Svedka is considering rebranding itself, I would recommend taking aggressive marketing strategies during the first three years of the new brand’s release so as to capitalize on this information.