Marketing Mix Modeling (WSB)

Project Overview

This project was developed using data supplied by MMA, a consulting firm that specializes in marketing mix models. The purpose of this project is to understand how sales for Brand C relate to factors in the marketplace (own marketing efforts, others’ marketing efforts, environmental factors). This information could then be used to (i) assess the relative impact of different elements of the marketing mix and (ii) volume forecasting.

This is an integrated dataset with 179 weeks (Feb 2000 – Jul 2003) of observations for a variety of marketing mix variables. Brand C is one of the big players in a fairly commoditized product category. Brands R, E, P and U are some of the other brands in the category. Brand C and U are owned by the same company. Some of the measures in the dataset should look familiar, while others may be new. The key dependent variable in the dataset is equivalent units sales volume. You have selling price information for brands C, E, and P. The variable disacv_c not only measures how deep of a discount was offered for brand C, but also how prevalent that discount was across sales outlets.

A unique measure of promotional activity included in the dataset is expressed in terms of coupon valuation within an FSI drop, and how big the coupon drop was (in terms of circulation). This measure is reported as two variables for brand C contingent on holiday or non-holiday time periods. The dataset also has information about coupon drops for competitors R and E.

Information from Nielsen Media Research is incorporated into the dataset as TV GRP information for commercials featuring brands C and U. A gross rating point (GRP) is a variable used to measure the “impact” of television advertising. There is also an indicator variable for the thematic focus of the television advertising message.

The dataset also includes information about the prevalence of brand C’s bonus pack offering, a measure of line length per store expressed as rolling average of SKU’s per store, and (using panel data) percent share of brand C that is sold through Wal*Mart.

Definitions of variables in dataset:

week: Week of observations
weeknumber: Week number
month: Month
year: Year
eq_volum_c: Equivalent unit sales volume for brand C (the dependent variable)
disacv_c: Brand C %ACV * % Discount (This variable captures depth of price discount and how prevalent it was. That is, weighted average price discount)
bonusacv: %ACV for stores in which brand C bonus pack had sales
price_c: Brand C price per equivalent unit (non promoted price)
price_e: Brand E price per equivalent unit (non promoted price)
price_p: Private label price per equivalent unit (non promoted price)
tvgrp_c: Brand C TV GRPs (GRPs are reach TIMES frequency or the number of people viewing the commercial and how many times they see it.)
tvgrp_u: Brand U TV GRPs (GRPs are reach TIMES frequency or the number of people viewing the commercial and how many times they see it.)
trustad: Theme of Brand C TV advertising focused on the message “Trusted”. Included to indicate times when this ad ran.
fsi_holi: Brand C Holiday FSIs (coupon value * circulation)
fsi_non: Brand C Non-Holiday FSIs (coupon value * circulation)
fsi_comp: Brand E or R FSIs (coupon value * circulation)
itemstor: Number of Brand C items sold per store – rolling 13 week average
walmart: Wal*Mart share

Part 1: Data exploration. Discover key findings

Install required packages and load

# Uncomment to install packages for the first time

#install.packages("tidyverse")
#install.packages("vtable")
#install.packages("gmodels")
#install.packages("GGally")
#install.packages("performance")
#install.packages("car")
#install.packages("reshape2")
#install.packages("psych")


library(tidyverse)
library(vtable)
library(gmodels)
library(GGally) 
library(performance)
library(car)
library(reshape2)
library(corrplot)
library(psych) 
library(lmtest)
library(gridExtra)

# Set working directory for the location of regression dataset
Datset = read.csv("MMM regression dataset.csv",header=TRUE)
attach(Datset)

Summary statistics

# Generate basic summary statistics
sumtable(Datset)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
observation	179	90	51.817	1	45.5	134.5	179
weeknumber	179	25.788	14.556	1	14	38	53
year	179	2001.374	1.038	2000	2000.5	2002	2003
eq_volum	179	4430421.475	1982087.846	2829145	3413788.5	4536892.5	15500000
disacv_c	179	15.141	6.133	5.42	10.99	17.325	33.29
bonusacv	179	33.067	22.361	0	18	54	69
price_c	179	0.928	0.022	0.84	0.92	0.94	0.96
price_e	179	1.02	0.054	0.7	0.99	1.06	1.12
price_p	179	0.712	0.038	0.54	0.69	0.74	0.79
tvgrp_c	179	48.144	82.006	0	0	84.25	346
tvgrp_u	179	36.62	81.79	0	0	2.5	398
trustad	179	0.263	0.441	0	0	1	1
fsi_holi	179	1014.827	6180.888	0	0	0	41590
fsi_non	179	3216.587	10883.519	0	0	0	41676
fsi_comp	179	7239.799	17776.866	0	0	0	92896
itemstor	179	9.242	0.38	8.5	9	9.5	10
walmart	179	0.317	0.063	0.22	0.27	0.35	0.45

summary(Datset)

##   observation      weekDay            weeknumber       month          
##  Min.   :  1.0   Length:179         Min.   : 1.00   Length:179        
##  1st Qu.: 45.5   Class :character   1st Qu.:14.00   Class :character  
##  Median : 90.0   Mode  :character   Median :25.00   Mode  :character  
##  Mean   : 90.0                      Mean   :25.79                     
##  3rd Qu.:134.5                      3rd Qu.:38.00                     
##  Max.   :179.0                      Max.   :53.00                     
##       year         eq_volum           disacv_c        bonusacv    
##  Min.   :2000   Min.   : 2829145   Min.   : 5.42   Min.   : 0.00  
##  1st Qu.:2000   1st Qu.: 3413788   1st Qu.:10.99   1st Qu.:18.00  
##  Median :2001   Median : 3879590   Median :14.17   Median :35.00  
##  Mean   :2001   Mean   : 4430421   Mean   :15.14   Mean   :33.07  
##  3rd Qu.:2002   3rd Qu.: 4536892   3rd Qu.:17.32   3rd Qu.:54.00  
##  Max.   :2003   Max.   :15500000   Max.   :33.29   Max.   :69.00  
##     price_c          price_e        price_p          tvgrp_c      
##  Min.   :0.8400   Min.   :0.70   Min.   :0.5400   Min.   :  0.00  
##  1st Qu.:0.9200   1st Qu.:0.99   1st Qu.:0.6900   1st Qu.:  0.00  
##  Median :0.9300   Median :1.02   Median :0.7200   Median :  0.00  
##  Mean   :0.9279   Mean   :1.02   Mean   :0.7117   Mean   : 48.14  
##  3rd Qu.:0.9400   3rd Qu.:1.06   3rd Qu.:0.7400   3rd Qu.: 84.25  
##  Max.   :0.9600   Max.   :1.12   Max.   :0.7900   Max.   :346.00  
##     tvgrp_u          trustad          fsi_holi        fsi_non     
##  Min.   :  0.00   Min.   :0.0000   Min.   :    0   Min.   :    0  
##  1st Qu.:  0.00   1st Qu.:0.0000   1st Qu.:    0   1st Qu.:    0  
##  Median :  0.00   Median :0.0000   Median :    0   Median :    0  
##  Mean   : 36.62   Mean   :0.2626   Mean   : 1015   Mean   : 3217  
##  3rd Qu.:  2.50   3rd Qu.:1.0000   3rd Qu.:    0   3rd Qu.:    0  
##  Max.   :398.00   Max.   :1.0000   Max.   :41590   Max.   :41676  
##     fsi_comp        itemstor         walmart     
##  Min.   :    0   Min.   : 8.500   Min.   :0.220  
##  1st Qu.:    0   1st Qu.: 9.000   1st Qu.:0.270  
##  Median :    0   Median : 9.200   Median :0.300  
##  Mean   : 7240   Mean   : 9.242   Mean   :0.317  
##  3rd Qu.:    0   3rd Qu.: 9.500   3rd Qu.:0.350  
##  Max.   :92896   Max.   :10.000   Max.   :0.450

# generate summary statistic for our response variable
summary(eq_volum)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  2829145  3413788  3879590  4430421  4536892 15500000

Based on the summary statistics, we see that brand E has the highest average price per equivalent unit whereas brand P has the lowest. For our response variable, we observe a huge difference between the mean and the maximum value. Eq_volum averages around 4.4M with minimum of 2.8M and maximum of 15.5M. This explains there are extreme outliers towards the upper bound for the eq_volum variable. For coupon variables depicted by FSI, we can see that our competitors (E or R) have higher average promotional activity compared to brand C in terms of dropping coupons.

Histogram and boxplot for eq_volum (response variable) shows right skewed distribution. There are a few extremes at the larger end.

ggplot(Datset, aes(x=eq_volum)) + geom_histogram(bins = 50)

ggplot(Datset, aes(x=eq_volum)) + geom_boxplot()

Dealing with extreme outliers

Histogram and boxplot for eq_volum (response variable) shows right skewed distribution. There are a few extremes at the larger end. To mitigate the effect of extreme outliers, we will perform a log transformation on our response variable.

# Create a new column to store log transformed sale volumn
Datset$log_eq_volum <- log(Datset$eq_volum)
# Visualize distribution again
ggplot(Datset, aes(x=log_eq_volum)) + geom_histogram(bins = 50)

Slightly better but not still not normally distributed. Because we have very limited observations (179), we will consider removing our outliers for now.

We will create a new dataset without outliers to see if it does make a better model in the end.

# Create dataset with outliers removed

summary(eq_volum)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  2829145  3413788  3879590  4430421  4536892 15500000

dim(Datset[eq_volum < 8000000,]) # Taking out outliers with eq_volum >= 8000000 left us with 169 observations which is 10 less than original data.

## [1] 169  20

Datset[eq_volum >= 8000000,] # We see that the outliers all fall during the last few weeks in a year or on the first week of year

##     observation    weekDay weeknumber    month year eq_volum disacv_c bonusacv
## 44           44 12/23/2000         52 December 2000 11300000    25.62       22
## 45           45 12/30/2000         53 December 2000 11100000    21.17       24
## 95           95 12/15/2001         50 December 2001  9705819    33.29       60
## 96           96 12/22/2001         51 December 2001 12200000    31.38       63
## 97           97 12/29/2001         52 December 2001 13700000    26.06       65
## 98           98   1/5/2002          1  January 2002  8562433    27.69       63
## 147         147 12/14/2002         50 December 2002  9774228    30.03       66
## 148         148 12/21/2002         51 December 2002 13100000    33.23       67
## 149         149 12/28/2002         52 December 2002 15500000    30.03       69
## 150         150   1/4/2003          1  January 2003  8360552    26.34       67
##     price_c price_e price_p tvgrp_c tvgrp_u trustad fsi_holi fsi_non fsi_comp
## 44     0.90    0.98    0.72     0.0     337       0    19000       0        0
## 45     0.91    1.05    0.71     0.0     398       0        0       0        0
## 95     0.95    1.00    0.70   142.0       0       0    41590       0        0
## 96     0.94    0.92    0.70   190.0       0       0        0       0    65935
## 97     0.93    1.01    0.73   240.0       0       0        0       0        0
## 98     0.94    1.04    0.73   161.7       0       0        0       0        0
## 147    0.95    1.02    0.74   165.8       0       1    41474       0    58564
## 148    0.93    0.95    0.74   199.9       0       1        0       0        0
## 149    0.94    1.02    0.74   305.0       0       1        0       0        0
## 150    0.95    1.00    0.75     0.0       0       1        0       0        0
##     itemstor walmart log_eq_volum
## 44       9.9    0.26     16.24031
## 45       9.8    0.23     16.22246
## 95       9.3    0.33     16.08824
## 96       9.2    0.33     16.31695
## 97       9.1    0.30     16.43291
## 98       9.1    0.30     15.96289
## 147      9.1    0.40     16.09526
## 148      9.1    0.40     16.38812
## 149      9.0    0.39     16.55635
## 150      9.0    0.40     15.93904

outlier_removed <- Datset[eq_volum < 8000000,]
ggplot(outlier_removed, aes(x=log(eq_volum))) + geom_histogram(bins = 50)

Closer to normal distribution with outliers removed.

Looking at price variations

Lets now look at how price varies by week and by year. We will use line chart to visualize how Brand C prices varies across weeks in each year compared to Brand E and P. Because we have very limited observations (179), we will not consider removing our outliers moving forward.

We will attach the dataset again after adding in the new log transformed response variable.

attach(Datset)

## The following objects are masked from Datset (pos = 3):
## 
##     bonusacv, disacv_c, eq_volum, fsi_comp, fsi_holi, fsi_non,
##     itemstor, month, observation, price_c, price_e, price_p, trustad,
##     tvgrp_c, tvgrp_u, walmart, weekDay, weeknumber, year

dim(Datset)

## [1] 179  20

# Mean price for Brand C, E, P grouped by week and by year
C_price_week_year <- aggregate(price_c, list(weeknumber, year), mean)
E_price_week_year <- aggregate(price_e, list(weeknumber, year), mean)
P_price_week_year <- aggregate(price_p, list(weeknumber, year), mean)

# Rename the column headers for the three brands
C_price_week_year <- C_price_week_year %>% 
  rename(
    C_price = x,
    week_number = Group.1,
    year = Group.2
  )

E_price_week_year <- E_price_week_year %>% 
  rename(
    E_price = x,
    week_number = Group.1,
    year = Group.2
  )

P_price_week_year <- P_price_week_year %>% 
  rename(
    P_price = x,
    week_number = Group.1,
    year = Group.2
  )

# Generate a line plot to visualize Brand C, E and P prices across week by year
lineplot_C_price_week_year <- C_price_week_year %>% 
  arrange(year, week_number) %>%
  ggplot() +
  geom_line(aes(week_number, C_price, color = factor(year), group = year)) +
  scale_color_brewer(palette = "Set1") +  # Choose a color palette for the year variable
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 5))

lineplot_E_price_week_year <- E_price_week_year %>% 
  arrange(year, week_number) %>%
  ggplot() +
  geom_line(aes(week_number, E_price, color = factor(year), group = year)) +
  scale_color_brewer(palette = "Set1") +  # Choose a color palette for the year variable
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 5))

lineplot_P_price_week_year <- P_price_week_year %>% 
  arrange(year, week_number) %>%
  ggplot() +
  geom_line(aes(week_number, P_price, color = factor(year), group = year)) +
  scale_color_brewer(palette = "Set1") +  # Choose a color palette for the year variable
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 5))


# Compare price across week by year among Brands C, E and P
grid.arrange(lineplot_C_price_week_year, lineplot_E_price_week_year, lineplot_P_price_week_year, nrow = 2)

Looking at price variation across Brand C, E and P, we do not observe any similarity in pricing strategy. We see that prices for Brand C remain relatively flat in some periods compared to that of Brand E and P. Brand C’s pricing strategy is much more stable. In 2003, Brand C’s pricing dropped significantly around week 20 whereas Brand E had a relatively more stable pricing strategy compared to that of Brand C and P. The cause of the drop in Brand C can be examined with more business context.

Part 2: Exploring bivariate relationships using correlation

We will use x-y plot in showing how sales vary by price

corr_C <- ggplot(Datset, aes(x=price_c, y=eq_volum)) + 
  geom_point()+
  geom_smooth(method=lm)
corr_E <- ggplot(Datset, aes(x=price_e, y=eq_volum)) + 
  geom_point()+
  geom_smooth(method=lm)
corr_P <- ggplot(Datset, aes(x=price_p, y=eq_volum)) + 
  geom_point()+
  geom_smooth(method=lm)

grid.arrange(corr_C, corr_E, corr_P, nrow = 2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

No strong correlation observed between eq_volum and each brand’s pricing. The price of Brand C has a positive correlation, which is odd since you would assume that a higher price would lead to reduced sales. This indicates that Brand C is relatively price inelastic.

We will generate a correlation matrix to see association between each variables.

data.new<-Datset %>% select (eq_volum,price_c,price_e,price_p,tvgrp_c,tvgrp_u,trustad,fsi_holi,fsi_non,fsi_comp,disacv_c,bonusacv,itemstor,walmart)

corr_matrix <- cor(data.new)
corrplot(corr_matrix, method = "color", type = "upper", tl.col = "black",
         tl.srt = 45, addCoef.col = "black", col = colorRampPalette(c("red", "white", "darkgreen"))(100), tl.cex = 0.8, number.cex = 0.7)

According to our x-y plots, changes in pricing of Brand E seems to have very little effect on Brand C sales, but Brand C sales seem to increase somewhat when the price of Brand P increases. However, as shown by the correlation matrix, the correlation for each of these associations is fairly weak. Additionally, as shown below, the p-value is fairly high for the correlation between price_e and eq_volum (p=0.55), but is a bit lower for the relationships between eq_volum/price_c (p=0.093) and eq_volum/price_p (p=0.035).

Check for significance of correlations that show descent association between two variables.

# Check for significance of correlations
cor(price_c,eq_volum)

## [1] 0.1260649

cor.test(price_c,eq_volum)$p.value #Not Significant

## [1] 0.09265963

cor(price_p,eq_volum)

## [1] 0.1575022

cor.test(price_p,eq_volum)$p.value #Significant

## [1] 0.03523624

cor(price_e,eq_volum)

## [1] -0.04473694

cor.test(price_e,eq_volum)$p.value #Not Significant

## [1] 0.5520814

cor(disacv_c,eq_volum)

## [1] 0.7854491

cor.test(disacv_c,eq_volum)$p.value #Significant

## [1] 1.001884e-38

cor(bonusacv,eq_volum)

## [1] 0.3918788

cor.test(bonusacv,eq_volum)$p.value #Significant

## [1] 5.801737e-08

cor(tvgrp_c,eq_volum)

## [1] 0.4773334

cor.test(tvgrp_c,eq_volum)$p.value #Significant

## [1] 1.420318e-11

cor(fsi_holi,eq_volum)

## [1] 0.3590815

cor.test(fsi_holi,eq_volum)$p.value #Significant

## [1] 7.967732e-07

cor(bonusacv,disacv_c)

## [1] 0.4911011

cor.test(bonusacv,disacv_c)$p.value #Significant

## [1] 2.958754e-12

Correlation matrix shows no strong correlation between brand C’s sales and price of all brands. Variable disacv_c shows the strongest correlation with our response variable. This is expected because price discounts have a strong positive correlation with sales in general. We also see a decent positive correlation between tvgrp_c and sales, indicating the successful translation of high impact television campaigns to increased sales.

We will now graph x-y plots of variables with descent or high association with sales.

highcorr_disacv_c <- ggplot(Datset, aes(x=eq_volum, y=disacv_c)) + 
  geom_point()+
  geom_smooth(method=lm)

highcorr_tvgrp_c <- ggplot(Datset, aes(x=eq_volum, y=tvgrp_c)) + 
  geom_point()+
  geom_smooth(method=lm)

highcorr_bonusacv <- ggplot(Datset, aes(x=eq_volum, y=bonusacv)) + 
  geom_point()+
  geom_smooth(method=lm)

highcorr_fsi_holi <- ggplot(Datset, aes(x=eq_volum, y=fsi_holi)) + 
  geom_point()+
  geom_smooth(method=lm)

grid.arrange(highcorr_disacv_c, highcorr_tvgrp_c, highcorr_bonusacv, highcorr_fsi_holi, nrow = 2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Brand C sales is positively correlated to both disacv_c and bonusacv, indicating a fairly strong association between increased sales and Brand C’s discounting efforts.
TV advertising campaigns also seem to have a positive influence on Brand C sales, as there is a fairly strong positive correlation between tvgrp_c and eq_volum.
There is also a fairly strong association between fsi_holi and eq_volum, indicating that holiday coupons do seem to translate into increased sales.

These associations will be important to keep in mind as we build and modify our regression model.

Additional plot showing correlation between independent variables that are high and significant.

# x-y plots of independent variables with high or descent association with one another
# (Not including response variable)

ggplot(Datset, aes(x=bonusacv, y=disacv_c)) + 
  geom_point()+
  geom_smooth(method=lm)

## `geom_smooth()` using formula = 'y ~ x'

Part 3: Building regression models for marketing mix modeling

Model 0

mod0 includes all price variables regressed onto eq_volum.

mod0<- lm(eq_volum ~ price_c+price_p+price_e)
summary(mod0)

## 
## Call:
## lm(formula = eq_volum ~ price_c + price_p + price_e)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1941683 -1105038  -534027   336833 10743891 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -7839553    6534569  -1.200   0.2319  
## price_c     10404559    6846057   1.520   0.1304  
## price_p      7068768    3989865   1.772   0.0782 .
## price_e     -2368148    2817822  -0.840   0.4018  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1960000 on 175 degrees of freedom
## Multiple R-squared:  0.03881,    Adjusted R-squared:  0.02233 
## F-statistic: 2.355 on 3 and 175 DF,  p-value: 0.07362

The adjusted R-Square is low, but the relationship is not linear so a log transformation may improve this

Model 1

mod1 includes all price variables regressed onto the log transformation of eq_volum.

mod1<- lm(log(eq_volum) ~ price_c+price_p+price_e)
summary(mod1)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + price_e)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43123 -0.20394 -0.07393  0.12475  1.25538 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.7924     1.0406  12.294   <2e-16 ***
## price_c       1.9227     1.0902   1.764   0.0795 .  
## price_p       1.2484     0.6354   1.965   0.0510 .  
## price_e      -0.2182     0.4487  -0.486   0.6274    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3121 on 175 degrees of freedom
## Multiple R-squared:  0.04694,    Adjusted R-squared:  0.0306 
## F-statistic: 2.873 on 3 and 175 DF,  p-value: 0.03779

The adjusted R-Square is better, but still quite low.

There is a high degree of seasonality in sales, with most eq_volum occuring in the 52nd week of the year, as shown in the following.

ggplot(Datset, aes(weeknumber)) + 
  geom_line(aes(y = eq_volum, colour = "eq_volum"),stat = "summary", fun = "mean") +
    scale_x_continuous(breaks = seq(min(Datset$weeknumber), max(Datset$weeknumber), by = 5))

Let’s create a dummy variable called holiday to account for the spike in sales around the last week of the year.

# Create holiday dummy
Datset$holiday <- ifelse(weeknumber == 52, 1, 0)
attach(Datset)

## The following objects are masked from Datset (pos = 3):
## 
##     bonusacv, disacv_c, eq_volum, fsi_comp, fsi_holi, fsi_non,
##     itemstor, log_eq_volum, month, observation, price_c, price_e,
##     price_p, trustad, tvgrp_c, tvgrp_u, walmart, weekDay, weeknumber,
##     year

## The following objects are masked from Datset (pos = 4):
## 
##     bonusacv, disacv_c, eq_volum, fsi_comp, fsi_holi, fsi_non,
##     itemstor, month, observation, price_c, price_e, price_p, trustad,
##     tvgrp_c, tvgrp_u, walmart, weekDay, weeknumber, year

Model 2

mod2 includes all variables regressed onto log(eq_volum), but also includes a dummy variable “holiday” to account for seasonality, as well as adding discounting activities and removing price_e which seems to have little relationship with eq_volum.

mod2<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday)
summary(mod2)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv + 
##     holiday)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40887 -0.07047 -0.01691  0.05491  0.73982 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.306775   0.439319  34.842  < 2e-16 ***
## price_c     -1.706962   0.496192  -3.440 0.000729 ***
## price_p      1.201763   0.268815   4.471 1.41e-05 ***
## disacv_c     0.039454   0.001840  21.439  < 2e-16 ***
## bonusacv     0.001697   0.000554   3.063 0.002545 ** 
## holiday      0.639085   0.076517   8.352 2.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1259 on 173 degrees of freedom
## Multiple R-squared:  0.8466, Adjusted R-squared:  0.8422 
## F-statistic:   191 on 5 and 173 DF,  p-value: < 2.2e-16

The R-Square for mod2 is great (0.8422). What about their TV activities?

Model 3

mod3 includes the tvgrp variables to see if their television activities can help predict sales.

mod3<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u)
summary(mod3)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv + 
##     holiday + tvgrp_c + tvgrp_u)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39366 -0.06617 -0.01270  0.05419  0.63552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.3747730  0.3995277  38.482  < 2e-16 ***
## price_c     -1.7470985  0.4506306  -3.877 0.000151 ***
## price_p      1.1407646  0.2447078   4.662 6.29e-06 ***
## disacv_c     0.0355696  0.0017886  19.887  < 2e-16 ***
## bonusacv     0.0023768  0.0005779   4.113 6.07e-05 ***
## holiday      0.5441300  0.0714369   7.617 1.69e-12 ***
## tvgrp_c      0.0007023  0.0001196   5.872 2.19e-08 ***
## tvgrp_u      0.0004574  0.0001301   3.515 0.000564 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1143 on 171 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.8699 
## F-statistic:   171 on 7 and 171 DF,  p-value: < 2.2e-16

The R-Square looks really good (0.8699) and vif looks good

vif(mod3)

##  price_c  price_p disacv_c bonusacv  holiday  tvgrp_c  tvgrp_u 
## 1.391716 1.149278 1.638242 2.273884 1.151487 1.309934 1.542436

Model 4 & 5

mod4 and mod5 assess the predictive power of additionally adding trustad and walmart, respectively.

mod4<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+trustad)
summary(mod4)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv + 
##     holiday + tvgrp_c + tvgrp_u + trustad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39349 -0.06615 -0.01300  0.05446  0.63598 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.3573630  0.4386302  35.012  < 2e-16 ***
## price_c     -1.7321136  0.4773277  -3.629 0.000376 ***
## price_p      1.1460839  0.2514029   4.559 9.80e-06 ***
## disacv_c     0.0355627  0.0017952  19.810  < 2e-16 ***
## bonusacv     0.0023524  0.0006313   3.726 0.000264 ***
## holiday      0.5444877  0.0717384   7.590 2.02e-12 ***
## tvgrp_c      0.0007019  0.0001200   5.848 2.49e-08 ***
## tvgrp_u      0.0004580  0.0001307   3.505 0.000584 ***
## trustad      0.0023637  0.0242280   0.098 0.922394    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1147 on 170 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.8691 
## F-statistic: 148.8 on 8 and 170 DF,  p-value: < 2.2e-16

vif(mod4)

##  price_c  price_p disacv_c bonusacv  holiday  tvgrp_c  tvgrp_u  trustad 
## 1.552457 1.205999 1.640808 2.697226 1.154504 1.311456 1.546686 1.547260

We see trustad not significant; no predictive power.

mod5<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+walmart)
summary(mod5)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv + 
##     holiday + tvgrp_c + tvgrp_u + walmart)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39249 -0.06616 -0.01461  0.05533  0.63826 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.3179748  0.4768880  32.121  < 2e-16 ***
## price_c     -1.7020223  0.4963309  -3.429  0.00076 ***
## price_p      1.1464215  0.2467402   4.646 6.75e-06 ***
## disacv_c     0.0356026  0.0017999  19.781  < 2e-16 ***
## bonusacv     0.0022901  0.0007013   3.266  0.00132 ** 
## holiday      0.5456007  0.0719489   7.583 2.10e-12 ***
## tvgrp_c      0.0007006  0.0001202   5.829 2.74e-08 ***
## tvgrp_u      0.0004593  0.0001308   3.512  0.00057 ***
## walmart      0.0419640  0.1911064   0.220  0.82646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1147 on 170 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.8692 
## F-statistic: 148.8 on 8 and 170 DF,  p-value: < 2.2e-16

vif(mod5)

##  price_c  price_p disacv_c bonusacv  holiday  tvgrp_c  tvgrp_u  walmart 
## 1.678911 1.161944 1.649709 3.329535 1.161553 1.315527 1.549196 1.933420

We see walmart not significant; no predictive power.

Model 6

mod6 assesses the predictive power of coupon drop by Brand C on holidays an non holidays and of competitors.

mod6<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+fsi_comp+fsi_holi+fsi_non)
summary(mod6)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv + 
##     holiday + tvgrp_c + tvgrp_u + fsi_comp + fsi_holi + fsi_non)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36723 -0.06430 -0.01217  0.05499  0.65931 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.545e+01  4.004e-01  38.595  < 2e-16 ***
## price_c     -1.750e+00  4.488e-01  -3.899 0.000139 ***
## price_p      1.040e+00  2.487e-01   4.182 4.65e-05 ***
## disacv_c     3.486e-02  1.882e-03  18.520  < 2e-16 ***
## bonusacv     2.315e-03  5.758e-04   4.020 8.78e-05 ***
## holiday      5.647e-01  7.179e-02   7.866 4.24e-13 ***
## tvgrp_c      6.762e-04  1.201e-04   5.629 7.45e-08 ***
## tvgrp_u      4.277e-04  1.307e-04   3.273 0.001290 ** 
## fsi_comp     9.350e-07  4.984e-07   1.876 0.062388 .  
## fsi_holi     1.190e-06  1.519e-06   0.783 0.434611    
## fsi_non      7.862e-07  8.099e-07   0.971 0.333085    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1137 on 168 degrees of freedom
## Multiple R-squared:  0.8785, Adjusted R-squared:  0.8712 
## F-statistic: 121.4 on 10 and 168 DF,  p-value: < 2.2e-16

vif(mod6)

##  price_c  price_p disacv_c bonusacv  holiday  tvgrp_c  tvgrp_u fsi_comp 
## 1.394702 1.199014 1.833396 2.280540 1.175098 1.335049 1.571149 1.079837 
## fsi_holi  fsi_non 
## 1.212358 1.068911

None of these variables were significant. No predictive power of couponing.

Model 7

mod7 assesses the predictive power of itemstor, an indicator of line length at stores.

mod7<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+itemstor)
summary(mod7)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv + 
##     holiday + tvgrp_c + tvgrp_u + itemstor)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20703 -0.05860 -0.00928  0.04545  0.54431 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.8073911  0.4630615  27.658  < 2e-16 ***
## price_c     -1.7618260  0.3831404  -4.598 8.28e-06 ***
## price_p      0.4450893  0.2248533   1.979  0.04938 *  
## disacv_c     0.0302460  0.0016548  18.277  < 2e-16 ***
## bonusacv     0.0073333  0.0007814   9.385  < 2e-16 ***
## holiday      0.5240258  0.0607872   8.621 4.48e-15 ***
## tvgrp_c      0.0005380  0.0001037   5.190 5.95e-07 ***
## tvgrp_u      0.0003943  0.0001109   3.555  0.00049 ***
## itemstor     0.3249773  0.0398350   8.158 7.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09721 on 170 degrees of freedom
## Multiple R-squared:  0.9102, Adjusted R-squared:  0.9059 
## F-statistic: 215.3 on 8 and 170 DF,  p-value: < 2.2e-16

vif(mod7)

##  price_c  price_p disacv_c bonusacv  holiday  tvgrp_c  tvgrp_u itemstor 
## 1.391747 1.342341 1.939906 5.750296 1.153383 1.361305 1.549972 4.324856

This improves R-Square, but also bumps vif for bonusacv - not to a concerning amount though.

At this point, there could be an interaction between bonusacv and disacv_c because we felt these terms might be related. For example, when there is a bonus product included for Brand C, it may be included as part of a discount promotion. So let’s add this interaction term and see how we do.

Model 8

mod8 includes all adding interaction term between disacv and bonusacv.

mod8<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+itemstor+disacv_c*bonusacv)
summary(mod8)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv + 
##     holiday + tvgrp_c + tvgrp_u + itemstor + disacv_c * bonusacv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19167 -0.04995 -0.00688  0.03712  0.56494 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.301e+01  4.343e-01  29.966  < 2e-16 ***
## price_c           -1.850e+00  3.582e-01  -5.165 6.70e-07 ***
## price_p            2.527e-01  2.133e-01   1.185    0.238    
## disacv_c           1.887e-02  2.714e-03   6.952 7.51e-11 ***
## bonusacv           4.149e-03  9.606e-04   4.319 2.66e-05 ***
## holiday            4.922e-01  5.710e-02   8.620 4.65e-15 ***
## tvgrp_c            4.882e-04  9.729e-05   5.017 1.32e-06 ***
## tvgrp_u            4.541e-04  1.042e-04   4.357 2.28e-05 ***
## itemstor           3.402e-01  3.731e-02   9.116  < 2e-16 ***
## disacv_c:bonusacv  2.662e-04  5.222e-05   5.097 9.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09077 on 169 degrees of freedom
## Multiple R-squared:  0.9221, Adjusted R-squared:  0.918 
## F-statistic: 222.4 on 9 and 169 DF,  p-value: < 2.2e-16

vif(mod8)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##           price_c           price_p          disacv_c          bonusacv 
##          1.395007          1.385701          5.986671          9.966421 
##           holiday           tvgrp_c           tvgrp_u          itemstor 
##          1.167341          1.375201          1.569871          4.352653 
## disacv_c:bonusacv 
##         15.004449

This increases the vif to an undesirable level for several variables, so we will remove this variable.

Part 4: Identify best model to use for prediction

Our best model is mod7

summary(mod7)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv + 
##     holiday + tvgrp_c + tvgrp_u + itemstor)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20703 -0.05860 -0.00928  0.04545  0.54431 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.8073911  0.4630615  27.658  < 2e-16 ***
## price_c     -1.7618260  0.3831404  -4.598 8.28e-06 ***
## price_p      0.4450893  0.2248533   1.979  0.04938 *  
## disacv_c     0.0302460  0.0016548  18.277  < 2e-16 ***
## bonusacv     0.0073333  0.0007814   9.385  < 2e-16 ***
## holiday      0.5240258  0.0607872   8.621 4.48e-15 ***
## tvgrp_c      0.0005380  0.0001037   5.190 5.95e-07 ***
## tvgrp_u      0.0003943  0.0001109   3.555  0.00049 ***
## itemstor     0.3249773  0.0398350   8.158 7.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09721 on 170 degrees of freedom
## Multiple R-squared:  0.9102, Adjusted R-squared:  0.9059 
## F-statistic: 215.3 on 8 and 170 DF,  p-value: < 2.2e-16

The R-Square of our final model is 0.91. All P-values were less than 0.001 except for price_p, which was 0.49. Although this is close to the 0.5 cutoff, we decided to leave it in since we felt that the price of a competing brand might be an important predictor to Brand C’s performance. This indicates that each variable has predictive value in our model.

Looking at mod7, price discounts is the most effective marketing activity for Brand C in terms of generating sales. Independent variable ‘disacv_c’ has a coefficient estimate of around 0.03. Holding all other variables constant, an one unit increase in disacv_c is associated with a exp(0.0302460+12.8073911) increase in sales which is around 376110.1 unit sales increase for Brand C.

Bonus packs comes second in terms of effective marketing activity for Brand C with a log transformed coefficient estimate of 0.0073333. TV advertisement comes third with a log transformed coefficient estimate of 0.0005380. Couponing activities (fsi) for Brand C and their competitors were found to have no significant association with sales hence they were excluded from the model in predicting sales.

Holiday is also an important factor in predicting sales. Holiday variable takes binary input of 1 or 0. For our model interpretation, the holiday’s reference group is set at 0 with a log transformed coefficient estimate of 0.5240258. We can interpret it as, sales on holidays are on average about exp(0.5240258+12.8073911) = 616255.5 more than that on non-holidays while keeping all other variables constant.

For effects of competitive activities on Brand C’s sales, we do not find any association between price of Brand E and sales of Brand C in our earlier models hence it was not included in our final model mod7. As for Brand P’s pricing, we do observe a significant association. As price for Brand P increases, sales for Brand C increases. This is anticipated because both brands sell products in the same category so Brand C could be perceived as a purchase alternative if there is an increase in price of Brand P.

Variable ‘tvgrp_u’ shows TV commercial viewing frequency for Brand U. The fact that Brand C and U are owned by the same company suggests customers who viewed Brand U’s commercials could be introduced to Brand C in their customer journey. This helps explain why ‘tvgrp_u’ is positively associated with our sales.

Justifying choices of independent variables in mod7

Price seemed like a logical starting point for our regression analysis because sales are typically closely related to price. We started by regressing all brand prices onto a log transformation of eq_volum. This resulted in a low R-Square value (0.03). We then decided to add in variables associated with discount (disavc_c and bonusacv) because we assumed (and informed by correlation analysis) that there would be a positive relationship between discounted items and unit sales. At this point, we also removed price_e because it had a minimal correlation with eq_volum and had a high p-value in the model, clearly making it a poor predictor of eq_volum. We also added a “dummy” variable to account for seasonality that seemed to be occurring at week 52 each year.

The effect of these three changes resulted in the R-Square jumping to 0.84 - a significant increase. We then decided to evaluate the effect of tvgrp_c and tvgrp_u, given that we saw a strong correlation between tvgrp_c and eq_volum. Adding these effects produced an R-Square of 0.87.

Itemstor as an evaluation of a brand’s presence within stores could be tied to sales because when consumers see more product lines of the same brand being offered they may be more likely to purchase from that brand. The addition of Itemstor brought the R-Square up to 0.91. At this point we decided to evaluate VIF to evaluate whether there was multicollinearity, which showed no evidence of multicollinearity as all VIF scores were below 2.3.

We then considered adding trustad and walmart but decided against including them because neither showed significance and both resulted in a reduced R-Square value. We also considered factoring in coupon variables as well (fsi_holi, fsi_non, fsi_comp), but non of these variables improved the model either. This was surprising, since one would assume that coupon drops would result in increased unit sales.

We also considered adding an interaction term between bonusacv and disacv_c because we felt that these two variables might be related. Although the addition of this interaction term resulted in an increased R-Square (0.92), it also resulted in a VIF of 6 for disacv_c, 10 for bonusacv, and 15 for the interaction factor. For this reason, we felt it better to remove this interaction factor from the model.

The final model was able to account for seasonality through the use of the “holiday” dummy variable for week 52. It also had fairly low VIFs among all independent variables with only two variables exhibiting a score higher than 2 (bonusacv=5.75, itemstor=4.38).

Now we will look at model fit of actual vs. predicted sales.

logpredicted7 <- predict(mod7)
predicted7 <- exp(logpredicted7)
ggplot(Datset, aes(x=eq_volum, y=predicted7)) + 
  coord_cartesian(xlim =c(0, 15000000), ylim = c(0, 15000000)) +
  geom_point()+
  geom_smooth(method=lm)

## `geom_smooth()` using formula = 'y ~ x'

ggplot(Datset, aes(observation)) + 
  geom_line(aes(y = eq_volum, colour = "eq_volum"),stat = "summary", fun = "mean") + 
  geom_line(aes(y = predicted7, colour = "predicted7"),stat = "summary", fun = "mean")

Our model did a very good job in predicting equivalent sales volume over time based on the key predictors that we identified. It did struggle slightly to accurately predict the annual sales spikes, over predicting the first and under predicting the second and third. Some of the struggles may stem from the fact that sales spikes actually lasted for two weeks while this model was only trained on one week.

Assumption check on mod7: Linearity

logeq_volum <- log(eq_volum)
linear_price_c <- ggplot(Datset, aes(x=price_c, y=eq_volum)) + 
  coord_cartesian(xlim =c(0.83, .97), ylim = c(0, 15000000)) +
  geom_point()+
  geom_smooth(method=lm)
linear_price_p <- ggplot(Datset, aes(x=price_p, y=eq_volum)) + 
  coord_cartesian(xlim =c(0.55, .8), ylim = c(0, 15000000)) +
  geom_point()+
  geom_smooth(method=lm)
linear_disacv_c <- ggplot(Datset, aes(x=disacv_c, y=eq_volum)) + 
  coord_cartesian(xlim =c(5, 34), ylim = c(0, 15000000)) +
  geom_point()+
  geom_smooth(method=lm)
linear_bonusacv <- ggplot(Datset, aes(x=bonusacv, y=eq_volum)) + 
  coord_cartesian(xlim =c(0, 70), ylim = c(0, 15000000)) +
  geom_point()+
  geom_smooth(method=lm)
linear_tvgrp_c <- ggplot(Datset, aes(x=tvgrp_c, y=eq_volum)) + 
  coord_cartesian(xlim =c(0, 350), ylim = c(0, 15000000)) +
  geom_point()+
  geom_smooth(method=lm)
linear_tvgrp_u <- ggplot(Datset, aes(x=tvgrp_u, y=eq_volum)) + 
  coord_cartesian(xlim =c(0, 400), ylim = c(0, 15000000)) +
  geom_point()+
  geom_smooth(method=lm)
linear_itemstor <- ggplot(Datset, aes(x=itemstor, y=eq_volum)) + 
  coord_cartesian(xlim =c(8, 10), ylim = c(0, 15000000)) +
  geom_point()+
  geom_smooth(method=lm)

grid.arrange(linear_price_c, linear_price_p, linear_disacv_c, linear_bonusacv, linear_tvgrp_c, linear_tvgrp_u, linear_itemstor, nrow = 3)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

As shown above, each of the predictor variables shows low linearity with the response variable.

Let’s see the linearity after log transformation which we used for mod7.

loglinear_price_c <- ggplot(Datset, aes(x=price_c, y=logeq_volum)) + 
  coord_cartesian(xlim =c(0.83, .97), ylim = c(12, 20)) +
  geom_point()+
  geom_smooth(method=lm)
loglinear_price_p<- ggplot(Datset, aes(x=price_p, y=logeq_volum)) + 
  coord_cartesian(xlim =c(0.55, .8), ylim = c(12, 20)) +
  geom_point()+
  geom_smooth(method=lm)
loglinear_disacv_c<- ggplot(Datset, aes(x=disacv_c, y=logeq_volum)) + 
  coord_cartesian(xlim =c(5, 34), ylim = c(12, 20)) +
  geom_point()+
  geom_smooth(method=lm)
loglinear_bonusacv<- ggplot(Datset, aes(x=bonusacv, y=logeq_volum)) + 
  coord_cartesian(xlim =c(0, 70), ylim = c(12, 20)) +
  geom_point()+
  geom_smooth(method=lm)
loglinear_tvgrp_c<- ggplot(Datset, aes(x=tvgrp_c, y=logeq_volum)) + 
  coord_cartesian(xlim =c(0, 350), ylim = c(12, 20)) +
  geom_point()+
  geom_smooth(method=lm)
loglinear_tvgrp_u<- ggplot(Datset, aes(x=tvgrp_u, y=logeq_volum)) + 
  coord_cartesian(xlim =c(0, 400), ylim = c(12, 20)) +
  geom_point()+
  geom_smooth(method=lm)
loglinear_itemstor<- ggplot(Datset, aes(x=itemstor, y=logeq_volum)) + 
  coord_cartesian(xlim =c(8, 10), ylim = c(12, 20)) +
  geom_point()+
  geom_smooth(method=lm)

grid.arrange(loglinear_price_c, loglinear_price_p, loglinear_disacv_c, loglinear_bonusacv, loglinear_tvgrp_c, loglinear_tvgrp_u, loglinear_itemstor, nrow = 3)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

After a log transformation, the relationship is much more linear. This is a good indication that we have built a strong model.

Assumption check on mod7: Do the residuals have equal variance?

residuals7 <- residuals(mod7)
head(eq_volum)

## [1] 3240042 2885233 2877506 3107180 2954494 2913908

head(predicted7)

##       1       2       3       4       5       6 
## 3217779 3078200 2935131 2967494 2832212 2777099

head (residuals7)

##            1            2            3            4            5            6 
##  0.006894928 -0.064739202 -0.019828075  0.045997742  0.042269347  0.048088243

residuals7

##             1             2             3             4             5 
##  0.0068949277 -0.0647392017 -0.0198280745  0.0459977422  0.0422693466 
##             6             7             8             9            10 
##  0.0480882431 -0.0055575647  0.0064553982 -0.0006719353  0.0666728081 
##            11            12            13            14            15 
##  0.0304126240  0.0185626299 -0.0332281080 -0.0225670041  0.0123430480 
##            16            17            18            19            20 
## -0.0278751364  0.0960855635  0.0946241523  0.0542141058  0.1330180126 
##            21            22            23            24            25 
##  0.0300612592  0.0201824240  0.0955602531  0.1133721600  0.1023246312 
##            26            27            28            29            30 
##  0.0780454940  0.1121440476 -0.0132600080  0.0976349128 -0.0826096704 
##            31            32            33            34            35 
##  0.0547005023  0.0564011335  0.0486113812 -0.0695821188 -0.1444992904 
##            36            37            38            39            40 
## -0.0539972014  0.0379532117 -0.0397309349 -0.0836179245 -0.0031899111 
##            41            42            43            44            45 
## -0.1109653413 -0.0461792149 -0.0328296395 -0.1123030768  0.5443092365 
##            46            47            48            49            50 
##  0.0458694711 -0.1489600797 -0.0349773087 -0.0969724184 -0.1286771333 
##            51            52            53            54            55 
## -0.1022464460 -0.0733210481 -0.0163448233 -0.1287436614  0.0344730971 
##            56            57            58            59            60 
## -0.1337746180  0.0149655890  0.0356339762  0.0648726186 -0.0236546693 
##            61            62            63            64            65 
## -0.0523492914 -0.2070337810 -0.0512625161 -0.0156171417  0.0441877316 
##            66            67            68            69            70 
## -0.0092803270 -0.0117795914 -0.0284126831  0.1009083739  0.0985429716 
##            71            72            73            74            75 
## -0.0163463115 -0.0383940982 -0.0538175972 -0.0582828548  0.0215427611 
##            76            77            78            79            80 
##  0.0551821445 -0.0061777273 -0.0846500826 -0.1773966575 -0.0677980390 
##            81            82            83            84            85 
##  0.0548367362  0.1583250549  0.0694326135 -0.0344638904 -0.0064999531 
##            86            87            88            89            90 
##  0.0791147876 -0.0785030852  0.0616984677  0.1325150678  0.1432124921 
##            91            92            93            94            95 
## -0.1221673131 -0.0979781893 -0.1614260664 -0.0952774448  0.0974476897 
##            96            97            98            99           100 
##  0.3509835920  0.0637865656  0.1429099042 -0.0024663927 -0.0430586406 
##           101           102           103           104           105 
## -0.0732742460 -0.1206319398 -0.0334566812 -0.0071697944  0.0221573600 
##           106           107           108           109           110 
##  0.0239742666  0.0354151589  0.0049897131  0.0259462714  0.0154137571 
##           111           112           113           114           115 
##  0.0382276353 -0.0909459286 -0.0105984796  0.0037728703 -0.1091825142 
##           116           117           118           119           120 
## -0.0828134185 -0.0366550475 -0.0723019039 -0.0970067346 -0.0507503503 
##           121           122           123           124           125 
##  0.0312630532 -0.0693150938 -0.0633583439  0.0111051318 -0.0861250813 
##           126           127           128           129           130 
## -0.0225376916 -0.0673712451 -0.0202938053 -0.1751993208 -0.0196980724 
##           131           132           133           134           135 
## -0.0176490304 -0.1151774711 -0.0704863646 -0.0718480201  0.0532613638 
##           136           137           138           139           140 
## -0.0726280383  0.0292613249 -0.0109673520 -0.1412730422 -0.0668191012 
##           141           142           143           144           145 
## -0.0041328985 -0.0099213539 -0.1438547937 -0.0714319096 -0.0088645240 
##           146           147           148           149           150 
##  0.0606937918  0.1934608074  0.3286213961  0.0485165112  0.2587568828 
##           151           152           153           154           155 
## -0.0039701945 -0.0376816986 -0.0883729632 -0.0336309314  0.0241210313 
##           156           157           158           159           160 
##  0.2227154816  0.1376621829  0.0450399924 -0.0042954289 -0.0294938108 
##           161           162           163           164           165 
##  0.0935331362 -0.0128273327  0.0713752002  0.1660576055 -0.0215742328 
##           166           167           168           169           170 
## -0.0054752121 -0.0484768224  0.0487284705 -0.0321761605  0.0147343560 
##           171           172           173           174           175 
## -0.0589159039  0.0023734651 -0.0052539767 -0.0281027281  0.0166929103 
##           176           177           178           179 
##  0.0279486612 -0.0190145791 -0.0416025136 -0.0132534237

# plot residuals
plot(mod7,which=1)

As shown above, the model does display more of a heteroscedastic pattern, given that the residuals are not evenly distributed. Given that each of the largest residual values occurs at observations coinciding with the high spike sales around week 52, this sales spike is likely the cause of the residual outliers. We attempted to combat this effect by implementing a dummy “holiday” variable for sales during this week, but this approach fell short of producing a homoscedasticity of errors.

bptest(mod7)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod7
## BP = 38.614, df = 8, p-value = 5.795e-06

Breusch-Pagan is significant (<0.05), there is heteroscedasticity in our mod7 model

Assumption check on mod7: Are the residuals normally distributed?

resi_hist <- ggplot(Datset, aes(x=residuals7)) + geom_histogram()
resi_dens <- ggplot(Datset,aes(x=residuals7)) +
  geom_density()

grid.arrange(resi_hist, resi_dens, nrow = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Q-Q plot
QQplot <- plot(mod7,which=2)

As shown above, the residuals are distributed fairly normally, with exception to a couple of outliers. This is further evidence for a strong regression model. Here the brief departure in residual normality is attributed to the “holiday” spike in sales described previously.

We will now visualize in-sample and out-of-sample fit

dim(Datset)

## [1] 179  21

length(eq_volum)

## [1] 179

length(eq_volum)*.75

## [1] 134.25

insample=1:(length(eq_volum)*.75)
outsample=c(1:length(eq_volum))[-insample]
tail(insample)

## [1] 129 130 131 132 133 134

head(outsample)

## [1] 135 136 137 138 139 140

# Create Training and Test data

trainingData <- Datset[insample, ]  # model training data
testData  <- Datset[outsample, ]   # test data

Build the model on training data

lmtrain <- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+itemstor, data=trainingData)  # build the model
Logtrainpred<-predict(lmtrain,trainingData)
trainpred=exp(Logtrainpred)
Logtestpred <- predict(lmtrain, testData)  # predict sales2
testpred=exp(Logtestpred)

We could look at actual v. predicted these ways

First, in-sample

ggplot(trainingData, aes(observation)) + 
  geom_line(aes(y = eq_volum, colour = "eq_volum"),stat = "summary", fun = "mean") + 
  geom_line(aes(y = trainpred, colour = "trainpred"),stat = "summary", fun = "mean")

Second, out of sample

ggplot(testData, aes(observation)) + 
  geom_line(aes(y = eq_volum, colour = "eq_volum"),stat = "summary", fun = "mean") + 
  geom_line(aes(y = testpred, colour = "testpred"),stat = "summary", fun = "mean")

Part 5: What-if analysis. Managing marketing mix of Brand C using mod7

summary(mod7)

## 
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv + 
##     holiday + tvgrp_c + tvgrp_u + itemstor)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20703 -0.05860 -0.00928  0.04545  0.54431 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.8073911  0.4630615  27.658  < 2e-16 ***
## price_c     -1.7618260  0.3831404  -4.598 8.28e-06 ***
## price_p      0.4450893  0.2248533   1.979  0.04938 *  
## disacv_c     0.0302460  0.0016548  18.277  < 2e-16 ***
## bonusacv     0.0073333  0.0007814   9.385  < 2e-16 ***
## holiday      0.5240258  0.0607872   8.621 4.48e-15 ***
## tvgrp_c      0.0005380  0.0001037   5.190 5.95e-07 ***
## tvgrp_u      0.0003943  0.0001109   3.555  0.00049 ***
## itemstor     0.3249773  0.0398350   8.158 7.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09721 on 170 degrees of freedom
## Multiple R-squared:  0.9102, Adjusted R-squared:  0.9059 
## F-statistic: 215.3 on 8 and 170 DF,  p-value: < 2.2e-16

Assume Brand C wants to predict their sales based on different marketing activities but first they want to know what their baseline sales are in the absence of any marketing activities and competitive activities. The intercept of mod7 will represent the baseline level of sales when all other variables are equal to zero. Brand C can expect to sell about exp(12.81) which is around 365857.8 units in the absence of any marketing activity and competitive activity. Let’s say Brand C wants to predict what their sales will look like during holidays. This can be achieved by multiplying holiday’s coefficient with 1 to see the effect of holiday on sales. If they were to predict sales not during holidays, they can multiply holiday’s coefficient with 0 to see the effect of non-holidays on sales.

To understand the effect of pricing on sales of Brand C, we can create scenarios where for example, how does sales vary when price_c = 1.2 and price_p = 0.8 compared to when price_c = 1.15 and price_p = 0.75 assuming we get information on price reduction for Brand P. Using the model, we can understand how Brand C’s sales will change if our Competitor Brand P lowered their price by 0.05 and Brand C is thinking of lowering their price by 0.05 as well.

Let’s say Brand C wants to test the effect of increasing their TV advertisement viewing frequency by from an average of 500 GRPs to an average of 800 GRPs, we can feed the values in the model and calculate the sales increase:

exp(800*0.000538+12.8073911) - exp(500*0.000538+12.8073911) = 83642.48

This suggests that, keeping all other variables constant, an increase of Brand C’s TV GRPs from 500 to 800 will lead to an increase of 83642.48 unit sales. Taking consideration into costs of TV advertisement, Brand C will be able to use this predicted increase in unit sales to evaluate the cost effectiveness of a 300 GRPs increase.

What-if analysis:

The equation for the model can be written as:

log(sales) = 12.8073911 - 1.7618260*price_c + 0.4450893*price_p + 0.0302460*disacv_c + 0.0073333*bonusacv + 0.5240258*holiday1 + 0.0005380*tvgrp_c + 0.0003943*tvgrp_u + 0.3249773*itemstor

Brand C wants to know what their predicted sales will be when price_c = 0.9, price_p = 0.8, disacv_c = 15 and holiday = 0 (non-holiday) while keeping all other variables constant. We will have the following regression formula:

log(sales) = 12.8073911 - (1.7618260*0.9) + (0.4450893*0.8) + (0.0302460*15) + 0.00733330 + (0.5240258*0) + 0.000538 + 0.0003943 + 0.3249773

Based on the formula above, we will get a predicted sales of exp(12.36475) which is 234392 unit sales.