This project was developed using data supplied by MMA, a consulting firm that specializes in marketing mix models. The purpose of this project is to understand how sales for Brand C relate to factors in the marketplace (own marketing efforts, others’ marketing efforts, environmental factors). This information could then be used to (i) assess the relative impact of different elements of the marketing mix and (ii) volume forecasting.
This is an integrated dataset with 179 weeks (Feb 2000 – Jul 2003) of observations for a variety of marketing mix variables. Brand C is one of the big players in a fairly commoditized product category. Brands R, E, P and U are some of the other brands in the category. Brand C and U are owned by the same company. Some of the measures in the dataset should look familiar, while others may be new. The key dependent variable in the dataset is equivalent units sales volume. You have selling price information for brands C, E, and P. The variable disacv_c not only measures how deep of a discount was offered for brand C, but also how prevalent that discount was across sales outlets.
A unique measure of promotional activity included in the dataset is expressed in terms of coupon valuation within an FSI drop, and how big the coupon drop was (in terms of circulation). This measure is reported as two variables for brand C contingent on holiday or non-holiday time periods. The dataset also has information about coupon drops for competitors R and E.
Information from Nielsen Media Research is incorporated into the dataset as TV GRP information for commercials featuring brands C and U. A gross rating point (GRP) is a variable used to measure the “impact” of television advertising. There is also an indicator variable for the thematic focus of the television advertising message.
The dataset also includes information about the prevalence of brand C’s bonus pack offering, a measure of line length per store expressed as rolling average of SKU’s per store, and (using panel data) percent share of brand C that is sold through Wal*Mart.
week: Week of observations
weeknumber: Week number
month:
Month
year: Year
eq_volum_c:
Equivalent unit sales volume for brand C (the dependent variable)
disacv_c: Brand C %ACV * % Discount (This variable
captures depth of price discount and how prevalent it was. That is,
weighted average price discount)
bonusacv: %ACV
for stores in which brand C bonus pack had sales
price_c: Brand C price per equivalent unit (non
promoted price)
price_e: Brand E price per
equivalent unit (non promoted price)
price_p:
Private label price per equivalent unit (non promoted price)
tvgrp_c: Brand C TV GRPs (GRPs are reach TIMES
frequency or the number of people viewing the commercial and how many
times they see it.)
tvgrp_u: Brand U TV GRPs (GRPs
are reach TIMES frequency or the number of people viewing the commercial
and how many times they see it.)
trustad: Theme of
Brand C TV advertising focused on the message “Trusted”. Included to
indicate times when this ad ran.
fsi_holi: Brand C
Holiday FSIs (coupon value * circulation)
fsi_non:
Brand C Non-Holiday FSIs (coupon value * circulation)
fsi_comp: Brand E or R FSIs (coupon value *
circulation)
itemstor: Number of Brand C items
sold per store – rolling 13 week average
walmart:
Wal*Mart share
# Uncomment to install packages for the first time
#install.packages("tidyverse")
#install.packages("vtable")
#install.packages("gmodels")
#install.packages("GGally")
#install.packages("performance")
#install.packages("car")
#install.packages("reshape2")
#install.packages("psych")
library(tidyverse)
library(vtable)
library(gmodels)
library(GGally)
library(performance)
library(car)
library(reshape2)
library(corrplot)
library(psych)
library(lmtest)
library(gridExtra)
# Set working directory for the location of regression dataset
Datset = read.csv("MMM regression dataset.csv",header=TRUE)
attach(Datset)
# Generate basic summary statistics
sumtable(Datset)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| observation | 179 | 90 | 51.817 | 1 | 45.5 | 134.5 | 179 |
| weeknumber | 179 | 25.788 | 14.556 | 1 | 14 | 38 | 53 |
| year | 179 | 2001.374 | 1.038 | 2000 | 2000.5 | 2002 | 2003 |
| eq_volum | 179 | 4430421.475 | 1982087.846 | 2829145 | 3413788.5 | 4536892.5 | 15500000 |
| disacv_c | 179 | 15.141 | 6.133 | 5.42 | 10.99 | 17.325 | 33.29 |
| bonusacv | 179 | 33.067 | 22.361 | 0 | 18 | 54 | 69 |
| price_c | 179 | 0.928 | 0.022 | 0.84 | 0.92 | 0.94 | 0.96 |
| price_e | 179 | 1.02 | 0.054 | 0.7 | 0.99 | 1.06 | 1.12 |
| price_p | 179 | 0.712 | 0.038 | 0.54 | 0.69 | 0.74 | 0.79 |
| tvgrp_c | 179 | 48.144 | 82.006 | 0 | 0 | 84.25 | 346 |
| tvgrp_u | 179 | 36.62 | 81.79 | 0 | 0 | 2.5 | 398 |
| trustad | 179 | 0.263 | 0.441 | 0 | 0 | 1 | 1 |
| fsi_holi | 179 | 1014.827 | 6180.888 | 0 | 0 | 0 | 41590 |
| fsi_non | 179 | 3216.587 | 10883.519 | 0 | 0 | 0 | 41676 |
| fsi_comp | 179 | 7239.799 | 17776.866 | 0 | 0 | 0 | 92896 |
| itemstor | 179 | 9.242 | 0.38 | 8.5 | 9 | 9.5 | 10 |
| walmart | 179 | 0.317 | 0.063 | 0.22 | 0.27 | 0.35 | 0.45 |
summary(Datset)
## observation weekDay weeknumber month
## Min. : 1.0 Length:179 Min. : 1.00 Length:179
## 1st Qu.: 45.5 Class :character 1st Qu.:14.00 Class :character
## Median : 90.0 Mode :character Median :25.00 Mode :character
## Mean : 90.0 Mean :25.79
## 3rd Qu.:134.5 3rd Qu.:38.00
## Max. :179.0 Max. :53.00
## year eq_volum disacv_c bonusacv
## Min. :2000 Min. : 2829145 Min. : 5.42 Min. : 0.00
## 1st Qu.:2000 1st Qu.: 3413788 1st Qu.:10.99 1st Qu.:18.00
## Median :2001 Median : 3879590 Median :14.17 Median :35.00
## Mean :2001 Mean : 4430421 Mean :15.14 Mean :33.07
## 3rd Qu.:2002 3rd Qu.: 4536892 3rd Qu.:17.32 3rd Qu.:54.00
## Max. :2003 Max. :15500000 Max. :33.29 Max. :69.00
## price_c price_e price_p tvgrp_c
## Min. :0.8400 Min. :0.70 Min. :0.5400 Min. : 0.00
## 1st Qu.:0.9200 1st Qu.:0.99 1st Qu.:0.6900 1st Qu.: 0.00
## Median :0.9300 Median :1.02 Median :0.7200 Median : 0.00
## Mean :0.9279 Mean :1.02 Mean :0.7117 Mean : 48.14
## 3rd Qu.:0.9400 3rd Qu.:1.06 3rd Qu.:0.7400 3rd Qu.: 84.25
## Max. :0.9600 Max. :1.12 Max. :0.7900 Max. :346.00
## tvgrp_u trustad fsi_holi fsi_non
## Min. : 0.00 Min. :0.0000 Min. : 0 Min. : 0
## 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 0 1st Qu.: 0
## Median : 0.00 Median :0.0000 Median : 0 Median : 0
## Mean : 36.62 Mean :0.2626 Mean : 1015 Mean : 3217
## 3rd Qu.: 2.50 3rd Qu.:1.0000 3rd Qu.: 0 3rd Qu.: 0
## Max. :398.00 Max. :1.0000 Max. :41590 Max. :41676
## fsi_comp itemstor walmart
## Min. : 0 Min. : 8.500 Min. :0.220
## 1st Qu.: 0 1st Qu.: 9.000 1st Qu.:0.270
## Median : 0 Median : 9.200 Median :0.300
## Mean : 7240 Mean : 9.242 Mean :0.317
## 3rd Qu.: 0 3rd Qu.: 9.500 3rd Qu.:0.350
## Max. :92896 Max. :10.000 Max. :0.450
# generate summary statistic for our response variable
summary(eq_volum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2829145 3413788 3879590 4430421 4536892 15500000
Based on the summary statistics, we see that brand E has the highest average price per equivalent unit whereas brand P has the lowest. For our response variable, we observe a huge difference between the mean and the maximum value. Eq_volum averages around 4.4M with minimum of 2.8M and maximum of 15.5M. This explains there are extreme outliers towards the upper bound for the eq_volum variable. For coupon variables depicted by FSI, we can see that our competitors (E or R) have higher average promotional activity compared to brand C in terms of dropping coupons.
Histogram and boxplot for eq_volum (response variable) shows right skewed distribution. There are a few extremes at the larger end.
ggplot(Datset, aes(x=eq_volum)) + geom_histogram(bins = 50)
ggplot(Datset, aes(x=eq_volum)) + geom_boxplot()
Histogram and boxplot for eq_volum (response variable) shows right skewed distribution. There are a few extremes at the larger end. To mitigate the effect of extreme outliers, we will perform a log transformation on our response variable.
# Create a new column to store log transformed sale volumn
Datset$log_eq_volum <- log(Datset$eq_volum)
# Visualize distribution again
ggplot(Datset, aes(x=log_eq_volum)) + geom_histogram(bins = 50)
Slightly better but not still not normally distributed. Because we have very limited observations (179), we will consider removing our outliers for now.
We will create a new dataset without outliers to see if it does make a better model in the end.
# Create dataset with outliers removed
summary(eq_volum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2829145 3413788 3879590 4430421 4536892 15500000
dim(Datset[eq_volum < 8000000,]) # Taking out outliers with eq_volum >= 8000000 left us with 169 observations which is 10 less than original data.
## [1] 169 20
Datset[eq_volum >= 8000000,] # We see that the outliers all fall during the last few weeks in a year or on the first week of year
## observation weekDay weeknumber month year eq_volum disacv_c bonusacv
## 44 44 12/23/2000 52 December 2000 11300000 25.62 22
## 45 45 12/30/2000 53 December 2000 11100000 21.17 24
## 95 95 12/15/2001 50 December 2001 9705819 33.29 60
## 96 96 12/22/2001 51 December 2001 12200000 31.38 63
## 97 97 12/29/2001 52 December 2001 13700000 26.06 65
## 98 98 1/5/2002 1 January 2002 8562433 27.69 63
## 147 147 12/14/2002 50 December 2002 9774228 30.03 66
## 148 148 12/21/2002 51 December 2002 13100000 33.23 67
## 149 149 12/28/2002 52 December 2002 15500000 30.03 69
## 150 150 1/4/2003 1 January 2003 8360552 26.34 67
## price_c price_e price_p tvgrp_c tvgrp_u trustad fsi_holi fsi_non fsi_comp
## 44 0.90 0.98 0.72 0.0 337 0 19000 0 0
## 45 0.91 1.05 0.71 0.0 398 0 0 0 0
## 95 0.95 1.00 0.70 142.0 0 0 41590 0 0
## 96 0.94 0.92 0.70 190.0 0 0 0 0 65935
## 97 0.93 1.01 0.73 240.0 0 0 0 0 0
## 98 0.94 1.04 0.73 161.7 0 0 0 0 0
## 147 0.95 1.02 0.74 165.8 0 1 41474 0 58564
## 148 0.93 0.95 0.74 199.9 0 1 0 0 0
## 149 0.94 1.02 0.74 305.0 0 1 0 0 0
## 150 0.95 1.00 0.75 0.0 0 1 0 0 0
## itemstor walmart log_eq_volum
## 44 9.9 0.26 16.24031
## 45 9.8 0.23 16.22246
## 95 9.3 0.33 16.08824
## 96 9.2 0.33 16.31695
## 97 9.1 0.30 16.43291
## 98 9.1 0.30 15.96289
## 147 9.1 0.40 16.09526
## 148 9.1 0.40 16.38812
## 149 9.0 0.39 16.55635
## 150 9.0 0.40 15.93904
outlier_removed <- Datset[eq_volum < 8000000,]
ggplot(outlier_removed, aes(x=log(eq_volum))) + geom_histogram(bins = 50)
Closer to normal distribution with outliers removed.
Lets now look at how price varies by week and by year. We will use line chart to visualize how Brand C prices varies across weeks in each year compared to Brand E and P. Because we have very limited observations (179), we will not consider removing our outliers moving forward.
We will attach the dataset again after adding in the new log transformed response variable.
attach(Datset)
## The following objects are masked from Datset (pos = 3):
##
## bonusacv, disacv_c, eq_volum, fsi_comp, fsi_holi, fsi_non,
## itemstor, month, observation, price_c, price_e, price_p, trustad,
## tvgrp_c, tvgrp_u, walmart, weekDay, weeknumber, year
dim(Datset)
## [1] 179 20
# Mean price for Brand C, E, P grouped by week and by year
C_price_week_year <- aggregate(price_c, list(weeknumber, year), mean)
E_price_week_year <- aggregate(price_e, list(weeknumber, year), mean)
P_price_week_year <- aggregate(price_p, list(weeknumber, year), mean)
# Rename the column headers for the three brands
C_price_week_year <- C_price_week_year %>%
rename(
C_price = x,
week_number = Group.1,
year = Group.2
)
E_price_week_year <- E_price_week_year %>%
rename(
E_price = x,
week_number = Group.1,
year = Group.2
)
P_price_week_year <- P_price_week_year %>%
rename(
P_price = x,
week_number = Group.1,
year = Group.2
)
# Generate a line plot to visualize Brand C, E and P prices across week by year
lineplot_C_price_week_year <- C_price_week_year %>%
arrange(year, week_number) %>%
ggplot() +
geom_line(aes(week_number, C_price, color = factor(year), group = year)) +
scale_color_brewer(palette = "Set1") + # Choose a color palette for the year variable
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 5))
lineplot_E_price_week_year <- E_price_week_year %>%
arrange(year, week_number) %>%
ggplot() +
geom_line(aes(week_number, E_price, color = factor(year), group = year)) +
scale_color_brewer(palette = "Set1") + # Choose a color palette for the year variable
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 5))
lineplot_P_price_week_year <- P_price_week_year %>%
arrange(year, week_number) %>%
ggplot() +
geom_line(aes(week_number, P_price, color = factor(year), group = year)) +
scale_color_brewer(palette = "Set1") + # Choose a color palette for the year variable
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 5))
# Compare price across week by year among Brands C, E and P
grid.arrange(lineplot_C_price_week_year, lineplot_E_price_week_year, lineplot_P_price_week_year, nrow = 2)
Looking at price variation across Brand C, E and P, we do not observe any similarity in pricing strategy. We see that prices for Brand C remain relatively flat in some periods compared to that of Brand E and P. Brand C’s pricing strategy is much more stable. In 2003, Brand C’s pricing dropped significantly around week 20 whereas Brand E had a relatively more stable pricing strategy compared to that of Brand C and P. The cause of the drop in Brand C can be examined with more business context.
We will use x-y plot in showing how sales vary by price
corr_C <- ggplot(Datset, aes(x=price_c, y=eq_volum)) +
geom_point()+
geom_smooth(method=lm)
corr_E <- ggplot(Datset, aes(x=price_e, y=eq_volum)) +
geom_point()+
geom_smooth(method=lm)
corr_P <- ggplot(Datset, aes(x=price_p, y=eq_volum)) +
geom_point()+
geom_smooth(method=lm)
grid.arrange(corr_C, corr_E, corr_P, nrow = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
No strong correlation observed between eq_volum and each brand’s pricing. The price of Brand C has a positive correlation, which is odd since you would assume that a higher price would lead to reduced sales. This indicates that Brand C is relatively price inelastic.
We will generate a correlation matrix to see association between each variables.
data.new<-Datset %>% select (eq_volum,price_c,price_e,price_p,tvgrp_c,tvgrp_u,trustad,fsi_holi,fsi_non,fsi_comp,disacv_c,bonusacv,itemstor,walmart)
corr_matrix <- cor(data.new)
corrplot(corr_matrix, method = "color", type = "upper", tl.col = "black",
tl.srt = 45, addCoef.col = "black", col = colorRampPalette(c("red", "white", "darkgreen"))(100), tl.cex = 0.8, number.cex = 0.7)
According to our x-y plots, changes in pricing of Brand E seems to have very little effect on Brand C sales, but Brand C sales seem to increase somewhat when the price of Brand P increases. However, as shown by the correlation matrix, the correlation for each of these associations is fairly weak. Additionally, as shown below, the p-value is fairly high for the correlation between price_e and eq_volum (p=0.55), but is a bit lower for the relationships between eq_volum/price_c (p=0.093) and eq_volum/price_p (p=0.035).
Check for significance of correlations that show descent association between two variables.
# Check for significance of correlations
cor(price_c,eq_volum)
## [1] 0.1260649
cor.test(price_c,eq_volum)$p.value #Not Significant
## [1] 0.09265963
cor(price_p,eq_volum)
## [1] 0.1575022
cor.test(price_p,eq_volum)$p.value #Significant
## [1] 0.03523624
cor(price_e,eq_volum)
## [1] -0.04473694
cor.test(price_e,eq_volum)$p.value #Not Significant
## [1] 0.5520814
cor(disacv_c,eq_volum)
## [1] 0.7854491
cor.test(disacv_c,eq_volum)$p.value #Significant
## [1] 1.001884e-38
cor(bonusacv,eq_volum)
## [1] 0.3918788
cor.test(bonusacv,eq_volum)$p.value #Significant
## [1] 5.801737e-08
cor(tvgrp_c,eq_volum)
## [1] 0.4773334
cor.test(tvgrp_c,eq_volum)$p.value #Significant
## [1] 1.420318e-11
cor(fsi_holi,eq_volum)
## [1] 0.3590815
cor.test(fsi_holi,eq_volum)$p.value #Significant
## [1] 7.967732e-07
cor(bonusacv,disacv_c)
## [1] 0.4911011
cor.test(bonusacv,disacv_c)$p.value #Significant
## [1] 2.958754e-12
Correlation matrix shows no strong correlation between brand C’s sales and price of all brands. Variable disacv_c shows the strongest correlation with our response variable. This is expected because price discounts have a strong positive correlation with sales in general. We also see a decent positive correlation between tvgrp_c and sales, indicating the successful translation of high impact television campaigns to increased sales.
We will now graph x-y plots of variables with descent or high association with sales.
highcorr_disacv_c <- ggplot(Datset, aes(x=eq_volum, y=disacv_c)) +
geom_point()+
geom_smooth(method=lm)
highcorr_tvgrp_c <- ggplot(Datset, aes(x=eq_volum, y=tvgrp_c)) +
geom_point()+
geom_smooth(method=lm)
highcorr_bonusacv <- ggplot(Datset, aes(x=eq_volum, y=bonusacv)) +
geom_point()+
geom_smooth(method=lm)
highcorr_fsi_holi <- ggplot(Datset, aes(x=eq_volum, y=fsi_holi)) +
geom_point()+
geom_smooth(method=lm)
grid.arrange(highcorr_disacv_c, highcorr_tvgrp_c, highcorr_bonusacv, highcorr_fsi_holi, nrow = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
These associations will be important to keep in mind as we build and modify our regression model.
Additional plot showing correlation between independent variables that are high and significant.
# x-y plots of independent variables with high or descent association with one another
# (Not including response variable)
ggplot(Datset, aes(x=bonusacv, y=disacv_c)) +
geom_point()+
geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'
mod0 includes all price variables regressed onto eq_volum.
mod0<- lm(eq_volum ~ price_c+price_p+price_e)
summary(mod0)
##
## Call:
## lm(formula = eq_volum ~ price_c + price_p + price_e)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1941683 -1105038 -534027 336833 10743891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7839553 6534569 -1.200 0.2319
## price_c 10404559 6846057 1.520 0.1304
## price_p 7068768 3989865 1.772 0.0782 .
## price_e -2368148 2817822 -0.840 0.4018
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1960000 on 175 degrees of freedom
## Multiple R-squared: 0.03881, Adjusted R-squared: 0.02233
## F-statistic: 2.355 on 3 and 175 DF, p-value: 0.07362
The adjusted R-Square is low, but the relationship is not linear so a log transformation may improve this
mod1 includes all price variables regressed onto the log transformation of eq_volum.
mod1<- lm(log(eq_volum) ~ price_c+price_p+price_e)
summary(mod1)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + price_e)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43123 -0.20394 -0.07393 0.12475 1.25538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.7924 1.0406 12.294 <2e-16 ***
## price_c 1.9227 1.0902 1.764 0.0795 .
## price_p 1.2484 0.6354 1.965 0.0510 .
## price_e -0.2182 0.4487 -0.486 0.6274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3121 on 175 degrees of freedom
## Multiple R-squared: 0.04694, Adjusted R-squared: 0.0306
## F-statistic: 2.873 on 3 and 175 DF, p-value: 0.03779
The adjusted R-Square is better, but still quite low.
There is a high degree of seasonality in sales, with most eq_volum occuring in the 52nd week of the year, as shown in the following.
ggplot(Datset, aes(weeknumber)) +
geom_line(aes(y = eq_volum, colour = "eq_volum"),stat = "summary", fun = "mean") +
scale_x_continuous(breaks = seq(min(Datset$weeknumber), max(Datset$weeknumber), by = 5))
Let’s create a dummy variable called holiday to account for the spike in sales around the last week of the year.
# Create holiday dummy
Datset$holiday <- ifelse(weeknumber == 52, 1, 0)
attach(Datset)
## The following objects are masked from Datset (pos = 3):
##
## bonusacv, disacv_c, eq_volum, fsi_comp, fsi_holi, fsi_non,
## itemstor, log_eq_volum, month, observation, price_c, price_e,
## price_p, trustad, tvgrp_c, tvgrp_u, walmart, weekDay, weeknumber,
## year
## The following objects are masked from Datset (pos = 4):
##
## bonusacv, disacv_c, eq_volum, fsi_comp, fsi_holi, fsi_non,
## itemstor, month, observation, price_c, price_e, price_p, trustad,
## tvgrp_c, tvgrp_u, walmart, weekDay, weeknumber, year
mod2 includes all variables regressed onto log(eq_volum), but also includes a dummy variable “holiday” to account for seasonality, as well as adding discounting activities and removing price_e which seems to have little relationship with eq_volum.
mod2<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday)
summary(mod2)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv +
## holiday)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40887 -0.07047 -0.01691 0.05491 0.73982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.306775 0.439319 34.842 < 2e-16 ***
## price_c -1.706962 0.496192 -3.440 0.000729 ***
## price_p 1.201763 0.268815 4.471 1.41e-05 ***
## disacv_c 0.039454 0.001840 21.439 < 2e-16 ***
## bonusacv 0.001697 0.000554 3.063 0.002545 **
## holiday 0.639085 0.076517 8.352 2.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1259 on 173 degrees of freedom
## Multiple R-squared: 0.8466, Adjusted R-squared: 0.8422
## F-statistic: 191 on 5 and 173 DF, p-value: < 2.2e-16
The R-Square for mod2 is great (0.8422). What about their TV activities?
mod3 includes the tvgrp variables to see if their television activities can help predict sales.
mod3<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u)
summary(mod3)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv +
## holiday + tvgrp_c + tvgrp_u)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39366 -0.06617 -0.01270 0.05419 0.63552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.3747730 0.3995277 38.482 < 2e-16 ***
## price_c -1.7470985 0.4506306 -3.877 0.000151 ***
## price_p 1.1407646 0.2447078 4.662 6.29e-06 ***
## disacv_c 0.0355696 0.0017886 19.887 < 2e-16 ***
## bonusacv 0.0023768 0.0005779 4.113 6.07e-05 ***
## holiday 0.5441300 0.0714369 7.617 1.69e-12 ***
## tvgrp_c 0.0007023 0.0001196 5.872 2.19e-08 ***
## tvgrp_u 0.0004574 0.0001301 3.515 0.000564 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1143 on 171 degrees of freedom
## Multiple R-squared: 0.875, Adjusted R-squared: 0.8699
## F-statistic: 171 on 7 and 171 DF, p-value: < 2.2e-16
The R-Square looks really good (0.8699) and vif looks good
vif(mod3)
## price_c price_p disacv_c bonusacv holiday tvgrp_c tvgrp_u
## 1.391716 1.149278 1.638242 2.273884 1.151487 1.309934 1.542436
mod4 and mod5 assess the predictive power of additionally adding trustad and walmart, respectively.
mod4<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+trustad)
summary(mod4)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv +
## holiday + tvgrp_c + tvgrp_u + trustad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39349 -0.06615 -0.01300 0.05446 0.63598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.3573630 0.4386302 35.012 < 2e-16 ***
## price_c -1.7321136 0.4773277 -3.629 0.000376 ***
## price_p 1.1460839 0.2514029 4.559 9.80e-06 ***
## disacv_c 0.0355627 0.0017952 19.810 < 2e-16 ***
## bonusacv 0.0023524 0.0006313 3.726 0.000264 ***
## holiday 0.5444877 0.0717384 7.590 2.02e-12 ***
## tvgrp_c 0.0007019 0.0001200 5.848 2.49e-08 ***
## tvgrp_u 0.0004580 0.0001307 3.505 0.000584 ***
## trustad 0.0023637 0.0242280 0.098 0.922394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1147 on 170 degrees of freedom
## Multiple R-squared: 0.875, Adjusted R-squared: 0.8691
## F-statistic: 148.8 on 8 and 170 DF, p-value: < 2.2e-16
vif(mod4)
## price_c price_p disacv_c bonusacv holiday tvgrp_c tvgrp_u trustad
## 1.552457 1.205999 1.640808 2.697226 1.154504 1.311456 1.546686 1.547260
We see trustad not significant; no predictive power.
mod5<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+walmart)
summary(mod5)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv +
## holiday + tvgrp_c + tvgrp_u + walmart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39249 -0.06616 -0.01461 0.05533 0.63826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.3179748 0.4768880 32.121 < 2e-16 ***
## price_c -1.7020223 0.4963309 -3.429 0.00076 ***
## price_p 1.1464215 0.2467402 4.646 6.75e-06 ***
## disacv_c 0.0356026 0.0017999 19.781 < 2e-16 ***
## bonusacv 0.0022901 0.0007013 3.266 0.00132 **
## holiday 0.5456007 0.0719489 7.583 2.10e-12 ***
## tvgrp_c 0.0007006 0.0001202 5.829 2.74e-08 ***
## tvgrp_u 0.0004593 0.0001308 3.512 0.00057 ***
## walmart 0.0419640 0.1911064 0.220 0.82646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1147 on 170 degrees of freedom
## Multiple R-squared: 0.875, Adjusted R-squared: 0.8692
## F-statistic: 148.8 on 8 and 170 DF, p-value: < 2.2e-16
vif(mod5)
## price_c price_p disacv_c bonusacv holiday tvgrp_c tvgrp_u walmart
## 1.678911 1.161944 1.649709 3.329535 1.161553 1.315527 1.549196 1.933420
We see walmart not significant; no predictive power.
mod6 assesses the predictive power of coupon drop by Brand C on holidays an non holidays and of competitors.
mod6<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+fsi_comp+fsi_holi+fsi_non)
summary(mod6)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv +
## holiday + tvgrp_c + tvgrp_u + fsi_comp + fsi_holi + fsi_non)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36723 -0.06430 -0.01217 0.05499 0.65931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.545e+01 4.004e-01 38.595 < 2e-16 ***
## price_c -1.750e+00 4.488e-01 -3.899 0.000139 ***
## price_p 1.040e+00 2.487e-01 4.182 4.65e-05 ***
## disacv_c 3.486e-02 1.882e-03 18.520 < 2e-16 ***
## bonusacv 2.315e-03 5.758e-04 4.020 8.78e-05 ***
## holiday 5.647e-01 7.179e-02 7.866 4.24e-13 ***
## tvgrp_c 6.762e-04 1.201e-04 5.629 7.45e-08 ***
## tvgrp_u 4.277e-04 1.307e-04 3.273 0.001290 **
## fsi_comp 9.350e-07 4.984e-07 1.876 0.062388 .
## fsi_holi 1.190e-06 1.519e-06 0.783 0.434611
## fsi_non 7.862e-07 8.099e-07 0.971 0.333085
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1137 on 168 degrees of freedom
## Multiple R-squared: 0.8785, Adjusted R-squared: 0.8712
## F-statistic: 121.4 on 10 and 168 DF, p-value: < 2.2e-16
vif(mod6)
## price_c price_p disacv_c bonusacv holiday tvgrp_c tvgrp_u fsi_comp
## 1.394702 1.199014 1.833396 2.280540 1.175098 1.335049 1.571149 1.079837
## fsi_holi fsi_non
## 1.212358 1.068911
None of these variables were significant. No predictive power of couponing.
mod7 assesses the predictive power of itemstor, an indicator of line length at stores.
mod7<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+itemstor)
summary(mod7)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv +
## holiday + tvgrp_c + tvgrp_u + itemstor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20703 -0.05860 -0.00928 0.04545 0.54431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.8073911 0.4630615 27.658 < 2e-16 ***
## price_c -1.7618260 0.3831404 -4.598 8.28e-06 ***
## price_p 0.4450893 0.2248533 1.979 0.04938 *
## disacv_c 0.0302460 0.0016548 18.277 < 2e-16 ***
## bonusacv 0.0073333 0.0007814 9.385 < 2e-16 ***
## holiday 0.5240258 0.0607872 8.621 4.48e-15 ***
## tvgrp_c 0.0005380 0.0001037 5.190 5.95e-07 ***
## tvgrp_u 0.0003943 0.0001109 3.555 0.00049 ***
## itemstor 0.3249773 0.0398350 8.158 7.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09721 on 170 degrees of freedom
## Multiple R-squared: 0.9102, Adjusted R-squared: 0.9059
## F-statistic: 215.3 on 8 and 170 DF, p-value: < 2.2e-16
vif(mod7)
## price_c price_p disacv_c bonusacv holiday tvgrp_c tvgrp_u itemstor
## 1.391747 1.342341 1.939906 5.750296 1.153383 1.361305 1.549972 4.324856
This improves R-Square, but also bumps vif for bonusacv - not to a concerning amount though.
At this point, there could be an interaction between bonusacv and disacv_c because we felt these terms might be related. For example, when there is a bonus product included for Brand C, it may be included as part of a discount promotion. So let’s add this interaction term and see how we do.
mod8 includes all adding interaction term between disacv and bonusacv.
mod8<- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+itemstor+disacv_c*bonusacv)
summary(mod8)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv +
## holiday + tvgrp_c + tvgrp_u + itemstor + disacv_c * bonusacv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19167 -0.04995 -0.00688 0.03712 0.56494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.301e+01 4.343e-01 29.966 < 2e-16 ***
## price_c -1.850e+00 3.582e-01 -5.165 6.70e-07 ***
## price_p 2.527e-01 2.133e-01 1.185 0.238
## disacv_c 1.887e-02 2.714e-03 6.952 7.51e-11 ***
## bonusacv 4.149e-03 9.606e-04 4.319 2.66e-05 ***
## holiday 4.922e-01 5.710e-02 8.620 4.65e-15 ***
## tvgrp_c 4.882e-04 9.729e-05 5.017 1.32e-06 ***
## tvgrp_u 4.541e-04 1.042e-04 4.357 2.28e-05 ***
## itemstor 3.402e-01 3.731e-02 9.116 < 2e-16 ***
## disacv_c:bonusacv 2.662e-04 5.222e-05 5.097 9.15e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09077 on 169 degrees of freedom
## Multiple R-squared: 0.9221, Adjusted R-squared: 0.918
## F-statistic: 222.4 on 9 and 169 DF, p-value: < 2.2e-16
vif(mod8)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## price_c price_p disacv_c bonusacv
## 1.395007 1.385701 5.986671 9.966421
## holiday tvgrp_c tvgrp_u itemstor
## 1.167341 1.375201 1.569871 4.352653
## disacv_c:bonusacv
## 15.004449
This increases the vif to an undesirable level for several variables, so we will remove this variable.
Our best model is mod7
summary(mod7)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv +
## holiday + tvgrp_c + tvgrp_u + itemstor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20703 -0.05860 -0.00928 0.04545 0.54431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.8073911 0.4630615 27.658 < 2e-16 ***
## price_c -1.7618260 0.3831404 -4.598 8.28e-06 ***
## price_p 0.4450893 0.2248533 1.979 0.04938 *
## disacv_c 0.0302460 0.0016548 18.277 < 2e-16 ***
## bonusacv 0.0073333 0.0007814 9.385 < 2e-16 ***
## holiday 0.5240258 0.0607872 8.621 4.48e-15 ***
## tvgrp_c 0.0005380 0.0001037 5.190 5.95e-07 ***
## tvgrp_u 0.0003943 0.0001109 3.555 0.00049 ***
## itemstor 0.3249773 0.0398350 8.158 7.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09721 on 170 degrees of freedom
## Multiple R-squared: 0.9102, Adjusted R-squared: 0.9059
## F-statistic: 215.3 on 8 and 170 DF, p-value: < 2.2e-16
The R-Square of our final model is 0.91. All P-values were less than 0.001 except for price_p, which was 0.49. Although this is close to the 0.5 cutoff, we decided to leave it in since we felt that the price of a competing brand might be an important predictor to Brand C’s performance. This indicates that each variable has predictive value in our model.
Looking at mod7, price discounts is the most effective marketing activity for Brand C in terms of generating sales. Independent variable ‘disacv_c’ has a coefficient estimate of around 0.03. Holding all other variables constant, an one unit increase in disacv_c is associated with a exp(0.0302460+12.8073911) increase in sales which is around 376110.1 unit sales increase for Brand C.
Bonus packs comes second in terms of effective marketing activity for Brand C with a log transformed coefficient estimate of 0.0073333. TV advertisement comes third with a log transformed coefficient estimate of 0.0005380. Couponing activities (fsi) for Brand C and their competitors were found to have no significant association with sales hence they were excluded from the model in predicting sales.
Holiday is also an important factor in predicting sales. Holiday variable takes binary input of 1 or 0. For our model interpretation, the holiday’s reference group is set at 0 with a log transformed coefficient estimate of 0.5240258. We can interpret it as, sales on holidays are on average about exp(0.5240258+12.8073911) = 616255.5 more than that on non-holidays while keeping all other variables constant.
For effects of competitive activities on Brand C’s sales, we do not find any association between price of Brand E and sales of Brand C in our earlier models hence it was not included in our final model mod7. As for Brand P’s pricing, we do observe a significant association. As price for Brand P increases, sales for Brand C increases. This is anticipated because both brands sell products in the same category so Brand C could be perceived as a purchase alternative if there is an increase in price of Brand P.
Variable ‘tvgrp_u’ shows TV commercial viewing frequency for Brand U. The fact that Brand C and U are owned by the same company suggests customers who viewed Brand U’s commercials could be introduced to Brand C in their customer journey. This helps explain why ‘tvgrp_u’ is positively associated with our sales.
Price seemed like a logical starting point for our regression analysis because sales are typically closely related to price. We started by regressing all brand prices onto a log transformation of eq_volum. This resulted in a low R-Square value (0.03). We then decided to add in variables associated with discount (disavc_c and bonusacv) because we assumed (and informed by correlation analysis) that there would be a positive relationship between discounted items and unit sales. At this point, we also removed price_e because it had a minimal correlation with eq_volum and had a high p-value in the model, clearly making it a poor predictor of eq_volum. We also added a “dummy” variable to account for seasonality that seemed to be occurring at week 52 each year.
The effect of these three changes resulted in the R-Square jumping to 0.84 - a significant increase. We then decided to evaluate the effect of tvgrp_c and tvgrp_u, given that we saw a strong correlation between tvgrp_c and eq_volum. Adding these effects produced an R-Square of 0.87.
Itemstor as an evaluation of a brand’s presence within stores could be tied to sales because when consumers see more product lines of the same brand being offered they may be more likely to purchase from that brand. The addition of Itemstor brought the R-Square up to 0.91. At this point we decided to evaluate VIF to evaluate whether there was multicollinearity, which showed no evidence of multicollinearity as all VIF scores were below 2.3.
We then considered adding trustad and walmart but decided against including them because neither showed significance and both resulted in a reduced R-Square value. We also considered factoring in coupon variables as well (fsi_holi, fsi_non, fsi_comp), but non of these variables improved the model either. This was surprising, since one would assume that coupon drops would result in increased unit sales.
We also considered adding an interaction term between bonusacv and disacv_c because we felt that these two variables might be related. Although the addition of this interaction term resulted in an increased R-Square (0.92), it also resulted in a VIF of 6 for disacv_c, 10 for bonusacv, and 15 for the interaction factor. For this reason, we felt it better to remove this interaction factor from the model.
The final model was able to account for seasonality through the use of the “holiday” dummy variable for week 52. It also had fairly low VIFs among all independent variables with only two variables exhibiting a score higher than 2 (bonusacv=5.75, itemstor=4.38).
Now we will look at model fit of actual vs. predicted sales.
logpredicted7 <- predict(mod7)
predicted7 <- exp(logpredicted7)
ggplot(Datset, aes(x=eq_volum, y=predicted7)) +
coord_cartesian(xlim =c(0, 15000000), ylim = c(0, 15000000)) +
geom_point()+
geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Datset, aes(observation)) +
geom_line(aes(y = eq_volum, colour = "eq_volum"),stat = "summary", fun = "mean") +
geom_line(aes(y = predicted7, colour = "predicted7"),stat = "summary", fun = "mean")
Our model did a very good job in predicting equivalent sales volume over time based on the key predictors that we identified. It did struggle slightly to accurately predict the annual sales spikes, over predicting the first and under predicting the second and third. Some of the struggles may stem from the fact that sales spikes actually lasted for two weeks while this model was only trained on one week.
logeq_volum <- log(eq_volum)
linear_price_c <- ggplot(Datset, aes(x=price_c, y=eq_volum)) +
coord_cartesian(xlim =c(0.83, .97), ylim = c(0, 15000000)) +
geom_point()+
geom_smooth(method=lm)
linear_price_p <- ggplot(Datset, aes(x=price_p, y=eq_volum)) +
coord_cartesian(xlim =c(0.55, .8), ylim = c(0, 15000000)) +
geom_point()+
geom_smooth(method=lm)
linear_disacv_c <- ggplot(Datset, aes(x=disacv_c, y=eq_volum)) +
coord_cartesian(xlim =c(5, 34), ylim = c(0, 15000000)) +
geom_point()+
geom_smooth(method=lm)
linear_bonusacv <- ggplot(Datset, aes(x=bonusacv, y=eq_volum)) +
coord_cartesian(xlim =c(0, 70), ylim = c(0, 15000000)) +
geom_point()+
geom_smooth(method=lm)
linear_tvgrp_c <- ggplot(Datset, aes(x=tvgrp_c, y=eq_volum)) +
coord_cartesian(xlim =c(0, 350), ylim = c(0, 15000000)) +
geom_point()+
geom_smooth(method=lm)
linear_tvgrp_u <- ggplot(Datset, aes(x=tvgrp_u, y=eq_volum)) +
coord_cartesian(xlim =c(0, 400), ylim = c(0, 15000000)) +
geom_point()+
geom_smooth(method=lm)
linear_itemstor <- ggplot(Datset, aes(x=itemstor, y=eq_volum)) +
coord_cartesian(xlim =c(8, 10), ylim = c(0, 15000000)) +
geom_point()+
geom_smooth(method=lm)
grid.arrange(linear_price_c, linear_price_p, linear_disacv_c, linear_bonusacv, linear_tvgrp_c, linear_tvgrp_u, linear_itemstor, nrow = 3)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
As shown above, each of the predictor variables shows low linearity with the response variable.
Let’s see the linearity after log transformation which we used for mod7.
loglinear_price_c <- ggplot(Datset, aes(x=price_c, y=logeq_volum)) +
coord_cartesian(xlim =c(0.83, .97), ylim = c(12, 20)) +
geom_point()+
geom_smooth(method=lm)
loglinear_price_p<- ggplot(Datset, aes(x=price_p, y=logeq_volum)) +
coord_cartesian(xlim =c(0.55, .8), ylim = c(12, 20)) +
geom_point()+
geom_smooth(method=lm)
loglinear_disacv_c<- ggplot(Datset, aes(x=disacv_c, y=logeq_volum)) +
coord_cartesian(xlim =c(5, 34), ylim = c(12, 20)) +
geom_point()+
geom_smooth(method=lm)
loglinear_bonusacv<- ggplot(Datset, aes(x=bonusacv, y=logeq_volum)) +
coord_cartesian(xlim =c(0, 70), ylim = c(12, 20)) +
geom_point()+
geom_smooth(method=lm)
loglinear_tvgrp_c<- ggplot(Datset, aes(x=tvgrp_c, y=logeq_volum)) +
coord_cartesian(xlim =c(0, 350), ylim = c(12, 20)) +
geom_point()+
geom_smooth(method=lm)
loglinear_tvgrp_u<- ggplot(Datset, aes(x=tvgrp_u, y=logeq_volum)) +
coord_cartesian(xlim =c(0, 400), ylim = c(12, 20)) +
geom_point()+
geom_smooth(method=lm)
loglinear_itemstor<- ggplot(Datset, aes(x=itemstor, y=logeq_volum)) +
coord_cartesian(xlim =c(8, 10), ylim = c(12, 20)) +
geom_point()+
geom_smooth(method=lm)
grid.arrange(loglinear_price_c, loglinear_price_p, loglinear_disacv_c, loglinear_bonusacv, loglinear_tvgrp_c, loglinear_tvgrp_u, loglinear_itemstor, nrow = 3)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
After a log transformation, the relationship is much more linear. This is a good indication that we have built a strong model.
residuals7 <- residuals(mod7)
head(eq_volum)
## [1] 3240042 2885233 2877506 3107180 2954494 2913908
head(predicted7)
## 1 2 3 4 5 6
## 3217779 3078200 2935131 2967494 2832212 2777099
head (residuals7)
## 1 2 3 4 5 6
## 0.006894928 -0.064739202 -0.019828075 0.045997742 0.042269347 0.048088243
residuals7
## 1 2 3 4 5
## 0.0068949277 -0.0647392017 -0.0198280745 0.0459977422 0.0422693466
## 6 7 8 9 10
## 0.0480882431 -0.0055575647 0.0064553982 -0.0006719353 0.0666728081
## 11 12 13 14 15
## 0.0304126240 0.0185626299 -0.0332281080 -0.0225670041 0.0123430480
## 16 17 18 19 20
## -0.0278751364 0.0960855635 0.0946241523 0.0542141058 0.1330180126
## 21 22 23 24 25
## 0.0300612592 0.0201824240 0.0955602531 0.1133721600 0.1023246312
## 26 27 28 29 30
## 0.0780454940 0.1121440476 -0.0132600080 0.0976349128 -0.0826096704
## 31 32 33 34 35
## 0.0547005023 0.0564011335 0.0486113812 -0.0695821188 -0.1444992904
## 36 37 38 39 40
## -0.0539972014 0.0379532117 -0.0397309349 -0.0836179245 -0.0031899111
## 41 42 43 44 45
## -0.1109653413 -0.0461792149 -0.0328296395 -0.1123030768 0.5443092365
## 46 47 48 49 50
## 0.0458694711 -0.1489600797 -0.0349773087 -0.0969724184 -0.1286771333
## 51 52 53 54 55
## -0.1022464460 -0.0733210481 -0.0163448233 -0.1287436614 0.0344730971
## 56 57 58 59 60
## -0.1337746180 0.0149655890 0.0356339762 0.0648726186 -0.0236546693
## 61 62 63 64 65
## -0.0523492914 -0.2070337810 -0.0512625161 -0.0156171417 0.0441877316
## 66 67 68 69 70
## -0.0092803270 -0.0117795914 -0.0284126831 0.1009083739 0.0985429716
## 71 72 73 74 75
## -0.0163463115 -0.0383940982 -0.0538175972 -0.0582828548 0.0215427611
## 76 77 78 79 80
## 0.0551821445 -0.0061777273 -0.0846500826 -0.1773966575 -0.0677980390
## 81 82 83 84 85
## 0.0548367362 0.1583250549 0.0694326135 -0.0344638904 -0.0064999531
## 86 87 88 89 90
## 0.0791147876 -0.0785030852 0.0616984677 0.1325150678 0.1432124921
## 91 92 93 94 95
## -0.1221673131 -0.0979781893 -0.1614260664 -0.0952774448 0.0974476897
## 96 97 98 99 100
## 0.3509835920 0.0637865656 0.1429099042 -0.0024663927 -0.0430586406
## 101 102 103 104 105
## -0.0732742460 -0.1206319398 -0.0334566812 -0.0071697944 0.0221573600
## 106 107 108 109 110
## 0.0239742666 0.0354151589 0.0049897131 0.0259462714 0.0154137571
## 111 112 113 114 115
## 0.0382276353 -0.0909459286 -0.0105984796 0.0037728703 -0.1091825142
## 116 117 118 119 120
## -0.0828134185 -0.0366550475 -0.0723019039 -0.0970067346 -0.0507503503
## 121 122 123 124 125
## 0.0312630532 -0.0693150938 -0.0633583439 0.0111051318 -0.0861250813
## 126 127 128 129 130
## -0.0225376916 -0.0673712451 -0.0202938053 -0.1751993208 -0.0196980724
## 131 132 133 134 135
## -0.0176490304 -0.1151774711 -0.0704863646 -0.0718480201 0.0532613638
## 136 137 138 139 140
## -0.0726280383 0.0292613249 -0.0109673520 -0.1412730422 -0.0668191012
## 141 142 143 144 145
## -0.0041328985 -0.0099213539 -0.1438547937 -0.0714319096 -0.0088645240
## 146 147 148 149 150
## 0.0606937918 0.1934608074 0.3286213961 0.0485165112 0.2587568828
## 151 152 153 154 155
## -0.0039701945 -0.0376816986 -0.0883729632 -0.0336309314 0.0241210313
## 156 157 158 159 160
## 0.2227154816 0.1376621829 0.0450399924 -0.0042954289 -0.0294938108
## 161 162 163 164 165
## 0.0935331362 -0.0128273327 0.0713752002 0.1660576055 -0.0215742328
## 166 167 168 169 170
## -0.0054752121 -0.0484768224 0.0487284705 -0.0321761605 0.0147343560
## 171 172 173 174 175
## -0.0589159039 0.0023734651 -0.0052539767 -0.0281027281 0.0166929103
## 176 177 178 179
## 0.0279486612 -0.0190145791 -0.0416025136 -0.0132534237
# plot residuals
plot(mod7,which=1)
As shown above, the model does display more of a heteroscedastic pattern, given that the residuals are not evenly distributed. Given that each of the largest residual values occurs at observations coinciding with the high spike sales around week 52, this sales spike is likely the cause of the residual outliers. We attempted to combat this effect by implementing a dummy “holiday” variable for sales during this week, but this approach fell short of producing a homoscedasticity of errors.
bptest(mod7)
##
## studentized Breusch-Pagan test
##
## data: mod7
## BP = 38.614, df = 8, p-value = 5.795e-06
Breusch-Pagan is significant (<0.05), there is heteroscedasticity in our mod7 model
resi_hist <- ggplot(Datset, aes(x=residuals7)) + geom_histogram()
resi_dens <- ggplot(Datset,aes(x=residuals7)) +
geom_density()
grid.arrange(resi_hist, resi_dens, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Q-Q plot
QQplot <- plot(mod7,which=2)
As shown above, the residuals are distributed fairly normally, with exception to a couple of outliers. This is further evidence for a strong regression model. Here the brief departure in residual normality is attributed to the “holiday” spike in sales described previously.
We will now visualize in-sample and out-of-sample fit
dim(Datset)
## [1] 179 21
length(eq_volum)
## [1] 179
length(eq_volum)*.75
## [1] 134.25
insample=1:(length(eq_volum)*.75)
outsample=c(1:length(eq_volum))[-insample]
tail(insample)
## [1] 129 130 131 132 133 134
head(outsample)
## [1] 135 136 137 138 139 140
# Create Training and Test data
trainingData <- Datset[insample, ] # model training data
testData <- Datset[outsample, ] # test data
lmtrain <- lm(log(eq_volum) ~ price_c+price_p+disacv_c+bonusacv+holiday+tvgrp_c+tvgrp_u+itemstor, data=trainingData) # build the model
Logtrainpred<-predict(lmtrain,trainingData)
trainpred=exp(Logtrainpred)
Logtestpred <- predict(lmtrain, testData) # predict sales2
testpred=exp(Logtestpred)
First, in-sample
ggplot(trainingData, aes(observation)) +
geom_line(aes(y = eq_volum, colour = "eq_volum"),stat = "summary", fun = "mean") +
geom_line(aes(y = trainpred, colour = "trainpred"),stat = "summary", fun = "mean")
Second, out of sample
ggplot(testData, aes(observation)) +
geom_line(aes(y = eq_volum, colour = "eq_volum"),stat = "summary", fun = "mean") +
geom_line(aes(y = testpred, colour = "testpred"),stat = "summary", fun = "mean")
summary(mod7)
##
## Call:
## lm(formula = log(eq_volum) ~ price_c + price_p + disacv_c + bonusacv +
## holiday + tvgrp_c + tvgrp_u + itemstor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20703 -0.05860 -0.00928 0.04545 0.54431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.8073911 0.4630615 27.658 < 2e-16 ***
## price_c -1.7618260 0.3831404 -4.598 8.28e-06 ***
## price_p 0.4450893 0.2248533 1.979 0.04938 *
## disacv_c 0.0302460 0.0016548 18.277 < 2e-16 ***
## bonusacv 0.0073333 0.0007814 9.385 < 2e-16 ***
## holiday 0.5240258 0.0607872 8.621 4.48e-15 ***
## tvgrp_c 0.0005380 0.0001037 5.190 5.95e-07 ***
## tvgrp_u 0.0003943 0.0001109 3.555 0.00049 ***
## itemstor 0.3249773 0.0398350 8.158 7.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09721 on 170 degrees of freedom
## Multiple R-squared: 0.9102, Adjusted R-squared: 0.9059
## F-statistic: 215.3 on 8 and 170 DF, p-value: < 2.2e-16
Assume Brand C wants to predict their sales based on different marketing activities but first they want to know what their baseline sales are in the absence of any marketing activities and competitive activities. The intercept of mod7 will represent the baseline level of sales when all other variables are equal to zero. Brand C can expect to sell about exp(12.81) which is around 365857.8 units in the absence of any marketing activity and competitive activity. Let’s say Brand C wants to predict what their sales will look like during holidays. This can be achieved by multiplying holiday’s coefficient with 1 to see the effect of holiday on sales. If they were to predict sales not during holidays, they can multiply holiday’s coefficient with 0 to see the effect of non-holidays on sales.
To understand the effect of pricing on sales of Brand C, we can create scenarios where for example, how does sales vary when price_c = 1.2 and price_p = 0.8 compared to when price_c = 1.15 and price_p = 0.75 assuming we get information on price reduction for Brand P. Using the model, we can understand how Brand C’s sales will change if our Competitor Brand P lowered their price by 0.05 and Brand C is thinking of lowering their price by 0.05 as well.
Let’s say Brand C wants to test the effect of increasing their TV advertisement viewing frequency by from an average of 500 GRPs to an average of 800 GRPs, we can feed the values in the model and calculate the sales increase:
exp(800*0.000538+12.8073911) - exp(500*0.000538+12.8073911) = 83642.48
This suggests that, keeping all other variables constant, an increase of Brand C’s TV GRPs from 500 to 800 will lead to an increase of 83642.48 unit sales. Taking consideration into costs of TV advertisement, Brand C will be able to use this predicted increase in unit sales to evaluate the cost effectiveness of a 300 GRPs increase.
What-if analysis:
The equation for the model can be written as:
log(sales) = 12.8073911 - 1.7618260*price_c + 0.4450893*price_p + 0.0302460*disacv_c + 0.0073333*bonusacv + 0.5240258*holiday1 + 0.0005380*tvgrp_c + 0.0003943*tvgrp_u + 0.3249773*itemstor
Brand C wants to know what their predicted sales will be when price_c = 0.9, price_p = 0.8, disacv_c = 15 and holiday = 0 (non-holiday) while keeping all other variables constant. We will have the following regression formula:
log(sales) = 12.8073911 - (1.7618260*0.9) + (0.4450893*0.8) + (0.0302460*15) + 0.00733330 + (0.5240258*0) + 0.000538 + 0.0003943 + 0.3249773
Based on the formula above, we will get a predicted sales of exp(12.36475) which is 234392 unit sales.