Problem setting

An e-commerce platform’s marketing department has ran 200 campaigns in across its 200 different markets. Depending on the market size, a market could be towns, cities, or countries; the platform’s advertisements were all aimed at its target market persona. The campaigns were conducted across 3 main channels: Google Ads, Meta (Facebook and Instagram), and LinkedIn. The marketing expenditure and sales data for each market were then queried from the database. A 4-variables dataset is made available as follows:

The table below illustrates what the first few observations in the dataset look like.

##   google linkedin facebook_insta Sales
## 1 230100    37800          69200 22100
## 2  44500    39300          45100 10400
## 3  17200    45900          69300 12000
## 4 151500    41300          58500 16500
## 5 180800    10800          58400 17900
## 6   8700    48900          75000  7200

The aim is to inform how the different channels perform, and advise on a future steps.

Analysis

I’ll start with a summary statistics table.

Data summary
Name mdf
Number of rows 200
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
google 0 1 147042.5 85854.24 700 74375 149750 218825 296400
linkedin 0 1 23264.0 14846.81 0 9975 22900 36525 49600
facebook_insta 0 1 30554.0 21778.62 300 12750 25750 45100 114000
Sales 0 1 15130.5 5283.89 1600 11000 16000 19050 27000

15k items per market were sold, on average. The table also illustrates that the marketing expenditure varied greatly by channel. On average, a LinkedIn campaign cost 23k euro, while a Google Ads campaign cost 147k euro.

Let’s see how they’re related to sales:

The correlation between sales and Google Ads seems to be the highest (0.9), followed by LinkedIn and Meta. Interesting to note that there’s moderate correlation between Meta and LinkedIn ads expenditure also.

I kept the x-axis (ads expenditure) the same length throughout the charts for perspective-taking, but also to illustrate the differences in expenditure between the marketing channels. A few things to mention here:

The challenge is to create a model that helps estimate the effect each of the marketing channels has. A linear regression model fitted on all of the data produced the following estimates:

## 
## Call:
## lm(formula = Sales ~ ., data = mdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7303.4  -824.4    -0.8   897.6  3747.3 
## 
## Coefficients:
##                    Estimate   Std. Error t value Pr(>|t|)    
## (Intercept)    4625.1240788  307.5011646  15.041   <2e-16 ***
## google            0.0544458    0.0013752  39.592   <2e-16 ***
## linkedin          0.1070012    0.0084896  12.604   <2e-16 ***
## facebook_insta    0.0003357    0.0057881   0.058    0.954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1662 on 196 degrees of freedom
## Multiple R-squared:  0.9026, Adjusted R-squared:  0.9011 
## F-statistic: 605.4 on 3 and 196 DF,  p-value: < 2.2e-16

The model indicates that both Google Ads and LinkedIn impact sales, but the Meta ads don’t. So I’ll drop the Meta advertising from the model.

## 
## Call:
## lm(formula = Sales ~ google + linkedin, data = mdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7313.1  -826.9     9.5   902.2  3748.4 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) 4630.879464  290.308085   15.95   <2e-16 ***
## google         0.054449    0.001371   39.73   <2e-16 ***
## linkedin       0.107175    0.007926   13.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1657 on 197 degrees of freedom
## Multiple R-squared:  0.9026, Adjusted R-squared:  0.9016 
## F-statistic: 912.7 on 2 and 197 DF,  p-value: < 2.2e-16

Google ads effect can be interpreted as such: a 1 euro increase in advertising on Google Ads is associated with a 0.054 units increase in sales. Put differently, about 18,52 euro of advertising expenditure on Google ads is expected to sell one additional unit.

For LinkedIn ads, a 9,10 euro of advertising expenditure is expected to sell one additional unit.

Looking at the regression model’s summary below, the model seems to explain some 90% of variation in sales. The model also has a close-to-zero p-value.

get_regression_summaries(lm2)
## # A tibble: 1 x 9
##   r_squared adj_r_squared      mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl>    <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.903         0.902 2706053. 1645. 1657.      913.       0     2   200

Note on interaction effects

In marketing, advertising on multiple channels tends to create better effects than advertising on any single channel alone. In statistical inference terminology, this multiplicative effect is known as an interaction effect. In this dataset I know there’s an interaction effect between Google and LinkedIn ads. However, in practice it can be hard to tell whether such an effect exists - especially between numeric variables. The best visual analysis tool that I could come up with for this dataset is this bubblechart:

## `geom_smooth()` using formula 'y ~ x'

The bubble chart illustrates that when more money are spent on LinkedIn ads, the total sales increases, as you’d expect.

But it isn’t all that obvious that a multiplicative effect is present. What indicates this to me is that when the Google Ads expenditure is low but radio expenditure is high (bottom left on the chart), the dots are below or close to the line of best fit (the red line) - thus the model tends to overestimate the sales volume. On the other hand, as the Google Ads expenditure increases, the model tends to underestimate the sales volume when LinkedIn expenditure is also high.

So the bubble chart at least indicates that there’s a distinct possibility that there is one. Statistical model testing is the next step.

## 
## Call:
## lm(formula = Sales ~ google * linkedin, data = mdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6309.3  -868.8   -31.2   908.9  3641.5 
## 
## Coefficients:
##                  Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)     6.193e+03  4.101e+02  15.101     < 2e-16 ***
## google          4.358e-02  2.489e-03  17.512     < 2e-16 ***
## linkedin        4.227e-02  1.473e-02   2.869     0.00457 ** 
## google:linkedin 4.431e-07  8.673e-08   5.110 0.000000763 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1561 on 196 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.9127 
## F-statistic: 694.7 on 3 and 196 DF,  p-value: < 2.2e-16

The interaction term (\(google*linkedin\)), is highly significant with a p-value close to 0. The R-squared value also increases from 0.903 to 0.914. This indicates that a moderation effect is present. However, interaction models are more difficult to interpret, and the additional variation explained is small. As an initial model, and for explainability purposes, I would leave it out at this stage. Instead, I would mention that using both channels is likely to “benefit on sales additionally”.

Interaction modelling also comes with a range of caveats but I’ll leave that for another post. Meanwhile, this article that suggests when you should include interaction effects.