How to use correlation analysis in R to improve marketing performance

How to use correlation analysis to improve marketing performance.

Why correlation?

It is important to understand what drives relationships. For example, if you want to determine how marketing performance impact your sales numbers you have to account for all the other major factors that can also explain your sales numbers.

Correlation is a great exploratory tool that often reveals interesting patterns and can suggest hypotheses. Most importantly, it is really straightforward to interpret.

What is correlation?

A correlation is a statistic that quantifies the strength between 2 variables.

We can plot the statistic in a correlation plot or matrix. A number between +1 and -1 calculated so as to represent the linear relationship between two variables. A positive or negative sign indicates the direction of the relationship.

We’re not going to get into the mathematics of how correlation is calculated.

Let’s cover three (3) common correlation methods:

Pearson method - correlation is the default for linear relationships and assumes your data is normally distributed. It is sensitive to outliers and skewed data.
Spearman method - for non-normal populations. Checks for rank or ordered relationships.
Kendall method - for when you have a small dataset and many tied rank relationships.

How to interpret correlation

Correlation thresholds using Cohen’s Rule of Thumb:

Cohen’s Rule of Thumb
Correlation Coefficient	Association
r >= 0.5	Large or strong association
r = 0.3	Medium association
r = 0.1	Small or weak association

If the underlying data distribution is not normal, then you could transform your variables before attempting to apply these thresholds.

What’s an acceptable correlation?

Even if the correlation coefficient is at or near zero, that doesn’t meant there’s no relationship. It’s just that relationship isn’t linear, but there could be other relationships which is why it’s important to visualize your variables.

1. Set up environment

# Install pacman if needed
if (!require("pacman")) install.packages("pacman")

## Loading required package: pacman

# load packages
pacman::p_load(pacman,
  tidyverse, cowplot, VIM, corrplot, parsnip)

setwd('~/Documents/GitHub/Marketing-Analytics-R')

2. Load & inspect data

We have our dataset that is available on Kaggle. https://www.kaggle.com/datasets/veer06b/marrket-mix-dataset. Here is what is known about this dataset according to Kaggle.

The data contains the sales data for two consecutive years of a particular product of a brand
Each row contains the Volume of Sales for a week and different campaign/promotion methods for that product for each week

#Dataset is in datasets subfolder
df <- read.csv("datasets/mktmix.csv", stringsAsFactors = TRUE)

#Check results
str(df)

## 'data.frame':    104 obs. of  9 variables:
##  $ NewVolSales     : int  19564 19387 23889 20055 20064 19481 19509 19033 20498 19509 ...
##  $ Base_Price      : num  15 15 14.6 15.3 15.6 ...
##  $ Radio           : num  245 314 324 298 279 259 235 290 315 318 ...
##  $ InStore         : num  15.5 16.4 62.7 16.6 41.5 ...
##  $ NewspaperInserts: Factor w/ 2 levels "","Insert": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Discount        : num  0 0 0.05 0 0.045 0 0 0 0.035 0.045 ...
##  $ TV              : num  101.8 76.7 131.6 119.6 103.4 ...
##  $ Stout           : num  2.28 2.22 2.01 2.2 1.82 ...
##  $ Website_Campaign: Factor w/ 4 levels "","Facebook",..: 1 1 1 1 1 1 1 1 1 1 ...

#any missing data? Using VIM package
aggr(df, prop = F, numbers = T) # radio has some missing data

#We will place with a 0
df$Radio[is.na(df$Radio)] <- 0

3. Exploratory data visualization

Let’s do some visualization. If you have several variables, plotting them in a pair plot might be a good option. This kills two birds with one stone. The visualization step plus the correlation coefficient.

p1 <- ggplot(df, aes(NewVolSales)) + geom_histogram(bins = 50, aes(y=..density..),  colour="black", fill="grey") + geom_density(alpha=.2, fill="#f4ecc2") + theme_minimal() 
p2 <- ggplot(df, aes(Base_Price, NewVolSales)) + geom_point() + theme_minimal() 
p3 <- ggplot(df, aes(Radio, NewVolSales)) + geom_point() + theme_minimal() 
p4 <- ggplot(df, aes(InStore, NewVolSales)) + geom_point() + theme_minimal() 
p5 <- ggplot(df, aes(NewspaperInserts, NewVolSales)) + geom_boxplot() + theme_minimal() 
p6 <- ggplot(df, aes(Discount, NewVolSales)) + geom_point() + theme_minimal() 
p7 <- ggplot(df, aes(TV, NewVolSales)) + geom_point() + theme_minimal()
p8 <- ggplot(df, aes(Stout, NewVolSales)) + geom_point() + theme_minimal() 
p9 <- ggplot(df, aes(Website_Campaign, NewVolSales)) + geom_boxplot() + theme_minimal() + coord_flip()

plot_grid(p1, p2, p3, p4, p5, p6, p7, p8, p9, labels = "auto")

ggplot(df, aes(Website_Campaign, NewVolSales)) + geom_boxplot() + theme_minimal() + coord_flip()

What can we tell from the exploratory visualization?

Higher sales when the base price is low. Less sales when base price is high.
The price of the product and stout seems to have a negative impact on sales. Not sure what stout refers to.
Both radio and in-store seems to have some impact. More sales when radio spending and in-store is high.
Newspaper insert doesn’t appear to have any significant impact.

4. Correlation analysis

The default method is pearson. You can change the method to kendall or spearman for non-normal distributions.

#Correlation Plot
df %>% 
 select_if(is.numeric) %>% 
  cor(method = "pearson") %>% 
    corrplot(type = "upper", addCoef.col = "black", diag=FALSE)

We have the correlation coefficients in each box. Positive correlations are in blue. Negative correlations in red.

Summary of correlations:

Instore and discount both have a medium positive correlation to NewVolSales.
Radio and TV have a weak positive correlation to NewvolSales.
Base_Price has a strong negative correlation to NewVolSales.
Last, but not least, Stout has a medium negative correlation to NewVolSales

5. Correlation and it’s relationship to regression

Let’s review how correlation and regression are related by reviewing just 2 variables (NewVolSales and Discount).

The correlation coefficient of NewVolSales and Discount ads is 0.42 (rounded to 2 decimal places). If we model this in a linear regression model and extract the r-squared, the result is 0.18. (rounded to 2 decimal places).

If we square the correlation coefficient, we will get our r-squared.

#Extract correlation coefficient
corr_coeff <- cor(df$NewVolSales, df$Discount)
corr_coeff

## [1] 0.4247606

#Extract R-squared from lm
linear_model <- lm(NewVolSales ~ Discount, data = df )
r_squared_lm <- summary(linear_model)$r.squared
r_squared_lm

## [1] 0.1804216

#Now compare the 2 numbers. The correlation coefficient squared and our r-squared extracted from our linear regression model. Rounded to 2 decimal places.
#Are these two the same? Yes they are.
round(corr_coeff^2,2) == round(r_squared_lm,2)

## [1] TRUE

NOTE: The coefficient of determination is our correlation coefficient squared. It is the proportion of the variance in the y variable that is predictable from our x variable.

6. Marketing Mix Modeling

Now we have some insights into which of our variables influence sales, let’s check for significance by using a linear regression model that includes all of our variables.

# Model Spec
model_spec_lm <- linear_reg() %>%
    set_engine('lm') %>% 
    set_mode('regression')

# Fit Linear Regression Model
mkmix_model <- model_spec_lm %>%
    fit(NewVolSales ~ Base_Price + Radio + InStore + factor(NewspaperInserts) + Discount + TV + Stout + factor(Website_Campaign), data = df)

#Print summary of model in a tidy object
mkmix_summary <- tidy(mkmix_model) %>% 
              mutate(significant = p.value <= 0.05)
mkmix_summary

#Use the glance function from the broom package to get additional information from the model (e.g. model statistics like r.squared)
(mod_glance <- glance(mkmix_model))

#If you prefer, you can use summary function to print output to console
summary(mkmix_model$fit)

## 
## Call:
## stats::lm(formula = NewVolSales ~ Base_Price + Radio + InStore + 
##     factor(NewspaperInserts) + Discount + TV + Stout + factor(Website_Campaign), 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1523.13  -473.88    16.83   381.71  2018.65 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                               54394.3976  2278.7953  23.870
## Base_Price                                -2056.8098   136.7813 -15.037
## Radio                                        -0.2801     0.7214  -0.388
## InStore                                      28.6341     7.3836   3.878
## factor(NewspaperInserts)Insert              159.0085   289.0542   0.550
## Discount                                   5554.2422  3618.1645   1.535
## TV                                            3.2233     1.6875   1.910
## Stout                                     -1631.4978   235.9605  -6.914
## factor(Website_Campaign)Facebook            239.8193   362.6601   0.661
## factor(Website_Campaign)Twitter             468.7707   362.9915   1.291
## factor(Website_Campaign)Website Campaign  -1273.3869   315.4384  -4.037
##                                           Pr(>|t|)    
## (Intercept)                                < 2e-16 ***
## Base_Price                                 < 2e-16 ***
## Radio                                     0.698694    
## InStore                                   0.000196 ***
## factor(NewspaperInserts)Insert            0.583570    
## Discount                                  0.128155    
## TV                                        0.059199 .  
## Stout                                     5.83e-10 ***
## factor(Website_Campaign)Facebook          0.510069    
## factor(Website_Campaign)Twitter           0.199762    
## factor(Website_Campaign)Website Campaign  0.000111 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 677.4 on 93 degrees of freedom
## Multiple R-squared:  0.8337, Adjusted R-squared:  0.8159 
## F-statistic: 46.63 on 10 and 93 DF,  p-value: < 2.2e-16

mkmix_summary %>% 
  filter(significant == TRUE) %>% 
  select(term, estimate, significant)

Our baseline revenue is 54,394. The only positive significant (p-value is less than 0.05) variable is InStore. The negative significant variables are Base_Price, Stout & Website Campaign (no campaign). The generic interpretation for each of our coefficients is for every one unit increase in the x variable, the y variable (NewVolSales) increases by beta units. For example, for every 1 unit increase of InStore, sales increase by 28. If our sales volume is in dollars, then this would be a 28 dollar increase.

7. What can we derive from our correlation analysis and how can we use this to inform marketing?

For the purpose of our analysis, we will just focus on the relationship between NewVolSales and each independent (x) variable.

What’s not working in marketing?:

Price - significant. We lose money when we increase the base price.
Stout - significant

What is working with marketing?:

Instore - significant

What is not impactful?:

Radio - not significant
TV - not significant
Discount - not significant

You should at this point have a conversation with your marketing stakeholders to understand their marketing goals and tactics for each of their marketing initiatives. There may be different goals for different initiatives. For instance, if marketing is using Radio and TV for top-funnel activities, then what we see in the data makes sense. Radio and TV are great for branding, but may have less of an impact on bottom-funnel metrics like sales.

8. Next steps

With our correlation analysis we have derived some key insights into what is working and what isn’t working when it comes to increased sales. Now, it’s up to you to go further in your analysis which could include:

Adding in additional factors that were not included initially.
Quantifying the impact of each marketing effort (e.g. ROI).
Build a future forecast based on current levels of marketing spend and promotions.

Pitfalls of correlation:

DANGER! DANGER! Correlation does not equal causation - correlation gives us a way to check if there is an association between two variables, but there could be other explanations.
Latent or hidden variables can affect the relationship between two (2) variables.

Advice:

Always check for highly correlated variables. An r > 0.9. You can do this by either checking the correlation matrix or checking the variance inflation factor (VIF).
If collinearity exists, remove or transform before modeling. Example, taking the natural log of a variable.
Model data using methods robust to collinearity, e.g. Random Forest models.

9. Final Summary & Key Takeaways

We walked through:

How to use correlation to help explain marketing performance.
How to test for significance.
Outlined potential next steps in your analysis.