R Markdown

T-tests compare the means of two groups to determine if there’s a significant difference between them. This is a great tool for marketing since you can easily determine if two groups have differences in behavior or when testing two prototypes for a new website design etc.

Similarly, Analysis of Variance (ANOVA) compares the means of three or more groups instead of two. This is particularly useful when you simply have more than two things to test! For instance, you run 3 different promotions and you want to see which promotion drove higher sales.

These are both useful in A/B tests.

Let’s explore both using a sales data set.

Read in the data

data<- read.csv("g:/Portfolio Projects/tTest_ANOVA/Data/WA_Marketing-Campaign.csv")
kable(data[1:5, 1:ncol(data)], caption = "Examine first few rows")
Examine first few rows
MarketID MarketSize LocationID AgeOfStore Purchase_Channel Promotion week SalesInThousands
1 Medium 1 4 Online 3 1 33.73
1 Medium 1 4 InStore 3 4 39.25
1 Medium 1 4 Online 3 2 35.67
1 Medium 1 4 InStore 3 3 29.03
1 Medium 2 5 Online 2 1 27.81

Are there missing values in the data?

# check variables for missing values
missing_vals<- data %>% map(anyNA)
missings<- names(which(missing_vals == TRUE))
missings
## character(0)
# no missing values

Let’s check out the data

str(data)
## 'data.frame':    548 obs. of  8 variables:
##  $ MarketID        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MarketSize      : chr  "Medium" "Medium" "Medium" "Medium" ...
##  $ LocationID      : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ AgeOfStore      : int  4 4 4 4 5 5 5 5 12 12 ...
##  $ Purchase_Channel: chr  "Online" "InStore" "Online" "InStore" ...
##  $ Promotion       : int  3 3 3 3 2 2 2 2 1 1 ...
##  $ week            : int  1 4 2 3 1 2 3 4 1 4 ...
##  $ SalesInThousands: num  33.7 39.2 35.7 29 27.8 ...
# let's convert the character variables to factor
data<- data %>% 
  mutate(across(where(is.character), as.factor))

# make sure the conversions worked as expected
str(data)
## 'data.frame':    548 obs. of  8 variables:
##  $ MarketID        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MarketSize      : Factor w/ 3 levels "Large","Medium",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LocationID      : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ AgeOfStore      : int  4 4 4 4 5 5 5 5 12 12 ...
##  $ Purchase_Channel: Factor w/ 2 levels "InStore","Online": 2 1 2 1 2 2 2 2 2 2 ...
##  $ Promotion       : int  3 3 3 3 2 2 2 2 1 1 ...
##  $ week            : int  1 4 2 3 1 2 3 4 1 4 ...
##  $ SalesInThousands: num  33.7 39.2 35.7 29 27.8 ...
# we'll want Promotion as a factor instead of an integer later on so let's change that
data$Promotion<- as.factor(data$Promotion)

T-test

Let’s first test a hypothesis which states there is no difference in sales for in store vs online purchases.

# plot distribution
ggplot(data = data, aes(x = SalesInThousands, color = Purchase_Channel)) +
  geom_density()

From the above plot, it’s fair to assume that there isn’t much of a difference in sales for purchase channel. Let’s run a T-test to find out.

t.test(SalesInThousands ~ Purchase_Channel, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  SalesInThousands by Purchase_Channel
## t = -0.17192, df = 181.15, p-value = 0.8637
## alternative hypothesis: true difference in means between group InStore and group Online is not equal to 0
## 95 percent confidence interval:
##  -3.719991  3.123715
## sample estimates:
## mean in group InStore  mean in group Online 
##              53.23009              53.52823

As the T-test output shows, there is NOT a significant difference in sales for the purchase channel. Therefore, we accept the NULL hypothesis.

ANOVA

Let’s now test the hypothesis that there is no difference between the various promotions that were ran on sales. For this we’ll use ANOVA. Again, let’s look at the distribution in sales.

# plot distribution
ggplot(data = data, aes(x = SalesInThousands, color = Promotion)) +
  geom_density()

Let’s see if there are statistical differences in the promotions with ANOVA!

data %>% 
  aov(SalesInThousands ~ Promotion, data = .) %>% 
  summary()
##              Df Sum Sq Mean Sq F value         Pr(>F)    
## Promotion     2  11449    5725   21.95 0.000000000677 ***
## Residuals   545 142114     261                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our ANOVA model is statistically significant! Let’s run a post-hoc test to see which promotions are different and which one our marketing team should use going forward.

data %>% 
  aov(SalesInThousands ~ Promotion, data = .) %>% 
  TukeyHSD()
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = SalesInThousands ~ Promotion, data = .)
## 
## $Promotion
##           diff        lwr       upr     p adj
## 2-1 -10.769597 -14.773842 -6.765351 0.0000000
## 3-1  -2.734544  -6.738789  1.269702 0.2443878
## 3-2   8.035053   4.120802 11.949304 0.0000055

Promotions 2 and 1 are statistically different. 3 and 1 are not. 3 and 2 are. So which do we use? We can evaluate the mean differences in sales.

data %>% 
  group_by(Promotion) %>% 
  summarise(Promotion_Sales_means = mean(SalesInThousands, na.rm = TRUE))
## # A tibble: 3 × 2
##   Promotion Promotion_Sales_means
##   <fct>                     <dbl>
## 1 1                          58.1
## 2 2                          47.3
## 3 3                          55.4

Conclusion

Since promotions 3 and 1 are not statistically different we could go with either of those promotions. However, since promotion 1 has the highest mean sales, my recommendation to the marketing team would be to run with promotion 1.

Both T-test and ANOVA are incredibly useful marketing tools. They can help marketing teams make data-driven choices and optimize the desired outcomes.