T-tests compare the means of two groups to determine if there’s a significant difference between them. This is a great tool for marketing since you can easily determine if two groups have differences in behavior or when testing two prototypes for a new website design etc.
Similarly, Analysis of Variance (ANOVA) compares the means of three or more groups instead of two. This is particularly useful when you simply have more than two things to test! For instance, you run 3 different promotions and you want to see which promotion drove higher sales.
These are both useful in A/B tests.
Let’s explore both using a sales data set.
data<- read.csv("g:/Portfolio Projects/tTest_ANOVA/Data/WA_Marketing-Campaign.csv")
kable(data[1:5, 1:ncol(data)], caption = "Examine first few rows")
| MarketID | MarketSize | LocationID | AgeOfStore | Purchase_Channel | Promotion | week | SalesInThousands |
|---|---|---|---|---|---|---|---|
| 1 | Medium | 1 | 4 | Online | 3 | 1 | 33.73 |
| 1 | Medium | 1 | 4 | InStore | 3 | 4 | 39.25 |
| 1 | Medium | 1 | 4 | Online | 3 | 2 | 35.67 |
| 1 | Medium | 1 | 4 | InStore | 3 | 3 | 29.03 |
| 1 | Medium | 2 | 5 | Online | 2 | 1 | 27.81 |
# check variables for missing values
missing_vals<- data %>% map(anyNA)
missings<- names(which(missing_vals == TRUE))
missings
## character(0)
# no missing values
str(data)
## 'data.frame': 548 obs. of 8 variables:
## $ MarketID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MarketSize : chr "Medium" "Medium" "Medium" "Medium" ...
## $ LocationID : int 1 1 1 1 2 2 2 2 3 3 ...
## $ AgeOfStore : int 4 4 4 4 5 5 5 5 12 12 ...
## $ Purchase_Channel: chr "Online" "InStore" "Online" "InStore" ...
## $ Promotion : int 3 3 3 3 2 2 2 2 1 1 ...
## $ week : int 1 4 2 3 1 2 3 4 1 4 ...
## $ SalesInThousands: num 33.7 39.2 35.7 29 27.8 ...
# let's convert the character variables to factor
data<- data %>%
mutate(across(where(is.character), as.factor))
# make sure the conversions worked as expected
str(data)
## 'data.frame': 548 obs. of 8 variables:
## $ MarketID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MarketSize : Factor w/ 3 levels "Large","Medium",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ LocationID : int 1 1 1 1 2 2 2 2 3 3 ...
## $ AgeOfStore : int 4 4 4 4 5 5 5 5 12 12 ...
## $ Purchase_Channel: Factor w/ 2 levels "InStore","Online": 2 1 2 1 2 2 2 2 2 2 ...
## $ Promotion : int 3 3 3 3 2 2 2 2 1 1 ...
## $ week : int 1 4 2 3 1 2 3 4 1 4 ...
## $ SalesInThousands: num 33.7 39.2 35.7 29 27.8 ...
# we'll want Promotion as a factor instead of an integer later on so let's change that
data$Promotion<- as.factor(data$Promotion)
Let’s first test a hypothesis which states there is no difference in sales for in store vs online purchases.
# plot distribution
ggplot(data = data, aes(x = SalesInThousands, color = Purchase_Channel)) +
geom_density()
From the above plot, it’s fair to assume that there isn’t much of a difference in sales for purchase channel. Let’s run a T-test to find out.
t.test(SalesInThousands ~ Purchase_Channel, data = data)
##
## Welch Two Sample t-test
##
## data: SalesInThousands by Purchase_Channel
## t = -0.17192, df = 181.15, p-value = 0.8637
## alternative hypothesis: true difference in means between group InStore and group Online is not equal to 0
## 95 percent confidence interval:
## -3.719991 3.123715
## sample estimates:
## mean in group InStore mean in group Online
## 53.23009 53.52823
As the T-test output shows, there is NOT a significant difference in sales for the purchase channel. Therefore, we accept the NULL hypothesis.
Let’s now test the hypothesis that there is no difference between the various promotions that were ran on sales. For this we’ll use ANOVA. Again, let’s look at the distribution in sales.
# plot distribution
ggplot(data = data, aes(x = SalesInThousands, color = Promotion)) +
geom_density()
Let’s see if there are statistical differences in the promotions with ANOVA!
data %>%
aov(SalesInThousands ~ Promotion, data = .) %>%
summary()
## Df Sum Sq Mean Sq F value Pr(>F)
## Promotion 2 11449 5725 21.95 0.000000000677 ***
## Residuals 545 142114 261
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our ANOVA model is statistically significant! Let’s run a post-hoc test to see which promotions are different and which one our marketing team should use going forward.
data %>%
aov(SalesInThousands ~ Promotion, data = .) %>%
TukeyHSD()
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = SalesInThousands ~ Promotion, data = .)
##
## $Promotion
## diff lwr upr p adj
## 2-1 -10.769597 -14.773842 -6.765351 0.0000000
## 3-1 -2.734544 -6.738789 1.269702 0.2443878
## 3-2 8.035053 4.120802 11.949304 0.0000055
Promotions 2 and 1 are statistically different. 3 and 1 are not. 3 and 2 are. So which do we use? We can evaluate the mean differences in sales.
data %>%
group_by(Promotion) %>%
summarise(Promotion_Sales_means = mean(SalesInThousands, na.rm = TRUE))
## # A tibble: 3 × 2
## Promotion Promotion_Sales_means
## <fct> <dbl>
## 1 1 58.1
## 2 2 47.3
## 3 3 55.4
Since promotions 3 and 1 are not statistically different we could go with either of those promotions. However, since promotion 1 has the highest mean sales, my recommendation to the marketing team would be to run with promotion 1.
Both T-test and ANOVA are incredibly useful marketing tools. They can help marketing teams make data-driven choices and optimize the desired outcomes.