d %>%
filter(past_purch > 0) %>%
ggplot(aes(x=last_purch, fill=group)) +
geom_histogram(binwidth = 25, alpha=0.2, position="identity") +
xlim(0, 500) +
xlab("Last Purchase") + ylab("Customers") +
labs(title="Distribution of purchase recency by group")
For categorical outcomes
success <- d %>%
filter(group!="ctrl") %>%
group_by(group,click) %>%
count() %>%
filter(click==1)
trials <- d %>%
filter(group!="ctrl") %>%
count(group)
prop.test(success$n,trials$n)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: success$n out of trials$n
## X-squared = 367.2, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.03754391 0.04612615
## sample estimates:
## prop 1 prop 2
## 0.13152992 0.08969489
For continuous variables
d <- d %>%
mutate(email = group %in% c("email_B","email_A"))
t.test(purch ~ email, data=d)
##
## Welch Two Sample t-test
##
## data: purch by email
## t = -5.4485, df = 89037, p-value = 5.093e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.9417432 -0.9143263
## sample estimates:
## mean in group FALSE mean in group TRUE
## 12.32166 13.74970
summary(d$last_purch)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 26.00 63.00 89.98 125.00 992.00
d <- d %>% mutate(recent=(last_purch<summary(d$last_purch)[['Median']]))
d %>%
filter(group != "ctrl") %>%
ggplot(aes(x=recent, fill=as.factor(open)))+
geom_bar()
d %>%
filter(group != "ctrl") %>%
ggplot(aes(x=recent, fill=as.factor(click)))+
geom_bar()
Question: Does the email campaign have more of an effect on the purchases of recent shoppers? If so, what is the difference?
Hint: Start by creating a new variable called ‘email’ that equals 1 if group != ‘ctrl’
d <- d %>% mutate(email = (group!="ctrl"))
d %>%
group_by(recent, email) %>%
summarise(purchases = mean(purch), opens=mean(open), clicks=mean(click))
## # A tibble: 4 x 5
## # Groups: recent [2]
## recent email purchases opens clicks
## <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 FALSE FALSE 6.04 0 0
## 2 FALSE TRUE 6.97 0.536 0.0858
## 3 TRUE FALSE 18.6 0 0
## 4 TRUE TRUE 20.6 0.836 0.136
Looks like emails increase purchases by about $2 for recent shoppers, but only about $1 for aged shoppers.
t.test(purch ~ email, data=d %>% filter(!recent))
##
## Welch Two Sample t-test
##
## data: purch by email
## t = -3.539, df = 45112, p-value = 0.0004021
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.4467326 -0.4154129
## sample estimates:
## mean in group FALSE mean in group TRUE
## 6.038011 6.969084
t.test(purch ~ email, data=d %>% filter(recent))
##
## Welch Two Sample t-test
##
## data: purch by email
## t = -4.4129, df = 44388, p-value = 1.022e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.849100 -1.096596
## sample estimates:
## mean in group FALSE mean in group TRUE
## 18.60653 20.57938
Anyone who keeps historic data on customers or visitors has lots of baseline variables available for slicing and dicing:
Question: Does the email campaign have more of an effect on the behavior of those that purchased syrah in the past?
summary(d$syrah > 0)
## Mode FALSE TRUE
## logical 88359 35629
d %>%
group_by(syrah>0, email) %>%
summarise(purchases = mean(purch), opens=mean(open), clicks=mean(click))
## # A tibble: 4 x 5
## # Groups: syrah > 0 [2]
## `syrah > 0` email purchases opens clicks
## <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 FALSE FALSE 11.7 0 0
## 2 FALSE TRUE 12.8 0.657 0.0997
## 3 TRUE FALSE 14.0 0 0
## 4 TRUE TRUE 16.0 0.756 0.138
Slicing and dicing means you will run many significance tests.
You may remember from intro stats that 1 in 20 significance tests at 95% confidence will be significant, when there is no effect. You will get false positives, especially when slicing and dicing.
When you think you’ve found a golden ticket, re-test before betting the company.
Slicing and dicing will reveal two things about subgroups of customers.
“Experiments are used because they provide credible estimates of the effect of an intervention for a sample population. But underlying this average effect for a sample may be substantial variation in how particular respondents respond to treatments: there may be heterogeneous treatment effects.”
– Athey and Imbens, 2015
Businesses should be interested in heterogeneous treatment effects when there is opportunity to apply different treatments to each subgroup (ie targeting).
email -> high potential for targeting
website -> less potential for targeting
We use a regression model to define a relationship between the response (\(y\)) and the treatment (\(x\)).
\(y = a + b \times x + \varepsilon\)
The model literally says that we get the average response by multiplying the treatment indicator (\(x\)) by \(b\) and adding that to \(a\). When we fit a model, we use data to estimate \(a\) and \(b\).
In R, we can shorthand the model equation with an R formula:
m1 <- lm(purch ~ email, data=d)
get_regression_table(m1)
## # A tibble: 2 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 12.3 0.22 56.0 0 11.9 12.8
## 2 emailTRUE 1.43 0.269 5.3 0 0.9 1.96
get_regression_table(m1)
## # A tibble: 2 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 12.3 0.22 56.0 0 11.9 12.8
## 2 emailTRUE 1.43 0.269 5.3 0 0.9 1.96
t.test(purch ~ email, data=d, var.equal = TRUE) # shoudl we assume if hte variance is equal, when testing the means, can we assume the distributions are the same, we can in this case bc
##
## Two Sample t-test
##
## data: purch by email
## t = -5.2996, df = 123990, p-value = 1.162e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.956172 -0.899897
## sample estimates:
## mean in group FALSE mean in group TRUE
## 12.32166 13.74970
# t.test = care about with smaller samples, assume variance is true, youre giving the ttest more power
?t.test
If you like regression, you can use regression to analyze all your tests.
If you don’t like regression, you should try it because it gives you the ability to pull in baseline variables. This is sometimes called “regression correction.”
m2 <- lm(purch ~ email + recent, data=d)
get_regression_table(m2)
## # A tibble: 3 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 5.69 0.251 22.7 0 5.20 6.18
## 2 emailTRUE 1.45 0.266 5.45 0 0.929 1.97
## 3 recentTRUE 13.3 0.251 52.8 0 12.8 13.8
Aged customers in the control group purchased on average $5.69 in the 30-days after the email was sent. Recent customers in the control group purchased an additional $13.26. The average effect of the email was $1.45.
lm(purch ~ email + last_purch, data=d) %>%
get_regression_table()
## # A tibble: 3 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 19.0 0.251 75.5 0 18.5 19.5
## 2 emailTRUE 1.42 0.266 5.32 0 0.896 1.94
## 3 last_purch -0.074 0.001 -52.8 0 -0.077 -0.071
Adding covariates increases the likelihood of finding significant effects. Why?
To incorporate heterogeneous treatment effects, we need an interaction between the treatment effect (\(x\)) and a baseline variable (\(z\)).
When we interact two terms, we are defining a model that multiplies the two terms:
\(y = a + b x + c z + d (x z) + \varepsilon\)
The R formula for this model is:
purch ~ email + recent + email:recent
…or…
purch ~ email*recent
m3 <- lm(purch ~ email + recent + email:recent, data=d)
get_regression_table(m3)
## # A tibble: 4 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 6.04 0.308 19.6 0 5.44 6.64
## 2 emailTRUE 0.931 0.377 2.47 0.013 0.193 1.67
## 3 recentTRUE 12.6 0.435 28.9 0 11.7 13.4
## 4 emailTRUE:recentTRUE 1.04 0.533 1.96 0.051 -0.003 2.09
The email effect is $0.93 for aged customers plus an additional $1.04 for newer customers (total of $1.99).
An uplift model is a regression model that incorporates many baseline variables.
m4 <- lm(purch ~ email*recent + email*(past_purch > 50) + email*(visits > 3) +
email*(chard > 0) + email*(sav_blanc>0) + email*(syrah>0) + email*(cab>0),
data=d)
get_regression_table(m4)
## # A tibble: 16 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept -0.315 0.596 -0.528 0.597 -1.48 0.854
## 2 emailTRUE -0.527 0.729 -0.723 0.47 -1.96 0.902
## 3 recentTRUE 12.5 0.432 28.9 0 11.6 13.3
## 4 past_purch > 50TRUE 3.08 0.654 4.71 0 1.80 4.36
## 5 visits > 3TRUE 1.44 0.566 2.54 0.011 0.327 2.55
## 6 chard > 0TRUE 4.88 0.573 8.51 0 3.75 6
## 7 sav_blanc > 0TRUE 4.62 0.574 8.05 0 3.49 5.74
## 8 syrah > 0TRUE 1.36 0.517 2.63 0.008 0.348 2.38
## 9 cab > 0TRUE 0.475 0.509 0.932 0.351 -0.523 1.47
## 10 emailTRUE:recentTRUE 1.15 0.53 2.17 0.03 0.114 2.19
## 11 emailTRUE:past_purch … 0.04 0.801 0.05 0.96 -1.53 1.61
## 12 emailTRUE:visits > 3T… 0.901 0.692 1.30 0.193 -0.456 2.26
## 13 emailTRUE:chard > 0TR… 0.570 0.702 0.813 0.416 -0.805 1.95
## 14 emailTRUE:sav_blanc >… 1.42 0.701 2.02 0.044 0.041 2.79
## 15 emailTRUE:syrah > 0TR… 0.702 0.634 1.11 0.268 -0.541 1.94
## 16 emailTRUE:cab > 0TRUE -0.49 0.623 -0.787 0.431 -1.71 0.731
If you have someone who wasn’t in the test, but you know their baseline variables, you can use an uplift model to predict likely treatment effect.
new_cust <- slice(d,1001) %>% select(chard, sav_blanc,syrah, cab, past_purch,recent,visits)
new_cust
## # A tibble: 1 x 7
## chard sav_blanc syrah cab past_purch recent visits
## <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <int>
## 1 0 0 0 0 0 FALSE 4
(pred <- predict(m4, cbind(email=c(TRUE, FALSE), new_cust)))
## 1 2
## 1.495629 1.121613
(lift <- pred[1] - pred[2])
## 1
## 0.3740154
This random customer is predicted to buy $29.74 if they get an email or $26.18 without, for a uplift of $3.55.
new_cust <- slice(d,25) %>% select(chard, sav_blanc,syrah, cab, past_purch,recent,visits)
new_cust
## # A tibble: 1 x 7
## chard sav_blanc syrah cab past_purch recent visits
## <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <int>
## 1 0 0 191. 0 191. FALSE 6
(pred <- predict(m4, cbind(email=c(TRUE, FALSE), new_cust)))
## 1 2
## 6.680073 5.564312
(lift <- pred[1] - pred[2])
## 1
## 1.115761
If treatments are costly (eg catalogs, discounts), then we should target customers that we predict will have a positive effect that exceeds costs.
### <b>
d <- d %>% mutate(lpurch=log(purch+1))
### </b>
m <- lm(lpurch ~ email*recent, data=d)
get_regression_points(m) %>%
mutate(sq_residuals = residual^2) %>%
summarize(ssr = sum(sq_residuals))
## # A tibble: 1 x 1
## ssr
## <dbl>
## 1 286696.
\(R^2=1-\frac{var(resid)}{var(y)}\)
Definition: the proportion of the total variation in the outcome variable y that the model explains.
Why is var(y) guaranteed to be >= var(residuals)?
Note: \(var(s^2)=\frac{\sum(x_i-\bar{x})^2}{n-1}\)
m <- lm(lpurch ~ email*recent, data=d)
get_regression_summaries(m)
## # A tibble: 1 x 8
## r_squared adj_r_squared mse rmse sigma statistic p_value df
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.065 0.065 2.31 1.52 1.52 2886. 0 4
Creating the train and test sets
# Randomly shuffle order of rows:
d_shuffled <- d %>%
sample_frac(size = 1, replace = FALSE)
# Split into train and test:
train <- d_shuffled %>%
slice(1:100000)
test <- d_shuffled %>%
slice(100001:123988)
m <- lm(log(purch+1) ~ email*recent, data=train)
get_regression_table(m)
## # A tibble: 4 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 0.371 0.012 31.6 0 0.348 0.394
## 2 emailTRUE 0.045 0.014 3.13 0.002 0.017 0.073
## 3 recentTRUE 0.756 0.017 45.4 0 0.723 0.788
## 4 emailTRUE:recentTRUE 0.063 0.02 3.08 0.002 0.023 0.103
get_regression_points(m, newdata = test) %>%
mutate(sq_residuals = residual^2) %>%
summarize(rmse = sqrt(mean(sq_residuals)))
## # A tibble: 1 x 1
## rmse
## <dbl>
## 1 46.8
c <- lm(purch ~ email*last_purch, data=d)
get_regression_table(c)
## # A tibble: 4 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 18.4 0.307 60.0 0 17.8 19.0
## 2 emailTRUE 2.22 0.377 5.88 0 1.48 2.96
## 3 last_purch -0.068 0.002 -28.2 0 -0.073 -0.063
## 4 emailTRUE:last_purch -0.009 0.003 -3 0.003 -0.015 -0.003