Agenda

A/B Testing Review (and upgrade)
Multiple Regression
Interactions and Uplift Modeling
Regression Diagnostics

But first…

A/B testing review (and upgrade)

What did we learn last time?

Randomization checks
Difference in proportions
Difference in means
How to calculate sample size (power)

Randomization checks

d %>% 
  filter(past_purch > 0) %>% 
ggplot(aes(x=last_purch, fill=group)) + 
  geom_histogram(binwidth = 25, alpha=0.2, position="identity") +
  xlim(0, 500) + 
  xlab("Last Purchase") + ylab("Customers") + 
  labs(title="Distribution of purchase recency by group")

Testing differences in proportions

For categorical outcomes

success <- d %>% 
  filter(group!="ctrl") %>% 
  group_by(group,click) %>% 
  count() %>% 
  filter(click==1)
trials <- d %>% 
  filter(group!="ctrl") %>% 
  count(group)

prop.test(success$n,trials$n)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  success$n out of trials$n
## X-squared = 367.2, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.03754391 0.04612615
## sample estimates:
##     prop 1     prop 2 
## 0.13152992 0.08969489

Testing differences in means

For continuous variables

d <- d %>% 
  mutate(email = group %in% c("email_B","email_A"))
t.test(purch ~ email, data=d)

## 
##  Welch Two Sample t-test
## 
## data:  purch by email
## t = -5.4485, df = 89037, p-value = 5.093e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.9417432 -0.9143263
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            12.32166            13.74970

(Upgrade) experiments within experiments

Consider the customers who have made a purchase in the last 60-ish days.
Within that subset, customers were randomly assigned to recieve email A, email B or no email.
So, we can analyze the data for a subgroup as it’s own test by slicing down and then re-analyzing.
However, we will only have enough sample in the subgroup if our initial test is big enough.

Exercise

Create a new variable called ‘recent’ that equals 1 (TRUE) if last_purch < median(last_purch)
create bar graphs that show the difference in 1) opens, and 2) clicks
by the newly created variable

Solution opens

summary(d$last_purch)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   26.00   63.00   89.98  125.00  992.00

d <- d %>% mutate(recent=(last_purch<summary(d$last_purch)[['Median']]))
d %>%
  filter(group != "ctrl") %>% 
  ggplot(aes(x=recent, fill=as.factor(open)))+
    geom_bar()

Solution clicks

d %>%
  filter(group != "ctrl") %>%
  ggplot(aes(x=recent, fill=as.factor(click)))+
    geom_bar()

Exercise

Question: Does the email campaign have more of an effect on the purchases of recent shoppers? If so, what is the difference?

Hint: Start by creating a new variable called ‘email’ that equals 1 if group != ‘ctrl’

Solution

d <- d %>% mutate(email = (group!="ctrl"))

d %>%
  group_by(recent, email) %>% 
  summarise(purchases = mean(purch), opens=mean(open), clicks=mean(click))

## # A tibble: 4 x 5
## # Groups:   recent [2]
##   recent email purchases opens clicks
##   <lgl>  <lgl>     <dbl> <dbl>  <dbl>
## 1 FALSE  FALSE      6.04 0     0     
## 2 FALSE  TRUE       6.97 0.536 0.0858
## 3 TRUE   FALSE     18.6  0     0     
## 4 TRUE   TRUE      20.6  0.836 0.136

Looks like emails increase purchases by about $2 for recent shoppers, but only about $1 for aged shoppers.

Testing the difference for aged

t.test(purch ~ email, data=d %>% filter(!recent))

## 
##  Welch Two Sample t-test
## 
## data:  purch by email
## t = -3.539, df = 45112, p-value = 0.0004021
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4467326 -0.4154129
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            6.038011            6.969084

Testing the difference for recent

t.test(purch ~ email, data=d %>% filter(recent))

## 
##  Welch Two Sample t-test
## 
## data:  purch by email
## t = -4.4129, df = 44388, p-value = 1.022e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.849100 -1.096596
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            18.60653            20.57938

We slice based on baseline variables

Anyone who keeps historic data on customers or visitors has lots of baseline variables available for slicing and dicing:

website visits (to particular parts of the site)
sign-ups
geographic location
source
past purchase (by category)
recency
frequency

Exercise

Question: Does the email campaign have more of an effect on the behavior of those that purchased syrah in the past?

summary(d$syrah > 0)

##    Mode   FALSE    TRUE 
## logical   88359   35629

Solution (table)

d %>%
  group_by(syrah>0, email) %>% 
  summarise(purchases = mean(purch), opens=mean(open), clicks=mean(click))

## # A tibble: 4 x 5
## # Groups:   syrah > 0 [2]
##   `syrah > 0` email purchases opens clicks
##   <lgl>       <lgl>     <dbl> <dbl>  <dbl>
## 1 FALSE       FALSE      11.7 0     0     
## 2 FALSE       TRUE       12.8 0.657 0.0997
## 3 TRUE        FALSE      14.0 0     0     
## 4 TRUE        TRUE       16.0 0.756 0.138

Repeated significance testing

Slicing and dicing means you will run many significance tests.

You may remember from intro stats that 1 in 20 significance tests at 95% confidence will be significant, when there is no effect. You will get false positives, especially when slicing and dicing.

When you think you’ve found a golden ticket, re-test before betting the company.

Slicing and dicing: Summary

Slicing and dicing will reveal two things about subgroups of customers.

Subgroups will vary in how much they engage in behaviors
- Recent buyers tend to have higher average purchases in the future
Subgroups vary in how they respond to treatments
- Recent buyers are more affected by the email

Multiple regression

Heterogeneous treatment effects

“Experiments are used because they provide credible estimates of the effect of an intervention for a sample population. But underlying this average effect for a sample may be substantial variation in how particular respondents respond to treatments: there may be heterogeneous treatment effects.”

– Athey and Imbens, 2015

Heterogeneous treatment effects and targeting

Businesses should be interested in heterogeneous treatment effects when there is opportunity to apply different treatments to each subgroup (ie targeting).

email -> high potential for targeting

website -> less potential for targeting

Analyzing experiments with regression

We use a regression model to define a relationship between the response ($y$) and the treatment ($x$).

$y = a + b \times x + \varepsilon$

The model literally says that we get the average response by multiplying the treatment indicator ($x$) by $b$ and adding that to $a$. When we fit a model, we use data to estimate $a$ and $b$.

In R, we can shorthand the model equation with an R formula:

m1 <- lm(purch ~ email, data=d)
get_regression_table(m1)

## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    12.3      0.22       56.0       0     11.9    12.8 
## 2 emailTRUE     1.43     0.269       5.3       0      0.9     1.96

Regression versus significance test

get_regression_table(m1)

## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    12.3      0.22       56.0       0     11.9    12.8 
## 2 emailTRUE     1.43     0.269       5.3       0      0.9     1.96

t.test(purch ~ email, data=d, var.equal = TRUE) # shoudl we assume if hte variance is equal, when testing the means, can we assume the distributions are the same, we can in this case bc

## 
##  Two Sample t-test
## 
## data:  purch by email
## t = -5.2996, df = 123990, p-value = 1.162e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.956172 -0.899897
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            12.32166            13.74970

# t.test = care about with smaller samples, assume variance is true, youre giving the ttest more power

?t.test

Regression versus significance tests

If you like regression, you can use regression to analyze all your tests.

If you don’t like regression, you should try it because it gives you the ability to pull in baseline variables. This is sometimes called “regression correction.”

Model including baseline variables

m2 <- lm(purch ~ email + recent, data=d)
get_regression_table(m2)

## # A tibble: 3 x 7
##   term       estimate std_error statistic p_value lower_ci upper_ci
##   <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept      5.69     0.251     22.7        0    5.20      6.18
## 2 emailTRUE      1.45     0.266      5.45       0    0.929     1.97
## 3 recentTRUE    13.3      0.251     52.8        0   12.8      13.8

Aged customers in the control group purchased on average $5.69 in the 30-days after the email was sent. Recent customers in the control group purchased an additional $13.26. The average effect of the email was $1.45.

Model with continuous baseline variable

lm(purch ~ email + last_purch, data=d) %>% 
  get_regression_table()

## # A tibble: 3 x 7
##   term       estimate std_error statistic p_value lower_ci upper_ci
##   <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    19.0       0.251     75.5        0   18.5     19.5  
## 2 emailTRUE     1.42      0.266      5.32       0    0.896    1.94 
## 3 last_purch   -0.074     0.001    -52.8        0   -0.077   -0.071

Adding covariates increases the likelihood of finding significant effects. Why?

Interactions and uplift modeling

Incorporating heterogeneous treatment effects

To incorporate heterogeneous treatment effects, we need an interaction between the treatment effect ($x$) and a baseline variable ($z$).

When we interact two terms, we are defining a model that multiplies the two terms:

$y = a + b x + c z + d (x z) + \varepsilon$

The R formula for this model is:

purch ~ email + recent + email:recent

…or…

purch ~ email*recent

Incorporating heterogeneous treatment effects

m3 <- lm(purch ~ email + recent + email:recent, data=d)
get_regression_table(m3)

## # A tibble: 4 x 7
##   term                 estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept               6.04      0.308     19.6    0        5.44      6.64
## 2 emailTRUE               0.931     0.377      2.47   0.013    0.193     1.67
## 3 recentTRUE             12.6       0.435     28.9    0       11.7      13.4 
## 4 emailTRUE:recentTRUE    1.04      0.533      1.96   0.051   -0.003     2.09

The email effect is $0.93 for aged customers plus an additional $1.04 for newer customers (total of $1.99).

An uplift model for purchase amount

An uplift model is a regression model that incorporates many baseline variables.

m4 <- lm(purch ~ email*recent + email*(past_purch > 50) + email*(visits > 3) +
                 email*(chard > 0) + email*(sav_blanc>0) + email*(syrah>0) + email*(cab>0), 
         data=d)
get_regression_table(m4)

## # A tibble: 16 x 7
##    term                   estimate std_error statistic p_value lower_ci upper_ci
##    <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
##  1 intercept                -0.315     0.596    -0.528   0.597   -1.48     0.854
##  2 emailTRUE                -0.527     0.729    -0.723   0.47    -1.96     0.902
##  3 recentTRUE               12.5       0.432    28.9     0       11.6     13.3  
##  4 past_purch > 50TRUE       3.08      0.654     4.71    0        1.80     4.36 
##  5 visits > 3TRUE            1.44      0.566     2.54    0.011    0.327    2.55 
##  6 chard > 0TRUE             4.88      0.573     8.51    0        3.75     6    
##  7 sav_blanc > 0TRUE         4.62      0.574     8.05    0        3.49     5.74 
##  8 syrah > 0TRUE             1.36      0.517     2.63    0.008    0.348    2.38 
##  9 cab > 0TRUE               0.475     0.509     0.932   0.351   -0.523    1.47 
## 10 emailTRUE:recentTRUE      1.15      0.53      2.17    0.03     0.114    2.19 
## 11 emailTRUE:past_purch …    0.04      0.801     0.05    0.96    -1.53     1.61 
## 12 emailTRUE:visits > 3T…    0.901     0.692     1.30    0.193   -0.456    2.26 
## 13 emailTRUE:chard > 0TR…    0.570     0.702     0.813   0.416   -0.805    1.95 
## 14 emailTRUE:sav_blanc >…    1.42      0.701     2.02    0.044    0.041    2.79 
## 15 emailTRUE:syrah > 0TR…    0.702     0.634     1.11    0.268   -0.541    1.94 
## 16 emailTRUE:cab > 0TRUE    -0.49      0.623    -0.787   0.431   -1.71     0.731

# Interactions

Scoring customers with an uplift model

If you have someone who wasn’t in the test, but you know their baseline variables, you can use an uplift model to predict likely treatment effect.

new_cust <- slice(d,1001) %>% select(chard, sav_blanc,syrah, cab, past_purch,recent,visits)
new_cust

## # A tibble: 1 x 7
##   chard sav_blanc syrah   cab past_purch recent visits
##   <dbl>     <dbl> <dbl> <dbl>      <dbl> <lgl>   <int>
## 1     0         0     0     0          0 FALSE       4

(pred <- predict(m4, cbind(email=c(TRUE, FALSE), new_cust)))

##        1        2 
## 1.495629 1.121613

(lift <- pred[1] - pred[2])

##         1 
## 0.3740154

This random customer is predicted to buy $29.74 if they get an email or $26.18 without, for a uplift of $3.55.

Scoring for another (worse) customer

new_cust <- slice(d,25) %>% select(chard, sav_blanc,syrah, cab, past_purch,recent,visits)
new_cust

## # A tibble: 1 x 7
##   chard sav_blanc syrah   cab past_purch recent visits
##   <dbl>     <dbl> <dbl> <dbl>      <dbl> <lgl>   <int>
## 1     0         0  191.     0       191. FALSE       6

(pred <- predict(m4, cbind(email=c(TRUE, FALSE), new_cust)))

##        1        2 
## 6.680073 5.564312

(lift <- pred[1] - pred[2])

##        1 
## 1.115761

Why uplift modeling?

If treatments are costly (eg catalogs, discounts), then we should target customers that we predict will have a positive effect that exceeds costs.

Regression diagnostics

Sum of squared residuals

Find the residuals (error terms)
Square them
Add them all up

### <b>
d <- d %>% mutate(lpurch=log(purch+1))
### </b>
m <- lm(lpurch ~ email*recent, data=d)
get_regression_points(m) %>% 
  mutate(sq_residuals = residual^2) %>%
  summarize(ssr = sum(sq_residuals))

## # A tibble: 1 x 1
##       ssr
##     <dbl>
## 1 286696.

R-squared

$R^2=1-\frac{var(resid)}{var(y)}$

Definition: the proportion of the total variation in the outcome variable y that the model explains.

Why is var(y) guaranteed to be >= var(residuals)?

Note: $var(s^2)=\frac{\sum(x_i-\bar{x})^2}{n-1}$

RMSE

Error = residuals
Square each error
Mean of that squared error
Root of that mean squared error

Roughly, the quality of a model’s predictions

Example

m <- lm(lpurch ~ email*recent, data=d)
get_regression_summaries(m)

## # A tibble: 1 x 8
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl>
## 1     0.065         0.065  2.31  1.52  1.52     2886.       0     4

Cross Validation - test/train

Creating the train and test sets

# Randomly shuffle order of rows:
d_shuffled <- d %>% 
  sample_frac(size = 1, replace = FALSE)
# Split into train and test:
train <- d_shuffled %>%
  slice(1:100000)
test <- d_shuffled %>%
  slice(100001:123988)

m <- lm(log(purch+1) ~ email*recent, data=train)
get_regression_table(m)

## # A tibble: 4 x 7
##   term                 estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept               0.371     0.012     31.6    0        0.348    0.394
## 2 emailTRUE               0.045     0.014      3.13   0.002    0.017    0.073
## 3 recentTRUE              0.756     0.017     45.4    0        0.723    0.788
## 4 emailTRUE:recentTRUE    0.063     0.02       3.08   0.002    0.023    0.103

Cross Validation - validate

get_regression_points(m, newdata = test) %>% 
  mutate(sq_residuals = residual^2) %>%
  summarize(rmse = sqrt(mean(sq_residuals)))

## # A tibble: 1 x 1
##    rmse
##   <dbl>
## 1  46.8

Bonus: Continuous by categorical interactions

c <- lm(purch ~ email*last_purch, data=d)
get_regression_table(c)

## # A tibble: 4 x 7
##   term                 estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept              18.4       0.307     60.0    0       17.8     19.0  
## 2 emailTRUE               2.22      0.377      5.88   0        1.48     2.96 
## 3 last_purch             -0.068     0.002    -28.2    0       -0.073   -0.063
## 4 emailTRUE:last_purch   -0.009     0.003     -3      0.003   -0.015   -0.003

A/B Testing & Multiple Regression/Diagnostics

Matt

11/8/2019

Agenda

But first…

A/B testing review (and upgrade)

What did we learn last time?

Randomization checks

Testing differences in proportions

Testing differences in means

(Upgrade) experiments within experiments

Exercise

Solution opens

Solution clicks

Exercise

Solution

Testing the difference for aged

Testing the difference for recent

We slice based on baseline variables

Exercise

Solution (table)

Repeated significance testing

Slicing and dicing: Summary

Multiple regression

Heterogeneous treatment effects

Heterogeneous treatment effects and targeting

Analyzing experiments with regression

Regression versus significance test

Regression versus significance tests

Model including baseline variables

Model with continuous baseline variable

Interactions and uplift modeling

Incorporating heterogeneous treatment effects

Incorporating heterogeneous treatment effects

An uplift model for purchase amount

# Interactions

Scoring customers with an uplift model

Scoring for another (worse) customer

Why uplift modeling?

Regression diagnostics

Sum of squared residuals

R-squared

RMSE

Roughly, the quality of a model’s predictions

Example

Cross Validation - test/train

Cross Validation - validate

Bonus: Continuous by categorical interactions