Load data and pre-process

As a first pass, I want to understand the type and breadth of the data. I see that the main table, users is unique by user, which there are 100. I also see that a few variables (i.e. discovery_time) are not in the proper format, so I use this space to update data types. Once I’ve updated data types, I want to make a master table, pairing users with their orders. While I would typically do a left join of users onto orders to make a master data table, there is no overlap in user_id of the two tables.

## Load
users <- data.table(read.csv("users.csv", stringsAsFactors = F))
orders <- data.table(read.csv("orders.csv", stringsAsFactors = F))
products <- data.table(read.csv("products.csv", stringsAsFactors = F))

## Take a look-see
# Users & adding a variable for day of the week
summary(users)
##     user_id        discovery_channel_category discovery_channel 
##  Min.   : 253430   Length:100                 Length:100        
##  1st Qu.:1683500   Class :character           Class :character  
##  Median :2033378   Mode  :character           Mode  :character  
##  Mean   :1979648                                                
##  3rd Qu.:2365550                                                
##  Max.   :2664888                                                
##  discovery_time     discovery_platform   location        
##  Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##     gender         
##  Length:100        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
sapply(users, class)
##                    user_id discovery_channel_category 
##                  "integer"                "character" 
##          discovery_channel             discovery_time 
##                "character"                "character" 
##         discovery_platform                   location 
##                "character"                "character" 
##                     gender 
##                "character"
users[, uniqueN(users)]
## [1] 100
users[, discovery_time :=  ymd_hms(discovery_time)]
users[, discovery_time := as.POSIXct(format(discovery_time, tz="America/Los_Angeles",usetz=TRUE))]
users[, discoveryDay := weekdays(discovery_time)]

# Orders & adding a variable for day of the week
summary(orders)
##   line_item_id         order_id       completed_at          user_id       
##  Min.   :10741316   Min.   :2241571   Length:100         Min.   : 336966  
##  1st Qu.:15003013   1st Qu.:3045172   Class :character   1st Qu.:1465237  
##  Median :17350712   Median :3525374   Mode  :character   Median :1774804  
##  Mean   :16960849   Mean   :3421079                      Mean   :1797260  
##  3rd Qu.:18992314   3rd Qu.:3800378                      3rd Qu.:2278386  
##  Max.   :20022352   Max.   :4028413                      Max.   :2597792  
##     quantity        price          style_id    order_channel     
##  Min.   :1.00   Min.   : 12.0   Min.   : 437   Length:100        
##  1st Qu.:1.00   1st Qu.: 54.5   1st Qu.: 574   Class :character  
##  Median :1.00   Median : 79.5   Median : 649   Mode  :character  
##  Mean   :1.01   Mean   : 84.1   Mean   : 682                     
##  3rd Qu.:1.00   3rd Qu.:100.0   3rd Qu.: 760                     
##  Max.   :2.00   Max.   :225.0   Max.   :1020                     
##  order_channel_category
##  Length:100            
##  Class :character      
##  Mode  :character      
##                        
##                        
## 
sapply(orders, class)
##           line_item_id               order_id           completed_at 
##              "integer"              "integer"            "character" 
##                user_id               quantity                  price 
##              "integer"              "integer"              "numeric" 
##               style_id          order_channel order_channel_category 
##              "integer"            "character"            "character"
orders[, completed_at :=  ymd_hms(completed_at)]
orders[, completed_at := as.POSIXct(format(completed_at, tz="America/Los_Angeles",usetz=TRUE))]
orders[, completedDay := weekdays(completed_at)]


# Products 
summary(products)
##     style_id       style             category         launch_date       
##  Min.   : 433   Length:100         Length:100         Length:100        
##  1st Qu.: 576   Class :character   Class :character   Class :character  
##  Median : 748   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 734                                                           
##  3rd Qu.: 865                                                           
##  Max.   :1086
sapply(products, class)
##    style_id       style    category launch_date 
##   "integer" "character" "character" "character"
products[, launch_date :=  ymd_hms(launch_date)]
products[, launch_date := as.POSIXct(format(launch_date, tz="America/Los_Angeles",usetz=TRUE))]
products[, launchDay := weekdays(launch_date)]

Question 1a

I would define a successful launch in the following ways: - Were users made aware that a product had launched? - Did users who saw the product via ads, social media, or on the landing page go to the product and purchase it? (The proportion of users who saw it and thusly bought it) - Did users buy the product soon after the launch?

I want to know whether users are looking at the product and whether they’re buying the product. Are people responding to the ads, are they coming to the landing page, are they going to the product and eventually buying? Are they keeping it? are they giving positive reviews? Most importantly, a successful product launch generates more business in general and creates a broader customer base. However, since I don’t have this data, I use a created variable, ordersByDay as a proxy. # Question 1b Ideally, I would bring the users data onto the products and orders data, but, since there is no overlap in users between the data sets, I make a merged table of orders onto the product information. Once merged, I first subset the data by products that have a) sold and b) we have launch information about. After merging the product data onto the order records, I drop any rows that do not have launch dates, as they will not be useful in these analytics. Once subset, I get a count of products that were sold and the length of time between each point of sale and their launch dates. As a next point of inquiry, I look to the order_channel-category to understand how users were best targeted.

Per the data, there was an uptick in sales during the latter half of 2017, soon after the launch of the Cotton Long Sleeve V-Neck Crop and the Cotton Turtleneck; presumably, the fall launch was a greater success than previous launches in terms of bringing more customers and orders in. However, there is no conclusive evidence to support this hypothesis with the data on hand. One consideration is understanding whether orders were already rising, despite the launch in August. If orders were rising, it would be valuable to know what else changed within the company. #Question 1c Launches generally are opening many avenues for analytics that could drive products to test what customers want, how much they want it for, when they want to purchase it, and how to contact them and advertise. While the data available isn’t conclusive about what makes a launch successful, the team could gather more data to paint a fuller picture. For example, there may be times of day, days of week, months, or seasons associated with more business. Consequently, when the launch happens may be impacting its success. I believe that there is an opportunity to investigate seasonality of launches (using time series and forecasting) to better understand how a launch will be received. In terms of a launch bringing in new customers, the data reveals that paid means of outreach are the most successful. Thus, there could be some analytics done to look at the change in returns to paid advertisement during periods of a launch. Analytics should also be performed on customers’ price points: there may be associations with how much customers are spending per product and on total orders. on one hand, during a launch, they may be willing to pay more for newer, more seasonal items but there may also be increased traffic to items that are older and that have dropped in price. The data may reveal that launches are creating windows of opportunity for sales and an opportunity to drive more business to the site.

# Make a master data table
setkey(orders, style_id)
setkey(products, style_id)
orderProducts <- products[orders]

## which products have sold? how many?
orderProducts[, numSold := sum(quantity), by = style_id]

## How long have products been on the market at the point of sale?
orderProducts[, dateOnMkt := max(completed_at[!is.na(completed_at)]) - 
                launch_date[!is.na(launch_date)], by = style_id]
orderProducts[, dateOnMkt := completed_at - launch_date]

## how were those customers targeted?
orderProducts[, numOrderChannel := .N, by = order_channel]
orderProducts[, numOrderChannelType := .N, by = order_channel_category]

## within the dates we have available, what proportion of products have been purchased?
length(orders[completed_at >=  min(products$launch_date),  unique(style_id)] %in% uniqueN(products$style_id))/uniqueN(products$style_id)
## [1] 0.74
## Let's look at the number of orders sold and the money generated
orderProducts[, ordersByDay :=  uniqueN(order_id), by = as.Date(completed_at)]
orderProducts[, salesByDay := sum(price), by = as.Date(completed_at)]

## Look for change in orders attributable to a variable
## what about understand proximity to launch?
### simple linear regression of all variables of interest
lmFull <- lm(data = orderProducts,formula = ordersByDay ~ as.factor(category) + as.integer(launch_date) + price + as.integer(dateOnMkt) + as.factor(launchDay) + as.factor(completedDay))
summary(lmFull)
## 
## Call:
## lm(formula = ordersByDay ~ as.factor(category) + as.integer(launch_date) + 
##     price + as.integer(dateOnMkt) + as.factor(launchDay) + as.factor(completedDay), 
##     data = orderProducts)
## 
## Residuals:
##                       5                       7                       9 
##  0.03123034173973376210 -0.00000000000000001041  0.03123034173973377944 
##                      14                      24                      27 
## -0.03123034173973375516 -0.00000000000000000260 -0.09358217370900871745 
##                      28                      29                      30 
##  0.07808180752732210572 -0.01138950275940101356  0.02688986894108761835 
##                      39                      51                      52 
## -0.00000000000000005204 -0.11985602418487037446 -0.09309030061551139268 
##                      54                      55                      57 
##  0.01488434771646066253  0.00000000000000000347  0.00000000000000000347 
##                      69                      77                      85 
##  0.22929231882365472273  0.00000000000000003469 -0.00367038519974558462 
##                      86                      92                      93 
## -0.02755995653998816186 -0.22929231882365472273  0.00000000000000002776 
##                      95                      96                      99 
## -0.18831588485510675213  0.15708554311537309411 -0.00000000000000010755 
##                     100 
##  0.22929231882365486150 
## 
## Coefficients:
##                                              Estimate    Std. Error
## (Intercept)                             57.2915805693 31.3521374049
## as.factor(category)Knit Dresses         -0.8101648837  0.9010852593
## as.factor(category)Knit Tops            -1.6888929986  1.3491026223
## as.factor(category)Outerwear             0.2121042876  0.4593207248
## as.factor(category)Seasonal Accessories  1.3144586344  0.6101940010
## as.factor(category)SLGs                 -2.5552419688  1.8470073728
## as.factor(category)Sweaters             -0.5046232322  0.6465289653
## as.factor(category)Woven Tops           -0.7187783201  0.8440820423
## as.integer(launch_date)                 -0.0000000366  0.0000000202
## price                                   -0.0201640528  0.0143648318
## as.integer(dateOnMkt)                   -0.0004875423  0.0007084754
## as.factor(launchDay)Sunday               0.2448068985  0.4115266723
## as.factor(launchDay)Tuesday              0.0161031063  0.2892112148
## as.factor(launchDay)Wednesday            0.1321649319  0.4236267462
## as.factor(completedDay)Monday            1.6338656551  0.8195608506
## as.factor(completedDay)Saturday          2.3507396039  1.9295087537
## as.factor(completedDay)Sunday            0.1476363739  0.5478388358
## as.factor(completedDay)Thursday          0.5048352212  0.7013210440
## as.factor(completedDay)Tuesday           0.2486067257  0.5563581589
## as.factor(completedDay)Wednesday        -0.8264345641  0.9127433245
##                                         t value Pr(>|t|)  
## (Intercept)                                1.83    0.127  
## as.factor(category)Knit Dresses           -0.90    0.410  
## as.factor(category)Knit Tops              -1.25    0.266  
## as.factor(category)Outerwear               0.46    0.664  
## as.factor(category)Seasonal Accessories    2.15    0.084 .
## as.factor(category)SLGs                   -1.38    0.225  
## as.factor(category)Sweaters               -0.78    0.470  
## as.factor(category)Woven Tops             -0.85    0.433  
## as.integer(launch_date)                   -1.81    0.130  
## price                                     -1.40    0.219  
## as.integer(dateOnMkt)                     -0.69    0.522  
## as.factor(launchDay)Sunday                 0.59    0.578  
## as.factor(launchDay)Tuesday                0.06    0.958  
## as.factor(launchDay)Wednesday              0.31    0.768  
## as.factor(completedDay)Monday              1.99    0.103  
## as.factor(completedDay)Saturday            1.22    0.277  
## as.factor(completedDay)Sunday              0.27    0.798  
## as.factor(completedDay)Thursday            0.72    0.504  
## as.factor(completedDay)Tuesday             0.45    0.674  
## as.factor(completedDay)Wednesday          -0.91    0.407  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.228 on 5 degrees of freedom
##   (75 observations deleted due to missingness)
## Multiple R-squared:  0.922,  Adjusted R-squared:  0.628 
## F-statistic: 3.13 on 19 and 5 DF,  p-value: 0.105

Graphics for Question 1

The first graph shows orders by day over time with a fitted line to show trends; I can see that orders are rising over time. Similarly, with sales, profits seem to be rising over time. The third graphic shows the same orders performance over time, but categorized by the order channel; it seems many of the orders are being driven by bloggers and influencers. Lastly, and arguably most importantly, I want to see how sales are being influenced, if at all, by launch dates. Although it may not be attributable, there is a sharp rise in orders after the August 2017 launch (presumably a Fall collection).

## Launch performance (orders)
plot_ly(data = orderProducts, x = ~completed_at, y = ~ordersByDay, symbols = c('circle','o','x', 'square', 'triangle'),type = "scatter") %>% 
  layout(title = "Orders by Day: Mar 2016 - Dec 2017") %>%
  add_lines(y = ~fitted(loess(as.numeric(ordersByDay) ~ as.numeric(completed_at))),
            line = list(color = 'red'),
            name = "Loess Smoother", showlegend = F)
## Launch performance (sales)
plot_ly(data = orderProducts, x = ~completed_at, y = ~salesByDay, symbols = c('circle','o','x', 'square', 'triangle'),type = "scatter") %>% 
  layout(title = "Saless by Day: Mar 2016 - Dec 2017") %>%
  add_lines(y = ~fitted(loess(salesByDay ~ as.numeric(completed_at))),
            line = list(color = 'red'),
            name = "Loess Smoother", showlegend = F)
## Launch performance by order channel
plot_ly(data = orderProducts, x = ~completed_at, y = ~ordersByDay, color = ~order_channel ,symbols = c('circle','o','x', 'square', 'triangle'),type = "bar") %>% 
  layout(title = "Orders Grouped by Discovery Channel", xaxis = list("Order Date"), yaxis = list("# Orders by Day"))
## By launch
ggplot(data = orderProducts, mapping = aes(x = completed_at, y = ordersByDay)) + 
  geom_smooth(se = F) + 
  geom_point() + theme_minimal() + 
  geom_vline(aes(xintercept=as.numeric(orderProducts[, launch_date])), linetype = "dotted") +
  labs("Orders over Time w/ Launch Intervals")

Question 2a

The strength of customer acquisition can be measured in a simple ratio: how many resources are expended (money, time, employees) in relation to how many customers are gained. In turn, I would then measure the kinds of customers being drawn in to understand whether the marketing had targeted the right customer base, which can be indicated by whether a customer is returning to Everlane post-first purchase, whether they are drawing in others from their network, and whether they are satisfied with the products. From here, there are myriad data points that would help support these analytics to further zoom in on customer acquisition strength; these metrics are discussed in detail in 2c.

Question 2b

I want to get a more comprehensive breakdown of how users are discovering Everlane, so I combing discovery_channel and discovery_channel_category into a single column. The data that we have available is all customers who were brought to Everlane using paid advertisements. I see that, of all users, 51% are drawn in by paid channels, 21% directly, 19% via organic search, and just 9% through email. Within the paid category, social and display discovery channels have the highest performance. Given the user data alone, I would say that paid, social channels are the most valuable.

To further confirm my hypothesis, I look at the users who have placed orders in the past two years. Of all the orders placed, 100% have been driven by paid means. The most successful paid avenue was a direct search, followed by other means; social channels only drove a little over 10% of orders.

As a final check I run two simple models: the effect on all discovery channels on the number of new users by day (userByDay) and the effect of unpaid vs. paid channels on the same variable. In the first model, I find no significant effect of any channel on new user acquisition. In the second, I find significant in the intercept, which leads me to believe that paid means do have effect on increased customer acquisition. However, the model is incredibly weak.

While I can, with confidence, say that paid means are more successful than unpaid, the data is insufficient to measure the kind of paid marketing that drives the highest user acquisition and the best user acquisition.

# users
users[, .N, by = discovery_channel_category]
##    discovery_channel_category  N
## 1:             organic search 19
## 2:                     direct 21
## 3:                      email  9
## 4:                       paid 51
users[discovery_channel == "", discovery_channel := discovery_channel_category]
users[, .N, by = discovery_channel]
##    discovery_channel  N
## 1:    organic search 19
## 2:            direct 21
## 3:             email  9
## 4:         affiliate 13
## 5:           display  8
## 6:             other  7
## 7:            social 16
## 8:            search  7
users[, userByDay := uniqueN(user_id), by = as.Date(discovery_time)]

# orders
orders[, .N, by = .(order_channel, order_channel_category)][order(N)]
##    order_channel order_channel_category  N
## 1:       display                   paid  7
## 2:        social                   paid 11
## 3:     affiliate                   paid 19
## 4:         other                   paid 26
## 5:        search                   paid 37
# users by day: all categories
head(users[order(userByDay, decreasing = T)])
##    user_id discovery_channel_category discovery_channel
## 1: 2619113                     direct            direct
## 2: 2621420                       paid             other
## 3: 2619581                       paid         affiliate
## 4: 1943278                      email             email
## 5: 2497828                       paid            social
## 6: 2435897                     direct            direct
##         discovery_time discovery_platform      location gender
## 1: 2017-12-18 20:57:03             mobile            US female
## 2: 2017-12-19 08:50:49            desktop            US female
## 3: 2017-12-18 23:48:50             mobile international female
## 4: 2017-12-19 09:26:48             mobile            US female
## 5: 2017-11-23 12:21:20            desktop international female
## 6: 2017-11-23 15:44:16             mobile international female
##    discoveryDay userByDay
## 1:       Monday         4
## 2:      Tuesday         4
## 3:       Monday         4
## 4:      Tuesday         4
## 5:     Thursday         3
## 6:     Thursday         3
summary(lm(data = users, formula = userByDay ~ as.factor(discovery_channel)))
## 
## Call:
## lm(formula = userByDay ~ as.factor(discovery_channel), data = users)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5714 -0.4286 -0.3333  0.0282  2.6667 
## 
## Coefficients:
##                                            Estimate Std. Error t value
## (Intercept)                                   1.538      0.206    7.46
## as.factor(discovery_channel)direct           -0.158      0.262   -0.60
## as.factor(discovery_channel)display          -0.288      0.334   -0.86
## as.factor(discovery_channel)email            -0.205      0.322   -0.64
## as.factor(discovery_channel)organic search   -0.433      0.268   -1.62
## as.factor(discovery_channel)other             0.033      0.348    0.09
## as.factor(discovery_channel)search           -0.110      0.348   -0.32
## as.factor(discovery_channel)social           -0.101      0.278   -0.36
##                                                  Pr(>|t|)    
## (Intercept)                                0.000000000047 ***
## as.factor(discovery_channel)direct                   0.55    
## as.factor(discovery_channel)display                  0.39    
## as.factor(discovery_channel)email                    0.53    
## as.factor(discovery_channel)organic search           0.11    
## as.factor(discovery_channel)other                    0.92    
## as.factor(discovery_channel)search                   0.75    
## as.factor(discovery_channel)social                   0.72    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.743 on 92 degrees of freedom
## Multiple R-squared:  0.0415, Adjusted R-squared:  -0.0314 
## F-statistic: 0.569 on 7 and 92 DF,  p-value: 0.779
## users by day: paid vs non-paid
users[discovery_channel_category != "paid", discovery_channel_category := "unpaid"]
summary(lm(data = users, formula = userByDay ~ as.factor(discovery_channel_category)))
## 
## Call:
## lm(formula = userByDay ~ as.factor(discovery_channel_category), 
##     data = users)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4510 -0.4510 -0.2653 -0.0617  2.7347 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                    1.451      0.102   14.20
## as.factor(discovery_channel_category)unpaid   -0.186      0.146   -1.27
##                                                        Pr(>|t|)    
## (Intercept)                                 <0.0000000000000002 ***
## as.factor(discovery_channel_category)unpaid                0.21    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.73 on 98 degrees of freedom
## Multiple R-squared:  0.0162, Adjusted R-squared:  0.0062 
## F-statistic: 1.62 on 1 and 98 DF,  p-value: 0.206
## median sales by day by discovery channel
orderProducts[, median(salesByDay), by = order_channel]
##    order_channel    V1
## 1:         other 101.5
## 2:        social  72.0
## 3:       display  68.0
## 4:        search  98.0
## 5:     affiliate  88.0

Graphics for Question 2

The first graphic shows how users have come to Everlane over time, disaggregated by discovery channel. The second graphic shows the sales by day patterns across each of the channels. We can see that there is no discernable, consistent pattern across the discovery channels, with the exception of affiliate and search channels, which have slight but positive trends over time.

## Looking at users
ggplot(data = users, mapping = aes(x = discovery_time, y = userByDay, color = discovery_channel)) + 
  geom_point() + 
  labs("Users over Time by Discovery Channel")

# looking at the profit generated
ggplot(data = orderProducts, mapping = aes(x = completed_at, y = salesByDay, color = order_channel)) + 
  geom_point() +
  geom_smooth(se = F) + 
  labs("Sales over Time by Discovery Channel")

Question 2c

I want to know how well Everlane a) is drawing users (impressions and visits) in and b) keeping customers (satisfaction) and c) how much each customer is spending over time (total sales). Of the customers that are coming to Everlane via the channels listed in the discovery_channel variable, I would want to know how many people were reached. These data would tell me the proportion of people who are drawn into the site initially. I would then want to know, of the users that visited the site, what those customers viewed, whether there are trends in their viewing patterns, and the proportion of items that they then went onto buy. Once there are orders placed, I would want to know whether customers were satisfied; to quantify this, I would want comments, reviews, the number of items that are returned, and whether a customer returned more than once to place an order. Of the returning customers, I would want data on their spending habits over time (total sales, frequency).

Question 3a

“Shoe lovers” can be defined as customers who either include footwear in each purchase and/or customers who have spent 50% or more of their summed order amounts on footwear. However, given the little information that we have available, we can only identify lovers of certain items as those who have purchased one. Ideally, I would want to know more about a customers’ willingness to pay for a point of shoes; show lovers would have more price flexibility. However, given the data that we have available, I can identify “shoe lovers” as customers who have purchased shoes.

orderProducts[category == "Footwear", user_id ]
## [1] 1286966 2118486 2016243

Question 3b

Using the same methodology, I have generated code that will make lists of all the kinds of item lovers within the existing data. Also, it may be useful to the marketing team to know that emails are, per the data available, the least successful outreach method.

orderProducts[!is.na(category), 
              .(lovers = paste(unique(user_id), collapse = ", ")), 
              by = category]
##                category
## 1:           Woven Tops
## 2:            Knit Tops
## 3:             Footwear
## 4:         Knit Dresses
## 5:                 SLGs
## 6:             Sweaters
## 7: Seasonal Accessories
## 8:            Outerwear
##                                                                    lovers
## 1: 1772889, 2050503, 2022637, 2421729, 1525025, 2105529, 2413754, 1818633
## 2:                                                       1637691, 1717464
## 3:                                              1286966, 2118486, 2016243
## 4:                                                                1323027
## 5:                                                                2359777
## 6:          1854949, 2593502, 2481099, 2380929, 1937006, 1125825, 1671627
## 7:                                                                 925461
## 8:                                                       1375796, 2515621

Question 3b

I would first need to gather the information about who the true shoe lovers are (using some of the methodology described in 3a). More concretely, this metric could be evaluated through a series of models with cross validation. Using existing data, I would build a model to understand who a shoe lover is. To do so, I would experiment with multiple models to ultimately build a supervised classifier to predict whether a customer will buy shoes. Ideally, once this model is built, I will have insights into shoe lovers’ purchasing habits that will help drive my a/b testing hypothesis. For example, if I find that the users who consistently buy a pair of shoes after seeing an Everlane ad on Instagram indicates a shoe lover, I may also assume that, if the same user sees an image of a shoe in an email, they will also be compelled to purchase it.

Once I understand what constituted a shoe lover, I would generate a propensity score by user for their average probability to buy shoes; I would assign the propensity scores using a logit regression. Based on the propensity scores, I would select 25% of users and ensure that the distribution of shoe lovers within that quarter closely matched the distribution of the entire client base. Once the control and treatment groups are balanced, I would add a binary variable to the users’ data table for shoeLover. After launching the email campaign, I would collect up to a month of data, depending on the length and frequency of the email campaign.

While I could use a service like Google Optimize to collect and store the data, this a/b test would only be changing the behavior of users’ visiting a site, rather than how they interact with it. Thusly, tracking those who were a part of the email campaign versus those who were not would be simplified. Depending on the outcome of interest, the data would allow us to analyze the efficacy of the email campaign on increased purchases made.