As a first pass, I want to understand the type and breadth of the data. I see that the main table, users is unique by user, which there are 100. I also see that a few variables (i.e. discovery_time) are not in the proper format, so I use this space to update data types. Once I’ve updated data types, I want to make a master table, pairing users with their orders. While I would typically do a left join of users onto orders to make a master data table, there is no overlap in user_id of the two tables.
## Load
users <- data.table(read.csv("users.csv", stringsAsFactors = F))
orders <- data.table(read.csv("orders.csv", stringsAsFactors = F))
products <- data.table(read.csv("products.csv", stringsAsFactors = F))
## Take a look-see
# Users & adding a variable for day of the week
summary(users)
## user_id discovery_channel_category discovery_channel
## Min. : 253430 Length:100 Length:100
## 1st Qu.:1683500 Class :character Class :character
## Median :2033378 Mode :character Mode :character
## Mean :1979648
## 3rd Qu.:2365550
## Max. :2664888
## discovery_time discovery_platform location
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## gender
## Length:100
## Class :character
## Mode :character
##
##
##
sapply(users, class)
## user_id discovery_channel_category
## "integer" "character"
## discovery_channel discovery_time
## "character" "character"
## discovery_platform location
## "character" "character"
## gender
## "character"
users[, uniqueN(users)]
## [1] 100
users[, discovery_time := ymd_hms(discovery_time)]
users[, discovery_time := as.POSIXct(format(discovery_time, tz="America/Los_Angeles",usetz=TRUE))]
users[, discoveryDay := weekdays(discovery_time)]
# Orders & adding a variable for day of the week
summary(orders)
## line_item_id order_id completed_at user_id
## Min. :10741316 Min. :2241571 Length:100 Min. : 336966
## 1st Qu.:15003013 1st Qu.:3045172 Class :character 1st Qu.:1465237
## Median :17350712 Median :3525374 Mode :character Median :1774804
## Mean :16960849 Mean :3421079 Mean :1797260
## 3rd Qu.:18992314 3rd Qu.:3800378 3rd Qu.:2278386
## Max. :20022352 Max. :4028413 Max. :2597792
## quantity price style_id order_channel
## Min. :1.00 Min. : 12.0 Min. : 437 Length:100
## 1st Qu.:1.00 1st Qu.: 54.5 1st Qu.: 574 Class :character
## Median :1.00 Median : 79.5 Median : 649 Mode :character
## Mean :1.01 Mean : 84.1 Mean : 682
## 3rd Qu.:1.00 3rd Qu.:100.0 3rd Qu.: 760
## Max. :2.00 Max. :225.0 Max. :1020
## order_channel_category
## Length:100
## Class :character
## Mode :character
##
##
##
sapply(orders, class)
## line_item_id order_id completed_at
## "integer" "integer" "character"
## user_id quantity price
## "integer" "integer" "numeric"
## style_id order_channel order_channel_category
## "integer" "character" "character"
orders[, completed_at := ymd_hms(completed_at)]
orders[, completed_at := as.POSIXct(format(completed_at, tz="America/Los_Angeles",usetz=TRUE))]
orders[, completedDay := weekdays(completed_at)]
# Products
summary(products)
## style_id style category launch_date
## Min. : 433 Length:100 Length:100 Length:100
## 1st Qu.: 576 Class :character Class :character Class :character
## Median : 748 Mode :character Mode :character Mode :character
## Mean : 734
## 3rd Qu.: 865
## Max. :1086
sapply(products, class)
## style_id style category launch_date
## "integer" "character" "character" "character"
products[, launch_date := ymd_hms(launch_date)]
products[, launch_date := as.POSIXct(format(launch_date, tz="America/Los_Angeles",usetz=TRUE))]
products[, launchDay := weekdays(launch_date)]
I would define a successful launch in the following ways: - Were users made aware that a product had launched? - Did users who saw the product via ads, social media, or on the landing page go to the product and purchase it? (The proportion of users who saw it and thusly bought it) - Did users buy the product soon after the launch?
I want to know whether users are looking at the product and whether they’re buying the product. Are people responding to the ads, are they coming to the landing page, are they going to the product and eventually buying? Are they keeping it? are they giving positive reviews? Most importantly, a successful product launch generates more business in general and creates a broader customer base. However, since I don’t have this data, I use a created variable, ordersByDay as a proxy. # Question 1b Ideally, I would bring the users data onto the products and orders data, but, since there is no overlap in users between the data sets, I make a merged table of orders onto the product information. Once merged, I first subset the data by products that have a) sold and b) we have launch information about. After merging the product data onto the order records, I drop any rows that do not have launch dates, as they will not be useful in these analytics. Once subset, I get a count of products that were sold and the length of time between each point of sale and their launch dates. As a next point of inquiry, I look to the order_channel-category to understand how users were best targeted.
Per the data, there was an uptick in sales during the latter half of 2017, soon after the launch of the Cotton Long Sleeve V-Neck Crop and the Cotton Turtleneck; presumably, the fall launch was a greater success than previous launches in terms of bringing more customers and orders in. However, there is no conclusive evidence to support this hypothesis with the data on hand. One consideration is understanding whether orders were already rising, despite the launch in August. If orders were rising, it would be valuable to know what else changed within the company. #Question 1c Launches generally are opening many avenues for analytics that could drive products to test what customers want, how much they want it for, when they want to purchase it, and how to contact them and advertise. While the data available isn’t conclusive about what makes a launch successful, the team could gather more data to paint a fuller picture. For example, there may be times of day, days of week, months, or seasons associated with more business. Consequently, when the launch happens may be impacting its success. I believe that there is an opportunity to investigate seasonality of launches (using time series and forecasting) to better understand how a launch will be received. In terms of a launch bringing in new customers, the data reveals that paid means of outreach are the most successful. Thus, there could be some analytics done to look at the change in returns to paid advertisement during periods of a launch. Analytics should also be performed on customers’ price points: there may be associations with how much customers are spending per product and on total orders. on one hand, during a launch, they may be willing to pay more for newer, more seasonal items but there may also be increased traffic to items that are older and that have dropped in price. The data may reveal that launches are creating windows of opportunity for sales and an opportunity to drive more business to the site.
# Make a master data table
setkey(orders, style_id)
setkey(products, style_id)
orderProducts <- products[orders]
## which products have sold? how many?
orderProducts[, numSold := sum(quantity), by = style_id]
## How long have products been on the market at the point of sale?
orderProducts[, dateOnMkt := max(completed_at[!is.na(completed_at)]) -
launch_date[!is.na(launch_date)], by = style_id]
orderProducts[, dateOnMkt := completed_at - launch_date]
## how were those customers targeted?
orderProducts[, numOrderChannel := .N, by = order_channel]
orderProducts[, numOrderChannelType := .N, by = order_channel_category]
## within the dates we have available, what proportion of products have been purchased?
length(orders[completed_at >= min(products$launch_date), unique(style_id)] %in% uniqueN(products$style_id))/uniqueN(products$style_id)
## [1] 0.74
## Let's look at the number of orders sold and the money generated
orderProducts[, ordersByDay := uniqueN(order_id), by = as.Date(completed_at)]
orderProducts[, salesByDay := sum(price), by = as.Date(completed_at)]
## Look for change in orders attributable to a variable
## what about understand proximity to launch?
### simple linear regression of all variables of interest
lmFull <- lm(data = orderProducts,formula = ordersByDay ~ as.factor(category) + as.integer(launch_date) + price + as.integer(dateOnMkt) + as.factor(launchDay) + as.factor(completedDay))
summary(lmFull)
##
## Call:
## lm(formula = ordersByDay ~ as.factor(category) + as.integer(launch_date) +
## price + as.integer(dateOnMkt) + as.factor(launchDay) + as.factor(completedDay),
## data = orderProducts)
##
## Residuals:
## 5 7 9
## 0.03123034173973376210 -0.00000000000000001041 0.03123034173973377944
## 14 24 27
## -0.03123034173973375516 -0.00000000000000000260 -0.09358217370900871745
## 28 29 30
## 0.07808180752732210572 -0.01138950275940101356 0.02688986894108761835
## 39 51 52
## -0.00000000000000005204 -0.11985602418487037446 -0.09309030061551139268
## 54 55 57
## 0.01488434771646066253 0.00000000000000000347 0.00000000000000000347
## 69 77 85
## 0.22929231882365472273 0.00000000000000003469 -0.00367038519974558462
## 86 92 93
## -0.02755995653998816186 -0.22929231882365472273 0.00000000000000002776
## 95 96 99
## -0.18831588485510675213 0.15708554311537309411 -0.00000000000000010755
## 100
## 0.22929231882365486150
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 57.2915805693 31.3521374049
## as.factor(category)Knit Dresses -0.8101648837 0.9010852593
## as.factor(category)Knit Tops -1.6888929986 1.3491026223
## as.factor(category)Outerwear 0.2121042876 0.4593207248
## as.factor(category)Seasonal Accessories 1.3144586344 0.6101940010
## as.factor(category)SLGs -2.5552419688 1.8470073728
## as.factor(category)Sweaters -0.5046232322 0.6465289653
## as.factor(category)Woven Tops -0.7187783201 0.8440820423
## as.integer(launch_date) -0.0000000366 0.0000000202
## price -0.0201640528 0.0143648318
## as.integer(dateOnMkt) -0.0004875423 0.0007084754
## as.factor(launchDay)Sunday 0.2448068985 0.4115266723
## as.factor(launchDay)Tuesday 0.0161031063 0.2892112148
## as.factor(launchDay)Wednesday 0.1321649319 0.4236267462
## as.factor(completedDay)Monday 1.6338656551 0.8195608506
## as.factor(completedDay)Saturday 2.3507396039 1.9295087537
## as.factor(completedDay)Sunday 0.1476363739 0.5478388358
## as.factor(completedDay)Thursday 0.5048352212 0.7013210440
## as.factor(completedDay)Tuesday 0.2486067257 0.5563581589
## as.factor(completedDay)Wednesday -0.8264345641 0.9127433245
## t value Pr(>|t|)
## (Intercept) 1.83 0.127
## as.factor(category)Knit Dresses -0.90 0.410
## as.factor(category)Knit Tops -1.25 0.266
## as.factor(category)Outerwear 0.46 0.664
## as.factor(category)Seasonal Accessories 2.15 0.084 .
## as.factor(category)SLGs -1.38 0.225
## as.factor(category)Sweaters -0.78 0.470
## as.factor(category)Woven Tops -0.85 0.433
## as.integer(launch_date) -1.81 0.130
## price -1.40 0.219
## as.integer(dateOnMkt) -0.69 0.522
## as.factor(launchDay)Sunday 0.59 0.578
## as.factor(launchDay)Tuesday 0.06 0.958
## as.factor(launchDay)Wednesday 0.31 0.768
## as.factor(completedDay)Monday 1.99 0.103
## as.factor(completedDay)Saturday 1.22 0.277
## as.factor(completedDay)Sunday 0.27 0.798
## as.factor(completedDay)Thursday 0.72 0.504
## as.factor(completedDay)Tuesday 0.45 0.674
## as.factor(completedDay)Wednesday -0.91 0.407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.228 on 5 degrees of freedom
## (75 observations deleted due to missingness)
## Multiple R-squared: 0.922, Adjusted R-squared: 0.628
## F-statistic: 3.13 on 19 and 5 DF, p-value: 0.105
The first graph shows orders by day over time with a fitted line to show trends; I can see that orders are rising over time. Similarly, with sales, profits seem to be rising over time. The third graphic shows the same orders performance over time, but categorized by the order channel; it seems many of the orders are being driven by bloggers and influencers. Lastly, and arguably most importantly, I want to see how sales are being influenced, if at all, by launch dates. Although it may not be attributable, there is a sharp rise in orders after the August 2017 launch (presumably a Fall collection).
## Launch performance (orders)
plot_ly(data = orderProducts, x = ~completed_at, y = ~ordersByDay, symbols = c('circle','o','x', 'square', 'triangle'),type = "scatter") %>%
layout(title = "Orders by Day: Mar 2016 - Dec 2017") %>%
add_lines(y = ~fitted(loess(as.numeric(ordersByDay) ~ as.numeric(completed_at))),
line = list(color = 'red'),
name = "Loess Smoother", showlegend = F)
## Launch performance (sales)
plot_ly(data = orderProducts, x = ~completed_at, y = ~salesByDay, symbols = c('circle','o','x', 'square', 'triangle'),type = "scatter") %>%
layout(title = "Saless by Day: Mar 2016 - Dec 2017") %>%
add_lines(y = ~fitted(loess(salesByDay ~ as.numeric(completed_at))),
line = list(color = 'red'),
name = "Loess Smoother", showlegend = F)
## Launch performance by order channel
plot_ly(data = orderProducts, x = ~completed_at, y = ~ordersByDay, color = ~order_channel ,symbols = c('circle','o','x', 'square', 'triangle'),type = "bar") %>%
layout(title = "Orders Grouped by Discovery Channel", xaxis = list("Order Date"), yaxis = list("# Orders by Day"))
## By launch
ggplot(data = orderProducts, mapping = aes(x = completed_at, y = ordersByDay)) +
geom_smooth(se = F) +
geom_point() + theme_minimal() +
geom_vline(aes(xintercept=as.numeric(orderProducts[, launch_date])), linetype = "dotted") +
labs("Orders over Time w/ Launch Intervals")
The strength of customer acquisition can be measured in a simple ratio: how many resources are expended (money, time, employees) in relation to how many customers are gained. In turn, I would then measure the kinds of customers being drawn in to understand whether the marketing had targeted the right customer base, which can be indicated by whether a customer is returning to Everlane post-first purchase, whether they are drawing in others from their network, and whether they are satisfied with the products. From here, there are myriad data points that would help support these analytics to further zoom in on customer acquisition strength; these metrics are discussed in detail in 2c.
I want to get a more comprehensive breakdown of how users are discovering Everlane, so I combing discovery_channel and discovery_channel_category into a single column. The data that we have available is all customers who were brought to Everlane using paid advertisements. I see that, of all users, 51% are drawn in by paid channels, 21% directly, 19% via organic search, and just 9% through email. Within the paid category, social and display discovery channels have the highest performance. Given the user data alone, I would say that paid, social channels are the most valuable.
To further confirm my hypothesis, I look at the users who have placed orders in the past two years. Of all the orders placed, 100% have been driven by paid means. The most successful paid avenue was a direct search, followed by other means; social channels only drove a little over 10% of orders.
As a final check I run two simple models: the effect on all discovery channels on the number of new users by day (userByDay) and the effect of unpaid vs. paid channels on the same variable. In the first model, I find no significant effect of any channel on new user acquisition. In the second, I find significant in the intercept, which leads me to believe that paid means do have effect on increased customer acquisition. However, the model is incredibly weak.
While I can, with confidence, say that paid means are more successful than unpaid, the data is insufficient to measure the kind of paid marketing that drives the highest user acquisition and the best user acquisition.
# users
users[, .N, by = discovery_channel_category]
## discovery_channel_category N
## 1: organic search 19
## 2: direct 21
## 3: email 9
## 4: paid 51
users[discovery_channel == "", discovery_channel := discovery_channel_category]
users[, .N, by = discovery_channel]
## discovery_channel N
## 1: organic search 19
## 2: direct 21
## 3: email 9
## 4: affiliate 13
## 5: display 8
## 6: other 7
## 7: social 16
## 8: search 7
users[, userByDay := uniqueN(user_id), by = as.Date(discovery_time)]
# orders
orders[, .N, by = .(order_channel, order_channel_category)][order(N)]
## order_channel order_channel_category N
## 1: display paid 7
## 2: social paid 11
## 3: affiliate paid 19
## 4: other paid 26
## 5: search paid 37
# users by day: all categories
head(users[order(userByDay, decreasing = T)])
## user_id discovery_channel_category discovery_channel
## 1: 2619113 direct direct
## 2: 2621420 paid other
## 3: 2619581 paid affiliate
## 4: 1943278 email email
## 5: 2497828 paid social
## 6: 2435897 direct direct
## discovery_time discovery_platform location gender
## 1: 2017-12-18 20:57:03 mobile US female
## 2: 2017-12-19 08:50:49 desktop US female
## 3: 2017-12-18 23:48:50 mobile international female
## 4: 2017-12-19 09:26:48 mobile US female
## 5: 2017-11-23 12:21:20 desktop international female
## 6: 2017-11-23 15:44:16 mobile international female
## discoveryDay userByDay
## 1: Monday 4
## 2: Tuesday 4
## 3: Monday 4
## 4: Tuesday 4
## 5: Thursday 3
## 6: Thursday 3
summary(lm(data = users, formula = userByDay ~ as.factor(discovery_channel)))
##
## Call:
## lm(formula = userByDay ~ as.factor(discovery_channel), data = users)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5714 -0.4286 -0.3333 0.0282 2.6667
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.538 0.206 7.46
## as.factor(discovery_channel)direct -0.158 0.262 -0.60
## as.factor(discovery_channel)display -0.288 0.334 -0.86
## as.factor(discovery_channel)email -0.205 0.322 -0.64
## as.factor(discovery_channel)organic search -0.433 0.268 -1.62
## as.factor(discovery_channel)other 0.033 0.348 0.09
## as.factor(discovery_channel)search -0.110 0.348 -0.32
## as.factor(discovery_channel)social -0.101 0.278 -0.36
## Pr(>|t|)
## (Intercept) 0.000000000047 ***
## as.factor(discovery_channel)direct 0.55
## as.factor(discovery_channel)display 0.39
## as.factor(discovery_channel)email 0.53
## as.factor(discovery_channel)organic search 0.11
## as.factor(discovery_channel)other 0.92
## as.factor(discovery_channel)search 0.75
## as.factor(discovery_channel)social 0.72
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.743 on 92 degrees of freedom
## Multiple R-squared: 0.0415, Adjusted R-squared: -0.0314
## F-statistic: 0.569 on 7 and 92 DF, p-value: 0.779
## users by day: paid vs non-paid
users[discovery_channel_category != "paid", discovery_channel_category := "unpaid"]
summary(lm(data = users, formula = userByDay ~ as.factor(discovery_channel_category)))
##
## Call:
## lm(formula = userByDay ~ as.factor(discovery_channel_category),
## data = users)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4510 -0.4510 -0.2653 -0.0617 2.7347
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.451 0.102 14.20
## as.factor(discovery_channel_category)unpaid -0.186 0.146 -1.27
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## as.factor(discovery_channel_category)unpaid 0.21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.73 on 98 degrees of freedom
## Multiple R-squared: 0.0162, Adjusted R-squared: 0.0062
## F-statistic: 1.62 on 1 and 98 DF, p-value: 0.206
## median sales by day by discovery channel
orderProducts[, median(salesByDay), by = order_channel]
## order_channel V1
## 1: other 101.5
## 2: social 72.0
## 3: display 68.0
## 4: search 98.0
## 5: affiliate 88.0
The first graphic shows how users have come to Everlane over time, disaggregated by discovery channel. The second graphic shows the sales by day patterns across each of the channels. We can see that there is no discernable, consistent pattern across the discovery channels, with the exception of affiliate and search channels, which have slight but positive trends over time.
## Looking at users
ggplot(data = users, mapping = aes(x = discovery_time, y = userByDay, color = discovery_channel)) +
geom_point() +
labs("Users over Time by Discovery Channel")
# looking at the profit generated
ggplot(data = orderProducts, mapping = aes(x = completed_at, y = salesByDay, color = order_channel)) +
geom_point() +
geom_smooth(se = F) +
labs("Sales over Time by Discovery Channel")
I want to know how well Everlane a) is drawing users (impressions and visits) in and b) keeping customers (satisfaction) and c) how much each customer is spending over time (total sales). Of the customers that are coming to Everlane via the channels listed in the discovery_channel variable, I would want to know how many people were reached. These data would tell me the proportion of people who are drawn into the site initially. I would then want to know, of the users that visited the site, what those customers viewed, whether there are trends in their viewing patterns, and the proportion of items that they then went onto buy. Once there are orders placed, I would want to know whether customers were satisfied; to quantify this, I would want comments, reviews, the number of items that are returned, and whether a customer returned more than once to place an order. Of the returning customers, I would want data on their spending habits over time (total sales, frequency).
“Shoe lovers” can be defined as customers who either include footwear in each purchase and/or customers who have spent 50% or more of their summed order amounts on footwear. However, given the little information that we have available, we can only identify lovers of certain items as those who have purchased one. Ideally, I would want to know more about a customers’ willingness to pay for a point of shoes; show lovers would have more price flexibility. However, given the data that we have available, I can identify “shoe lovers” as customers who have purchased shoes.
orderProducts[category == "Footwear", user_id ]
## [1] 1286966 2118486 2016243
Using the same methodology, I have generated code that will make lists of all the kinds of item lovers within the existing data. Also, it may be useful to the marketing team to know that emails are, per the data available, the least successful outreach method.
orderProducts[!is.na(category),
.(lovers = paste(unique(user_id), collapse = ", ")),
by = category]
## category
## 1: Woven Tops
## 2: Knit Tops
## 3: Footwear
## 4: Knit Dresses
## 5: SLGs
## 6: Sweaters
## 7: Seasonal Accessories
## 8: Outerwear
## lovers
## 1: 1772889, 2050503, 2022637, 2421729, 1525025, 2105529, 2413754, 1818633
## 2: 1637691, 1717464
## 3: 1286966, 2118486, 2016243
## 4: 1323027
## 5: 2359777
## 6: 1854949, 2593502, 2481099, 2380929, 1937006, 1125825, 1671627
## 7: 925461
## 8: 1375796, 2515621
I would first need to gather the information about who the true shoe lovers are (using some of the methodology described in 3a). More concretely, this metric could be evaluated through a series of models with cross validation. Using existing data, I would build a model to understand who a shoe lover is. To do so, I would experiment with multiple models to ultimately build a supervised classifier to predict whether a customer will buy shoes. Ideally, once this model is built, I will have insights into shoe lovers’ purchasing habits that will help drive my a/b testing hypothesis. For example, if I find that the users who consistently buy a pair of shoes after seeing an Everlane ad on Instagram indicates a shoe lover, I may also assume that, if the same user sees an image of a shoe in an email, they will also be compelled to purchase it.
Once I understand what constituted a shoe lover, I would generate a propensity score by user for their average probability to buy shoes; I would assign the propensity scores using a logit regression. Based on the propensity scores, I would select 25% of users and ensure that the distribution of shoe lovers within that quarter closely matched the distribution of the entire client base. Once the control and treatment groups are balanced, I would add a binary variable to the users’ data table for shoeLover. After launching the email campaign, I would collect up to a month of data, depending on the length and frequency of the email campaign.
While I could use a service like Google Optimize to collect and store the data, this a/b test would only be changing the behavior of users’ visiting a site, rather than how they interact with it. Thusly, tracking those who were a part of the email campaign versus those who were not would be simplified. Depending on the outcome of interest, the data would allow us to analyze the efficacy of the email campaign on increased purchases made.