Hello, my name is Jack Hanlon, I’m the VP of Analytics at Jet.com where my teams include Analytics, Experimentation, Research, and Data Science, I’m an adjunct professor at NYU Stern in the Luxury and Fashion MBA program where I teach Retail Strategy and Analytics, and most importantly, I was once sitting where you are and now am a proud alum of NYU’s MSBA program. In the following series of videos we’re going to talk about multivariate regression and price elasticity.
#Regressions Multivariate Regression is a scientist’s hammer. There is no more widely used technique to identify and describe the relationship between variables. Across disciplines and across problem types this method is a great place to start because it can be incredibly effective method at:
(1) identifying whether a relationship exists between the independent and dependent variables, and quantifying the strength of that relationship
(2) establishing the marginal effect of each variable, and
(3) attempting to predict future outcomes given new values attached to these variables
At its core, it is a computationally cheap and easy to understand approach that has likely been published more than any other comparable analytical method.
Perhaps you’ve already encountered this technique or even have been trained to use it, but frequently the gap between the academic setting and the workplace limits people’s ability to execute it with confidence. These videos aim to give you a strong base layer by walking you through implementing a multivariate regression to solve a real business problem on real, messy data.
####What Can It Do? First let’s talk about some tangible applications in the business world. There are countless ways that multivariate regressions are used in businesses but to offer some examples of interesting areas: they can be used to examine multi-touch attribution for marketing (so, how much is each of my marketing channels contributing to my sales?), they can figure out the attributes that most frequently affect the selection process for given products, they can be used to figure out the pricing of houses in a market (how do the various features of a house contribute to its projected value?), and on and on. But how is this actually done?
We take a dataset that has multiple observations of data (perhaps multiple days, or multiple people, etc…). In the house example earlier, this would be a dataset of houses, their sale price and their key attributes.
## Parsed with column specification:
## cols(
## House = col_double(),
## `Sale Price` = col_double(),
## `Sq Ft` = col_double(),
## Bedrooms = col_double(),
## Neighborhood = col_character()
## )
summary(toy_house)
## House Sale Price Sq Ft Bedrooms Neighborhood
## Min. :1.0 Min. :380000 Min. :1500 Min. :3 Length:3
## 1st Qu.:1.5 1st Qu.:405000 1st Qu.:1500 1st Qu.:3 Class :character
## Median :2.0 Median :430000 Median :1500 Median :3 Mode :character
## Mean :2.0 Mean :416667 Mean :1500 Mean :3
## 3rd Qu.:2.5 3rd Qu.:435000 3rd Qu.:1500 3rd Qu.:3
## Max. :3.0 Max. :440000 Max. :1500 Max. :3
So for example, we might see that a house in Westchester with 3 bedrooms and 1500 sq ft sold for $430,000. The next week a house with the same attributes around the corner sells for $440,000. That same week a house in a different neighborhood with the same number of bedrooms and square footage sells for $380,000. What explains why this house had such a different sale price despite having the same specs? Using this variability we can try to learn about how well the difference in neighborhood explains the difference in sale price.
Here we might set up a regression that attempts to describe the variation in sale price of the homes using each home’s key attributes. In this setting, the sale price is the Dependent Variable and the attributes are Independent Variables. A regression takes the following format, where it aims to produce the coefficients (the weight of the effect of that variable, ceteris paribus, on the dependent variable) given the data. The generic form of the equation is: Dependent Variable = Intercept + coefficient_1*indep.variable_1 + coefficient_2*indep.variable_2 + ... + error term
So in our example, our regression would look like the following: Sale price of house = intercept + coef1*sq footage + coef2*number of bedrooms + coef2*neighborhood + error term
The results of our regression produce the intercept and coefficients in the model.
Additionally, using this information, we could plug in the calculated coefficients into this same equation to figure out the potential sale price of a house about to go on the market, given certain attribute values. So if we have a new house coming on the market in Westchester with 3 bedrooms and 2000 sq ft we could project what we think a fair sale price might be. That equation would look like the following:
Sale price of house = intercept + coef1*2000 + coef2*3 + coef3*1 + error term
Adding the results (note this is a toy example so these are made up numbers, don’t make any house bids on this…): Sales price of house = 200,000 + 100*2,000 + 15,000*3 + 5,000*1 + 10,000
Our projected sales price would be: $460,000
That’s a fun problem but for our project we want to talk about work that has a huge impact that spans many industries so we’re going to explore one of the most crucial and widely used uses of multivariate regression: calculating the Price Elasticity of Demand of a given product.
Thanks for watching.
Time check: about 4:30 at this point
Welcome back. In the previous video we discussed the basics of multivariate regression. In this video we’re going to talk about and calculate one of the most important implementations of this technique: Price Elasticity of Demand.
##Price Elasticity of Demand Price Elasticity of Demand is a measure used to show the relationship between the quantity demanded of a good or service relative to a change in its price. As a retailer, having a solid understanding of this relationship for any given product allows for vastly more effective control over profit maximization, sales management, and inventory management.
Price Elasticity is calculated using a two step process:
(1) We take a dataset that shows multiple consistent time periods and the price and units sold in each period. We perform a regression on that data to evaluate the impact of price on quantity sold.
(2) Then we plug the results into the Price Elasticity equation: \(PE = (\Delta Q/\Delta P) * (P/Q)\) where our regression will produce the \((\Delta Q/\Delta P)\) value [which will be the coefficient associated with Price, as you will see in the example] and then \(P/Q\) will be \(mean(P) / mean(Q)\) over that same time period.
####Let’s Code This notebook will walk you through a few concepts:
(1) How do we determine a simple price elasticity for a given product?
(2) How do we determine to what level other factors may have impacted the units sold for our product, besides just price?
(3) How do we improve the accuracy of our modeling?
#####Aside: How will we code? Modern computing makes regression much more accessible. It can be executed using any number of statistical packages or programming languages. Professor Ghose will use Stata and I will show you using R but there are many other options, ranging from non-coding tools like Excel to other scientific tools like Matlab.
First let’s load our dataset into a dataframe (effectively just a way of storing a spreadsheet of N x M columns). The dataset we have is a by-day look at the price and units sold of a single popular product, like paper towels, within a certain customer segment. Let’s look at our results and the summary stats about our data:
## Parsed with column specification:
## cols(
## order_date = col_character(),
## price = col_double(),
## units = col_double()
## )
summary(simple_df)
## order_date price units
## Length:384 Min. :12.99 Min. : 0.00
## Class :character 1st Qu.:16.97 1st Qu.: 80.75
## Mode :character Median :17.45 Median :120.50
## Mean :17.90 Mean :123.19
## 3rd Qu.:17.98 3rd Qu.:153.25
## Max. :23.42 Max. :341.00
Looking at the summary stats here we can see that we have some nice variation in terms of prices and units, and we have a decent time range. This looks like it could be promising – having a good amount of data over a longer time period with variability in the attribute values gives us the right raw material to help solve this problem.
Let’s get to the actual regression. Our independent variable is the price and our dependent variable is the number of units sold. If we only include these two variables we are suggesting that price alone will adequately predict units sold - this is a naive view but let’s see how it goes.
To run this regression in R we use a function called lm() [which stands for Linear Model] which will create a model like our regression equation from earlier in the form: \(Y_0 = \beta_0 + \beta_1x_1 + \epsilon\) which with our variables looks like: Units Sold = Intercept + B1*Price
This function generally looks like this, where we would replace our relevant info for the parts that have brackets: [Name Of Model] <- lm(formula = [dependent variable]~[independent variable], data = [our dataset])
Let’s try it:
simplemodel <- lm(formula = units~price, data = simple_df)
summary(simplemodel)
##
## Call:
## lm(formula = units ~ price, data = simple_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142.050 -25.050 -1.904 22.295 197.076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 485.342 22.758 21.33 <2e-16 ***
## price -20.229 1.265 -15.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.49 on 382 degrees of freedom
## Multiple R-squared: 0.401, Adjusted R-squared: 0.3995
## F-statistic: 255.8 on 1 and 382 DF, p-value: < 2.2e-16
Wow. There’s a lot here but don’t worry, we’ll make sense of it.
#####Coefficients First let’s look at the Coefficients section. We recall that our model was Units Sold = Intercept + B1*Price so we can look in the coefficients area under Estimate and take those values attributed to each of these variables and see that our basic linear model is as follows: Units Sold = 485.342 + -20.229 * Price
This is implying that if we were to plug in a new price value, we would generate the projected Units Sold for that day. Notice that the coefficient next to price is negative, which makes sense as we would expect a higher price to drive down units sold.
#####Residuals We also see a section called Residuals. If we were to go through the process earlier with our coefficients, where we plug in our price value for each day in our dataset, and then compare those subsequent estimates for each day against the real units sold for that day, this produces what’s called the Residuals. If the residuals are all 0, we have a perfect model, i.e. there’s no difference between the predicted values and the actuals. In our case we’re not so fortunate as clearly price alone is not telling quite the whole story. We can see the summary stats on the Residuals section, but if we want to dig even deeper we can plot them:
par(mfrow=c(1,2))
plot(simplemodel, which=1:2)
If you look at the first panel, Residuals vs Fitted, we can see where the predictions are and the distance between that prediction and reality in the black circles. As you can see there’s a fairly high level of variation which suggests our model’s accuracy isn’t so great at the moment and that there are other factors contributing to the number of units sold per day besides price.
#####R-squared One way to succintly describe our general model fit is using r-squared. In our summary it tells a similar story to the rest of the data with a value of 0.39 (out of 1) – not super impressive. In a real world environment you might be satisfied with something in the 70’s or higher depending on how precise you need to be for that application as the amount of potential factors at play here makes very high r-squared fairly unreasonable to achieve, but we are unlikely to ever find 0.39 very satisfying.
#####P-value Although our model is not incredibly accurate at the moment, not all is lost: if we look at the p-value of each coefficient, the model is quite sure that price has an effect on units sold, represented here by the *** next to the result.
If we felt that this was an adequate result, we could calculate the price elasticity now. Let’s look back at what we said about the Price Elasticity equation:
—–SHOW EQUATION HERE
The equation is \(PE = (ΔQ/ΔP) \* (P/Q)\) where our regression will produce our \((ΔQ/ΔP)\) value [which will be the coefficient associated with Price, as you will see in the example] and then \(P/Q\) will be \(mean(P) / mean(Q)\) over that same time period.
Having run our regression, we now have the \((ΔQ/ΔP)\) value - this value is the coefficient associated with Price, which in our case is -20.229. Then for \(P/Q\) we can take the \(mean(P) / mean(Q)\) over that same time period as we recommended earlier.
elasticity_estimate = coef(summary(simplemodel))["price","Estimate"] * mean(simple_df$price)/mean(simple_df$units) #rather than hardcoding our (ΔQ/ΔP) value, this will find the coefficient from our model and access it from there, so that if we change our data later this equation will change with it, rather than being fixed to an outdated number
elasticity_estimate
## [1] -2.939698
—–SHOW EQUATION HERE
This estimate implies that for each unit we raise the price of our product, we can expect to see a drop of 2.93 units of sales. So if we increase the price of paper towels by $1 we’d expect to sell 2.93 fewer units per day. You can imagine that given this information you could make choices about ideal profitability, balancing volume and margin. But in order to do that, we’ll have to trust our model a little more. In the next video we’ll talk about how to improve our accuracy by using Multivariate methods.
Thanks for watching
Time Estimate for this section: ~6 mins, aggregate 10:30
Welcome Back. In the previous video we described how to perform a basic regression and translate that into Price Elasticity for a given product. In this video we will explore methods of improving the accuracy of our model.
###Going Multivariate for Better Accuracy
We have a good starting point but it’s clear that we are not describing all the variation in sales with only price based on our model stats based on the accuracy measures we just discussed. With this in mind, how might we improve the performance of our model?
One of the methods we can use to attack this is to add other potentially strong variables that could help us better describe the factors that determine units sold per day. Some of those variables may include endogenous variables like marketing spend, promotions, etc… where we have direct control over aspects that may impact unit sales. Considering these efforts are explicitly designed to positively impact unit sales it stands to reason that these could be key elements. Additionally, exogenous factors like weather or other firms’ decisions could also alter supply or demand factors in the marketplace. While it is unreasonable to expect that we could completely explain the variation of sales for every day, we still have notable upside given the intuitively important variables mentioned earlier. Let’s take some of the things we have at hand and add some detail on our dataset.
We’re going to add some elements you are likely to have available:
#####Day of Week In most areas of retail, the Day of the Week frequently has a notable effect on the demand for certain goods. For example, Monday tends to be a much more popular shopping day than Friday, and the weekends tend to be soft sales days for many product types.
#####Promotions Offering coupons or sales is another method of driving demand and is something that explicitly impacts price, one of our key variables. But price alone may not tell the whole story (e.g. we might do a mailer to announce the low price) so we want to capture the full impact of a Promotion.
#####Search Clicks In digital, search advertising is a huge harvester of online demand, so the amount of search traffic we drive could be meaningful to how much product we sell.
## Parsed with column specification:
## cols(
## order_date = col_character(),
## day_of_week = col_character(),
## price = col_double(),
## units = col_double(),
## promo = col_double(),
## promo2 = col_double(),
## search_clicks = col_double()
## )
summary(betterdf)
## order_date day_of_week price units
## Length:373 Length:373 Min. :12.99 Min. : 33.0
## Class :character Class :character 1st Qu.:16.97 1st Qu.: 83.0
## Mode :character Mode :character Median :17.40 Median :122.0
## Mean :17.92 Mean :125.9
## 3rd Qu.:17.98 3rd Qu.:154.0
## Max. :23.42 Max. :341.0
## promo promo2 search_clicks
## Min. :0.0000 Min. :0.0000 Min. : 198.8
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 482.0
## Median :1.0000 Median :0.0000 Median : 607.2
## Mean :0.6836 Mean :0.2815 Mean : 596.8
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 707.7
## Max. :1.0000 Max. :1.0000 Max. :1056.3
In order to run our new multivariate regression with these variables, we add them to our initial formula. Before our equation was \(Y_0 = \beta_0 + \beta_1x_1\) now we add other variables, so: \(Y_0 = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_ix_i... + \epsilon\)
Let’s see how this looks in code and run the new model. Again we use lm(), and this time we add the other variables by name.
bettermodel <- lm(formula = units~price + day_of_week + promo + promo2 + search_clicks, data = betterdf)
summary(bettermodel)
##
## Call:
## lm(formula = units ~ price + day_of_week + promo + promo2 + search_clicks,
## data = betterdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123.257 -21.932 -4.493 14.873 191.350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 371.1031 22.9563 16.166 < 2e-16 ***
## price -19.6262 1.0864 -18.065 < 2e-16 ***
## day_of_weekMonday 4.0341 7.1033 0.568 0.570446
## day_of_weekSaturday -9.1024 6.9949 -1.301 0.193988
## day_of_weekSunday 2.4491 7.1230 0.344 0.731170
## day_of_weekThursday 5.7162 7.0067 0.816 0.415141
## day_of_weekTuesday -0.9265 7.0538 -0.131 0.895577
## day_of_weekWednesday 4.6673 6.9762 0.669 0.503903
## promo1 34.3401 10.1145 3.395 0.000762 ***
## promo21 41.3242 9.9918 4.136 4.40e-05 ***
## search_clicks 0.1181 0.0164 7.199 3.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.12 on 362 degrees of freedom
## Multiple R-squared: 0.5917, Adjusted R-squared: 0.5804
## F-statistic: 52.45 on 10 and 362 DF, p-value: < 2.2e-16
OK, so what are we seeing here… First, we see that each Day of the Week and each Promo have been broken out into their own variable. Why is that?
####Dummy Variables Because these variables are categorical variables (e.g. something is in a category (text) rather than, say, number values), R breaks them out into dummy variables for us (which is known as One-Hot-Encoding for all you scikit-learn Python users) where the variable shows a 1 if it true and a 0 otherwise. So for example, on a Thursday where Promo 1 was active the row would look like so:
demo_df <- data.frame("date" = c("18/03/03"), "price" = c(18), "day_of_weekMonday" = c(0), "day_of_weekTuesday" = c(0), "day_of_weekWednesday" = c(0), "day_of_weekThursday" = c(1), "day_of_weekFriday" = c(0), "day_of_weekSaturday" = c(0), "day_of_weekSunday" = c(0), "promo1" = c(1), "promo21" = c(0))
demo_df
## date price day_of_weekMonday day_of_weekTuesday day_of_weekWednesday
## 1 18/03/03 18 0 0 0
## day_of_weekThursday day_of_weekFriday day_of_weekSaturday day_of_weekSunday
## 1 1 0 0 0
## promo1 promo21
## 1 1 0
You can see Thursday and Promo1 have a value of 1 and the other categorical variables have a value of 0. A regression can only be performed on numbers and this method unlocks our categorical variables or factors to be included. We also could have opted to assign each of these a numeric value, e.g. Monday = 1, Tuesday = 2, etc… but had we done so, then you have some weird relationships between categories, e.g. Tuesday = 2*Monday which is meaningless. Rather than create these structures, we opt for binary 1-0 representations of true/false.
Looking at our variables performance: if we look at the analysis results for significance, while Day of Week is not significant on any of its component days, the Promo variable is very significant. Additionally, our new numeric variable, Search Clicks, also is very significant. As a result of including these factors, if we look at our measures of success, we can see we’ve made considerable improvement. Let’s check our graphs again…
par(mfrow=c(1,2))
plot(bettermodel, which=1:2)
Progress! This is definitely better than it was in the first model as we can see in the improved accuracy measures in our model performance summary (and visually we can see how the dots are closer to the line than in our first example). Even so, some of those dots are pretty far away, and it still feels like we could be doing better than an r-squared of 0.5917 (reminder: this is measured out of 1). How might we continue to improve?
####Digging Deeper Normally we would never go straight into modeling, instead opting to do a healthy amount of exploratory data analysis to understand our data first. That is out of scope for these videos but let’s at least take a quick look at our sales plotted over time and see what we find…
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
salesplot <- ggplot(data = betterdf, aes(x=order_date, y=units)) + geom_line()
salesplot + scale_color_grey() + theme_classic()
Interesting. Here we can see three potential issues that are quite common to problems of this type : Outliers, Trends, and Seasonality.
#####Outliers We still have a couple tricky outliers that we can see in our model summary panel 1 (35, 36, 321 particularly) and panel 2 alike. Additionally if we look at the sales plot we can see significant variability at the beginning and end of the year. It’s clear that those are not being well explained by our existing variables, so we have a few options: We can keep them in and note them in our analysis to try to figure out what happened on those days that may have impacted our outcomes by adding other attributes (including some that we discussed like Weather). That may explain parts but may not catch all the items. Then if we dig futher we may uncover rare events (e.g. Justin Bieber randomly was photographed by the paparazzi holding the product) or we may never find the reason as we don’t have perfect information about our competitor’s decisions or what’s happening in all areas of all markets.
Fortunately, this doesn’t mean our efforts are in vain. Having a perfect explanation is not as important to us as understanding how we can better take control of our outcomes, so after further analysis and research we may choose instead to remove the rogue observations and say “something weird was happening there” and just move on. Yes, this exposes us to being hit by other rare events later but it also offers us a great level of precision in understanding how our core variables explain the great majority of performance. That arms us with the ability to create more optimal outcomes for our business.
#####Trend Another issue is that sales in general could be sales growing/shrinking over time. In our sales plot it does look like sales have diminshed over the course of the year. If we believe this could be the case there are a few ways of thinking about it, but let’s say we believe that there is a linear trend of growth (or shrinkage) occuring. In math terms this looks like: \(Y_i = \beta_0 + \beta_i T_i + \epsilon_i\) where \(T_i\) is our time, which we can build as an index, time = 1,2,3,4,… so day 1’s value = 1, and day N’s value = N. In our regression we will then be able to see how much the trend does or does not contribute to the overall outcome.
#####Seasonality Even if we take out the outliers and accommodate for trend we may have other types of variation that aren’t consistent across our time periods but instead represent variation within our periods showcasing seasonal demand. We attempted to accommodate for part of this using Day of Week, but there could be longer seasonal cycles as well, such as those we might see for winter jackets or grills.
We could look at our plot and see how seasonality determines how you might split up time periods. Some go for a simple seasonal breakdown with dummy variables of winter, spring, summer, fall, etc… which for our case, where there’s a clear dip in fall, may suffice. Others prefer a more granular approach surrounding certain holidays or sub-portions or the year. This also can vary based on geo (e.g. more northern states in the US get warm later in the season and get snow before southern states) so you may find that this variation of geographic elements also is critical.
Time estimate: ~4:30, aggregate: 15:00
##Looking Ahead
Moving forward, what could we improve?
(1) Control for Outliers/Trend/Seasonality
(2) Add other new variables of interest. We could explore what other data we need to include to see if there are any predictors we may have access to that we haven’t used yet that could help the effort
(3) Address issues of linearity. Notice that in our main Residuals plot above the red line is curved, and in panel 2, on our Normal Q-Q plot, the dots don’t fit on the line at the ends… this suggests that perhaps this relationship is not linear. There is a bit of work we can do to our data to make this work better that should also improve our accuracy but I’ll leave that fun part to Professor Ghose and invite you to come back after his class and see if you can make the adjustments to your analysis of this data accordingly
Thanks for watching and good luck with the program!