The Fresh Case: Multiple Regression, Functional Forms, and Indicators

November 13, 2017

What Determines Laundry Detergent Purchases?

Drivers?
What is under Managerial/corporate control?
Simple vs. complicated?

A Look at the Data

Enterprise Industries: Fresh Detergent

Enterprise Industries, owners of Fresh Detergent, want to predict demand for their product. In this case, the product is an extra large bottle of Fresh liquid detergent. Given a model for demand, Enterprise can:

Plan a production schedule
Estimate production requirements
Plan inventory requirements
Estimate sales revenue (and profits?)

The Data to Work With

Four indicators for 30 sales periods (4 weeks):

The demand for the extra large sized bottle of Fresh (in 100,000s of bottles) in a sales period
The price of Fresh in the sales period (in dollars)
The average competitor price for similar products in the sales period (in dollars)
Enterprise Industries' advertising expenditures (in $100,000s) targeted toward Fresh in the sales period.
Later we might incorporate data on the advertising campaigns.

Data

##    Fresh.Demand Fresh.Price Industry.Price Advertising.Spending
## 1          7.38        3.85           3.80                 5.50
## 2          8.51        3.75           4.00                 6.75
## 3          9.52        3.70           4.30                 7.25
## 4          7.50        3.70           3.70                 5.50
## 5          9.33        3.60           3.85                 7.00
## 6          8.28        3.60           3.80                 6.50
## 7          8.75        3.60           3.75                 6.75
## 8          7.87        3.80           3.85                 5.25
## 9          7.10        3.80           3.65                 5.25
## 10         8.00        3.85           4.00                 6.00
## 11         7.89        3.90           4.10                 6.50
## 12         8.15        3.90           4.00                 6.25
## 13         9.10        3.70           4.10                 7.00
## 14         8.86        3.75           4.20                 6.90
## 15         8.90        3.75           4.10                 6.80
## 16         8.87        3.80           4.10                 6.80
## 17         9.26        3.70           4.20                 7.10
## 18         9.00        3.80           4.30                 7.00
## 19         8.75        3.70           4.10                 6.80
## 20         7.95        3.80           3.75                 6.50

Summaries

	Mean	Std. Dev.	Minimum	Maximum	Atoms
Fresh.Demand	8.38	0.68	7.10	9.52	26.00
Fresh.Price	3.73	0.09	3.55	3.90	8.00
Industry.Price	3.95	0.22	3.65	4.30	11.00
Advertising.Spending	6.45	0.57	5.25	7.25	13.00

	Fresh.Demand	Fresh.Price	Industry.Price	Advertising.Spending
Fresh.Demand	1.00	-0.47	0.74	0.88
Fresh.Price	-0.47	1.00	0.08	-0.47
Industry.Price	0.74	0.08	1.00	0.60
Advertising.Spending	0.88	-0.47	0.60	1.00

Some Simple Graphics

Prices act as we might expect.
Own Price and Industry Price appear to work opposing.
Advertising Spending could be a line or a non-line.

Plotly

Three Dimensions

Let's have a look at the 3-D.

Code

scatter3d(Fresh.Demand~Advertising.Spending+Price.Difference, data=fresh.data, fit=FALSE, residuals=TRUE, bg="white", axis.scales=TRUE, grid=TRUE, ellipsoid=FALSE)
scatter3d(Fresh.Demand~Advertising.Spending+Price.Difference, data=fresh.data, fit="linear", residuals=TRUE, bg="white", axis.scales=TRUE, grid=TRUE, ellipsoid=FALSE)
scatter3d(Fresh.Demand~Advertising.Spending+Price.Difference, data=fresh.data, fit="quadratic", residuals=TRUE, bg="white", axis.scales=TRUE, grid=TRUE, ellipsoid=FALSE)
scatter3d(Fresh.Demand~Advertising.Spending+Price.Difference, data=fresh.data, fit=c("linear","quadratic"), residuals=FALSE, bg="white", axis.scales=TRUE, grid=TRUE, ellipsoid=FALSE)

Naive Regession Model


	Dependent variable:

	Fresh.Demand

Fresh.Price	-2.358^***
	(0.638)

Industry.Price	1.612^***
	(0.295)

Advertising.Spending	0.501^***
	(0.126)

Constant	7.589^***
	(2.445)


Observations	30
R²	0.894
Adjusted R²	0.881
Residual Std. Error	0.235 (df = 26)
F Statistic	72.797^*** (df = 3; 26)

Note:	p<0.1; p<0.05; p<0.01

Interpretation

Conforms to intuition:

Demand is inversely related to own price.
Demand is positively related to industry price.
Advertising enhances demand.

Intervals

Constructing na"ive confidence intervals:

Own price and industry price may cancel out.
-2.36 $\pm t_{.975, 30-4}$*0.64={-1.04,-3.68}
1.61 $\pm t_{.025, 30-4}$*0.30={0.99,2.23}.

How could we test this?

Testing

We can either test a negative equality constraint ($\beta_{1} = -\beta_{2}$) or estimate a single parameter for a variable measuring the price difference.
The former imposes a constraint.
The latter involves a simple manipulation of the data. $Fresh.Price - Industry.Price$ and then compare the two models. \[ y_{t} = \beta_{0} + \beta^{*}_{D}(F.P._{t} - I.P._{t}) + \beta_{3}Ad.Spend_{t} + \epsilon_{t} \]

Result


	Dependent variable:

	Fresh.Demand
	(1)	(2)

Fresh.Price	-2.358^***
	(0.638)

Industry.Price	1.612^***
	(0.295)

Price.Difference		1.588^***
		(0.299)

Advertising.Spending	0.501^***	0.563^***
	(0.126)	(0.119)

Constant	7.589^***	4.407^***
	(2.445)	(0.722)


Observations	30	30
R²	0.894	0.886
Adjusted R²	0.881	0.878
Residual Std. Error	0.235 (df = 26)	0.238 (df = 27)
F Statistic	72.797^*** (df = 3; 26)	104.967^*** (df = 2; 27)

Note:	p<0.1; p<0.05; p<0.01

	Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
1	27	1.53
2	26	1.43	1	0.10	1.85	0.1855

Recreating F

Let's solve for F in terms of r-squared.

What is the difference in r-squared across the two models?
0.007569.
What is the average unexplained variance for the biggest model? 0.0043829
Which yields the following F. 1.8497987

Difference in Predictions?

Basic Scatterplots Revisited

Functional Forms

Though we call it a linear model, the important part about linear is linear coefficients.
Transformations of data make it possible to incorporate non-linear functions. John von Neumann once equated the class of non-linear functions to the class of non-elephant animals.
Caution with overfitting and creating inexplicable functional forms should remain in our minds. There is a widely employed saturated model that fits the data perfectly (with exactly as many parameters as data points).

How should we use the information?

We expect demand to be determined by price and by information.
How should we use the information that we have?
Relevant characteristics of price(s)
What is the effect of advertising spending and what do we think it should be?
For this demand problem, the difference in prices seems sufficient.
For this demand problem, advertising expenditures may be subject to nonconstant marginal returns.

Exploring Advertising Spending

Non-linear fitting

A line struggles with the points on the right of the graphic.
A quadratic term in Advertising.Spending may be justified given non-constant returns.
We can incorporate this with a model written as, \[ y_{t} = \beta_{0} + \beta^{*}_{D}\textrm{Price.Diff.}_{t} + \beta_{3}\textrm{Ad.Spending}_{t} + \beta_{4}\textrm{Ad.Spending}_{t}^{2} + \epsilon_{t} \]
This doesn't violate any assumptions of the regression model as the square is a non-linear transformation.

Tables


	Dependent variable:

	Fresh.Demand
	(1)	(2)	(3)

Fresh.Price	-2.358^***
	(0.638)

Industry.Price	1.612^***
	(0.295)

Price.Difference		1.588^***	1.307^***
		(0.299)	(0.304)

Advertising.Spending	0.501^***	0.563^***	-3.696^*
	(0.126)	(0.119)	(1.850)

I(Advertising.Spending2)			0.349^**
			(0.151)

Constant	7.589^***	4.407^***	17.324^***
	(2.445)	(0.722)	(5.641)


Observations	30	30	30
R²	0.894	0.886	0.905
Adjusted R²	0.881	0.878	0.894
Residual Std. Error	0.235 (df = 26)	0.238 (df = 27)	0.221 (df = 26)
F Statistic	72.797^*** (df = 3; 26)	104.967^*** (df = 2; 27)	82.941^*** (df = 3; 26)

Note:	p<0.1; p<0.05; p<0.01

What does the model say?

All else equal, the price difference effect changes a little
Before, a $0.10 price differential in favor of fresh implied 15,900 more sales [$(1.59x0.1)100,000$]
Now the effect is about 13,100 [$(1.31x0.1)x100,000$].

Model Comparisons

Does Advertising.Spending have an effect? Test the hypothesis that both the linear term and the squared term are jointly zero against the alternative that at least one is not. $F = \frac{\frac{2.8059-1.2733}{(3-1)}}{\frac{1.2733}{26}} = 15.65$.
From the standpoint of model comparison, the regression $F$ statistic does not reject the inclusion of the quadratic term, $F^{E}=t^{2}=5.3176, Pr(F^{E} > F)=0.02934$.
All else equal, as Advertising.Spending increases, sales increase at an increasing rate. Let's look at a picture of the effect.
We could use the tools of calculus to figure this out. Simply take $\frac{\partial \hat{y}}{\partial \textrm{Ad. Spending}} = -3.6956 + 0.3486*\textrm{Ad.Spending}$.

The Picture

The Effect of a Quadratic on the Regression Plane

Instead of a constant slope plane for the regression, we have a non-linear function on one dimension.
what remains are (partial) lines, we are simply shifting this non-linear function.
Let's have a look at the plane implied by this addition to our case.

The Plane

Some Residual Diagnostics

The key model diagnostics suggest a reasonable fit and relevant predictors.
That is not all that we want to justify inference.
Examine key assumptions that undergird the linear model before supplying Enterprise with their forecast.

The Forecasts

Normal Residuals?

par(mfrow=c(2,2))
qqnorm(fresh.model.diff$residuals, main="QQ-Normal: Linear Ad.Spending", datax=TRUE)
qqnorm(fresh.model.sq$residuals, main="QQ-Normal: Quadratic Ad.Spending", datax=TRUE)
plot(fresh.data$Price.Difference,fresh.model.sq$residuals, xlab="Price.Difference")
plot(fresh.data$Price.Difference,fresh.model.diff$residuals, xlab="Price.Difference")

Residual Plots

Putting Together the Forecast

Once we decide on a model, we can come up with at least two very valuable quantities.

The upper bound of the prediction interval [Recall that the prediction interval is \[ \hat{y} \pm t_{\frac{\alpha}{2}}s\sqrt{1 + \frac{1}{n} + \frac{(x_{0} - \overline{x})^{2}}{SS_{xx}}}\]] can be used to bound necessary inventory.
The lower bound of the prediction interval can be used to bound a revenue forecast.

Let's characterize the in choosing among these models.

Picture

Picture 2

What Warnings?

It seems reasonable to expect a further kink in the right of the Advertising Spending graphic were Advertising Spending to increase further. The marginal dollar cannot reasonably be thought to imply such returns.
The general problem this raises is extrapolation.
How do we predict outcomes that are outside of the realm of things we have observed?
More generally, this is where fitting functions to data and being able to generalize functions come to sharp disagreement.

Some New Data

##   Fresh.Price Industry.Price Price.Difference Ad.Exp Fresh.Demand
## 1        3.85           3.80            -0.05   5.50         7.38
## 2        3.75           4.00             0.25   6.75         8.51
## 3        3.70           4.30             0.60   7.25         9.52
## 4        3.70           3.70             0.00   5.50         7.50
## 5        3.60           3.85             0.25   7.00         9.33
## 6        3.60           3.80             0.20   6.50         8.28
##   Ad.Campaign DA DB DC
## 1           B  0  1  0
## 2           B  0  1  0
## 3           B  0  1  0
## 4           A  1  0  0
## 5           C  0  0  1
## 6           A  1  0  0

Introducing Interactions

What if the advertising campaign matters?

FD.lm <- lm(Fresh.Demand ~ Price.Difference+poly(Ad.Exp, 2), data=Fresh)
shapiro.test(FD.lm$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  FD.lm$residuals
## W = 0.9841, p-value = 0.9209

FD.lmI <- lm(Fresh.Demand ~ Price.Difference+Ad.Exp:Ad.Campaign, data=Fresh)
shapiro.test(FD.lmI$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  FD.lmI$residuals
## W = 0.99121, p-value = 0.9958

Comparing Slopes that Depend on Campaigns

anova(FD.lm,FD.lmI)

## Analysis of Variance Table
## 
## Model 1: Fresh.Demand ~ Price.Difference + poly(Ad.Exp, 2)
## Model 2: Fresh.Demand ~ Price.Difference + Ad.Exp:Ad.Campaign
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     26 1.27327                                  
## 2     25 0.75218  1   0.52109 17.319 0.0003268 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A Plot

plot(allEffects(FD.lmI))

Are they comparable?

plotmeans(Ad.Exp~Ad.Campaign, data=Fresh)

Prices

plotmeans(Price.Difference~Ad.Campaign, data=Fresh)