Data and code are on the course site. This includes lightbeer-start.R; much of the code you will need has already been written for you in this script.
The data are 73,115 observations light beer purchases in the US. These observations are taken from the Nielsen Home-Scan database
lightbeer.csv contains:
We will consider beer spend, price floz, and beer brand (and transformations thereof) as the responses of interest.
Our starter script defines the basic regression mle.
Write the mathematical formula for the regression model behind mle and describe it in words. You don’t need to list every demographic variable, but be sure to describe important interactions and the probability model that is implied by the deviance we’ve minimized.
mle <- glm( logprice ~ logvol*promo*brand + container + ., data=demog )
\[ \log(price) = \beta_0 + \beta_1log(volume)_i + \beta_2promo_i + \beta3brand_i + \] \[\beta_4log(volume)_i * promo_i* brand_i + \beta_5container_i + ...+\beta_{demog_{p}}X_{demog_{p}} + \epsilon_i \] Here, the mle model that is implemented by the code is describing a model that will predict the (log) price of beer. In this case, we are using the variables for volume of beer sold (beer_floz), the promotional status of the beer (promo), and the brand of the beer sold (brand). These variables are also used in an interactive term (\(\beta_4log(volume)_i * promo_i* brand_i\)). This interaction term is intended to capture the combined effects of promotional efforts for each brand and the volume of beer sold.
What is our estimate for the error variance?
The model deviance is 2042.3423749, the null deviance is 4712.442879 and the amount of variance explained by the model is 43.34%.
The variance of the residuals is 0.0279903.
Compare (i.e., with a plot) the fitted residuals and log volume.
a = data.frame(mle$residuals, logvol, mle$fitted.values)
ggplot(a, aes(x = mle.fitted.values, y = logvol)) + geom_point() + geom_smooth(method = "lm")
ggplot(a, aes(x = mle.residuals, y= logvol)) + geom_point() + geom_smooth(method = "lm")
(a) describe the fit
According to the above graphs, we can see that the fit between the log volume and the residuals from the mle model is poor. There is are many extreme values throughout the plot. Ideally, the most of the residual values would be at 0, meaning the the difference between the predicted and actual data point would be the same. However, as we can see a large majority of the points are baove and below 0 for the residuals.
(b) suggest one degree of freedom that could be added to your model to improve the fit (and see how it works).
For this model, we added the variable logspend to the model.
mle2 <- glm( logprice ~ logvol*promo*brand + container + logspend + ., data=demog )
b = data.frame(mle2$residuals, logvol, mle2$fitted.values)
ggplot(b, aes(x = mle2.fitted.values, y = logvol)) + geom_point() + geom_smooth(method = "lm")
ggplot(b, aes(y = mle2.residuals, x = logvol)) + geom_point() + geom_smooth(method = "lm")
Compared to the graph from the original mle model, the residuals are much lower.
Given the original mle fit, all other things equal, what is the change in price
- for Rural North Dakota relative to Chicago?
NorthDakota = coef(mle)["marketRURAL NORTH DAKOTA"]
print(exp(NorthDakota))
Bush = coef(mle)["brandBUSCH LIGHT"]
NattieLight = coef(mle)["brandNATURAL LIGHT"]
bush.vs.nattie = exp(Bush - NattieLight)
print(bush.vs.nattie)
Coors = coef(mle)["brandCOORS LIGHT"]
Promotion = coef(mle)["promoTRUE"]
coors.promo = coef(mle)["logvol:promoTRUE:brandCOORS LIGHT"]
mean.vol.coors.promo = mean(logvol)*coef(mle)["promoTRUE:brandCOORS LIGHT"]
mean.vol.truepromo = mean(logvol)*coef(mle)["logvol:promoTRUE"]
exp(Promotion + coors.promo + mean.vol.coors.promo + mean.vol.truepromo)
promoTRUE
0.4863119
Plot the p-value distribution. What does it tell you about the underlying signal in this regression? What is the p-value cutoff for an FDR of 1%? What is the FDR at a p-value cutoff of 0.05?
pvals = coef(summary(mle))[,4] %>% as.data.frame()
alpha = .01
ggplot(pvals, aes(x = pvals$.)) + geom_histogram(bins = 6)
fdr1 = fdr_cut(pvals, alpha, plot_it = TRUE)
num_fdr1 = (length(pvals[pvals<= fdr1]))
fdr2 = fdr_cut(pvals, .05, plot_it = TRUE)
num_fdr2 = (length(pvals[pvals<= fdr2]))
the cutoff for an alpha of .01 is 0.0077746, resulting in 122 below the cutoff line. and the cutoff for an fdr of .05 is 0.0392872 leaving 133 below the new cutoff
Run a bootstrap for the sampling distribution of the coefficient on \(log(volume) – \beta logvol\) – using the code and random seed in the starter script.
### 1.5
# bootstrapping
# this code takes a minute to run...
B <- 20
volbeta <- rep(NA,B)
set.seed(41201)
for(b in 1:B){
#print(b)
mleb = glm( logprice ~ logvol*promo*brand + container + .,
data=demog,
subset = sample(1:length(logprice), size = 10000, replace = TRUE))
volbeta[b] = coef(mleb)[['logvol']]
#print(paste0("b: ",b," ", volbeta[b]))
}
bootstrap.df = data.frame("Bootstrap Number" = c(1:20), "Bootstrapped Coefficient" = volbeta)
kable(bootstrap.df)
Compare the 80% confidence interval built from your bootstrap to the theoretical 80% confidence interval implied by summary(mle). Give potential explanations for any discrepancy in the regions covered by these intervals.
#volbeta_conf = quantile(volbeta, c(0.1, .9))
#confint(mle,"logvol",level = .8)
#regression beta confidence interval:
real_beta = coef(summary(mle))['logvol',]
coef.val = real_beta[1]
#upper bound
ub_beta = coef.val * qnorm(.9)*real_beta[2]
#lower bound
lb_beta = coef.val * qnorm(.1)*real_beta[2]
bootstrap_ci = quantile(volbeta, c(.1, .9))
ci = data.frame(c(bootstrap_ci[1], lb_beta),
c(bootstrap_ci[2], ub_beta))
colnames(ci) = c("10%", "90%")
rownames(ci) = c("bootstrapped confidence interval", "confidence interval from regression")
kable(ci)
The values and confidence intervals are different because the bootstrap is not taken on the whole data frame, rather just a subset of resampled (with replacement) values.
The starter script defines a large design of interactions between price and our other variables. This is combined with the beer and demographic information in the sparse matrix x.
Run a cross-validated lasso for logspend onto x.
Describe the criteria used to choose models under select=“1se” and select=“min” rules.
# find the lowest MSE in plot
plot(cv.logspend)
The “min” lambda is minimum average out of sample deviance, represented by the left most vertical line, while the “1se” lambda value/oos deviance is at least 1 standard deviation away from the minimum value and still penalizes the large number of coefficeints.
What are estimated out-of-sample R2 for models fit under each rule?
r2_min <- summary(cv.logspend)[cv.logspend$seg.min,"oos.r2"]
r2_1se <- summary(cv.logspend)[cv.logspend$seg.1se,"oos.r2"]
r2_val_min <- cor(exp(predict(cv.logspend, x, select = "min")[,1]),exp(logspend))
r2_val_1se <- cor(exp(predict(cv.logspend, x, select = "1se")[,1]),exp(logspend))
the \(R^2\) value for min lambda is 53.56%
the \(R^2\) value for 1se lambda is 53.21%
Compare AICc, AIC, and BIC selection for this lasso to each other and to the CV choices. (You can exponentiate each IC divided by n to plot them on the same scale as the CV results).
cv_min = cor(exp(predict(cv.logspend, x, select = "min")[,1]), exp(logspend))
cv_1se = cor(exp(predict(cv.logspend, x, select = "1se")[,1]), exp(logspend))
aicc = cor(exp(predict(logspend2, x, select=NULL)[,1]), exp(logspend))
aic = cor(exp(predict(logspend2,
x,
select=NULL,
corrected=FALSE)[,1]),exp(logspend))
bic = cor(exp(predict(logspend2,
x,
select=NULL,
k=log(logspend2$nobs ),
corrected=FALSE)[,1]), exp(logspend))
table1 = data.frame("Method" = c("Minimum Cross Validation",
"1se Cross Validation",
"AICc","AIC","BIC"),
"Value" = c(cv_min,
cv_1se,
aicc,
aic,
bic))
kable(table1)
What is the effect of price on sales of Bud Light for people earning > $200k in Los Angeles?
la.x = x[sub,]
la.logprice = log(lb[sub,]$price_floz)
los.angeles = gamlr(la.x, la.logprice, lmr = 1e-4)
la.df = data.frame("Value" = desc(coef(los.angeles)[,1]),
"Variable Name" = rownames(coef(los.angeles))
)
la.df = la.df[la.df$Value!=0, ]
la.df$exp.value = exp(la.df$Value)
kable(la.df[,-2])
Compare this to the treatment effect estimates in question 2.
Why is it different and in what settings would we prefer the current model?
What is the average price sensitivity in our model? What is the price sensitivity for the median consumer (see starter script for xbar median).
Price Elasticity of Demand: \(\frac{\Delta Quantity}{\Delta Price} * \frac{P}{Q}\)
Average Price Sensativity: 0.0237162
meaning that for a one unit increase of beer will increase the average price by (approximately) .024
Median Price Sensativity: 0.1154365
mean that for a unit increase of beer, the median price will increase by (approximately) .12
Define the binary outcome variable ybud that is one for Bud Light and zero otherwise. Fit the lasso path for regression of ybud onto the demographic information in xdemog (no need for cross validation). Plot and describe (quickly) the fitted values.
lb$ybud = ifelse(lb$beer_brand == "BUD LIGHT", 1, 0)
ybud.reg = gamlr(xdemog, lb$ybud, family = "binomial")
plot(ybud.reg, xvar = "lambda")
prediction = predict(ybud.reg, xdemog, type = "response")
predictions = data.frame(prediction, prediction, lb$ybud )
colnames(predictions) = c("Prediction", "Threshold.Result","Reality")
predictions$Threshold.Result = ifelse(predictions$Prediction > .1, 1, 0)
predictions$compare = ifelse(predictions$Threshold.Result == predictions$Reality, 1, 0 )
kable(head(predictions))
What is the precise function that was minimized when fitting the AICc selected model here?
Here, the precise function that was minimized when fitting the AICc selected model was the probability of being Bud Light regressed on the demographic variables in the xdemog data set. This function is a logistic regression where the outcome is a probability of being budlight and the coefficients in the model are log-odds ratios that influence the odds of being budlight
Here, we are using the AICc, which is:
AICc = AIC + 2k^2 + 2k/n-k-1
What is the in-sample R2 for this fit?
r.squared = R2(lb$ybud, prediction, family="binomial")
The \(R^2\) for this model is 0.3730698
What is the sensitivity and specificity of a classification rule that considers every consumer with probability greater than 0.1 of buying Bud Light a potential Bud customer? Plot the ROC curve and add this point to the picture.
library(caret)
cm = caret::confusionMatrix(predictions$Threshold.Result, predictions$Reality)
sensativity = cm$byClass[2]
specificity = cm$byClass[1]
roc(p = prediction, y = lb$ybud)
points(x = (1 - specificity), y =sensativity, col = "blue", cex = 1.4, pch = 19)
The sensativity for this model is 0.4143586 and the specificity for this model is 0.9575306
x = as.table(cm$table)
x <- cbind(x, Total = rowSums(x))
x <- rbind(x, Total = colSums(x))
row.names(x) = c("Predicted Not Bud Light", "Predicted Bud Light", "Total")
colnames(x) = c("Reality Not Bud Light", "Reality Bud Light", "Total")
x = as.data.frame(x)
kable(x, caption = "Confusion Matrix for Full Data Frame p > .1")
Looking only at the main effect coefficient (in starter script), how do the odds of a consumer choosing Bud Light change in Rural California relative to the intercept market? What markets are included in this intercept? How does the story change when you consider interaction between income and being in Rural California?
ybud.coef = coef(ybud.reg)
ybud.coef = data.frame("Variable" = rownames(ybud.coef), "Coefficient Value" = ybud.coef[,1])
cali = filter(ybud.coef, grepl("RURAL CALIFORNIA:income", ybud.coef$Variable) | grepl("intercept", ybud.coef$Variable))
cali$Exp_Value = exp(cali$Coefficient.Value)
kable(cali)
we can see that people making under 20k in rural california are much more likely to buy bud light and with income between 20 - 60 k are also more likely compared to the base amrket. If you make 200k+, 100 - 200k+ or 60 - 100k you are less likely to buy bud light
markets = arrange(ybud.coef,desc(abs(Coefficient.Value))) %>%
filter((grepl("market", Variable) & !grepl(":", Variable)) | grepl("intercept",Variable))
kable(markets)
All markets are included in this intercept
Use the starter script commands to run a multinomial logistic regression for brand choice onto the demographic info. Quickly describe the computing environment set up by the makeCluster command here (including the number of cores you are using). Consider the multinomial regression coefficients related to your answers in 3.4: what can you add to your story?
The multinomial regression is using parrallel computing to quickly calculate the coefficients for each brand and the given variables and interaction effects in xdemog. The makeCluster() function creates a cluster of workers that will spread the job of calculating the coefficients across different cpu cores. Using the detectCores() function we are specifying the computer to use 4 of the cores to run the specified “job”.
coef1 = as.data.frame(as.matrix(coef(ybud.reg)))
coef1$names = rownames(coef1)
coef2 = as.data.frame(as.matrix(coef(multifit)))
coef2$names = rownames(coef2)
coef3 = merge(coef2, coef1, by = "names")
coef3$`BUD LIGHT` = exp(coef3$`BUD LIGHT`)
coef3$`BUSCH LIGHT` = exp(coef3$`BUSCH LIGHT`)
coef3$`COORS LIGHT` = exp(coef3$`COORS LIGHT`)
coef3$`MILLER LITE` = exp(coef3$`MILLER LITE`)
coef3$`NATURAL LIGHT` = exp(coef3$`NATURAL LIGHT`)
coef3$seg100 = exp(coef3$seg100)
kable(arrange(coef3, desc(coef3$`BUD LIGHT`)))
A noteable change to the model when running a multinomial regression is the marketRural, where for budlight the probability increases compared to the previous logistic regression on just bud light.