This notebook includes suggested solutions to the analytical questions in Assignment 2. For some questions, your answers might not be exactly the same as the suggested solutions. These should be fine in terms of grading. For example, for Question 3a, depending on how the train and test set are splitted, you may have different optimal bandwidths.
load("Assignment_2.Rdata")
ls()
## [1] "ad.trial" "airbnb" "app.installs"
This question uses “app.installs”:
We first construct a function which estimates the Bass model with a vector of adoptions (i.e., installations) at different time periods (i.e., days) as input.
# create a function to estimate the Bass model
estimate_bass <- function(x) {
# create a new variable of cumulative sales
cum.x <- cumsum(as.numeric(x)) # to avoid integer overflow
# run a regression with sales as DV and cum_sales and cum_sales^2 as IVs
mdl <- lm(x ~ cum.x + I(cum.x^2))
# get the coefficients
# a: the intercept
# b: the coefficient of cumulative sales
# c: the coefficient of squared cumulative sales
a <- mdl$coefficients[1]
b <- mdl$coefficients[2]
c <- mdl$coefficients[3]
# solving for p, q and M with a, b and c
M1 <- (-b-sqrt(b^2-4*a*c))/(2*c)
M2 <- (-b+sqrt(b^2-4*a*c))/(2*c)
M <- max(M1,M2)
p <- a/M
q <- -c*M
# output a named vector
bass.par <- c(p,q,M)
names(bass.par) <- c("p","q","M")
return(bass.par)
}
We then apply the function to the whole list to get the Bass parameters of all apps.
apps_bass_paras <- lapply(app.installs,estimate_bass)
# transform into a data frame for easy reading
temp <- t(matrix(unlist(apps_bass_paras),
nrow = 3,
ncol = 291
)
)
apps_bass_paras <- data.frame(apps = names(apps_bass_paras),
p = temp[,1],
q = temp[,2],
M = temp[,3],
stringsAsFactors=F)
apps_bass_paras[order(apps_bass_paras$apps),]
This question uses “ad.trial” as the data.
For this question, we only need to run three regressions with the treatment variable using different outcomes. As noted in the question, we need to transform the variables in the estimations.
The ATE on page views:
page_views <- lm(log(pageviews)~treatment,data = ad.trial)
summary(page_views)
##
## Call:
## lm(formula = log(pageviews) ~ treatment, data = ad.trial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9970 -0.2867 -0.1447 0.3769 0.7068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.973689 0.003473 1720.02 <2e-16 ***
## treatmentcurrent.ad 0.191438 0.004912 38.98 <2e-16 ***
## treatmentother.ad 0.141705 0.004912 28.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3473 on 29997 degrees of freedom
## Multiple R-squared: 0.05172, Adjusted R-squared: 0.05166
## F-statistic: 818 on 2 and 29997 DF, p-value: < 2.2e-16
The ATE on reservations:
reservations <- lm(log(reservations)~treatment,data = ad.trial)
summary(reservations)
##
## Call:
## lm(formula = log(reservations) ~ treatment, data = ad.trial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82164 -0.13912 -0.01262 0.13117 0.65744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.506420 0.001902 1843.826 <2e-16 ***
## treatmentcurrent.ad 0.002710 0.002689 1.008 0.314
## treatmentother.ad 0.205587 0.002689 76.443 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1902 on 29997 degrees of freedom
## Multiple R-squared: 0.204, Adjusted R-squared: 0.204
## F-statistic: 3845 on 2 and 29997 DF, p-value: < 2.2e-16
The ATE on conversion rates (CR):
# create a new variable of conversion rate
ad.trial$CR <- ad.trial$reservations/ad.trial$pageviews
CR <- lm(log(CR/(1-CR))~treatment,data = ad.trial)
summary(CR)
##
## Call:
## lm(formula = log(CR/(1 - CR)) ~ treatment, data = ad.trial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28426 -0.20741 0.03596 0.20195 0.83512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.375566 0.002627 -904.13 <2e-16 ***
## treatmentcurrent.ad -0.205930 0.003716 -55.42 <2e-16 ***
## treatmentother.ad 0.070763 0.003716 19.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2627 on 29997 degrees of freedom
## Multiple R-squared: 0.1664, Adjusted R-squared: 0.1663
## F-statistic: 2993 on 2 and 29997 DF, p-value: < 2.2e-16
For this quesiton, we run regressions which interact the types of restaurants (ad.trial:restaurant_type) and the treatment variable (ad.trial:treatment).
The heterogeneous treatment effects on page views:
hetero_page_views <- lm(log(pageviews)~treatment*restaurant_type,data = ad.trial)
summary(hetero_page_views)
##
## Call:
## lm(formula = log(pageviews) ~ treatment * restaurant_type, data = ad.trial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71762 -0.06745 0.00480 0.07655 0.41490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.694350 0.001449 3930.675 <2e-16
## treatmentcurrent.ad 0.225303 0.002049 109.970 <2e-16
## treatmentother.ad 0.142213 0.002049 69.414 <2e-16
## restaurant_typechain 0.698349 0.002291 304.877 <2e-16
## treatmentcurrent.ad:restaurant_typechain -0.084662 0.003239 -26.135 <2e-16
## treatmentother.ad:restaurant_typechain -0.001271 0.003239 -0.392 0.695
##
## (Intercept) ***
## treatmentcurrent.ad ***
## treatmentother.ad ***
## restaurant_typechain ***
## treatmentcurrent.ad:restaurant_typechain ***
## treatmentother.ad:restaurant_typechain
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1122 on 29994 degrees of freedom
## Multiple R-squared: 0.901, Adjusted R-squared: 0.901
## F-statistic: 5.46e+04 on 5 and 29994 DF, p-value: < 2.2e-16
The heterogeneous treatment effects on reservations:
hetero_reservations <- lm(log(reservations)~treatment*restaurant_type,data = ad.trial)
summary(hetero_reservations)
##
## Call:
## lm(formula = log(reservations) ~ treatment * restaurant_type,
## data = ad.trial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96377 -0.09089 0.00970 0.09632 0.51531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.391496 0.001748 1940.607 <2e-16
## treatmentcurrent.ad 0.001477 0.002472 0.598 0.55
## treatmentother.ad 0.225755 0.002472 91.342 <2e-16
## restaurant_typechain 0.287311 0.002763 103.975 <2e-16
## treatmentcurrent.ad:restaurant_typechain 0.003082 0.003908 0.789 0.43
## treatmentother.ad:restaurant_typechain -0.050420 0.003908 -12.902 <2e-16
##
## (Intercept) ***
## treatmentcurrent.ad
## treatmentother.ad ***
## restaurant_typechain ***
## treatmentcurrent.ad:restaurant_typechain
## treatmentother.ad:restaurant_typechain ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1354 on 29994 degrees of freedom
## Multiple R-squared: 0.5967, Adjusted R-squared: 0.5966
## F-statistic: 8876 on 5 and 29994 DF, p-value: < 2.2e-16
This question uses “airbnb” as the data.
Here we try to select the “optimal” bandwidth with the 2-fold cross-validation method. The optimal bandwidth is the one that has the smallest mean squared error (MSE) on the testing sample. We test 5 bandwidth. To select the optimal bandwidth, we adopt a regression-based estimator of neighborhood effects. We estimate a regression equation like this: \(Price = Neighbourhood + Distance + Neighbourhood\times Distance + e\).
#===== to split data into test and train set --------------------
# Set seeds for replication
set.seed(12345)
# Permutate the index
N <- dim(airbnb)[1]
idx <- sample(1:N,N,replace = F)
# To obtain a random subsample as train and test data
train <- airbnb[idx[1:(N/2)],]
test <- airbnb[idx[(N/2+1):N],]
#===== To select optimal bandwidths ------------------------------
# Here we test 5 bandwidths.
# standard deviation of distance
H <- rep(sd(airbnb$distance),5)*seq(0.1,0.9,0.2)
# a vector to store MSE
H.fit <- rep(0,5)
# Loop over 5 bandwidths
for (i in 1:5) {
# This is the model specified above. More complex method may be used here.
mdl <- lm(price~neighbourhood*distance,train[abs(train$distance)<H[i],])
# To obtain the predicted outcome for test data.
price.hat <- predict(mdl,newdata = test[abs(test$distance)<H[i],])
# To calcualte and store MSE
H.fit[i] <- mean((test$price[abs(test$distance)<H[i]]-price.hat)^2)
}
# To get the bandwidth of the one with mininum MSE
H <- H[H.fit==min(H.fit)]
# The MSE of the 5 bandwidths:
H.fit
## [1] 0.2796239 0.2795826 0.2843565 0.2734253 0.2711636
#The optimal bandwidth is:
H
## [1] 1.126712
After having the optimal bandwidth \(H\), we rerun the model with the full data. This gives us a “local average treatment effects” (LATE), i.e., the neighborhood effects on prices for hotels located close to the “boarder” of the two neighborhoods.
#Rerun a model on full data with bandwidth set to H
mdl <- lm(price~neighbourhood*distance,airbnb[abs(airbnb$distance)<H,])
# The neighborhood effect is measured by the coefficient of the neighborhood dummy.
summary(mdl)
##
## Call:
## lm(formula = price ~ neighbourhood * distance, data = airbnb[abs(airbnb$distance) <
## H, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07310 -0.33320 -0.04758 0.28273 2.90143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.24018 0.03052 171.706 < 2e-16 ***
## neighbourhoodCentrum-West -0.08917 0.04003 -2.228 0.025988 *
## distance 0.16662 0.04898 3.402 0.000681 ***
## neighbourhoodCentrum-West:distance -0.19706 0.06462 -3.050 0.002315 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5106 on 2491 degrees of freedom
## Multiple R-squared: 0.005084, Adjusted R-squared: 0.003885
## F-statistic: 4.243 on 3 and 2491 DF, p-value: 0.005337
To rerun the model with control variables added to check the robustness of the LATE.
#Rerun a model on full data with bandwidth set to H
mdl.1 <- lm(price~neighbourhood*distance +
room_type +
reviews_per_month +
review_scores_rating +
host_is_superhost,airbnb[abs(airbnb$distance)<H,])
# The neighborhood effect is measured by the coefficient of the neighborhood dummy.
summary(mdl.1)
##
## Call:
## lm(formula = price ~ neighbourhood * distance + room_type + reviews_per_month +
## review_scores_rating + host_is_superhost, data = airbnb[abs(airbnb$distance) <
## H, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14868 -0.31188 -0.03687 0.27048 2.81899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.488548 0.148381 30.250 < 2e-16 ***
## neighbourhoodCentrum-West -0.079688 0.037094 -2.148 0.031787 *
## distance 0.193163 0.045476 4.248 2.24e-05 ***
## room_typeEntire home/apt 0.401022 0.024010 16.702 < 2e-16 ***
## room_typeShared room -0.503600 0.180045 -2.797 0.005196 **
## reviews_per_month -0.016357 0.007995 -2.046 0.040874 *
## review_scores_rating 0.004974 0.001531 3.249 0.001174 **
## host_is_superhostTRUE 0.066735 0.025176 2.651 0.008081 **
## neighbourhoodCentrum-West:distance -0.232013 0.059954 -3.870 0.000112 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4726 on 2485 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1485, Adjusted R-squared: 0.1457
## F-statistic: 54.15 on 8 and 2485 DF, p-value: < 2.2e-16