This document includes the suggested solutions of the final project. For the rubrics, please see Canvas.

load("Final_Project.RData")

Question 1

Question 1a

# to specify the constraint matrix with 4 parameters
ui <- diag(4) # an identify matrix 4*4 
ci <- rep(0,4) # all all-zero vector of length 4

# para: the starting values; a vector with 4 elements corresponding to (r,alpha,a,b). 
# it usually does not matter much with different starting values. 
# you can change this to other values or test different values.
para <- rep(.01,4)

# to run the constrained optimization 
results <- constrOptim(para, lik_bgnbd, NULL, ui, ci)

# to get the estimated parameters in the order of (r,alpha,a,b)
para <- results$par
names(para) <- c("r","alpha","a","b")
para
##          r      alpha          a          b 
## 0.05731699 0.43525166 0.40168860 3.43605575

Question 1b

To do the prediction, we can use the function pred_Y ans set the time t=60.

# so we set t = 60
t <- 60

# a loop procedure
# you can also use "apply" by changing a function

# a vector to store the results
Yhat <- rep(0,2074)

for (i in 1:2074) {
  
  H <- CLV_data[i,] # getting the i-th user
  Yhat[i] <- as.numeric(pred_Y(para,H,t)) # storing the results of the i-th user
  
}

With Yhat, we first visualize the distribution with a histogram.

hist(Yhat, breaks = 50)

To get a closer look on the distribution of Yhat, we next get the descriptive statistics.

desc_Yhat <- c(mean(Yhat),median(Yhat),sd(Yhat))
names(desc_Yhat) <- c("Mean","Median","Std. Dev.")
desc_Yhat
##       Mean     Median  Std. Dev. 
##  3.0010238  0.1037716 15.0097306

Question 1c

This is an open question. Please see Canvas for the grading criteria. One possible answer could be:

  • The distribution is left-skewed, meaning most of clicks come from a few users and most of users have no clicks.
  • For the recommender system, this provides good benchmarks for 1) whether the distribution is expected by the company, and 2) the effectiveness of the recommender system given the mass of the low clicks.

Question 2

Question 2a

As Cards is a list, we use lapply to apply the function of estimate_bass to the list.

# estimate the bass parameters
bass.pars <- lapply(Cards, estimate_bass)

# transform the parameters to a data frame
bass.pars <- t(matrix(unlist(bass.pars), 3, 2538))
bass.pars <- as.data.frame(bass.pars)
colnames(bass.pars) <- c("p","q","M")

We can next obtain the histgram and description stats of the bass pars. We first build a function to produce the results and then we can apply it to the three parameters in bass.pars.

# the first input argument is the vector x
# the second is the name to be used 
descr_bass <- function(x, name.x) {
  g <- hist(x, breaks = 50,
       main = paste("Histogram of",name.x))
  desc <- c(mean(x),median(x),sd(x))
  names(desc) <- paste(c("Mean","Median","Std. Dev."),
                       "of",name.x) 
  g
  desc
}

The description stats and histogram of p:

descr_bass(bass.pars$p,"p")

##      Mean of p    Median of p Std. Dev. of p 
##     0.02208286     0.01251350     0.02931045

The description stats and histogram of q:

descr_bass(bass.pars$q,"q")

##      Mean of q    Median of q Std. Dev. of q 
##      0.5398106      0.1442026      4.4683900

The description stats and histogram of M:

descr_bass(bass.pars$M,"M")

##      Mean of M    Median of M Std. Dev. of M 
##     10114.3572       253.8988     57296.7644

Question 2b

You can build the model in different ways. The objective is to beat the bench mark. I’ll show you here the benchmark model with a simple linear regression. Note that with the simple linear regression, you cannot use talkID or contentId as the DV as this would lead to too many IVs. For your own model, you can use these variables. For example, you can use a LASSO-based regression with contentID as an extra IV by setting some of levels to zeros. For the specific grading criteria, please check Canvas.

First, let’s merge the two datasets bass.pars and Cards_features:

Cards_features <- cbind(Cards_features, bass.pars[,1:2])
head(Cards_features)
##         mlogId publishTime type contentId talkId gender registeredMonthCnt
## 1 GCGCHCMCHCPC         168    0        45     38 female                 48
## 2 GCGCKCKCOCNC         168    0        45    361   male                 48
## 3 GCHCJCMCKCNC         168    0         6     21   male                 32
## 4 GCHCLCNCGCMC         168    0        45    346   male                 37
## 5 GCHCLCPCOCKC         168    0        45    332 female                 43
## 6 GCHCMCLCOCPC         168    0        45     38 female                 30
##   follows followeds creatorType level           p          q
## 1       5      2204           1     8 0.008308272 0.03773790
## 2      20        21           1     7 0.005352587 0.03415290
## 3       9        49           1     8 0.017085047 0.17263693
## 4      22        47           1     8 0.012667936 0.07551139
## 5      26        43           1     8 0.012376498 0.06145323
## 6      18       105           1     9 0.071706785 0.01644253

To run a simple linear regression on the first 2,000 observations.

benchmark_p <- lm(log(p) ~ publishTime + type + gender + registeredMonthCnt + 
                  follows + followeds + creatorType + level, Cards_features[1:2000,])
benchmark_q <- lm(log(q) ~ publishTime + type + gender + registeredMonthCnt + 
                  follows + followeds + creatorType + level, Cards_features[1:2000,])

To show the MSE of the model on the test sample:

mse_benchmark_p <- predict(benchmark_p, 
                           newdata = Cards_features[-c(1:2000),])
mse_benchmark_p <- mean((mse_benchmark_p-
                           log(Cards_features$p[-c(1:2000)]))^2)

mse_benchmark_q <- predict(benchmark_q, 
                           newdata = Cards_features[-c(1:2000),])
mse_benchmark_q <- mean((mse_benchmark_q-
                           log(Cards_features$q[-c(1:2000)]))^2)

mse_benchmark <- c(mse_benchmark_p, mse_benchmark_q)
names(mse_benchmark) <- c("p", "q")
mse_benchmark
##         p         q 
## 18.963256  1.239371

Question 2c

This is an open question. Please find the specific grading criteria on Canvas.

Question 3

Question 3a

First, we run the Fuzzy RDD analysis following the three steps outlined in the assignment.

To begin the analysis, run a simple linear regression of first_screen on assigend_first_screen and match_scores and their interactions.

mdl_1 <- lm(first_screen ~ assigned_first_screen*match_scores, Impressions)
summary(mdl_1)
## 
## Call:
## lm(formula = first_screen ~ assigned_first_screen * match_scores, 
##     data = Impressions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1553 -0.3740  0.1544  0.2635  1.1897 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         0.476421   0.003037  156.86   <2e-16 ***
## assigned_first_screen               0.148263   0.006889   21.52   <2e-16 ***
## match_scores                        0.261324   0.005159   50.65   <2e-16 ***
## assigned_first_screen:match_scores -0.130881   0.006476  -20.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4202 on 49996 degrees of freedom
## Multiple R-squared:  0.2059, Adjusted R-squared:  0.2058 
## F-statistic:  4321 on 3 and 49996 DF,  p-value: < 2.2e-16
# get the predicted value of first_screen
first_screen_hat <- predict(mdl_1, type = "response")

With first_screen_hat, we run a second (logistic) regression with first_screen_hat and match_scores as the IVs.

# bind first_screen_hat to Impressions
Impressions$first_screen_hat <- first_screen_hat
mdl_2 <- glm(isClick ~ first_screen_hat + match_scores, family = "binomial", Impressions)
summary(mdl_2)
## 
## Call:
## glm(formula = isClick ~ first_screen_hat + match_scores, family = "binomial", 
##     data = Impressions)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7292  -0.3292  -0.2997  -0.2365   3.2539  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.0892     0.2686 -15.223  < 2e-16 ***
## first_screen_hat   2.6564     0.5572   4.768 1.86e-06 ***
## match_scores      -1.0531     0.1345  -7.828 4.96e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17922  on 49999  degrees of freedom
## Residual deviance: 17532  on 49997  degrees of freedom
## AIC: 17538
## 
## Number of Fisher Scoring iterations: 6

From the regression results, the conclusion is the recommender system indeed has some value in the sense that the cards on the first_screen are more likely to be clicked than those not in the first screen.

To interpret the results, you need to transform the estimated effects for easy interpretation. You can use, for example, the “log odds ratio” or the “average marginal effect”.

For a logistic regression, the log odds ratio equal to the coefficient. The interpretation is the probability of being clicked is 2.6564 times of that of not clicked, if a card is recommended by the recommender system.

For the average marginal effect, the definition is: \[ AME = \dfrac{\partial P(isClicked)}{\partial FirstScreen} = \frac{1}{N}\sum_i\dfrac{exp(V_i)\times\beta_{firstscreen}}{(1+exp(V_i))^2} \]

Where \(V_i=Intercept + \beta_{firstscreen}\times FirstScreen + \beta_{matchingscores}\times MatchingScores\).

# get the predicted V_i from the model
V <- predict(mdl_2)

# getting the average marginal effect
average_marginal_effect <- mean(exp(V)/((1+exp(V))^2))*2.6564
average_marginal_effect
## [1] 0.1099258

The average marginal effect is the expected effect of first_screen on the probability of being clicked. On average, if recommended by the recommender system, a card is 10.99% more likely to be clicked.

Lastly, the above are the numerical interpretations. The empirical interpretations are more complex, due to the setup and assumptions of the model. There are two important points to emphasize in the interpretation.

  1. First, we already remove samples and focus on a local sample with matching scores that are bounded in a reasonable range, as stated in the question. Therefore, the effect is a “local” effect or “sample average treatment effect” at the best. This part is the same as the normal sharp RDD as discussed in class.
  2. Second, essentially, we use assigned_first_screen as the instrument for first_screen to estimate the local effect. Due to the assignment issues, not all cards with high enough matching scores that should have been assigned to the first screen were assigned to the first screen. This is why assigned_first_screen and first_screen are not the same. This is the well-known “non-compliance” in experimental studies. The identified local effect is more “local” as we can only identify the effect for “compliers” (cards that should have been assigned to the first screen were actually assigned to the first screen) if we assume there were no “defiers”.

In conclusion, the identified effect is at best the sample average treatment effect for compliers.

Question 3b

This is an open question. Here I include one set of exemplary answers to the questions. Please see Canvas for the specific grading criteria.

  • One possible design is to use a “random” recommendation system as the control group, where the cards are selected at random and recommended to people. The treatment group is the default recommendations made by the company. So, no disruptions.
  • The counterfactual is the average no. of clicks of the cards, had the cards were recommended to people randomly.
  • One possible and common way to do the assignment is to run the experiments for a certain interval, and for each visitor to the app, do a Bernoulli draw of 1 or 0 and assign the visitor to treatment/control accordingly.