Instructions

Show your work/explain your thinking on all questions to receive full credit. For the questions involving math, you can either type your answers up using within this file or submit a separate file (handwritten or typed) with your answers.

Remember, your submission must be in the form of a readable and organized PDF file, not a .Rmd or .R file, with your answers/code results easy to find. We will not grade submissions that are in the incorrect format or otherwise unreadable. Likewise, if you handwrite your answers to the math questions, be sure to combine them into a single document rather than submitting several individual image files, organized so that it is clear which question you are answering and what the final answer is.

You are encouraged to discuss and collaborate on this assignment with your peers, but you should never directly copy answers or code from others. We will not grade submissions that show clear evidence of direct copying.

Modeling censored random variables

A common setting where we need to assume a model of \(X\), rather than just calculating some “model free” summary statistics of \(X\), is when \(X\) is imperfectly observed and we need to use a model to impute unobserved values.

One very common way a random variable can be partially unobserved/missing is censoring: we say that a random variable \(X\) is censored when, for some values of \(X\), we can’t observe the exact value, but instead only know that it’s within a certain interval. For instance, in survey questions, we often ask questions in the form of categorical intervals, even though the actual underlying value is continuous. For instance, when we ask people what their income is, the answer choices are usually ranges like “$10,000 or less,” “$10,001 to $25,000,” “$150,001 or more,” etc.

So, while we are really interested in the distribution of income \(X\), all we get to observe is a censored random variable \(X^{*}\) that only tells us ranges of \(X\) values rather than the exact value. However, if we have a good model for the distribution of \(X\), we might be able to impute the distribution of \(X\) from the distribution of \(X^{*}\).

Let’s revisit the Instacart data to see how this works in practice.

#put code for Q1 here
library (tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
instacart_data = read_csv('instacart_sample.csv',col_types = "ccdccccccdccccc")
order_sizes = instacart_data %>%
group_by(order_id) %>% #group together observations from each order
summarize(order_size = n())%>%  #n() counts number of rows per order
mutate(order_size_cens = order_size) 

order_sizes$order_size_cens[order_sizes$order_size_cens >10] = 10
#put code for Q2 here
library(maxLik)
## 载入需要的程辑包:miscTools
## 
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
## 
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
ll_nbd = function(param, x_vec){
#first parameter: shape
r = param[1]
#second parameter: probability
p = param[2]

upper = sum(dnbinom(x_vec[x_vec<9], r, p, log = TRUE))
lower = sum(log(1-pnbinom(x_vec[x_vec>8]-1, r, p)))
s = upper+lower

}
#put code for Q3 here
order_vec_shifted = (order_sizes$order_size_cens) - 1
mle_est_nbd = maxLik(ll_nbd, start = c(3, .5), x_vec = order_vec_shifted)
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
summary(mle_est_nbd)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 6 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -19351.71 
## 2  free parameters
## Estimates:
##      Estimate Std. error t value Pr(> t)    
## [1,] 1.682461   0.041413   40.63  <2e-16 ***
## [2,] 0.154508   0.004278   36.12  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#put code for Q4 here
param_mle = mle_est_nbd$estimate
data.frame(x_grid = 1:30,
  observed = as.vector(table(order_sizes$order_size)[1:30]),
  expected = length(order_vec_shifted)*
  dnbinom(0:29, param_mle[1], param_mle[2])) %>%
pivot_longer(observed:expected) %>%
ggplot(aes(x_grid, value, group = name, fill = name)) +
geom_bar(stat = "identity", position = "dodge") +
#scale_fill_economist("Distribution") +
theme(legend.position = "bottom") +
scale_x_continuous("Number of items ordered") +
scale_y_continuous("Count")

#put code for Q5 here
instacart_data = read_csv('instacart_sample.csv',col_types = "ccdccccccdccccc")
order_sizes = instacart_data %>%
group_by(order_id) %>% #group together observations from each order
summarize(order_size = n())%>%  #n() counts number of rows per order
mutate(order_size_cens = order_size) 

order_sizes$order_size_cens[order_sizes$order_size_cens >5] = 5
ll_nbd = function(param, x_vec){
#first parameter: shape
r = param[1]
#second parameter: probability
p = param[2]

upper = sum(dnbinom(x_vec[x_vec<4], r, p, log = TRUE))
lower = sum(log(1-pnbinom(x_vec[x_vec>3]-1, r, p)))
s = upper+lower

}

order_vec_shifted = (order_sizes$order_size_cens) - 1
mle_est_nbd = maxLik(ll_nbd, start = c(3, .5), x_vec = order_vec_shifted)
## Warning in dnbinom(x_vec[x_vec < 4], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 3] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 4], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 3] - 1, r, p): 产生了NaNs
summary(mle_est_nbd)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 7 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -8666.1 
## 2  free parameters
## Estimates:
##      Estimate Std. error t value Pr(> t)    
## [1,] 1.348949   0.053996   24.98  <2e-16 ***
## [2,] 0.107600   0.007076   15.21  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
param_mle = mle_est_nbd$estimate
data.frame(x_grid = 1:30,
  observed = as.vector(table(order_sizes$order_size)[1:30]),
  expected = length(order_vec_shifted)*
  dnbinom(0:29, param_mle[1], param_mle[2])) %>%
pivot_longer(observed:expected) %>%
ggplot(aes(x_grid, value, group = name, fill = name)) +
geom_bar(stat = "identity", position = "dodge") +
#scale_fill_economist("Distribution") +
theme(legend.position = "bottom") +
scale_x_continuous("Number of items ordered") +
scale_y_continuous("Count")

What makes a good clickbait title? The answer might surprise you!

In the age of digital marketing, randomized experiments are common in the form of A/B tests: marketers often want to run “horse races” where they determine which of several ad copies will get better clickthrough rates, which website design will get better conversion, etc.

The website Upworthy is known for spreading viral content using so-called “clickbait” titles that tend to use vague and/or exagerrated headlines to catch viewers’ attention and encourage them to click into an article. Over the years, Upworthy perfected the clickbait headline through thousands of A/B tests. Let’s analyze Upworthy data to understand what kind of language makes people more likely to click on an article.

The file upworthy_archive_sample.csv contains data on nearly 5,000 A/B tests conducted by Upworthy from 2013 to 2015. See more information on this dataset, the Upworthy Research Archive, in Matias et al. (2021). In particular, definitions of each column in the data are given here.

While viewing an article on Upworthy’s website, visitors would be shown previews of recommended articles on the sidebar, with each article preview consisting of an image and paired headline, as shown in Figure \(\ref{fig:website_example}\). In these A/B tests, Upworthy would randomly choose one of several candidate headlines and images to display in the sidebar: different visitors would see different headlines linking to the same article. Figure \(\ref{fig:headline_example}\) gives an example of several candidate headlines for a single article.

Upworthy’s goal for each A/B test was to infer which headline/image led to the best clickthrough rate (CTR). Each A/B test would be conducted when an article was first published on Upworthy, and once a sufficient sample size was collected, a website editor would choose which headline/image pair to use for all subsequent visitors.

For this assignment, we will get some practice with text analysis, focusing on the headlines. What we want to understand is, generally, what kind of language makes people more or less likely to click on an article? For instance, what features of the headlines in Figure \(\ref{fig:headline_example}\) might make them perform better or worse?

To answer this, we meta-analyze the experiments, combining information across thousands of experiments to learn more generalizable principles about what sorts of headlines get better or worse CTRs, which can help us write better headlines in the future. For instance, if we observe across many articles that headlines phrased as questions get worse CTRs, then we may wish to avoid phrasing headlines as questions in the future.

library(scales)
## 
## 载入程辑包:'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(lfe)
## 载入需要的程辑包:Matrix
## 
## 载入程辑包:'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

Answer:
Many articles uses subordinate clause leaded by”What”, such as “That’s what you need to know,”, “That’s What they needed to hear”, etc. However the seemingly effective approach to simulate curiosity might already has been overused, and audiences automatically shun from the headlines written this way.

data = read_csv("upworthy_archive_sample.csv")
## New names:
## Rows: 22666 Columns: 17
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): clickability_test_id, excerpt, headline, lede, slug, eyecatcher_id... dbl
## (5): ...1, impressions, clicks, significance, test_week lgl (2): first_place,
## winner dttm (2): created_at, updated_at
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
nrow(data)
## [1] 22666
length(unique(data$clickability_test_id))
## [1] 4873
colnames(data)
##  [1] "...1"                 "created_at"           "updated_at"          
##  [4] "clickability_test_id" "excerpt"              "headline"            
##  [7] "lede"                 "slug"                 "eyecatcher_id"       
## [10] "impressions"          "clicks"               "significance"        
## [13] "first_place"          "winner"               "share_text"          
## [16] "square"               "test_week"
ggplot(data) +
  geom_histogram(aes(impressions))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data = data %>%
  mutate(ctr = clicks / impressions, #clicks as % of impressions
  what = grepl("What", headline)) #True if "What is in the headline

There are 12% headlines that uses “What” in it

mean(data$what) 
## [1] 0.127239
lm(ctr ~ what, data = data) %>%
summary
## 
## Call:
## lm(formula = ctr ~ what, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.016366 -0.008299 -0.003219  0.004529  0.120397 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.567e-02  8.736e-05  179.32  < 2e-16 ***
## whatTRUE    7.005e-04  2.449e-04    2.86  0.00423 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01229 on 22664 degrees of freedom
## Multiple R-squared:  0.0003609,  Adjusted R-squared:  0.0003168 
## F-statistic: 8.182 on 1 and 22664 DF,  p-value: 0.004235

The regression indicates a weak negative correlation between usage of “What” and CTR. However, the test should focus on the use of the word “What” when it functions as the start of a subdornate clause

data %>%
filter(clickability_test_id == "547480311631de7fd8000033") %>%
.$headline
## [1] "Preparing For Lengthy Discussions With Your Relatives At The Holiday Table? Get Learnt Now."      
## [2] "Your Roof, Your Roof, Your Roof Isn’t Covered With Solar Panels. Here's Why."                     
## [3] "What Do We Want? Power! When Do We Want It? Now! Why Can't We Have It? A Whole Host Of Reasons..."
## [4] "Solar Power Might Not Be There Just Yet But We're Getting Closer And Here's How"
felm(ctr ~ what | clickability_test_id, data = data) %>%
summary
## 
## Call:
##    felm(formula = ctr ~ what | clickability_test_id, data = data) 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.033037 -0.001938 -0.000135  0.001736  0.043648 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)   
## whatTRUE 0.0004263  0.0001559   2.735  0.00624 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004292 on 17792 degrees of freedom
## Multiple R-squared(full model): 0.9042   Adjusted R-squared: 0.878 
## Multiple R-squared(proj model): 0.0004203   Adjusted R-squared: -0.2734 
## F-statistic(full model):34.48 on 4873 and 17792 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 7.481 on 1 and 17792 DF, p-value: 0.006242

Using the fixed effects regression, We see a positive correlation between use of the word “What” and CTR