Instructions

Show your work/explain your thinking on all questions to receive full credit. For the questions involving math, you can either type your answers up using within this file or submit a separate file (handwritten or typed) with your answers.

Remember, your submission must be in the form of a readable and organized PDF file, not a .Rmd or .R file, with your answers/code results easy to find. We will not grade submissions that are in the incorrect format or otherwise unreadable. Likewise, if you handwrite your answers to the math questions, be sure to combine them into a single document rather than submitting several individual image files, organized so that it is clear which question you are answering and what the final answer is.

You are encouraged to discuss and collaborate on this assignment with your peers, but you should never directly copy answers or code from others. We will not grade submissions that show clear evidence of direct copying.

Modeling censored random variables

A common setting where we need to assume a model of $X$, rather than just calculating some “model free” summary statistics of $X$, is when $X$ is imperfectly observed and we need to use a model to impute unobserved values.

One very common way a random variable can be partially unobserved/missing is censoring: we say that a random variable $X$ is censored when, for some values of $X$, we can’t observe the exact value, but instead only know that it’s within a certain interval. For instance, in survey questions, we often ask questions in the form of categorical intervals, even though the actual underlying value is continuous. For instance, when we ask people what their income is, the answer choices are usually ranges like “$10,000 or less,” “$10,001 to $25,000,” “$150,001 or more,” etc.

So, while we are really interested in the distribution of income $X$, all we get to observe is a censored random variable $X^{*}$ that only tells us ranges of $X$ values rather than the exact value. However, if we have a good model for the distribution of $X$, we might be able to impute the distribution of $X$ from the distribution of $X^{*}$.

Let’s revisit the Instacart data to see how this works in practice.

Question 1 (5 points): Use the sample code from class to create a table with the sizes of each order (i.e. the number of items ordered). Then, add a new column order_size_cens where you replace all order sizes of over 10 items with 10. Denoting the order size as a random variable X, this new censored random variable $X^{*}$ is right-censored at 10, meaning that when $X\geq10$, $X^{*}=10$ only tells us that the true order size is $X\geq10$ rather than telling us the exact value.

#put code for Q1 here
library (tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

instacart_data = read_csv('instacart_sample.csv',col_types = "ccdccccccdccccc")
order_sizes = instacart_data %>%
group_by(order_id) %>% #group together observations from each order
summarize(order_size = n())%>%  #n() counts number of rows per order
mutate(order_size_cens = order_size) 

order_sizes$order_size_cens[order_sizes$order_size_cens >10] = 10

Question 2 (15 points): Assuming that $X$ follows a shifted negative binomial distribution (a.k.a., $X-1$ follows a negative binomial distribution), write a function which computes the log-likelihood of $X^{*}$, taking into account the fact that $X^{*}$ is censored. You can modify the function from the sample code used in class. The main modification you need to make is that $X^{*}=10$ means $X\geq10$; as such, the likelihood term for $P_{X^{*}}(10)$ needs to calculate the probability $P(X\geq10)$ instead of $P(X=10)$. Caution: Be careful to choose the correct censoring cutoff for calculating the likelihood. Since we are modeling the shifted counts ($X^*-1$) by a negative binomial distribution, the counts being censored at 10 means that the shifted counts are censored at 9. Caution 2: Note that pnbinom(x, ...) gives you $P(X\leq x)$, such that 1-pnbinom(x, ...) or pnbinom(x, ..., lower.tail = FALSE) give the probability of $P(X>x)$, not $P(X\geq x)$; that is, it’s missing $P(X=x)$. Be careful that your calculation for the censored term includes the boundary value.

#put code for Q2 here
library(maxLik)

## 载入需要的程辑包：miscTools

## 
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
## 
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/

ll_nbd = function(param, x_vec){
#first parameter: shape
r = param[1]
#second parameter: probability
p = param[2]

upper = sum(dnbinom(x_vec[x_vec<9], r, p, log = TRUE))
lower = sum(log(1-pnbinom(x_vec[x_vec>8]-1, r, p)))
s = upper+lower

}

Question 3 (10 points): Estimate the parameters of the shifted NBD model for $X$ via maximum likelihood using the observations of $X^{*}$. How do the estimated parameters (and their associated standard errors) compare to the parameters we found in class using the uncensored observations of $X$? You can use the standard errors given by maxLik; no need to do a bootstrap. Hint: You can use the maxLik code we used in class, substituting in the censored variable $X^{*}$ you made in Q1 and the log-likelihood function you wrote in Q2.

#put code for Q3 here
order_vec_shifted = (order_sizes$order_size_cens) - 1
mle_est_nbd = maxLik(ll_nbd, start = c(3, .5), x_vec = order_vec_shifted)

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs

summary(mle_est_nbd)

## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 6 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -19351.71 
## 2  free parameters
## Estimates:
##      Estimate Std. error t value Pr(> t)    
## [1,] 1.682461   0.041413   40.63  <2e-16 ***
## [2,] 0.154508   0.004278   36.12  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Question 4 (10 points): Reproduce the histogram we made in class comparing the expected values from the model to the actual values, but substituting in the parameters you estimated in Q3. How well is the model able to recover the (unobserved) distribution of $X$ for values of $X\geq10$?

#put code for Q4 here
param_mle = mle_est_nbd$estimate
data.frame(x_grid = 1:30,
  observed = as.vector(table(order_sizes$order_size)[1:30]),
  expected = length(order_vec_shifted)*
  dnbinom(0:29, param_mle[1], param_mle[2])) %>%
pivot_longer(observed:expected) %>%
ggplot(aes(x_grid, value, group = name, fill = name)) +
geom_bar(stat = "identity", position = "dodge") +
#scale_fill_economist("Distribution") +
theme(legend.position = "bottom") +
scale_x_continuous("Number of items ordered") +
scale_y_continuous("Count")

Question 5 (10 points): Repeat Questions 1 to 4, but censoring at $X\geq5$ instead of $X\geq10$. Does the model’s ability to impute unobserved values degrade when we censor too much information?

#put code for Q5 here
instacart_data = read_csv('instacart_sample.csv',col_types = "ccdccccccdccccc")
order_sizes = instacart_data %>%
group_by(order_id) %>% #group together observations from each order
summarize(order_size = n())%>%  #n() counts number of rows per order
mutate(order_size_cens = order_size) 

order_sizes$order_size_cens[order_sizes$order_size_cens >5] = 5
ll_nbd = function(param, x_vec){
#first parameter: shape
r = param[1]
#second parameter: probability
p = param[2]

upper = sum(dnbinom(x_vec[x_vec<4], r, p, log = TRUE))
lower = sum(log(1-pnbinom(x_vec[x_vec>3]-1, r, p)))
s = upper+lower

}

order_vec_shifted = (order_sizes$order_size_cens) - 1
mle_est_nbd = maxLik(ll_nbd, start = c(3, .5), x_vec = order_vec_shifted)

## Warning in dnbinom(x_vec[x_vec < 4], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 3] - 1, r, p): 产生了NaNs

## Warning in dnbinom(x_vec[x_vec < 4], r, p, log = TRUE): 产生了NaNs

## Warning in pnbinom(x_vec[x_vec > 3] - 1, r, p): 产生了NaNs

summary(mle_est_nbd)

## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 7 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -8666.1 
## 2  free parameters
## Estimates:
##      Estimate Std. error t value Pr(> t)    
## [1,] 1.348949   0.053996   24.98  <2e-16 ***
## [2,] 0.107600   0.007076   15.21  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

param_mle = mle_est_nbd$estimate
data.frame(x_grid = 1:30,
  observed = as.vector(table(order_sizes$order_size)[1:30]),
  expected = length(order_vec_shifted)*
  dnbinom(0:29, param_mle[1], param_mle[2])) %>%
pivot_longer(observed:expected) %>%
ggplot(aes(x_grid, value, group = name, fill = name)) +
geom_bar(stat = "identity", position = "dodge") +
#scale_fill_economist("Distribution") +
theme(legend.position = "bottom") +
scale_x_continuous("Number of items ordered") +
scale_y_continuous("Count")

What makes a good clickbait title? The answer might surprise you!

In the age of digital marketing, randomized experiments are common in the form of A/B tests: marketers often want to run “horse races” where they determine which of several ad copies will get better clickthrough rates, which website design will get better conversion, etc.

The website Upworthy is known for spreading viral content using so-called “clickbait” titles that tend to use vague and/or exagerrated headlines to catch viewers’ attention and encourage them to click into an article. Over the years, Upworthy perfected the clickbait headline through thousands of A/B tests. Let’s analyze Upworthy data to understand what kind of language makes people more likely to click on an article.

The file upworthy_archive_sample.csv contains data on nearly 5,000 A/B tests conducted by Upworthy from 2013 to 2015. See more information on this dataset, the Upworthy Research Archive, in Matias et al. (2021). In particular, definitions of each column in the data are given here.

While viewing an article on Upworthy’s website, visitors would be shown previews of recommended articles on the sidebar, with each article preview consisting of an image and paired headline, as shown in Figure $\ref{fig:website_example}$. In these A/B tests, Upworthy would randomly choose one of several candidate headlines and images to display in the sidebar: different visitors would see different headlines linking to the same article. Figure $\ref{fig:headline_example}$ gives an example of several candidate headlines for a single article.

Upworthy’s goal for each A/B test was to infer which headline/image led to the best clickthrough rate (CTR). Each A/B test would be conducted when an article was first published on Upworthy, and once a sufficient sample size was collected, a website editor would choose which headline/image pair to use for all subsequent visitors.

For this assignment, we will get some practice with text analysis, focusing on the headlines. What we want to understand is, generally, what kind of language makes people more or less likely to click on an article? For instance, what features of the headlines in Figure $\ref{fig:headline_example}$ might make them perform better or worse?

To answer this, we meta-analyze the experiments, combining information across thousands of experiments to learn more generalizable principles about what sorts of headlines get better or worse CTRs, which can help us write better headlines in the future. For instance, if we observe across many articles that headlines phrased as questions get worse CTRs, then we may wish to avoid phrasing headlines as questions in the future.

Question 6 (50 points): Formulate several hypotheses about what sort of language in an article headline may affect CTRs and empirically test your hypotheses using the Upworthy data. For instance, you might wish to test the effects of specific words/phrases, punctuation, sentence structures, positive/negative emotional tone, etc. Once you have run your analysis, write a brief (2-3 paragraph) summary of what we can learn from your analysis. For instance, can your results tell us something about the psychology of attention or curiosity? Can editors/authors on websites like Upworthy use these results to inform how they write article headlines? Hint: Text analysis can be a bit tricky, especially if it’s your first time. See the doc hw4_example included with this homework assignment to see two worked examples. Feel free to base your code off the sample code and adapt it to your own hypotheses (but please come up with different hypotheses than the examples I give!). Note: I encourage you to use a similar modeling approach as in the examples (linear regression with fixed effects), but you don’t have to; feel free to try running your analysis a different way, just be sure to justify your modeling choices carefully. Note 2: Don’t worry if you find statistically insignificant effects; the fact that some language apparently doesn’t affect clickthrough rates can also be informative. Just be sure to provide a thorough analysis and thoughtful discussion.

library(scales)

## 
## 载入程辑包：'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(lfe)

## 载入需要的程辑包：Matrix

## 
## 载入程辑包：'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

Answer:
Many articles uses subordinate clause leaded by”What”, such as “That’s what you need to know,”, “That’s What they needed to hear”, etc. However the seemingly effective approach to simulate curiosity might already has been overused, and audiences automatically shun from the headlines written this way.

data = read_csv("upworthy_archive_sample.csv")

## New names:
## Rows: 22666 Columns: 17
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): clickability_test_id, excerpt, headline, lede, slug, eyecatcher_id... dbl
## (5): ...1, impressions, clicks, significance, test_week lgl (2): first_place,
## winner dttm (2): created_at, updated_at
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

nrow(data)

## [1] 22666

length(unique(data$clickability_test_id))

## [1] 4873

colnames(data)

##  [1] "...1"                 "created_at"           "updated_at"          
##  [4] "clickability_test_id" "excerpt"              "headline"            
##  [7] "lede"                 "slug"                 "eyecatcher_id"       
## [10] "impressions"          "clicks"               "significance"        
## [13] "first_place"          "winner"               "share_text"          
## [16] "square"               "test_week"

ggplot(data) +
  geom_histogram(aes(impressions))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data = data %>%
  mutate(ctr = clicks / impressions, #clicks as % of impressions
  what = grepl("What", headline)) #True if "What is in the headline

There are 12% headlines that uses “What” in it

mean(data$what)

## [1] 0.127239

lm(ctr ~ what, data = data) %>%
summary

## 
## Call:
## lm(formula = ctr ~ what, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.016366 -0.008299 -0.003219  0.004529  0.120397 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.567e-02  8.736e-05  179.32  < 2e-16 ***
## whatTRUE    7.005e-04  2.449e-04    2.86  0.00423 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01229 on 22664 degrees of freedom
## Multiple R-squared:  0.0003609,  Adjusted R-squared:  0.0003168 
## F-statistic: 8.182 on 1 and 22664 DF,  p-value: 0.004235

The regression indicates a weak negative correlation between usage of “What” and CTR. However, the test should focus on the use of the word “What” when it functions as the start of a subdornate clause

data %>%
filter(clickability_test_id == "547480311631de7fd8000033") %>%
.$headline

## [1] "Preparing For Lengthy Discussions With Your Relatives At The Holiday Table? Get Learnt Now."      
## [2] "Your Roof, Your Roof, Your Roof Isn’t Covered With Solar Panels. Here's Why."                     
## [3] "What Do We Want? Power! When Do We Want It? Now! Why Can't We Have It? A Whole Host Of Reasons..."
## [4] "Solar Power Might Not Be There Just Yet But We're Getting Closer And Here's How"

felm(ctr ~ what | clickability_test_id, data = data) %>%
summary

## 
## Call:
##    felm(formula = ctr ~ what | clickability_test_id, data = data) 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.033037 -0.001938 -0.000135  0.001736  0.043648 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)   
## whatTRUE 0.0004263  0.0001559   2.735  0.00624 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004292 on 17792 degrees of freedom
## Multiple R-squared(full model): 0.9042   Adjusted R-squared: 0.878 
## Multiple R-squared(proj model): 0.0004203   Adjusted R-squared: -0.2734 
## F-statistic(full model):34.48 on 4873 and 17792 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 7.481 on 1 and 17792 DF, p-value: 0.006242

Using the fixed effects regression, We see a positive correlation between use of the word “What” and CTR

Homework 4

Ziye Shao

MRKT B9659: Statistical Modeling and Decision Making

Instructions

Modeling censored random variables

What makes a good clickbait title? The answer might surprise you!