Show your work/explain your thinking on all questions to receive full credit. For the questions involving math, you can either type your answers up using within this file or submit a separate file (handwritten or typed) with your answers.
Remember, your submission must be in the form of a readable
and organized PDF file, not a .Rmd or .R
file, with your answers/code results easy to find. We will not
grade submissions that are in the incorrect format or otherwise
unreadable. Likewise, if you handwrite your answers to the math
questions, be sure to combine them into a single document rather than
submitting several individual image files, organized so that it is clear
which question you are answering and what the final answer is.
You are encouraged to discuss and collaborate on this assignment with your peers, but you should never directly copy answers or code from others. We will not grade submissions that show clear evidence of direct copying.
A common setting where we need to assume a model of \(X\), rather than just calculating some “model free” summary statistics of \(X\), is when \(X\) is imperfectly observed and we need to use a model to impute unobserved values.
One very common way a random variable can be partially unobserved/missing is censoring: we say that a random variable \(X\) is censored when, for some values of \(X\), we can’t observe the exact value, but instead only know that it’s within a certain interval. For instance, in survey questions, we often ask questions in the form of categorical intervals, even though the actual underlying value is continuous. For instance, when we ask people what their income is, the answer choices are usually ranges like “$10,000 or less,” “$10,001 to $25,000,” “$150,001 or more,” etc.
So, while we are really interested in the distribution of income \(X\), all we get to observe is a censored random variable \(X^{*}\) that only tells us ranges of \(X\) values rather than the exact value. However, if we have a good model for the distribution of \(X\), we might be able to impute the distribution of \(X\) from the distribution of \(X^{*}\).
Let’s revisit the Instacart data to see how this works in practice.
order_size_cens where you replace all order sizes of over
10 items with 10. Denoting the order size as a random variable X, this
new censored random variable \(X^{*}\)
is right-censored at 10, meaning that when \(X\geq10\), \(X^{*}=10\) only tells us that the true
order size is \(X\geq10\) rather than
telling us the exact value.#put code for Q1 here
library (tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
instacart_data = read_csv('instacart_sample.csv',col_types = "ccdccccccdccccc")
order_sizes = instacart_data %>%
group_by(order_id) %>% #group together observations from each order
summarize(order_size = n())%>% #n() counts number of rows per order
mutate(order_size_cens = order_size)
order_sizes$order_size_cens[order_sizes$order_size_cens >10] = 10
pnbinom(x, ...) gives you \(P(X\leq x)\), such that
1-pnbinom(x, ...) or
pnbinom(x, ..., lower.tail = FALSE) give the probability of
\(P(X>x)\), not \(P(X\geq x)\); that is, it’s missing \(P(X=x)\). Be careful that your calculation
for the censored term includes the boundary value.#put code for Q2 here
library(maxLik)
## 载入需要的程辑包:miscTools
##
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
##
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
ll_nbd = function(param, x_vec){
#first parameter: shape
r = param[1]
#second parameter: probability
p = param[2]
upper = sum(dnbinom(x_vec[x_vec<9], r, p, log = TRUE))
lower = sum(log(1-pnbinom(x_vec[x_vec>8]-1, r, p)))
s = upper+lower
}
maxLik; no need to do a bootstrap. Hint:
You can use the maxLik code we used in class, substituting
in the censored variable \(X^{*}\) you
made in Q1 and the log-likelihood function you wrote in Q2.#put code for Q3 here
order_vec_shifted = (order_sizes$order_size_cens) - 1
mle_est_nbd = maxLik(ll_nbd, start = c(3, .5), x_vec = order_vec_shifted)
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 9], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 8] - 1, r, p): 产生了NaNs
summary(mle_est_nbd)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 6 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -19351.71
## 2 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## [1,] 1.682461 0.041413 40.63 <2e-16 ***
## [2,] 0.154508 0.004278 36.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#put code for Q4 here
param_mle = mle_est_nbd$estimate
data.frame(x_grid = 1:30,
observed = as.vector(table(order_sizes$order_size)[1:30]),
expected = length(order_vec_shifted)*
dnbinom(0:29, param_mle[1], param_mle[2])) %>%
pivot_longer(observed:expected) %>%
ggplot(aes(x_grid, value, group = name, fill = name)) +
geom_bar(stat = "identity", position = "dodge") +
#scale_fill_economist("Distribution") +
theme(legend.position = "bottom") +
scale_x_continuous("Number of items ordered") +
scale_y_continuous("Count")
#put code for Q5 here
instacart_data = read_csv('instacart_sample.csv',col_types = "ccdccccccdccccc")
order_sizes = instacart_data %>%
group_by(order_id) %>% #group together observations from each order
summarize(order_size = n())%>% #n() counts number of rows per order
mutate(order_size_cens = order_size)
order_sizes$order_size_cens[order_sizes$order_size_cens >5] = 5
ll_nbd = function(param, x_vec){
#first parameter: shape
r = param[1]
#second parameter: probability
p = param[2]
upper = sum(dnbinom(x_vec[x_vec<4], r, p, log = TRUE))
lower = sum(log(1-pnbinom(x_vec[x_vec>3]-1, r, p)))
s = upper+lower
}
order_vec_shifted = (order_sizes$order_size_cens) - 1
mle_est_nbd = maxLik(ll_nbd, start = c(3, .5), x_vec = order_vec_shifted)
## Warning in dnbinom(x_vec[x_vec < 4], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 3] - 1, r, p): 产生了NaNs
## Warning in dnbinom(x_vec[x_vec < 4], r, p, log = TRUE): 产生了NaNs
## Warning in pnbinom(x_vec[x_vec > 3] - 1, r, p): 产生了NaNs
summary(mle_est_nbd)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 7 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -8666.1
## 2 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## [1,] 1.348949 0.053996 24.98 <2e-16 ***
## [2,] 0.107600 0.007076 15.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
param_mle = mle_est_nbd$estimate
data.frame(x_grid = 1:30,
observed = as.vector(table(order_sizes$order_size)[1:30]),
expected = length(order_vec_shifted)*
dnbinom(0:29, param_mle[1], param_mle[2])) %>%
pivot_longer(observed:expected) %>%
ggplot(aes(x_grid, value, group = name, fill = name)) +
geom_bar(stat = "identity", position = "dodge") +
#scale_fill_economist("Distribution") +
theme(legend.position = "bottom") +
scale_x_continuous("Number of items ordered") +
scale_y_continuous("Count")
In the age of digital marketing, randomized experiments are common in the form of A/B tests: marketers often want to run “horse races” where they determine which of several ad copies will get better clickthrough rates, which website design will get better conversion, etc.
The website Upworthy is known for spreading viral content using so-called “clickbait” titles that tend to use vague and/or exagerrated headlines to catch viewers’ attention and encourage them to click into an article. Over the years, Upworthy perfected the clickbait headline through thousands of A/B tests. Let’s analyze Upworthy data to understand what kind of language makes people more likely to click on an article.
The file upworthy_archive_sample.csv contains data on
nearly 5,000 A/B tests conducted by Upworthy from 2013 to 2015. See more
information on this dataset, the Upworthy Research Archive, in Matias et
al. (2021). In particular, definitions of each column in the data
are given here.
While viewing an article on Upworthy’s website, visitors would be shown previews of recommended articles on the sidebar, with each article preview consisting of an image and paired headline, as shown in Figure \(\ref{fig:website_example}\). In these A/B tests, Upworthy would randomly choose one of several candidate headlines and images to display in the sidebar: different visitors would see different headlines linking to the same article. Figure \(\ref{fig:headline_example}\) gives an example of several candidate headlines for a single article.
Upworthy’s goal for each A/B test was to infer which headline/image led to the best clickthrough rate (CTR). Each A/B test would be conducted when an article was first published on Upworthy, and once a sufficient sample size was collected, a website editor would choose which headline/image pair to use for all subsequent visitors.
For this assignment, we will get some practice with text analysis, focusing on the headlines. What we want to understand is, generally, what kind of language makes people more or less likely to click on an article? For instance, what features of the headlines in Figure \(\ref{fig:headline_example}\) might make them perform better or worse?
To answer this, we meta-analyze the experiments, combining information across thousands of experiments to learn more generalizable principles about what sorts of headlines get better or worse CTRs, which can help us write better headlines in the future. For instance, if we observe across many articles that headlines phrased as questions get worse CTRs, then we may wish to avoid phrasing headlines as questions in the future.
hw4_example included with this homework assignment to see
two worked examples. Feel free to base your code off the sample code and
adapt it to your own hypotheses (but please come up with different
hypotheses than the examples I give!). Note: I encourage you to
use a similar modeling approach as in the examples (linear regression
with fixed effects), but you don’t have to; feel free to try running
your analysis a different way, just be sure to justify your modeling
choices carefully. Note 2: Don’t worry if you find
statistically insignificant effects; the fact that some language
apparently doesn’t affect clickthrough rates can also be
informative. Just be sure to provide a thorough analysis and thoughtful
discussion.library(scales)
##
## 载入程辑包:'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(lfe)
## 载入需要的程辑包:Matrix
##
## 载入程辑包:'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
Answer:
Many articles uses subordinate clause leaded by”What”, such as “That’s
what you need to know,”, “That’s What they needed to hear”, etc. However
the seemingly effective approach to simulate curiosity might already has
been overused, and audiences automatically shun from the headlines
written this way.
data = read_csv("upworthy_archive_sample.csv")
## New names:
## Rows: 22666 Columns: 17
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): clickability_test_id, excerpt, headline, lede, slug, eyecatcher_id... dbl
## (5): ...1, impressions, clicks, significance, test_week lgl (2): first_place,
## winner dttm (2): created_at, updated_at
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
nrow(data)
## [1] 22666
length(unique(data$clickability_test_id))
## [1] 4873
colnames(data)
## [1] "...1" "created_at" "updated_at"
## [4] "clickability_test_id" "excerpt" "headline"
## [7] "lede" "slug" "eyecatcher_id"
## [10] "impressions" "clicks" "significance"
## [13] "first_place" "winner" "share_text"
## [16] "square" "test_week"
ggplot(data) +
geom_histogram(aes(impressions))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data = data %>%
mutate(ctr = clicks / impressions, #clicks as % of impressions
what = grepl("What", headline)) #True if "What is in the headline
There are 12% headlines that uses “What” in it
mean(data$what)
## [1] 0.127239
lm(ctr ~ what, data = data) %>%
summary
##
## Call:
## lm(formula = ctr ~ what, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.016366 -0.008299 -0.003219 0.004529 0.120397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.567e-02 8.736e-05 179.32 < 2e-16 ***
## whatTRUE 7.005e-04 2.449e-04 2.86 0.00423 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01229 on 22664 degrees of freedom
## Multiple R-squared: 0.0003609, Adjusted R-squared: 0.0003168
## F-statistic: 8.182 on 1 and 22664 DF, p-value: 0.004235
The regression indicates a weak negative correlation between usage of “What” and CTR. However, the test should focus on the use of the word “What” when it functions as the start of a subdornate clause
data %>%
filter(clickability_test_id == "547480311631de7fd8000033") %>%
.$headline
## [1] "Preparing For Lengthy Discussions With Your Relatives At The Holiday Table? Get Learnt Now."
## [2] "Your Roof, Your Roof, Your Roof Isn’t Covered With Solar Panels. Here's Why."
## [3] "What Do We Want? Power! When Do We Want It? Now! Why Can't We Have It? A Whole Host Of Reasons..."
## [4] "Solar Power Might Not Be There Just Yet But We're Getting Closer And Here's How"
felm(ctr ~ what | clickability_test_id, data = data) %>%
summary
##
## Call:
## felm(formula = ctr ~ what | clickability_test_id, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.033037 -0.001938 -0.000135 0.001736 0.043648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## whatTRUE 0.0004263 0.0001559 2.735 0.00624 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.004292 on 17792 degrees of freedom
## Multiple R-squared(full model): 0.9042 Adjusted R-squared: 0.878
## Multiple R-squared(proj model): 0.0004203 Adjusted R-squared: -0.2734
## F-statistic(full model):34.48 on 4873 and 17792 DF, p-value: < 2.2e-16
## F-statistic(proj model): 7.481 on 1 and 17792 DF, p-value: 0.006242
Using the fixed effects regression, We see a positive correlation between use of the word “What” and CTR