Current Pilot
This analysis is based on 1684 respondents who completed the current pilot survey wave (version 5), and 398 respondents who completed the last pilot survey wave (version 4b).
Key Takeaways:
- FB’s A|B testing is NOT a “true” randomized A|B test (FB draws a random sample from the total population, and randomly assigns it to one of the 5 ad sets)
- FB’s A|B testing can pick the best Ad set, which is the best-performed Ad set based on our key cost metric when we look at the bottom of the funnel
- When we adjust for the quality of responses (e.g., elicitation, time spent in survey, nonsense answers), airtime ads still outperform our control when using our key cost metric
- Airtime participants will not skew responses
- The proportion of females in Ads impressions is keep increasing, from 53% to 62%
- The proportion of people with age > 45 in the conversation started is keep decreasing, from 40% to 20%
- Most of the impressions/conversations started are made by young people (age: 18-34)
- In Side-by-Side comparisons,
unnecessary outperformed risk and inaccessible; airtime outperformed control; image 1 outperformed all other images
Setup
Detailed setting can be found here.
- 1 campaign
- 5 ad sets, 3 ads in each ad set
- 15 ads split into:
- 3 impediment themes (3
inaccessible, 6 risky, 6 unnecessary)
- 2 different prompts (6
control and 9 airtime)
- 9 different images (Images 1-6 used twice, images 7-9 used once)
Overview of the Ads Strategy
Ad Performance will be measured using the following indicators:
- Click-through rate
- Recruitment of non-vaccinated but potentially treatable participants
- Total quantity
- Percent of total participants recruited
- Retention
- Average participant elicitation
- Cost
- per impression
- per Link Click
- per Survey Complete
Key definitions:
unvax: participants with “vax_status” == “unvax”
unvax, open to treatment: participants that are unvax and chose “maybe” or “of course” when we ask them “would you ever consider getting a vaccine in the future?” in the Treatment section
Goals for this analysis script (version 5):
- See whether Facebook A|B testing is a “true” randomized A|B test, and look into whether the best Ad set selected by Facebook aligns with the Ad set selected by our key metrics
- Provide an overview on the performance of current Ads based on various metrics and compare the Ads performance between
version 4b and version 5 in terms of the key cost metrics
- Evaluate the strategy of using airtime text to recruit by estimating the advantages and disadvantages of mentioning
airtime in our Ads
- Analysis on Ad targeting: find demographic covariates that would predict unvax, open to treatment participants
- Understand the variation in demographics distribution across pilots
- Compare the Ads performance across three different themes, built upon hypothesized drivers of hesitancy, as well as different versions of creativity within those themes.
Goals of this pilot with ads:
- Recruit full survey completes for unvaxxed, open to treatment participants at a low cost while maintaining participant response quality
- Take the learnings here and apply them to our next large pilot
Part I: Version 5 Specific Analysis
In this part, we are going to answer the research questions that are specific to this pilot (version 5).
1. FB A|B testing algorithm analysis
Takeaways:
- FB’s A|B testing can pick the best Ad set, which is the best-performed Ad set based on our key cost metric when we look at the bottom of the funnel
- FB’s A|B testing is NOT a “true” randomized A|B test (FB draws a random sample from the total population, and randomly assigns it to one of the 5 ad sets)
Winning Ad set comparison
To understand Facebook’s A|B test capabilities to draw conclusions of which ad performed best, we are going to see whether the winning Ad set selected by Facebook matches the winning Ad set selected by our bottom of the funnel key metrics.
- Facebook A|B test selected
pilot_v5_unnecessary_airtime
- The metric table below suggests us
pilot_v5_unnecessary_airtime is the best based on our funnel analysis
Conclusion:
FB’s winning Ad set matches the winning Ad set selected by our metrics. We can make a preliminary determination that Facebook’s A|B test can draw conclusions of which ad performed best
Interpretation: FB A|B test divides our budget to equally and randomly divide exposure between each version of our Ad sets and chooses the most cost-efficient Ad set as the winner, which aligns with our cost metrics. i.e. Given a budget, the more Ad audiences started the conversion, the more participants (that are unvaccinated and open to treatment) we can obtain, hence more cost-efficient when we look at the key cost metrics at the bottom of the funnel.
Demographic variables distribution
We are interested in:
- is FB A|B testing a “true” randomized A|B test (FB draws a random sample from the total population, and randomly assigns it to one of the 5 ad sets), OR
- is it a different comparative test (FB divides up the total population into five equal samples, then runs an algorithm to identify attractive target populations to maximize cost efficiency on each sample)?
Metric:
- if it is a true A|B test, we should see roughly the same demographic distributions in each of the five ad sets.
- if it isn’t, we should see very different demographic distributions in each of the five ad sets.
Findings:
- Using the Facebook ads data on demographics, if we combine the 18-24 and 25 - 34 population group, the
age distribution is roughly balanced across five ad sets
- Similarly,
region (as identified by FB) is also roughly balanced
- However,
gender is not balanced across five ad sets, with two ad sets having about 10% lower female ratio (~57.5%) than the other three ads sets (~67.5%), which might suggest we are hitting different populations.
Age

Ad Set Version:
A: unnecessary airtime
B: unnecessary control
C: risky airtime
D: risky control
E: inaccessible airtime
Gender

Ad Set Version:
A: unnecessary airtime
B: unnecessary control
C: risky airtime
D: risky control
E: inaccessible airtime
Region

Ad Set Version:
A: unnecessary airtime
B: unnecessary control
C: risky airtime
D: risky control
E: inaccessible airtime
Age | Gender
Since we found imbalanced distribution of gender, we are interested in will the distribution of Age is conditional on gender. Based on the plots below, the answer is No.


Ad Set Version:
A: unnecessary airtime
B: unnecessary control
C: risky airtime
D: risky control
E: inaccessible airtime
2. Airtime analysis
Our last pilot ads that mention airtime were a lot more cost-efficient in recruiting a large number of participants.
In this subsection, we are interested in:
When we adjust for the quality of responses (e.g., elicitation, time spent in survey, nonsense answers), do airtime ads still outperform our control when using our key cost metric (survey complete, unvaxxed, open to treatment)?
How does recruit airtime participants also skew responses
Should we continue to use airtime ads to recruit participants?
Takeaways:
- After adjusting for the number of characters elicited or time spent in the survey, ads mentioning airtime are still a lot more cost-efficient.
- When looking at the best treatment respondents select, the treatment selection distribution is roughly the same as the control group, suggesting financial incentives in ads do not translate to people selecting “rewards for vaxxing” in treatment.
- We should continue to use airtime ads to recruit participants.
Adjusted cost analysis
- We compared the cost per ten characters elicitation in treatment and impediment explanations between the
airtime and control group.
- We also compared the cost per minute spent in Survey Complete between the
airtime and control groups.
In both comparisons, we found airtime is still a lot more cost-efficient.
Adjusted by character elicitation
Adjusted by time spent in survey
Skew responses analysis
We estimated the distribution of best treatment selected by participants recruited by airtime and participants recruited by control. We would like to see is there a skewed distribution of airtime’s. Based on the table below, the answer is NO.
Part II: General Ads A|B Testing
This part includes the analysis that we will apply to every pilot.
1. Demographics and Ads
In this section, we are interested in the changes in the distribution of three demographic variables across pilots
Takeaway:
- The proportion of females in Ads impressions is keep increasing, from 53% to 62%
- The proportion of people with age > 45 in the conversation started is keep decreasing, from 40% to 20%
- Most of the impressions/conversations started are made by young people (age: 18-34)
Age
Age and Impression count

Age and Conversation Started count

Gender
Gender and Impression count

Gender and Conversation Started count

Region
Region and Impression count

Region and Conversation Started count

2. Side-by-Side Chart on Key Metrics
Metrics explanation:
Impressions (Total Count) = the total number of times our ad has been viewed
Clickthrough (%) = #clicks / #impressions
Messages Sent (%) = #conversations / #clicks
Consent Obtained (%) = #consents / #conversations
Core Survey Complete (%) = #forking section completed / #consents
Treatment Complete (%) = #treatment section completed / #forking section completed
Demo Questions Complete (%) = #demog section completed / #treatment section completed
Full Survey Complete (%) = #full chat completed / #demog section completed
Total characters elicited per completed survey (treatment) = average #character in best treatment explanation per full chat completed
Avg characters elicited per completed survey (impediment explanations) = average #character in impediment explanations per full chat completed
Cost per Impression = amount spent / #impressions (in USD)
Cost per Link Click = amount spent / #clicks (in USD)
Cost per Survey Complete (All participants) = amount spent / #full chat completed (in USD)
Cost per Survey Complete (Unvax) = amount spent / #full chat completed with unvaccinated participants (in USD)
Cost per Survey Complete (Unvax, Open to Treatment) = amount spent / #full chat completed with unvaccinated and open to treatment participants (in USD)
Unnecessary vs Unsafe vs Inaccessible
This table compared three Ad impediment sources (vaccine is unnecessary vs vaccine is risky vs vaccine is inaccessible) in terms of the metrics described above.
Winner: unnecessary
Body texts used
Unnecessary
- Do you actually NEED the COVID VACCINE? Take a short survey, earn mobile airtime! (Airtime)
- Do you actually NEED the COVID VACCINE? Tell us about your experience, we’re here to listen (Control)
Risky
- Is the COVID VACCINE too RISKY? Take a short survey, earn mobile airtime! (Airtime)
- Is the COVID VACCINE too RISKY? Tell us about your experience, we’re here to listen (Control)
Inaccessible
- IMPOSSIBLE to get a COVID VACCINE? Take a short survey, earn mobile airtime! (Airtime)
Control vs Airtime
This table compared three Ad body text approaches - control (share your opinion) vs airtime (take a short survey and earn airtime) vs survey (take this short survey)- in terms of the metrics described above.
Winner: airtime
Compare by Image
This table compared nine images (provided below the table) in terms of the metrics described above.
Winner: image 1
Takeaway:
Best performing images that we should we use as baseline images for the next large survey/experiment:
- unnecessary:
image 1
- risky:
image 6, but the key cost metric of image 4, image 5, image 6 are close to each other
- inaccessible:
image 7, note: this image is supressed by FB algorithm due to higher cost per impression and higher cost per link click
Images used
Image 1

Image 2

Image 3

Image 4

Image 5

Image 6
