Results Analysis

There are 25,200 responses in total that were collected in March and April 2025 from Google Gemini and OpenAI GPT. We calculated correct_prediction based on the majority vote across 3 runs.

Setup

Here we filter out the 5th question type “general” because these questions do not have a ground truth. We also separate responses by search-enabled and search-disabled where searching the internet was enabled or disabled for a model, respectively, during the prompting process.

The following analysis will only focus on responses from search-enabled models (unless otherwise indicated) as the SOA models at the time of this study.

There are 12,400 responses per client type.

1 Accuracy

How accurate are model responses compared to current state laws when forced to answer one way or another? In other words, how accurate are model responses with respect to a forced ground truth when the requested prompt response is “yes” or “no”.

We deliberately set up prompting this way even though the answer may be more complicated than a binary response can convey. The goal is to understand how often these models are correct and explore what “correct” means in this context.

1.1 Accuracy overall

Model response accuracy across all questions asked, state, and models is 75%.

avg_accuracy	sd_accuracy	se_accuracy	ci_lower	ci_upper	num_responses
0.7351613	0.4412532	0.0022878	0.7306772	0.7396454	37200

1.2 Accuracy per state

Across all states, North Dakota has the lowest accuracy and Vermont has the highest. In 5 out of the 50 states, model responses are wrong more than half the time. As these responses come from search-enabled models (meaning the models have access to internet search results to augment their training data), the reason for lower accuracy could be due to regional specificity or inconsistent information that the models struggle with, especially if these are states with rapidly changing laws/abortion-related news coverage. Conversely, in states where models had higher accuracy, there may be less ambiguity in the training and search data.

Notably, the standard deviation also drops as the accuracy increases per state meaning that in general, performance varies more for responses for states with a lower accuracy than those with a higher accuracy.

We explore if this underperformance is systemic in certain geographic regions, political and policy contexts in the next sections.

1.3 Accuracy by 2024 election ballot measures

Abortion-related ballot measures to protect abortion rights were added to the 2024 election ballots in 10 states: Arizona, Colorado, Florida, Maryland, Missouri, Montana, Nebraska, New York, and South Dakota¹.

These measures passed in all of the states except for Florida, Nebraska, and South Dakota.

However, while Missouri voters approved Amendment 3, which enshrined a right to reproductive freedom in the state constitution, in May 2025, the Missouri Supreme Court issued a ruling that effectively reinstates a ban on abortion in the state. A new ballot is being drafted by Republicans². This ruling is an example of the confusing and complicated reality of changing abortion laws in the United States.

However, at the time of these model responses were collected for this study, Missouri was considered not to have a total ban³.

It’s interesting that of these states, model responses for Missouri have the worst accuracy at 48%.

t-test and plot shows that ballot measure states are distributed across accuracy spectrum and can’t reject null hypothesis because there’s no real difference in means, therefore no evidence that mean accuracy differs between states with ballot measures and states without.

2 Geography + Political/Policy Bias

Here we ask: how do models handle state-by-state variability with respect to different restrictions?

2.1 By Policy

2.1.1 Guttmacher Institute Abortion Policy Spectrum

Here, we implement The Guttmacher Institute’s state abortion policy framework⁴ where states are categorized based on the restrictiveness of their abortion laws.

Within this spectrum, states that are restrictive have the lowest accuracy with 61% across all “most restrictive” states and 95% across all “very protective” states. Meaning in states with restrictive abortion policies, models are less likely to provide a correct response.

policy	avg_accuracy	sd_accuracy	se_accuracy	ci_lower	ci_upper
most restrictive	0.6022401	0.4894573	0.0046332	0.5931590	0.6113212
restrictive	0.6303763	0.4827492	0.0066894	0.6172652	0.6434875
very restrictive	0.6442652	0.4788426	0.0101355	0.6243996	0.6641308
some restrictions/protections	0.7290323	0.4445192	0.0072882	0.7147474	0.7433171
most protective	0.8592070	0.3478664	0.0063767	0.8467087	0.8717053
protective	0.8642473	0.3425513	0.0041862	0.8560424	0.8724522
very protective	0.9312596	0.2530365	0.0035063	0.9243873	0.9381319

2.1.2 Total Abortion Bans and Bans Based on Gestational Duration

Another way to frame state abortion policies is by considering states with total abortion bans (“total_ban”), no limit (“allowed_no_limit”), and bans based on gestational duration at a certain time in a pregnant person’s pregnancy (“gestational_limit {n_weeks_LMP}”). This policy view provides a slightly granular perspective than the one above of the accuracy of model responses.

Here we see that states that have a 6-week LMP gestational ban (gestational_limit 6) have the lowest accuracy at 52% and states that allow abortion any stage of pregnancy have the highest accuracy at 96%. This shows us that accuracy is worse when we look at states that have specific restrictions like gestational bans regardless how “restrictive” or “protective” a state’s abortion policies may be.

This is important because a state labeled as “protective” could be protective in some respects but restrictive in others which may complicate the information available to provide answers to questions on the legality of abortion in a specific state.

❗️ *** Importantly, fetal viability is generally considered to begin at 23 or 24 weeks gestational age⁵. We take the highest number here so any state that allows abortions until “viability” without a specified gestational age is categorized as gestational_limit 24 with 24 denoting 24 weeks LMP as the threshold for fetal viability.

combined_policy_type	avg_accuracy	sd_accuracy	se_accuracy	ci_lower	ci_upper
gestational_limit 6	0.5156810	0.4998660	0.0105805	0.4949432	0.5364188
gestational_limit 22	0.5963262	0.4907435	0.0103874	0.5759668	0.6166855
total_ban	0.6141353	0.4868261	0.0051522	0.6040369	0.6242337
gestational_limit 18	0.7459677	0.4356087	0.0159702	0.7146662	0.7772693
gestational_limit 24	0.7544803	0.4304110	0.0037193	0.7471905	0.7617701
gestational_limit 12	0.7560484	0.4296083	0.0111371	0.7342197	0.7778771
gestational_limit 26.6	0.8024194	0.3984419	0.0146076	0.7737885	0.8310502
gestational_limit 26	0.9126344	0.2825600	0.0103592	0.8923305	0.9329383
allowed_no_limit	0.9442951	0.2293681	0.0028030	0.9388012	0.9497890

2.1.3 By Policy x State

The restrictions in abortions vary in states across these policy framings and have a different effect on accuracy when these policies are broken down by state.

By combining both of these policy framings at the state level, we come back to accuracy per state; this time with a clearer view of what kind of policies are affecting model responses in the states with the lowest and highest accuracy.

North Dakota, the state where model responses had the lowest accuracy, is a restrictive state (but not one of the most restrictive states) as a total abortion ban was repealed in late 2024 in the state. However, there are no abortion providers in the state and litigation around the ban is still in progress. What additionally makes North Dakota “restrictive” versus protective like other states with a gestational viability ban such as Montana are the additional restrictions pregnant person’s must abide by to access an abortion. This includes: waiting periods, required counseling, limited insurance coverage, etc. Wisconsin similarly has a long list of additional restrictions to access abortion care.
With the exception of California, the states that follow the list in increasing order of accuracy fall under a restrictive policy category and either have a low gestational duration ban (meaning the state bans abortions early in pregnancy) and/or they have additional restrictions alongside a high gestational duration ban (meaning a pregnant person can get an abortion in theory, but they face many obstacles to do so). We will revisit California soon. Accuracy ranges from 32% to 77% for these states.
The states with higher model response accuracy either have no ban and allow abortion throughout pregnancy or have allow abortions up to viability (gestational_limit_24) with few additional restrictions. These state policies are in the protective range or have some restrictions/protections. Accuracy ranges from 86% to 99% for these states.

2.2 By Partisan Lean

To compute the partisan lean of a state, we first calculate the presidential deviation for the 2024 U.S. presidential election using data from AP News⁶. We subtract the national popular vote margin (Republican lead by 1.5 percentage points) from compare each state’s vote margin to find how much a state deviates from the national average in support for the Republican presidential candidate in 2024. This metric highlights how much more Democratic or Republican a state is relative to the nation as a whole⁷.

We see that at a national level, states likely to lean more liberal have a higher model response accuracy with Strongly Republican states at 64% model response accuracy and Strongly Democrat states at 90% model response accuracy. This shows an overall model bias that favors states with a Democratic lean.

partisan_lean	avg_accuracy	sd_accuracy	se_accuracy	ci_lower	ci_upper
Strongly Republican	0.5381720	0.4986078	0.0081750	0.5221490	0.5541950
Leans Republican	0.6340246	0.4817257	0.0047201	0.6247732	0.6432759
Center	0.6982527	0.4590419	0.0048582	0.6887306	0.7077748
Strongly Democrat	0.8443100	0.3626018	0.0054271	0.8336729	0.8549472
Leans Democrat	0.9035360	0.2952418	0.0030021	0.8976519	0.9094200

However, broken down by state, we see that this bias is not as clear cut. We note that partisan lean is based on the results of the 2024 presidential election, not considering current events and the changing political landscape across the U.S.

Given this, we next explore the most likely factors impacting model response accuracy.

2.3 Odds Ratio

Next, we run a logistic regression to test which variables are more likely to predict the likelihood that a model provides the correct answer (correct_prediction).

We use all our data (search-enabled and search-disabled) for the following analysis.

logit_data_per_run$policy <- relevel(factor(logit_data_per_run$policy), ref = "some restrictions/protections") 
logit_data_per_run$partisan_lean <- relevel(factor(logit_data_per_run$partisan_lean), ref = "Center") 
logit_data_per_run$affirmative_vs_negative <- relevel(factor(logit_data_per_run$affirmative_vs_negative), ref = "negative prompt")
logit_data_per_run$correct_answer_yes <- factor(logit_data_per_run$correct_answer_yes, 
                                                levels = c(FALSE, TRUE))

logit_covariates_per_run <- glm(
  correct_prediction ~ policy + partisan_lean + 
    type + model + client_type + affirmative_vs_negative,
  data = logit_data_per_run,
  family = "binomial"
)

# logit_covariates_per_run <- glm(
#   correct_prediction ~ policy + partisan_lean + 
#     type + model + client_type + affirmative_vs_negative + correct_answer_yes,
#   data = logit_data_per_run,
#   family = "binomial"
# )

Below we have the odds ratios from this logistic regression model predicting the likelihood of a correct response from the models in our study, using covariates including state abortion policy, partisan lean, question type, client search settings, and model type.

TODO: fix strip text coloring and casing

term	estimate	std.error	statistic	p.value	conf.low	conf.high	term_label	group
affirmative_vs_negativeaffirmative prompt	1.7715657	0.0181668	31.478554	0.0000000	1.7096346	1.8358229	affirmative prompt	Affirmative vs Negative
modelopenai	1.1814214	0.0178872	9.320547	0.0000000	1.1407276	1.2235830	OpenAI	Model
partisan_leanLeans Democrat	2.4802894	0.0407941	22.267297	0.0000000	2.2897778	2.6868606	Leans Democrat	Partisan Lean
partisan_leanStrongly Democrat	1.2709124	0.0555215	4.317880	0.0000158	1.1400657	1.4172809	Strongly Democrat	Partisan Lean
partisan_leanLeans Republican	0.9150113	0.0243209	-3.651959	0.0002602	0.8724102	0.9596776	Leans Republican	Partisan Lean
partisan_leanStrongly Republican	0.6085394	0.0303789	-16.349951	0.0000000	0.5733547	0.6458648	Strongly Republican	Partisan Lean
policyvery protective	3.7413414	0.0513500	25.695110	0.0000000	3.3845125	4.1392767	very protective	Policy
policyprotective	2.6537326	0.0430503	22.670391	0.0000000	2.4393002	2.8877448	protective	Policy
policymost protective	2.0245514	0.0631610	11.167468	0.0000000	1.7891873	2.2918476	most protective	Policy
policyrestrictive	0.8672656	0.0369150	-3.857785	0.0001144	0.8066666	0.9322670	restrictive	Policy
policymost restrictive	0.8083426	0.0343489	-6.194360	0.0000000	0.7556264	0.8645399	most restrictive	Policy
policyvery restrictive	0.7114474	0.0435359	-7.820077	0.0000000	0.6532379	0.7748003	very restrictive	Policy
typetravel	17.4493478	0.2036735	14.038654	0.0000000	11.9785941	26.7102195	Travel	Question Type
typegestation	1.4934929	0.0196312	20.432680	0.0000000	1.4371855	1.5521492	Gestation	Question Type
typetelemedicine	0.9152404	0.0716273	-1.236519	0.2162658	0.7961551	1.0543259	Telemedicine	Question Type
client_typewithout_search	0.9422990	0.0178774	-3.324453	0.0008859	0.9098510	0.9758996	Search Disabled	Search Setting

3 Question Type + Prompt Sensitivity (Prompt Framing Effect)

Overall, does phrasing matter: By state? By policy? Restrictiveness? Do LLMs respond differently to variations in how the question is asked? Are there inconsistencies depending on user phrasing and state? Where is this more prevalent?

type	avg_accuracy	sd_accuracy	se_accuracy	num_responses	ci_width
telemedicine	0.67500	0.4695502	0.0332022	200	0.1301527
travel	0.97000	0.1710153	0.0120926	200	0.0474030
gestation	0.78525	0.4107001	0.0064937	4000	0.0254555
assault	0.73450	0.4416267	0.0049375	8000	0.0193551

Overall, the mean accuracy for the 4 question types we designed for our prompts show a range between 68% and 97% for model response accuracy with responses to telemedicine questions having the lowest accuracy and responses to travel questions with the highest.

However, if we disaggregate question types by the exact prompts used within each type, we see that the phrasing for each question has an effect on overall accuracy. Questions of type “assault” and “gestation” have more prompts because they include variations by gestational age (TODO: explain this better).

For every single question type, prompts beginning with “Is it…” performed worse than prompts beginning with “Can I…”. This discrepancy is clearest between prompt 1 and prompt 2 for the telemedicine question type, the type with the lowest accuracy overall. This is important because it shows how simple rephrasing could change the response someone would receive when asking these questions!!!

However, since we know a state’s abortion policies have a strong effect on accuracy, when when zoom into question types by policy, we see that accuracy range widens depending on the question (WARNING: 0% is a VERY small sample size).

Once again, we see a low accuracy swing toward states with restrictive policies while states with less restrictive policies have a higher model response accuracy.

3.1 PLEASE, what’s up with California?

By breaking down question type, we see that California’s model response accuracy is extremely low at 38% for prompts of “assault” type. While California is one of the states with the highest level of protection for abortion care, that only applies to people with private and public insurance providers, not people who are insured through employers who provide “self-funded” policies. So in reality, abortion access in California is entirely dependent on what insurance you have and what they are willing to cover.⁸ From a model information perspective, this stratification might be clear in the state’s policies but unclear in the abortion information available for California.

(TODO: include this as part of thematic analysis for RAG.)

4 Effect of Search

Using search and non-search data

4.1 By Policy

When we account for the difference between models that used their cutoff knowledge (search-disabled) and models that augmented responses with search (search-enabled) the states with policies on the protective part of the policy spectrum fared worse overall. The accuracy of search augmented responses was most negatively affected with search enabled for states with “protective” policies, decreasing model response accuracy on average by 6.17%. For states with “most protective” policies, search only improved model response accuracy by less than 1%.

4.2 By Questions

type	with_search	without_search	Diff	effect
assault	0.7133750	0.6935417	0.0198333	Search helped
gestation	0.7715833	0.7696667	0.0019167	Search helped
telemedicine	0.6550000	0.8250000	-0.1700000	Search hurt
travel	0.9583333	1.0000000	-0.0416667	Search hurt

Search-enabled responses also had a negative effect on model response accuracy for telemedicine questions, reducing accuracy by 17% on average. Where search-enabled responses did help, the effect was minimal with model response accuracy increasing by less than quarter of a percentage and less than 2% for gestation and assault questions, respectively.

5 Model Performance

5.1 Overall

model	avg_accuracy	sd_accuracy	se_accuracy	num_responses	ci_low	ci_high
gemini	0.7001075	0.4582229	0.0033599	18600	0.6935222	0.7066928
openai	0.7702151	0.4207058	0.0030848	18600	0.7641689	0.7762612

Overall, OpenAI performed better than Gemini’s model at the time of this study.

5.2 By policy

model	policy	avg_accuracy	sd_accuracy	se_accuracy	num_responses	ci_low	ci_high
gemini	most protective	0.8548280	0.3523324	0.0064071	3024	0.8422701	0.8673860
gemini	most restrictive	0.5032628	0.5000114	0.0046954	11340	0.4940598	0.5124658
gemini	protective	0.8734568	0.3324850	0.0040308	6804	0.8655564	0.8813571
gemini	restrictive	0.6458806	0.4782908	0.0065748	5292	0.6329940	0.6587672
gemini	some restrictions/protections	0.7632275	0.4251577	0.0069152	3780	0.7496737	0.7767813
gemini	very protective	0.9157218	0.2778308	0.0038192	5292	0.9082362	0.9232074
gemini	very restrictive	0.5537919	0.4972076	0.0104404	2268	0.5333287	0.5742550
openai	most protective	0.8287037	0.3768300	0.0068526	3024	0.8152726	0.8421348
openai	most restrictive	0.6584656	0.4742452	0.0044534	11340	0.6497369	0.6671944
openai	protective	0.8883010	0.3150189	0.0038190	6804	0.8808157	0.8957863
openai	restrictive	0.5461073	0.4979166	0.0068446	5292	0.5326920	0.5595227
openai	some restrictions/protections	0.6788360	0.4669854	0.0075955	3780	0.6639488	0.6937232
openai	very protective	0.9195011	0.2720897	0.0037403	5292	0.9121702	0.9268320
openai	very restrictive	0.5930335	0.4913769	0.0103179	2268	0.5728103	0.6132567

5.3 By state

model	state	policy	avg_accuracy	sd_accuracy	se_accuracy	num_responses	ci_low	ci_high
openai	North Dakota	restrictive	0.1084656	0.3111734	0.0113173	756	0.0862838	0.1306475
openai	Wisconsin	restrictive	0.2354497	0.4245605	0.0154411	756	0.2051852	0.2657143
openai	West Virginia	most restrictive	0.3597884	0.4802560	0.0174667	756	0.3255535	0.3940232
gemini	Georgia	very restrictive	0.3809524	0.4859424	0.0176736	756	0.3463122	0.4155925
openai	Mississippi	most restrictive	0.4047619	0.4911709	0.0178637	756	0.3697490	0.4397748
openai	South Carolina	most restrictive	0.4126984	0.4926454	0.0179173	756	0.3775804	0.4478164
openai	Ohio	some restrictions/protections	0.4378307	0.4964484	0.0180557	756	0.4024416	0.4732198
openai	Missouri	some restrictions/protections	0.4589947	0.4986456	0.0181356	756	0.4234490	0.4945404
gemini	Kentucky	most restrictive	0.4629630	0.4989565	0.0181469	756	0.4273951	0.4985308
gemini	Indiana	most restrictive	0.4748677	0.4996986	0.0181739	756	0.4392470	0.5104885
gemini	Arkansas	most restrictive	0.4761905	0.4997634	0.0181762	756	0.4405651	0.5118159
openai	Florida	most restrictive	0.4775132	0.4998248	0.0181785	756	0.4418835	0.5131430
gemini	Oklahoma	most restrictive	0.4788360	0.4998826	0.0181806	756	0.4432021	0.5144699
gemini	North Dakota	restrictive	0.4814815	0.4999877	0.0181844	756	0.4458401	0.5171229
gemini	Tennessee	most restrictive	0.4841270	0.5000788	0.0181877	756	0.4484791	0.5197749
gemini	Idaho	most restrictive	0.4933862	0.5002872	0.0181953	756	0.4577235	0.5290490
gemini	Alabama	most restrictive	0.5013228	0.5003293	0.0181968	756	0.4656570	0.5369885
gemini	South Dakota	most restrictive	0.5039683	0.5003153	0.0181963	756	0.4683035	0.5396330
gemini	Texas	most restrictive	0.5066138	0.5002872	0.0181953	756	0.4709510	0.5422765
gemini	Louisiana	most restrictive	0.5079365	0.5002680	0.0181946	756	0.4722751	0.5435979
gemini	South Carolina	most restrictive	0.5105820	0.5002190	0.0181928	756	0.4749241	0.5462399
gemini	West Virginia	most restrictive	0.5224868	0.4998248	0.0181785	756	0.4868570	0.5581165
gemini	Missouri	some restrictions/protections	0.5277778	0.4995583	0.0181688	756	0.4921670	0.5633885
openai	Iowa	most restrictive	0.5291005	0.4994829	0.0181660	756	0.4934951	0.5647059
gemini	Wisconsin	restrictive	0.5291005	0.4994829	0.0181660	756	0.4934951	0.5647059
gemini	Mississippi	most restrictive	0.5370370	0.4989565	0.0181469	756	0.5014692	0.5726049
gemini	Iowa	most restrictive	0.5410053	0.4986456	0.0181356	756	0.5054596	0.5765510
gemini	Florida	most restrictive	0.5476190	0.4980568	0.0181141	756	0.5121153	0.5831228
openai	Utah	very restrictive	0.5515873	0.4976609	0.0180998	756	0.5161118	0.5870628
openai	Wyoming	restrictive	0.5595238	0.4967729	0.0180675	756	0.5241116	0.5949360
openai	Arizona	restrictive	0.5740741	0.4948100	0.0179961	756	0.5388018	0.6093464
gemini	California	most protective	0.5833333	0.4933330	0.0179423	756	0.5481663	0.6185003
gemini	Wyoming	restrictive	0.5846561	0.4931075	0.0179341	756	0.5495052	0.6198070
openai	California	most protective	0.5939153	0.4914258	0.0178730	756	0.5588843	0.6289464
gemini	Arizona	restrictive	0.6005291	0.4901139	0.0178253	756	0.5655916	0.6354666
openai	Nebraska	very restrictive	0.6084656	0.4884166	0.0177635	756	0.5736491	0.6432821
openai	Georgia	very restrictive	0.6190476	0.4859424	0.0176736	756	0.5844075	0.6536878
gemini	Nebraska	very restrictive	0.6349206	0.4817711	0.0175218	756	0.6005778	0.6692635
gemini	Utah	very restrictive	0.6455026	0.4786774	0.0174093	756	0.6113804	0.6796249
openai	Indiana	most restrictive	0.6812169	0.4663133	0.0169596	756	0.6479760	0.7144578
gemini	Ohio	some restrictions/protections	0.7050265	0.4563328	0.0165967	756	0.6724970	0.7375559
openai	Idaho	most restrictive	0.7063492	0.4557354	0.0165749	756	0.6738623	0.7388361
gemini	North Carolina	restrictive	0.7275132	0.4455337	0.0162039	756	0.6957536	0.7592729
openai	New Hampshire	some restrictions/protections	0.7301587	0.4441711	0.0161543	756	0.6984962	0.7618212
openai	Montana	protective	0.7328042	0.4427884	0.0161041	756	0.7012403	0.7643682
openai	Kentucky	most restrictive	0.7354497	0.4413855	0.0160530	756	0.7039858	0.7669137
gemini	Montana	protective	0.7526455	0.4317602	0.0157030	756	0.7218677	0.7834233
openai	Arkansas	most restrictive	0.7619048	0.4261997	0.0155007	756	0.7315233	0.7922862
openai	Kansas	restrictive	0.7645503	0.4245605	0.0154411	756	0.7342857	0.7948148
openai	Pennsylvania	restrictive	0.7685185	0.4220586	0.0153501	756	0.7384323	0.7986048
gemini	Kansas	restrictive	0.7698413	0.4212130	0.0153194	756	0.7398153	0.7998672
openai	Oklahoma	most restrictive	0.7817460	0.4133342	0.0150328	756	0.7522817	0.8112104
openai	Texas	most restrictive	0.7949735	0.4039882	0.0146929	756	0.7661754	0.8237716
openai	Louisiana	most restrictive	0.7962963	0.4030178	0.0146576	756	0.7675674	0.8250252
openai	Tennessee	most restrictive	0.8042328	0.3970528	0.0144407	756	0.7759291	0.8325365
openai	Alabama	most restrictive	0.8095238	0.3929367	0.0142910	756	0.7815135	0.8375341
openai	North Carolina	restrictive	0.8121693	0.3908355	0.0142145	756	0.7843088	0.8400298
openai	South Dakota	most restrictive	0.8214286	0.3832466	0.0139385	756	0.7941090	0.8487481
gemini	Virginia	some restrictions/protections	0.8280423	0.3775935	0.0137329	756	0.8011258	0.8549589
gemini	Pennsylvania	restrictive	0.8280423	0.3775935	0.0137329	756	0.8011258	0.8549589
gemini	New Hampshire	some restrictions/protections	0.8399471	0.3668979	0.0133439	756	0.8137930	0.8661012
gemini	Rhode Island	protective	0.8412698	0.3656662	0.0132992	756	0.8152035	0.8673362
openai	Michigan	very protective	0.8425926	0.3644256	0.0132540	756	0.8166147	0.8685705
gemini	Michigan	very protective	0.8478836	0.3593714	0.0130702	756	0.8222660	0.8735012
openai	Virginia	some restrictions/protections	0.8505291	0.3567881	0.0129763	756	0.8250956	0.8759626
openai	Delaware	protective	0.8597884	0.3474363	0.0126361	756	0.8350215	0.8845552
openai	Vermont	most protective	0.8637566	0.3432739	0.0124848	756	0.8392865	0.8882267
gemini	Hawaii	protective	0.8730159	0.3331756	0.0121175	756	0.8492656	0.8967661
gemini	Delaware	protective	0.8743386	0.3316868	0.0120633	756	0.8506945	0.8979828
gemini	Massachusetts	protective	0.8783069	0.3271475	0.0118982	756	0.8549863	0.9016274
openai	Hawaii	protective	0.8796296	0.3256096	0.0118423	756	0.8564187	0.9028405
gemini	Washington	very protective	0.8822751	0.3224954	0.0117290	756	0.8592862	0.9052641
openai	Washington	very protective	0.8968254	0.3043882	0.0110705	756	0.8751272	0.9185236
openai	Massachusetts	protective	0.9021164	0.2973539	0.0108147	756	0.8809197	0.9233131
gemini	New York	very protective	0.9034392	0.2955544	0.0107492	756	0.8823707	0.9245076
gemini	Connecticut	protective	0.9034392	0.2955544	0.0107492	756	0.8823707	0.9245076
openai	Connecticut	protective	0.9047619	0.2937379	0.0106831	756	0.8838229	0.9257009
gemini	Alaska	protective	0.9060847	0.2919040	0.0106164	756	0.8852764	0.9268929
gemini	New Mexico	very protective	0.9100529	0.2862954	0.0104125	756	0.8896445	0.9304613
gemini	Illinois	protective	0.9140212	0.2805184	0.0102024	756	0.8940245	0.9340178
openai	Alaska	protective	0.9140212	0.2805184	0.0102024	756	0.8940245	0.9340178
gemini	Nevada	some restrictions/protections	0.9153439	0.2785535	0.0101309	756	0.8954874	0.9352005
openai	Nevada	some restrictions/protections	0.9166667	0.2765684	0.0100587	756	0.8969516	0.9363817
gemini	Maine	protective	0.9179894	0.2745625	0.0099857	756	0.8984174	0.9375615
openai	Minnesota	very protective	0.9206349	0.2704867	0.0098375	756	0.9013534	0.9399164
openai	New Mexico	very protective	0.9259259	0.2620648	0.0095312	756	0.9072448	0.9446071
openai	Maryland	most protective	0.9272487	0.2598998	0.0094525	756	0.9087218	0.9457755
openai	Rhode Island	protective	0.9285714	0.2577099	0.0093728	756	0.9102007	0.9469422
openai	Oregon	most protective	0.9298942	0.2554943	0.0092922	756	0.9116814	0.9481070
openai	Illinois	protective	0.9312169	0.2532524	0.0092107	756	0.9131640	0.9492699
gemini	Maryland	most protective	0.9325397	0.2509836	0.0091282	756	0.9146484	0.9504309
gemini	Vermont	most protective	0.9391534	0.2392069	0.0086999	756	0.9221017	0.9562052
openai	Maine	protective	0.9417989	0.2342782	0.0085206	756	0.9250985	0.9584994
openai	Colorado	very protective	0.9444444	0.2292131	0.0083364	756	0.9281051	0.9607838
openai	New Jersey	very protective	0.9484127	0.2213388	0.0080500	756	0.9326347	0.9641907
gemini	New Jersey	very protective	0.9497354	0.2186350	0.0079517	756	0.9341502	0.9653207
gemini	Minnesota	very protective	0.9497354	0.2186350	0.0079517	756	0.9341502	0.9653207
openai	New York	very protective	0.9576720	0.2014698	0.0073274	756	0.9433103	0.9720336
gemini	Oregon	most protective	0.9642857	0.1856997	0.0067538	756	0.9510482	0.9775232
gemini	Colorado	very protective	0.9669312	0.1789346	0.0065078	756	0.9541760	0.9796865

6 Inconsistency & Variability

mean_inconsistency	sd_inconsistency	se_inconsistency	num_responses
0.2104032	0.4076115	0.0036605	12400

7 Confusion Matrix

Failure in handling negative prompts

## # A tibble: 4 × 4
## # Groups:   affirmative_vs_negative [2]
##   affirmative_vs_negative predicted     n  prop
##   <chr>                   <fct>     <int> <dbl>
## 1 affirmative prompt      yes       15004 0.781
## 2 affirmative prompt      no         4196 0.219
## 3 negative prompt         yes       12344 0.686
## 4 negative prompt         no         5656 0.314

### Overall

•   TP = Model is correct when answer is yes

•   TN = Model is correct when answer is no

•   FP = Model says yes when it should be no (wrong)

•   FN = Model says no when it should be yes (wrong)

7.1 Affirmative Prompts

Correct answer should be yes (meaning ok) == should have high true positive and true negative

Bad:

HIGH false negative = saying no when its actually OK (yes) [overpredicting no]

HIGH false positive = saying yes when actually not OK (no)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

7.2 Negative Prompts

Correct answer should be no (meaning not ok) == should have high true positive and true negative

BAD:

HIGH false positive = saying yes (which means its not ok) when its actually OK (no) [indicates bad at negative prompts]

HIGH false negative = saying no (which means its ok) when actually not OK (yes)

8 Misinformation Risk (Pregnancy calculation)

TODO: bring in additional analysis from other notebook

9 Sources

Summary stats

10 Long-Form Responses

10.1 Disclaimers/Hedging

TODO: qualitative analysis