There are 25,200 responses in total that were collected in March and April 2025 from Google Gemini and OpenAI GPT. We calculated correct_prediction based on the majority vote across 3 runs.
Here we filter out the 5th question type “general” because these questions do not have a ground truth. We also separate responses by search-enabled and search-disabled where searching the internet was enabled or disabled for a model, respectively, during the prompting process.
The following analysis will only focus on responses from search-enabled models (unless otherwise indicated) as the SOA models at the time of this study.
There are 12,400 responses per client type.How accurate are model responses compared to current state laws when forced to answer one way or another? In other words, how accurate are model responses with respect to a forced ground truth when the requested prompt response is “yes” or “no”.
We deliberately set up prompting this way even though the answer may be more complicated than a binary response can convey. The goal is to understand how often these models are correct and explore what “correct” means in this context.
| avg_accuracy | sd_accuracy | se_accuracy | ci_lower | ci_upper | num_responses |
|---|---|---|---|---|---|
| 0.7351613 | 0.4412532 | 0.0022878 | 0.7306772 | 0.7396454 | 37200 |
Across all states, North Dakota has the lowest accuracy and Vermont has the highest. In 5 out of the 50 states, model responses are wrong more than half the time. As these responses come from search-enabled models (meaning the models have access to internet search results to augment their training data), the reason for lower accuracy could be due to regional specificity or inconsistent information that the models struggle with, especially if these are states with rapidly changing laws/abortion-related news coverage. Conversely, in states where models had higher accuracy, there may be less ambiguity in the training and search data.
Notably, the standard deviation also drops as the accuracy increases per state meaning that in general, performance varies more for responses for states with a lower accuracy than those with a higher accuracy.
We explore if this underperformance is systemic in certain geographic regions, political and policy contexts in the next sections.
Abortion-related ballot measures to protect abortion rights were added to the 2024 election ballots in 10 states: Arizona, Colorado, Florida, Maryland, Missouri, Montana, Nebraska, New York, and South Dakota1.
These measures passed in all of the states except for Florida, Nebraska, and South Dakota.
However, while Missouri voters approved Amendment 3, which enshrined a right to reproductive freedom in the state constitution, in May 2025, the Missouri Supreme Court issued a ruling that effectively reinstates a ban on abortion in the state. A new ballot is being drafted by Republicans2. This ruling is an example of the confusing and complicated reality of changing abortion laws in the United States.
However, at the time of these model responses were collected for this study, Missouri was considered not to have a total ban3.
It’s interesting that of these states, model responses for Missouri have the worst accuracy at 48%.
t-test and plot shows that ballot measure states are distributed across accuracy spectrum and can’t reject null hypothesis because there’s no real difference in means, therefore no evidence that mean accuracy differs between states with ballot measures and states without.
Here we ask: how do models handle state-by-state variability with respect to different restrictions?
Here, we implement The Guttmacher Institute’s state abortion policy framework4 where states are categorized based on the restrictiveness of their abortion laws.
Within this spectrum, states that are restrictive have the lowest accuracy with 61% across all “most restrictive” states and 95% across all “very protective” states. Meaning in states with restrictive abortion policies, models are less likely to provide a correct response.
| policy | avg_accuracy | sd_accuracy | se_accuracy | ci_lower | ci_upper |
|---|---|---|---|---|---|
| most restrictive | 0.6022401 | 0.4894573 | 0.0046332 | 0.5931590 | 0.6113212 |
| restrictive | 0.6303763 | 0.4827492 | 0.0066894 | 0.6172652 | 0.6434875 |
| very restrictive | 0.6442652 | 0.4788426 | 0.0101355 | 0.6243996 | 0.6641308 |
| some restrictions/protections | 0.7290323 | 0.4445192 | 0.0072882 | 0.7147474 | 0.7433171 |
| most protective | 0.8592070 | 0.3478664 | 0.0063767 | 0.8467087 | 0.8717053 |
| protective | 0.8642473 | 0.3425513 | 0.0041862 | 0.8560424 | 0.8724522 |
| very protective | 0.9312596 | 0.2530365 | 0.0035063 | 0.9243873 | 0.9381319 |
Another way to frame state abortion policies is by considering states with total abortion bans (“total_ban”), no limit (“allowed_no_limit”), and bans based on gestational duration at a certain time in a pregnant person’s pregnancy (“gestational_limit {n_weeks_LMP}”). This policy view provides a slightly granular perspective than the one above of the accuracy of model responses.
Here we see that states that have a 6-week LMP gestational ban (gestational_limit 6) have the lowest accuracy at 52% and states that allow abortion any stage of pregnancy have the highest accuracy at 96%. This shows us that accuracy is worse when we look at states that have specific restrictions like gestational bans regardless how “restrictive” or “protective” a state’s abortion policies may be.
This is important because a state labeled as “protective” could be protective in some respects but restrictive in others which may complicate the information available to provide answers to questions on the legality of abortion in a specific state.
❗️ *** Importantly, fetal viability is generally considered to begin at 23 or 24 weeks gestational age5. We take the highest number here so any state that allows abortions until “viability” without a specified gestational age is categorized as gestational_limit 24 with 24 denoting 24 weeks LMP as the threshold for fetal viability.
| combined_policy_type | avg_accuracy | sd_accuracy | se_accuracy | ci_lower | ci_upper |
|---|---|---|---|---|---|
| gestational_limit 6 | 0.5156810 | 0.4998660 | 0.0105805 | 0.4949432 | 0.5364188 |
| gestational_limit 22 | 0.5963262 | 0.4907435 | 0.0103874 | 0.5759668 | 0.6166855 |
| total_ban | 0.6141353 | 0.4868261 | 0.0051522 | 0.6040369 | 0.6242337 |
| gestational_limit 18 | 0.7459677 | 0.4356087 | 0.0159702 | 0.7146662 | 0.7772693 |
| gestational_limit 24 | 0.7544803 | 0.4304110 | 0.0037193 | 0.7471905 | 0.7617701 |
| gestational_limit 12 | 0.7560484 | 0.4296083 | 0.0111371 | 0.7342197 | 0.7778771 |
| gestational_limit 26.6 | 0.8024194 | 0.3984419 | 0.0146076 | 0.7737885 | 0.8310502 |
| gestational_limit 26 | 0.9126344 | 0.2825600 | 0.0103592 | 0.8923305 | 0.9329383 |
| allowed_no_limit | 0.9442951 | 0.2293681 | 0.0028030 | 0.9388012 | 0.9497890 |
The restrictions in abortions vary in states across these policy framings and have a different effect on accuracy when these policies are broken down by state.
By combining both of these policy framings at the state level, we come back to accuracy per state; this time with a clearer view of what kind of policies are affecting model responses in the states with the lowest and highest accuracy.
To compute the partisan lean of a state, we first calculate the presidential deviation for the 2024 U.S. presidential election using data from AP News6. We subtract the national popular vote margin (Republican lead by 1.5 percentage points) from compare each state’s vote margin to find how much a state deviates from the national average in support for the Republican presidential candidate in 2024. This metric highlights how much more Democratic or Republican a state is relative to the nation as a whole7.
We see that at a national level, states likely to lean more liberal have a higher model response accuracy with Strongly Republican states at 64% model response accuracy and Strongly Democrat states at 90% model response accuracy. This shows an overall model bias that favors states with a Democratic lean.
| partisan_lean | avg_accuracy | sd_accuracy | se_accuracy | ci_lower | ci_upper |
|---|---|---|---|---|---|
| Strongly Republican | 0.5381720 | 0.4986078 | 0.0081750 | 0.5221490 | 0.5541950 |
| Leans Republican | 0.6340246 | 0.4817257 | 0.0047201 | 0.6247732 | 0.6432759 |
| Center | 0.6982527 | 0.4590419 | 0.0048582 | 0.6887306 | 0.7077748 |
| Strongly Democrat | 0.8443100 | 0.3626018 | 0.0054271 | 0.8336729 | 0.8549472 |
| Leans Democrat | 0.9035360 | 0.2952418 | 0.0030021 | 0.8976519 | 0.9094200 |
However, broken down by state, we see that this bias is not as clear cut. We note that partisan lean is based on the results of the 2024 presidential election, not considering current events and the changing political landscape across the U.S.
Given this, we next explore the most likely factors impacting model response accuracy.
Next, we run a logistic regression to test which variables are more likely to predict the likelihood that a model provides the correct answer (correct_prediction).
We use all our data (search-enabled and search-disabled) for the following analysis.
logit_data_per_run$policy <- relevel(factor(logit_data_per_run$policy), ref = "some restrictions/protections")
logit_data_per_run$partisan_lean <- relevel(factor(logit_data_per_run$partisan_lean), ref = "Center")
logit_data_per_run$affirmative_vs_negative <- relevel(factor(logit_data_per_run$affirmative_vs_negative), ref = "negative prompt")
logit_data_per_run$correct_answer_yes <- factor(logit_data_per_run$correct_answer_yes,
levels = c(FALSE, TRUE))
logit_covariates_per_run <- glm(
correct_prediction ~ policy + partisan_lean +
type + model + client_type + affirmative_vs_negative,
data = logit_data_per_run,
family = "binomial"
)
# logit_covariates_per_run <- glm(
# correct_prediction ~ policy + partisan_lean +
# type + model + client_type + affirmative_vs_negative + correct_answer_yes,
# data = logit_data_per_run,
# family = "binomial"
# )Below we have the odds ratios from this logistic regression model predicting the likelihood of a correct response from the models in our study, using covariates including state abortion policy, partisan lean, question type, client search settings, and model type.
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | term_label | group |
|---|---|---|---|---|---|---|---|---|
| affirmative_vs_negativeaffirmative prompt | 1.7715657 | 0.0181668 | 31.478554 | 0.0000000 | 1.7096346 | 1.8358229 | affirmative prompt | Affirmative vs Negative |
| modelopenai | 1.1814214 | 0.0178872 | 9.320547 | 0.0000000 | 1.1407276 | 1.2235830 | OpenAI | Model |
| partisan_leanLeans Democrat | 2.4802894 | 0.0407941 | 22.267297 | 0.0000000 | 2.2897778 | 2.6868606 | Leans Democrat | Partisan Lean |
| partisan_leanStrongly Democrat | 1.2709124 | 0.0555215 | 4.317880 | 0.0000158 | 1.1400657 | 1.4172809 | Strongly Democrat | Partisan Lean |
| partisan_leanLeans Republican | 0.9150113 | 0.0243209 | -3.651959 | 0.0002602 | 0.8724102 | 0.9596776 | Leans Republican | Partisan Lean |
| partisan_leanStrongly Republican | 0.6085394 | 0.0303789 | -16.349951 | 0.0000000 | 0.5733547 | 0.6458648 | Strongly Republican | Partisan Lean |
| policyvery protective | 3.7413414 | 0.0513500 | 25.695110 | 0.0000000 | 3.3845125 | 4.1392767 | very protective | Policy |
| policyprotective | 2.6537326 | 0.0430503 | 22.670391 | 0.0000000 | 2.4393002 | 2.8877448 | protective | Policy |
| policymost protective | 2.0245514 | 0.0631610 | 11.167468 | 0.0000000 | 1.7891873 | 2.2918476 | most protective | Policy |
| policyrestrictive | 0.8672656 | 0.0369150 | -3.857785 | 0.0001144 | 0.8066666 | 0.9322670 | restrictive | Policy |
| policymost restrictive | 0.8083426 | 0.0343489 | -6.194360 | 0.0000000 | 0.7556264 | 0.8645399 | most restrictive | Policy |
| policyvery restrictive | 0.7114474 | 0.0435359 | -7.820077 | 0.0000000 | 0.6532379 | 0.7748003 | very restrictive | Policy |
| typetravel | 17.4493478 | 0.2036735 | 14.038654 | 0.0000000 | 11.9785941 | 26.7102195 | Travel | Question Type |
| typegestation | 1.4934929 | 0.0196312 | 20.432680 | 0.0000000 | 1.4371855 | 1.5521492 | Gestation | Question Type |
| typetelemedicine | 0.9152404 | 0.0716273 | -1.236519 | 0.2162658 | 0.7961551 | 1.0543259 | Telemedicine | Question Type |
| client_typewithout_search | 0.9422990 | 0.0178774 | -3.324453 | 0.0008859 | 0.9098510 | 0.9758996 | Search Disabled | Search Setting |
Overall, does phrasing matter: By state? By policy? Restrictiveness? Do LLMs respond differently to variations in how the question is asked? Are there inconsistencies depending on user phrasing and state? Where is this more prevalent?
| type | avg_accuracy | sd_accuracy | se_accuracy | num_responses | ci_width |
|---|---|---|---|---|---|
| telemedicine | 0.67500 | 0.4695502 | 0.0332022 | 200 | 0.1301527 |
| travel | 0.97000 | 0.1710153 | 0.0120926 | 200 | 0.0474030 |
| gestation | 0.78525 | 0.4107001 | 0.0064937 | 4000 | 0.0254555 |
| assault | 0.73450 | 0.4416267 | 0.0049375 | 8000 | 0.0193551 |
Overall, the mean accuracy for the 4 question types we designed for our prompts show a range between 68% and 97% for model response accuracy with responses to telemedicine questions having the lowest accuracy and responses to travel questions with the highest.
However, if we disaggregate question types by the exact prompts used within each type, we see that the phrasing for each question has an effect on overall accuracy. Questions of type “assault” and “gestation” have more prompts because they include variations by gestational age (TODO: explain this better).
For every single question type, prompts beginning with “Is it…” performed worse than prompts beginning with “Can I…”. This discrepancy is clearest between prompt 1 and prompt 2 for the telemedicine question type, the type with the lowest accuracy overall. This is important because it shows how simple rephrasing could change the response someone would receive when asking these questions!!!
However, since we know a state’s abortion policies have a strong effect on accuracy, when when zoom into question types by policy, we see that accuracy range widens depending on the question (WARNING: 0% is a VERY small sample size).
Once again, we see a low accuracy swing toward states with restrictive policies while states with less restrictive policies have a higher model response accuracy.
By breaking down question type, we see that California’s model response accuracy is extremely low at 38% for prompts of “assault” type. While California is one of the states with the highest level of protection for abortion care, that only applies to people with private and public insurance providers, not people who are insured through employers who provide “self-funded” policies. So in reality, abortion access in California is entirely dependent on what insurance you have and what they are willing to cover.8 From a model information perspective, this stratification might be clear in the state’s policies but unclear in the abortion information available for California.
(TODO: include this as part of thematic analysis for RAG.)
Using search and non-search data
When we account for the difference between models that used their cutoff knowledge (search-disabled) and models that augmented responses with search (search-enabled) the states with policies on the protective part of the policy spectrum fared worse overall. The accuracy of search augmented responses was most negatively affected with search enabled for states with “protective” policies, decreasing model response accuracy on average by 6.17%. For states with “most protective” policies, search only improved model response accuracy by less than 1%.
| type | with_search | without_search | Diff | effect |
|---|---|---|---|---|
| assault | 0.7133750 | 0.6935417 | 0.0198333 | Search helped |
| gestation | 0.7715833 | 0.7696667 | 0.0019167 | Search helped |
| telemedicine | 0.6550000 | 0.8250000 | -0.1700000 | Search hurt |
| travel | 0.9583333 | 1.0000000 | -0.0416667 | Search hurt |
Search-enabled responses also had a negative effect on model response accuracy for telemedicine questions, reducing accuracy by 17% on average. Where search-enabled responses did help, the effect was minimal with model response accuracy increasing by less than quarter of a percentage and less than 2% for gestation and assault questions, respectively.
| model | avg_accuracy | sd_accuracy | se_accuracy | num_responses | ci_low | ci_high |
|---|---|---|---|---|---|---|
| gemini | 0.7001075 | 0.4582229 | 0.0033599 | 18600 | 0.6935222 | 0.7066928 |
| openai | 0.7702151 | 0.4207058 | 0.0030848 | 18600 | 0.7641689 | 0.7762612 |
Overall, OpenAI performed better than Gemini’s model at the time of this study.
| model | policy | avg_accuracy | sd_accuracy | se_accuracy | num_responses | ci_low | ci_high |
|---|---|---|---|---|---|---|---|
| gemini | most protective | 0.8548280 | 0.3523324 | 0.0064071 | 3024 | 0.8422701 | 0.8673860 |
| gemini | most restrictive | 0.5032628 | 0.5000114 | 0.0046954 | 11340 | 0.4940598 | 0.5124658 |
| gemini | protective | 0.8734568 | 0.3324850 | 0.0040308 | 6804 | 0.8655564 | 0.8813571 |
| gemini | restrictive | 0.6458806 | 0.4782908 | 0.0065748 | 5292 | 0.6329940 | 0.6587672 |
| gemini | some restrictions/protections | 0.7632275 | 0.4251577 | 0.0069152 | 3780 | 0.7496737 | 0.7767813 |
| gemini | very protective | 0.9157218 | 0.2778308 | 0.0038192 | 5292 | 0.9082362 | 0.9232074 |
| gemini | very restrictive | 0.5537919 | 0.4972076 | 0.0104404 | 2268 | 0.5333287 | 0.5742550 |
| openai | most protective | 0.8287037 | 0.3768300 | 0.0068526 | 3024 | 0.8152726 | 0.8421348 |
| openai | most restrictive | 0.6584656 | 0.4742452 | 0.0044534 | 11340 | 0.6497369 | 0.6671944 |
| openai | protective | 0.8883010 | 0.3150189 | 0.0038190 | 6804 | 0.8808157 | 0.8957863 |
| openai | restrictive | 0.5461073 | 0.4979166 | 0.0068446 | 5292 | 0.5326920 | 0.5595227 |
| openai | some restrictions/protections | 0.6788360 | 0.4669854 | 0.0075955 | 3780 | 0.6639488 | 0.6937232 |
| openai | very protective | 0.9195011 | 0.2720897 | 0.0037403 | 5292 | 0.9121702 | 0.9268320 |
| openai | very restrictive | 0.5930335 | 0.4913769 | 0.0103179 | 2268 | 0.5728103 | 0.6132567 |
| model | state | policy | avg_accuracy | sd_accuracy | se_accuracy | num_responses | ci_low | ci_high |
|---|---|---|---|---|---|---|---|---|
| openai | North Dakota | restrictive | 0.1084656 | 0.3111734 | 0.0113173 | 756 | 0.0862838 | 0.1306475 |
| openai | Wisconsin | restrictive | 0.2354497 | 0.4245605 | 0.0154411 | 756 | 0.2051852 | 0.2657143 |
| openai | West Virginia | most restrictive | 0.3597884 | 0.4802560 | 0.0174667 | 756 | 0.3255535 | 0.3940232 |
| gemini | Georgia | very restrictive | 0.3809524 | 0.4859424 | 0.0176736 | 756 | 0.3463122 | 0.4155925 |
| openai | Mississippi | most restrictive | 0.4047619 | 0.4911709 | 0.0178637 | 756 | 0.3697490 | 0.4397748 |
| openai | South Carolina | most restrictive | 0.4126984 | 0.4926454 | 0.0179173 | 756 | 0.3775804 | 0.4478164 |
| openai | Ohio | some restrictions/protections | 0.4378307 | 0.4964484 | 0.0180557 | 756 | 0.4024416 | 0.4732198 |
| openai | Missouri | some restrictions/protections | 0.4589947 | 0.4986456 | 0.0181356 | 756 | 0.4234490 | 0.4945404 |
| gemini | Kentucky | most restrictive | 0.4629630 | 0.4989565 | 0.0181469 | 756 | 0.4273951 | 0.4985308 |
| gemini | Indiana | most restrictive | 0.4748677 | 0.4996986 | 0.0181739 | 756 | 0.4392470 | 0.5104885 |
| gemini | Arkansas | most restrictive | 0.4761905 | 0.4997634 | 0.0181762 | 756 | 0.4405651 | 0.5118159 |
| openai | Florida | most restrictive | 0.4775132 | 0.4998248 | 0.0181785 | 756 | 0.4418835 | 0.5131430 |
| gemini | Oklahoma | most restrictive | 0.4788360 | 0.4998826 | 0.0181806 | 756 | 0.4432021 | 0.5144699 |
| gemini | North Dakota | restrictive | 0.4814815 | 0.4999877 | 0.0181844 | 756 | 0.4458401 | 0.5171229 |
| gemini | Tennessee | most restrictive | 0.4841270 | 0.5000788 | 0.0181877 | 756 | 0.4484791 | 0.5197749 |
| gemini | Idaho | most restrictive | 0.4933862 | 0.5002872 | 0.0181953 | 756 | 0.4577235 | 0.5290490 |
| gemini | Alabama | most restrictive | 0.5013228 | 0.5003293 | 0.0181968 | 756 | 0.4656570 | 0.5369885 |
| gemini | South Dakota | most restrictive | 0.5039683 | 0.5003153 | 0.0181963 | 756 | 0.4683035 | 0.5396330 |
| gemini | Texas | most restrictive | 0.5066138 | 0.5002872 | 0.0181953 | 756 | 0.4709510 | 0.5422765 |
| gemini | Louisiana | most restrictive | 0.5079365 | 0.5002680 | 0.0181946 | 756 | 0.4722751 | 0.5435979 |
| gemini | South Carolina | most restrictive | 0.5105820 | 0.5002190 | 0.0181928 | 756 | 0.4749241 | 0.5462399 |
| gemini | West Virginia | most restrictive | 0.5224868 | 0.4998248 | 0.0181785 | 756 | 0.4868570 | 0.5581165 |
| gemini | Missouri | some restrictions/protections | 0.5277778 | 0.4995583 | 0.0181688 | 756 | 0.4921670 | 0.5633885 |
| openai | Iowa | most restrictive | 0.5291005 | 0.4994829 | 0.0181660 | 756 | 0.4934951 | 0.5647059 |
| gemini | Wisconsin | restrictive | 0.5291005 | 0.4994829 | 0.0181660 | 756 | 0.4934951 | 0.5647059 |
| gemini | Mississippi | most restrictive | 0.5370370 | 0.4989565 | 0.0181469 | 756 | 0.5014692 | 0.5726049 |
| gemini | Iowa | most restrictive | 0.5410053 | 0.4986456 | 0.0181356 | 756 | 0.5054596 | 0.5765510 |
| gemini | Florida | most restrictive | 0.5476190 | 0.4980568 | 0.0181141 | 756 | 0.5121153 | 0.5831228 |
| openai | Utah | very restrictive | 0.5515873 | 0.4976609 | 0.0180998 | 756 | 0.5161118 | 0.5870628 |
| openai | Wyoming | restrictive | 0.5595238 | 0.4967729 | 0.0180675 | 756 | 0.5241116 | 0.5949360 |
| openai | Arizona | restrictive | 0.5740741 | 0.4948100 | 0.0179961 | 756 | 0.5388018 | 0.6093464 |
| gemini | California | most protective | 0.5833333 | 0.4933330 | 0.0179423 | 756 | 0.5481663 | 0.6185003 |
| gemini | Wyoming | restrictive | 0.5846561 | 0.4931075 | 0.0179341 | 756 | 0.5495052 | 0.6198070 |
| openai | California | most protective | 0.5939153 | 0.4914258 | 0.0178730 | 756 | 0.5588843 | 0.6289464 |
| gemini | Arizona | restrictive | 0.6005291 | 0.4901139 | 0.0178253 | 756 | 0.5655916 | 0.6354666 |
| openai | Nebraska | very restrictive | 0.6084656 | 0.4884166 | 0.0177635 | 756 | 0.5736491 | 0.6432821 |
| openai | Georgia | very restrictive | 0.6190476 | 0.4859424 | 0.0176736 | 756 | 0.5844075 | 0.6536878 |
| gemini | Nebraska | very restrictive | 0.6349206 | 0.4817711 | 0.0175218 | 756 | 0.6005778 | 0.6692635 |
| gemini | Utah | very restrictive | 0.6455026 | 0.4786774 | 0.0174093 | 756 | 0.6113804 | 0.6796249 |
| openai | Indiana | most restrictive | 0.6812169 | 0.4663133 | 0.0169596 | 756 | 0.6479760 | 0.7144578 |
| gemini | Ohio | some restrictions/protections | 0.7050265 | 0.4563328 | 0.0165967 | 756 | 0.6724970 | 0.7375559 |
| openai | Idaho | most restrictive | 0.7063492 | 0.4557354 | 0.0165749 | 756 | 0.6738623 | 0.7388361 |
| gemini | North Carolina | restrictive | 0.7275132 | 0.4455337 | 0.0162039 | 756 | 0.6957536 | 0.7592729 |
| openai | New Hampshire | some restrictions/protections | 0.7301587 | 0.4441711 | 0.0161543 | 756 | 0.6984962 | 0.7618212 |
| openai | Montana | protective | 0.7328042 | 0.4427884 | 0.0161041 | 756 | 0.7012403 | 0.7643682 |
| openai | Kentucky | most restrictive | 0.7354497 | 0.4413855 | 0.0160530 | 756 | 0.7039858 | 0.7669137 |
| gemini | Montana | protective | 0.7526455 | 0.4317602 | 0.0157030 | 756 | 0.7218677 | 0.7834233 |
| openai | Arkansas | most restrictive | 0.7619048 | 0.4261997 | 0.0155007 | 756 | 0.7315233 | 0.7922862 |
| openai | Kansas | restrictive | 0.7645503 | 0.4245605 | 0.0154411 | 756 | 0.7342857 | 0.7948148 |
| openai | Pennsylvania | restrictive | 0.7685185 | 0.4220586 | 0.0153501 | 756 | 0.7384323 | 0.7986048 |
| gemini | Kansas | restrictive | 0.7698413 | 0.4212130 | 0.0153194 | 756 | 0.7398153 | 0.7998672 |
| openai | Oklahoma | most restrictive | 0.7817460 | 0.4133342 | 0.0150328 | 756 | 0.7522817 | 0.8112104 |
| openai | Texas | most restrictive | 0.7949735 | 0.4039882 | 0.0146929 | 756 | 0.7661754 | 0.8237716 |
| openai | Louisiana | most restrictive | 0.7962963 | 0.4030178 | 0.0146576 | 756 | 0.7675674 | 0.8250252 |
| openai | Tennessee | most restrictive | 0.8042328 | 0.3970528 | 0.0144407 | 756 | 0.7759291 | 0.8325365 |
| openai | Alabama | most restrictive | 0.8095238 | 0.3929367 | 0.0142910 | 756 | 0.7815135 | 0.8375341 |
| openai | North Carolina | restrictive | 0.8121693 | 0.3908355 | 0.0142145 | 756 | 0.7843088 | 0.8400298 |
| openai | South Dakota | most restrictive | 0.8214286 | 0.3832466 | 0.0139385 | 756 | 0.7941090 | 0.8487481 |
| gemini | Virginia | some restrictions/protections | 0.8280423 | 0.3775935 | 0.0137329 | 756 | 0.8011258 | 0.8549589 |
| gemini | Pennsylvania | restrictive | 0.8280423 | 0.3775935 | 0.0137329 | 756 | 0.8011258 | 0.8549589 |
| gemini | New Hampshire | some restrictions/protections | 0.8399471 | 0.3668979 | 0.0133439 | 756 | 0.8137930 | 0.8661012 |
| gemini | Rhode Island | protective | 0.8412698 | 0.3656662 | 0.0132992 | 756 | 0.8152035 | 0.8673362 |
| openai | Michigan | very protective | 0.8425926 | 0.3644256 | 0.0132540 | 756 | 0.8166147 | 0.8685705 |
| gemini | Michigan | very protective | 0.8478836 | 0.3593714 | 0.0130702 | 756 | 0.8222660 | 0.8735012 |
| openai | Virginia | some restrictions/protections | 0.8505291 | 0.3567881 | 0.0129763 | 756 | 0.8250956 | 0.8759626 |
| openai | Delaware | protective | 0.8597884 | 0.3474363 | 0.0126361 | 756 | 0.8350215 | 0.8845552 |
| openai | Vermont | most protective | 0.8637566 | 0.3432739 | 0.0124848 | 756 | 0.8392865 | 0.8882267 |
| gemini | Hawaii | protective | 0.8730159 | 0.3331756 | 0.0121175 | 756 | 0.8492656 | 0.8967661 |
| gemini | Delaware | protective | 0.8743386 | 0.3316868 | 0.0120633 | 756 | 0.8506945 | 0.8979828 |
| gemini | Massachusetts | protective | 0.8783069 | 0.3271475 | 0.0118982 | 756 | 0.8549863 | 0.9016274 |
| openai | Hawaii | protective | 0.8796296 | 0.3256096 | 0.0118423 | 756 | 0.8564187 | 0.9028405 |
| gemini | Washington | very protective | 0.8822751 | 0.3224954 | 0.0117290 | 756 | 0.8592862 | 0.9052641 |
| openai | Washington | very protective | 0.8968254 | 0.3043882 | 0.0110705 | 756 | 0.8751272 | 0.9185236 |
| openai | Massachusetts | protective | 0.9021164 | 0.2973539 | 0.0108147 | 756 | 0.8809197 | 0.9233131 |
| gemini | New York | very protective | 0.9034392 | 0.2955544 | 0.0107492 | 756 | 0.8823707 | 0.9245076 |
| gemini | Connecticut | protective | 0.9034392 | 0.2955544 | 0.0107492 | 756 | 0.8823707 | 0.9245076 |
| openai | Connecticut | protective | 0.9047619 | 0.2937379 | 0.0106831 | 756 | 0.8838229 | 0.9257009 |
| gemini | Alaska | protective | 0.9060847 | 0.2919040 | 0.0106164 | 756 | 0.8852764 | 0.9268929 |
| gemini | New Mexico | very protective | 0.9100529 | 0.2862954 | 0.0104125 | 756 | 0.8896445 | 0.9304613 |
| gemini | Illinois | protective | 0.9140212 | 0.2805184 | 0.0102024 | 756 | 0.8940245 | 0.9340178 |
| openai | Alaska | protective | 0.9140212 | 0.2805184 | 0.0102024 | 756 | 0.8940245 | 0.9340178 |
| gemini | Nevada | some restrictions/protections | 0.9153439 | 0.2785535 | 0.0101309 | 756 | 0.8954874 | 0.9352005 |
| openai | Nevada | some restrictions/protections | 0.9166667 | 0.2765684 | 0.0100587 | 756 | 0.8969516 | 0.9363817 |
| gemini | Maine | protective | 0.9179894 | 0.2745625 | 0.0099857 | 756 | 0.8984174 | 0.9375615 |
| openai | Minnesota | very protective | 0.9206349 | 0.2704867 | 0.0098375 | 756 | 0.9013534 | 0.9399164 |
| openai | New Mexico | very protective | 0.9259259 | 0.2620648 | 0.0095312 | 756 | 0.9072448 | 0.9446071 |
| openai | Maryland | most protective | 0.9272487 | 0.2598998 | 0.0094525 | 756 | 0.9087218 | 0.9457755 |
| openai | Rhode Island | protective | 0.9285714 | 0.2577099 | 0.0093728 | 756 | 0.9102007 | 0.9469422 |
| openai | Oregon | most protective | 0.9298942 | 0.2554943 | 0.0092922 | 756 | 0.9116814 | 0.9481070 |
| openai | Illinois | protective | 0.9312169 | 0.2532524 | 0.0092107 | 756 | 0.9131640 | 0.9492699 |
| gemini | Maryland | most protective | 0.9325397 | 0.2509836 | 0.0091282 | 756 | 0.9146484 | 0.9504309 |
| gemini | Vermont | most protective | 0.9391534 | 0.2392069 | 0.0086999 | 756 | 0.9221017 | 0.9562052 |
| openai | Maine | protective | 0.9417989 | 0.2342782 | 0.0085206 | 756 | 0.9250985 | 0.9584994 |
| openai | Colorado | very protective | 0.9444444 | 0.2292131 | 0.0083364 | 756 | 0.9281051 | 0.9607838 |
| openai | New Jersey | very protective | 0.9484127 | 0.2213388 | 0.0080500 | 756 | 0.9326347 | 0.9641907 |
| gemini | New Jersey | very protective | 0.9497354 | 0.2186350 | 0.0079517 | 756 | 0.9341502 | 0.9653207 |
| gemini | Minnesota | very protective | 0.9497354 | 0.2186350 | 0.0079517 | 756 | 0.9341502 | 0.9653207 |
| openai | New York | very protective | 0.9576720 | 0.2014698 | 0.0073274 | 756 | 0.9433103 | 0.9720336 |
| gemini | Oregon | most protective | 0.9642857 | 0.1856997 | 0.0067538 | 756 | 0.9510482 | 0.9775232 |
| gemini | Colorado | very protective | 0.9669312 | 0.1789346 | 0.0065078 | 756 | 0.9541760 | 0.9796865 |
| mean_inconsistency | sd_inconsistency | se_inconsistency | num_responses |
|---|---|---|---|
| 0.2104032 | 0.4076115 | 0.0036605 | 12400 |
Failure in handling negative prompts
## # A tibble: 4 × 4
## # Groups: affirmative_vs_negative [2]
## affirmative_vs_negative predicted n prop
## <chr> <fct> <int> <dbl>
## 1 affirmative prompt yes 15004 0.781
## 2 affirmative prompt no 4196 0.219
## 3 negative prompt yes 12344 0.686
## 4 negative prompt no 5656 0.314
### Overall
• TP = Model is correct when answer is yes
• TN = Model is correct when answer is no
• FP = Model says yes when it should be no (wrong)
• FN = Model says no when it should be yes (wrong)
Correct answer should be yes (meaning ok) == should have high true positive and true negative
Bad:
HIGH false negative = saying no when its actually OK (yes) [overpredicting no]
HIGH false positive = saying yes when actually not OK (no)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Correct answer should be no (meaning not ok) == should have high true positive and true negative
BAD:
HIGH false positive = saying yes (which means its not ok) when its actually OK (no) [indicates bad at negative prompts]
HIGH false negative = saying no (which means its ok) when actually
not OK (yes)
TODO: bring in additional analysis from other notebook
Summary stats
TODO: qualitative analysis
https://www.kff.org/womens-health-policy/dashboard/ballot-tracker-status-of-abortion-related-state-constitutional-amendment-measures/↩︎
https://slate.com/news-and-politics/2025/06/missouri-abortion-ban-explained-amendment-3.html↩︎
http://web.archive.org/web/20250428015151/https://www.guttmacher.org/state-policy/explore/state-policies-abortion-bans↩︎
http://web.archive.org/web/20250430012019/https://states.guttmacher.org/policies/↩︎
https://www.acog.org/advocacy/facts-are-important/understanding-and-navigating-viability↩︎
https://apnews.com/projects/election-results-2024/?office=P↩︎
https://www.presidency.ucsb.edu/statistics/data/presidential-election-mandates↩︎
https://www.aclunc.org/our-work/know-your-rights/know-your-rights-abortion-access-california↩︎