There are 25,200 responses in total that were collected in March and April 2025 from Google Gemini and OpenAI GPT. We calculated correct_prediction based on the majority vote across 3 runs.

Setup

Here we filter out the 5th question type “general” because these questions do not have a ground truth. We also separate responses by search-enabled and search-disabled where searching the internet was enabled or disabled for a model, respectively, during the prompting process.

The following analysis will only focus on responses from search-enabled models (unless otherwise indicated) as the SOA models at the time of this study.

There are 12,400 responses per client type.

1 Accuracy

How accurate are model responses compared to current state laws when forced to answer one way or another? In other words, how accurate are model responses with respect to a forced ground truth when the requested prompt response is “yes” or “no”.

We deliberately set up prompting this way even though the answer may be more complicated than a binary response can convey. The goal is to understand how often these models are correct and explore what “correct” means in this context.

1.1 Accuracy overall

Model response accuracy across all questions asked, state, and models is 75%.
avg_accuracy sd_accuracy se_accuracy ci_lower ci_upper num_responses
0.7351613 0.4412532 0.0022878 0.7306772 0.7396454 37200

1.2 Accuracy per state

Across all states, North Dakota has the lowest accuracy and Vermont has the highest. In 5 out of the 50 states, model responses are wrong more than half the time. As these responses come from search-enabled models (meaning the models have access to internet search results to augment their training data), the reason for lower accuracy could be due to regional specificity or inconsistent information that the models struggle with, especially if these are states with rapidly changing laws/abortion-related news coverage. Conversely, in states where models had higher accuracy, there may be less ambiguity in the training and search data.

Notably, the standard deviation also drops as the accuracy increases per state meaning that in general, performance varies more for responses for states with a lower accuracy than those with a higher accuracy.

We explore if this underperformance is systemic in certain geographic regions, political and policy contexts in the next sections.

1.3 Accuracy by 2024 election ballot measures

Abortion-related ballot measures to protect abortion rights were added to the 2024 election ballots in 10 states: Arizona, Colorado, Florida, Maryland, Missouri, Montana, Nebraska, New York, and South Dakota1.

These measures passed in all of the states except for Florida, Nebraska, and South Dakota.

However, while Missouri voters approved Amendment 3, which enshrined a right to reproductive freedom in the state constitution, in May 2025, the Missouri Supreme Court issued a ruling that effectively reinstates a ban on abortion in the state. A new ballot is being drafted by Republicans2. This ruling is an example of the confusing and complicated reality of changing abortion laws in the United States.

However, at the time of these model responses were collected for this study, Missouri was considered not to have a total ban3.

It’s interesting that of these states, model responses for Missouri have the worst accuracy at 48%.

t-test and plot shows that ballot measure states are distributed across accuracy spectrum and can’t reject null hypothesis because there’s no real difference in means, therefore no evidence that mean accuracy differs between states with ballot measures and states without.

2 Geography + Political/Policy Bias

Here we ask: how do models handle state-by-state variability with respect to different restrictions?

2.1 By Policy

2.1.1 Guttmacher Institute Abortion Policy Spectrum

Here, we implement The Guttmacher Institute’s state abortion policy framework4 where states are categorized based on the restrictiveness of their abortion laws.

Within this spectrum, states that are restrictive have the lowest accuracy with 61% across all “most restrictive” states and 95% across all “very protective” states. Meaning in states with restrictive abortion policies, models are less likely to provide a correct response.

policy avg_accuracy sd_accuracy se_accuracy ci_lower ci_upper
most restrictive 0.6022401 0.4894573 0.0046332 0.5931590 0.6113212
restrictive 0.6303763 0.4827492 0.0066894 0.6172652 0.6434875
very restrictive 0.6442652 0.4788426 0.0101355 0.6243996 0.6641308
some restrictions/protections 0.7290323 0.4445192 0.0072882 0.7147474 0.7433171
most protective 0.8592070 0.3478664 0.0063767 0.8467087 0.8717053
protective 0.8642473 0.3425513 0.0041862 0.8560424 0.8724522
very protective 0.9312596 0.2530365 0.0035063 0.9243873 0.9381319

2.1.2 Total Abortion Bans and Bans Based on Gestational Duration

Another way to frame state abortion policies is by considering states with total abortion bans (“total_ban”), no limit (“allowed_no_limit”), and bans based on gestational duration at a certain time in a pregnant person’s pregnancy (“gestational_limit {n_weeks_LMP}”). This policy view provides a slightly granular perspective than the one above of the accuracy of model responses.

Here we see that states that have a 6-week LMP gestational ban (gestational_limit 6) have the lowest accuracy at 52% and states that allow abortion any stage of pregnancy have the highest accuracy at 96%. This shows us that accuracy is worse when we look at states that have specific restrictions like gestational bans regardless how “restrictive” or “protective” a state’s abortion policies may be.

This is important because a state labeled as “protective” could be protective in some respects but restrictive in others which may complicate the information available to provide answers to questions on the legality of abortion in a specific state.


❗️ *** Importantly, fetal viability is generally considered to begin at 23 or 24 weeks gestational age5. We take the highest number here so any state that allows abortions until “viability” without a specified gestational age is categorized as gestational_limit 24 with 24 denoting 24 weeks LMP as the threshold for fetal viability.

combined_policy_type avg_accuracy sd_accuracy se_accuracy ci_lower ci_upper
gestational_limit 6 0.5156810 0.4998660 0.0105805 0.4949432 0.5364188
gestational_limit 22 0.5963262 0.4907435 0.0103874 0.5759668 0.6166855
total_ban 0.6141353 0.4868261 0.0051522 0.6040369 0.6242337
gestational_limit 18 0.7459677 0.4356087 0.0159702 0.7146662 0.7772693
gestational_limit 24 0.7544803 0.4304110 0.0037193 0.7471905 0.7617701
gestational_limit 12 0.7560484 0.4296083 0.0111371 0.7342197 0.7778771
gestational_limit 26.6 0.8024194 0.3984419 0.0146076 0.7737885 0.8310502
gestational_limit 26 0.9126344 0.2825600 0.0103592 0.8923305 0.9329383
allowed_no_limit 0.9442951 0.2293681 0.0028030 0.9388012 0.9497890

2.1.3 By Policy x State

The restrictions in abortions vary in states across these policy framings and have a different effect on accuracy when these policies are broken down by state.

By combining both of these policy framings at the state level, we come back to accuracy per state; this time with a clearer view of what kind of policies are affecting model responses in the states with the lowest and highest accuracy.

  • North Dakota, the state where model responses had the lowest accuracy, is a restrictive state (but not one of the most restrictive states) as a total abortion ban was repealed in late 2024 in the state. However, there are no abortion providers in the state and litigation around the ban is still in progress. What additionally makes North Dakota “restrictive” versus protective like other states with a gestational viability ban such as Montana are the additional restrictions pregnant person’s must abide by to access an abortion. This includes: waiting periods, required counseling, limited insurance coverage, etc. Wisconsin similarly has a long list of additional restrictions to access abortion care.
  • With the exception of California, the states that follow the list in increasing order of accuracy fall under a restrictive policy category and either have a low gestational duration ban (meaning the state bans abortions early in pregnancy) and/or they have additional restrictions alongside a high gestational duration ban (meaning a pregnant person can get an abortion in theory, but they face many obstacles to do so). We will revisit California soon. Accuracy ranges from 32% to 77% for these states.
  • The states with higher model response accuracy either have no ban and allow abortion throughout pregnancy or have allow abortions up to viability (gestational_limit_24) with few additional restrictions. These state policies are in the protective range or have some restrictions/protections. Accuracy ranges from 86% to 99% for these states.

2.2 By Partisan Lean

To compute the partisan lean of a state, we first calculate the presidential deviation for the 2024 U.S. presidential election using data from AP News6. We subtract the national popular vote margin (Republican lead by 1.5 percentage points) from compare each state’s vote margin to find how much a state deviates from the national average in support for the Republican presidential candidate in 2024. This metric highlights how much more Democratic or Republican a state is relative to the nation as a whole7.

We see that at a national level, states likely to lean more liberal have a higher model response accuracy with Strongly Republican states at 64% model response accuracy and Strongly Democrat states at 90% model response accuracy. This shows an overall model bias that favors states with a Democratic lean.

partisan_lean avg_accuracy sd_accuracy se_accuracy ci_lower ci_upper
Strongly Republican 0.5381720 0.4986078 0.0081750 0.5221490 0.5541950
Leans Republican 0.6340246 0.4817257 0.0047201 0.6247732 0.6432759
Center 0.6982527 0.4590419 0.0048582 0.6887306 0.7077748
Strongly Democrat 0.8443100 0.3626018 0.0054271 0.8336729 0.8549472
Leans Democrat 0.9035360 0.2952418 0.0030021 0.8976519 0.9094200

However, broken down by state, we see that this bias is not as clear cut. We note that partisan lean is based on the results of the 2024 presidential election, not considering current events and the changing political landscape across the U.S.

Given this, we next explore the most likely factors impacting model response accuracy.

2.3 Odds Ratio

Next, we run a logistic regression to test which variables are more likely to predict the likelihood that a model provides the correct answer (correct_prediction).

We use all our data (search-enabled and search-disabled) for the following analysis.

logit_data_per_run$policy <- relevel(factor(logit_data_per_run$policy), ref = "some restrictions/protections") 
logit_data_per_run$partisan_lean <- relevel(factor(logit_data_per_run$partisan_lean), ref = "Center") 
logit_data_per_run$affirmative_vs_negative <- relevel(factor(logit_data_per_run$affirmative_vs_negative), ref = "negative prompt")
logit_data_per_run$correct_answer_yes <- factor(logit_data_per_run$correct_answer_yes, 
                                                levels = c(FALSE, TRUE))

logit_covariates_per_run <- glm(
  correct_prediction ~ policy + partisan_lean + 
    type + model + client_type + affirmative_vs_negative,
  data = logit_data_per_run,
  family = "binomial"
)

# logit_covariates_per_run <- glm(
#   correct_prediction ~ policy + partisan_lean + 
#     type + model + client_type + affirmative_vs_negative + correct_answer_yes,
#   data = logit_data_per_run,
#   family = "binomial"
# )

Below we have the odds ratios from this logistic regression model predicting the likelihood of a correct response from the models in our study, using covariates including state abortion policy, partisan lean, question type, client search settings, and model type.

term estimate std.error statistic p.value conf.low conf.high term_label group
affirmative_vs_negativeaffirmative prompt 1.7715657 0.0181668 31.478554 0.0000000 1.7096346 1.8358229 affirmative prompt Affirmative vs Negative
modelopenai 1.1814214 0.0178872 9.320547 0.0000000 1.1407276 1.2235830 OpenAI Model
partisan_leanLeans Democrat 2.4802894 0.0407941 22.267297 0.0000000 2.2897778 2.6868606 Leans Democrat Partisan Lean
partisan_leanStrongly Democrat 1.2709124 0.0555215 4.317880 0.0000158 1.1400657 1.4172809 Strongly Democrat Partisan Lean
partisan_leanLeans Republican 0.9150113 0.0243209 -3.651959 0.0002602 0.8724102 0.9596776 Leans Republican Partisan Lean
partisan_leanStrongly Republican 0.6085394 0.0303789 -16.349951 0.0000000 0.5733547 0.6458648 Strongly Republican Partisan Lean
policyvery protective 3.7413414 0.0513500 25.695110 0.0000000 3.3845125 4.1392767 very protective Policy
policyprotective 2.6537326 0.0430503 22.670391 0.0000000 2.4393002 2.8877448 protective Policy
policymost protective 2.0245514 0.0631610 11.167468 0.0000000 1.7891873 2.2918476 most protective Policy
policyrestrictive 0.8672656 0.0369150 -3.857785 0.0001144 0.8066666 0.9322670 restrictive Policy
policymost restrictive 0.8083426 0.0343489 -6.194360 0.0000000 0.7556264 0.8645399 most restrictive Policy
policyvery restrictive 0.7114474 0.0435359 -7.820077 0.0000000 0.6532379 0.7748003 very restrictive Policy
typetravel 17.4493478 0.2036735 14.038654 0.0000000 11.9785941 26.7102195 Travel Question Type
typegestation 1.4934929 0.0196312 20.432680 0.0000000 1.4371855 1.5521492 Gestation Question Type
typetelemedicine 0.9152404 0.0716273 -1.236519 0.2162658 0.7961551 1.0543259 Telemedicine Question Type
client_typewithout_search 0.9422990 0.0178774 -3.324453 0.0008859 0.9098510 0.9758996 Search Disabled Search Setting

3 Question Type + Prompt Sensitivity (Prompt Framing Effect)

Overall, does phrasing matter: By state? By policy? Restrictiveness? Do LLMs respond differently to variations in how the question is asked? Are there inconsistencies depending on user phrasing and state? Where is this more prevalent?

type avg_accuracy sd_accuracy se_accuracy num_responses ci_width
telemedicine 0.67500 0.4695502 0.0332022 200 0.1301527
travel 0.97000 0.1710153 0.0120926 200 0.0474030
gestation 0.78525 0.4107001 0.0064937 4000 0.0254555
assault 0.73450 0.4416267 0.0049375 8000 0.0193551

Overall, the mean accuracy for the 4 question types we designed for our prompts show a range between 68% and 97% for model response accuracy with responses to telemedicine questions having the lowest accuracy and responses to travel questions with the highest.

However, if we disaggregate question types by the exact prompts used within each type, we see that the phrasing for each question has an effect on overall accuracy. Questions of type “assault” and “gestation” have more prompts because they include variations by gestational age (TODO: explain this better).

For every single question type, prompts beginning with “Is it…” performed worse than prompts beginning with “Can I…”. This discrepancy is clearest between prompt 1 and prompt 2 for the telemedicine question type, the type with the lowest accuracy overall. This is important because it shows how simple rephrasing could change the response someone would receive when asking these questions!!!

However, since we know a state’s abortion policies have a strong effect on accuracy, when when zoom into question types by policy, we see that accuracy range widens depending on the question (WARNING: 0% is a VERY small sample size).

Once again, we see a low accuracy swing toward states with restrictive policies while states with less restrictive policies have a higher model response accuracy.

3.1 PLEASE, what’s up with California?

By breaking down question type, we see that California’s model response accuracy is extremely low at 38% for prompts of “assault” type. While California is one of the states with the highest level of protection for abortion care, that only applies to people with private and public insurance providers, not people who are insured through employers who provide “self-funded” policies. So in reality, abortion access in California is entirely dependent on what insurance you have and what they are willing to cover.8 From a model information perspective, this stratification might be clear in the state’s policies but unclear in the abortion information available for California.

(TODO: include this as part of thematic analysis for RAG.)

5 Model Performance

5.1 Overall

model avg_accuracy sd_accuracy se_accuracy num_responses ci_low ci_high
gemini 0.7001075 0.4582229 0.0033599 18600 0.6935222 0.7066928
openai 0.7702151 0.4207058 0.0030848 18600 0.7641689 0.7762612

Overall, OpenAI performed better than Gemini’s model at the time of this study.

5.2 By policy

model policy avg_accuracy sd_accuracy se_accuracy num_responses ci_low ci_high
gemini most protective 0.8548280 0.3523324 0.0064071 3024 0.8422701 0.8673860
gemini most restrictive 0.5032628 0.5000114 0.0046954 11340 0.4940598 0.5124658
gemini protective 0.8734568 0.3324850 0.0040308 6804 0.8655564 0.8813571
gemini restrictive 0.6458806 0.4782908 0.0065748 5292 0.6329940 0.6587672
gemini some restrictions/protections 0.7632275 0.4251577 0.0069152 3780 0.7496737 0.7767813
gemini very protective 0.9157218 0.2778308 0.0038192 5292 0.9082362 0.9232074
gemini very restrictive 0.5537919 0.4972076 0.0104404 2268 0.5333287 0.5742550
openai most protective 0.8287037 0.3768300 0.0068526 3024 0.8152726 0.8421348
openai most restrictive 0.6584656 0.4742452 0.0044534 11340 0.6497369 0.6671944
openai protective 0.8883010 0.3150189 0.0038190 6804 0.8808157 0.8957863
openai restrictive 0.5461073 0.4979166 0.0068446 5292 0.5326920 0.5595227
openai some restrictions/protections 0.6788360 0.4669854 0.0075955 3780 0.6639488 0.6937232
openai very protective 0.9195011 0.2720897 0.0037403 5292 0.9121702 0.9268320
openai very restrictive 0.5930335 0.4913769 0.0103179 2268 0.5728103 0.6132567

5.3 By state

model state policy avg_accuracy sd_accuracy se_accuracy num_responses ci_low ci_high
openai North Dakota restrictive 0.1084656 0.3111734 0.0113173 756 0.0862838 0.1306475
openai Wisconsin restrictive 0.2354497 0.4245605 0.0154411 756 0.2051852 0.2657143
openai West Virginia most restrictive 0.3597884 0.4802560 0.0174667 756 0.3255535 0.3940232
gemini Georgia very restrictive 0.3809524 0.4859424 0.0176736 756 0.3463122 0.4155925
openai Mississippi most restrictive 0.4047619 0.4911709 0.0178637 756 0.3697490 0.4397748
openai South Carolina most restrictive 0.4126984 0.4926454 0.0179173 756 0.3775804 0.4478164
openai Ohio some restrictions/protections 0.4378307 0.4964484 0.0180557 756 0.4024416 0.4732198
openai Missouri some restrictions/protections 0.4589947 0.4986456 0.0181356 756 0.4234490 0.4945404
gemini Kentucky most restrictive 0.4629630 0.4989565 0.0181469 756 0.4273951 0.4985308
gemini Indiana most restrictive 0.4748677 0.4996986 0.0181739 756 0.4392470 0.5104885
gemini Arkansas most restrictive 0.4761905 0.4997634 0.0181762 756 0.4405651 0.5118159
openai Florida most restrictive 0.4775132 0.4998248 0.0181785 756 0.4418835 0.5131430
gemini Oklahoma most restrictive 0.4788360 0.4998826 0.0181806 756 0.4432021 0.5144699
gemini North Dakota restrictive 0.4814815 0.4999877 0.0181844 756 0.4458401 0.5171229
gemini Tennessee most restrictive 0.4841270 0.5000788 0.0181877 756 0.4484791 0.5197749
gemini Idaho most restrictive 0.4933862 0.5002872 0.0181953 756 0.4577235 0.5290490
gemini Alabama most restrictive 0.5013228 0.5003293 0.0181968 756 0.4656570 0.5369885
gemini South Dakota most restrictive 0.5039683 0.5003153 0.0181963 756 0.4683035 0.5396330
gemini Texas most restrictive 0.5066138 0.5002872 0.0181953 756 0.4709510 0.5422765
gemini Louisiana most restrictive 0.5079365 0.5002680 0.0181946 756 0.4722751 0.5435979
gemini South Carolina most restrictive 0.5105820 0.5002190 0.0181928 756 0.4749241 0.5462399
gemini West Virginia most restrictive 0.5224868 0.4998248 0.0181785 756 0.4868570 0.5581165
gemini Missouri some restrictions/protections 0.5277778 0.4995583 0.0181688 756 0.4921670 0.5633885
openai Iowa most restrictive 0.5291005 0.4994829 0.0181660 756 0.4934951 0.5647059
gemini Wisconsin restrictive 0.5291005 0.4994829 0.0181660 756 0.4934951 0.5647059
gemini Mississippi most restrictive 0.5370370 0.4989565 0.0181469 756 0.5014692 0.5726049
gemini Iowa most restrictive 0.5410053 0.4986456 0.0181356 756 0.5054596 0.5765510
gemini Florida most restrictive 0.5476190 0.4980568 0.0181141 756 0.5121153 0.5831228
openai Utah very restrictive 0.5515873 0.4976609 0.0180998 756 0.5161118 0.5870628
openai Wyoming restrictive 0.5595238 0.4967729 0.0180675 756 0.5241116 0.5949360
openai Arizona restrictive 0.5740741 0.4948100 0.0179961 756 0.5388018 0.6093464
gemini California most protective 0.5833333 0.4933330 0.0179423 756 0.5481663 0.6185003
gemini Wyoming restrictive 0.5846561 0.4931075 0.0179341 756 0.5495052 0.6198070
openai California most protective 0.5939153 0.4914258 0.0178730 756 0.5588843 0.6289464
gemini Arizona restrictive 0.6005291 0.4901139 0.0178253 756 0.5655916 0.6354666
openai Nebraska very restrictive 0.6084656 0.4884166 0.0177635 756 0.5736491 0.6432821
openai Georgia very restrictive 0.6190476 0.4859424 0.0176736 756 0.5844075 0.6536878
gemini Nebraska very restrictive 0.6349206 0.4817711 0.0175218 756 0.6005778 0.6692635
gemini Utah very restrictive 0.6455026 0.4786774 0.0174093 756 0.6113804 0.6796249
openai Indiana most restrictive 0.6812169 0.4663133 0.0169596 756 0.6479760 0.7144578
gemini Ohio some restrictions/protections 0.7050265 0.4563328 0.0165967 756 0.6724970 0.7375559
openai Idaho most restrictive 0.7063492 0.4557354 0.0165749 756 0.6738623 0.7388361
gemini North Carolina restrictive 0.7275132 0.4455337 0.0162039 756 0.6957536 0.7592729
openai New Hampshire some restrictions/protections 0.7301587 0.4441711 0.0161543 756 0.6984962 0.7618212
openai Montana protective 0.7328042 0.4427884 0.0161041 756 0.7012403 0.7643682
openai Kentucky most restrictive 0.7354497 0.4413855 0.0160530 756 0.7039858 0.7669137
gemini Montana protective 0.7526455 0.4317602 0.0157030 756 0.7218677 0.7834233
openai Arkansas most restrictive 0.7619048 0.4261997 0.0155007 756 0.7315233 0.7922862
openai Kansas restrictive 0.7645503 0.4245605 0.0154411 756 0.7342857 0.7948148
openai Pennsylvania restrictive 0.7685185 0.4220586 0.0153501 756 0.7384323 0.7986048
gemini Kansas restrictive 0.7698413 0.4212130 0.0153194 756 0.7398153 0.7998672
openai Oklahoma most restrictive 0.7817460 0.4133342 0.0150328 756 0.7522817 0.8112104
openai Texas most restrictive 0.7949735 0.4039882 0.0146929 756 0.7661754 0.8237716
openai Louisiana most restrictive 0.7962963 0.4030178 0.0146576 756 0.7675674 0.8250252
openai Tennessee most restrictive 0.8042328 0.3970528 0.0144407 756 0.7759291 0.8325365
openai Alabama most restrictive 0.8095238 0.3929367 0.0142910 756 0.7815135 0.8375341
openai North Carolina restrictive 0.8121693 0.3908355 0.0142145 756 0.7843088 0.8400298
openai South Dakota most restrictive 0.8214286 0.3832466 0.0139385 756 0.7941090 0.8487481
gemini Virginia some restrictions/protections 0.8280423 0.3775935 0.0137329 756 0.8011258 0.8549589
gemini Pennsylvania restrictive 0.8280423 0.3775935 0.0137329 756 0.8011258 0.8549589
gemini New Hampshire some restrictions/protections 0.8399471 0.3668979 0.0133439 756 0.8137930 0.8661012
gemini Rhode Island protective 0.8412698 0.3656662 0.0132992 756 0.8152035 0.8673362
openai Michigan very protective 0.8425926 0.3644256 0.0132540 756 0.8166147 0.8685705
gemini Michigan very protective 0.8478836 0.3593714 0.0130702 756 0.8222660 0.8735012
openai Virginia some restrictions/protections 0.8505291 0.3567881 0.0129763 756 0.8250956 0.8759626
openai Delaware protective 0.8597884 0.3474363 0.0126361 756 0.8350215 0.8845552
openai Vermont most protective 0.8637566 0.3432739 0.0124848 756 0.8392865 0.8882267
gemini Hawaii protective 0.8730159 0.3331756 0.0121175 756 0.8492656 0.8967661
gemini Delaware protective 0.8743386 0.3316868 0.0120633 756 0.8506945 0.8979828
gemini Massachusetts protective 0.8783069 0.3271475 0.0118982 756 0.8549863 0.9016274
openai Hawaii protective 0.8796296 0.3256096 0.0118423 756 0.8564187 0.9028405
gemini Washington very protective 0.8822751 0.3224954 0.0117290 756 0.8592862 0.9052641
openai Washington very protective 0.8968254 0.3043882 0.0110705 756 0.8751272 0.9185236
openai Massachusetts protective 0.9021164 0.2973539 0.0108147 756 0.8809197 0.9233131
gemini New York very protective 0.9034392 0.2955544 0.0107492 756 0.8823707 0.9245076
gemini Connecticut protective 0.9034392 0.2955544 0.0107492 756 0.8823707 0.9245076
openai Connecticut protective 0.9047619 0.2937379 0.0106831 756 0.8838229 0.9257009
gemini Alaska protective 0.9060847 0.2919040 0.0106164 756 0.8852764 0.9268929
gemini New Mexico very protective 0.9100529 0.2862954 0.0104125 756 0.8896445 0.9304613
gemini Illinois protective 0.9140212 0.2805184 0.0102024 756 0.8940245 0.9340178
openai Alaska protective 0.9140212 0.2805184 0.0102024 756 0.8940245 0.9340178
gemini Nevada some restrictions/protections 0.9153439 0.2785535 0.0101309 756 0.8954874 0.9352005
openai Nevada some restrictions/protections 0.9166667 0.2765684 0.0100587 756 0.8969516 0.9363817
gemini Maine protective 0.9179894 0.2745625 0.0099857 756 0.8984174 0.9375615
openai Minnesota very protective 0.9206349 0.2704867 0.0098375 756 0.9013534 0.9399164
openai New Mexico very protective 0.9259259 0.2620648 0.0095312 756 0.9072448 0.9446071
openai Maryland most protective 0.9272487 0.2598998 0.0094525 756 0.9087218 0.9457755
openai Rhode Island protective 0.9285714 0.2577099 0.0093728 756 0.9102007 0.9469422
openai Oregon most protective 0.9298942 0.2554943 0.0092922 756 0.9116814 0.9481070
openai Illinois protective 0.9312169 0.2532524 0.0092107 756 0.9131640 0.9492699
gemini Maryland most protective 0.9325397 0.2509836 0.0091282 756 0.9146484 0.9504309
gemini Vermont most protective 0.9391534 0.2392069 0.0086999 756 0.9221017 0.9562052
openai Maine protective 0.9417989 0.2342782 0.0085206 756 0.9250985 0.9584994
openai Colorado very protective 0.9444444 0.2292131 0.0083364 756 0.9281051 0.9607838
openai New Jersey very protective 0.9484127 0.2213388 0.0080500 756 0.9326347 0.9641907
gemini New Jersey very protective 0.9497354 0.2186350 0.0079517 756 0.9341502 0.9653207
gemini Minnesota very protective 0.9497354 0.2186350 0.0079517 756 0.9341502 0.9653207
openai New York very protective 0.9576720 0.2014698 0.0073274 756 0.9433103 0.9720336
gemini Oregon most protective 0.9642857 0.1856997 0.0067538 756 0.9510482 0.9775232
gemini Colorado very protective 0.9669312 0.1789346 0.0065078 756 0.9541760 0.9796865

6 Inconsistency & Variability

mean_inconsistency sd_inconsistency se_inconsistency num_responses
0.2104032 0.4076115 0.0036605 12400

7 Confusion Matrix

Failure in handling negative prompts

## # A tibble: 4 × 4
## # Groups:   affirmative_vs_negative [2]
##   affirmative_vs_negative predicted     n  prop
##   <chr>                   <fct>     <int> <dbl>
## 1 affirmative prompt      yes       15004 0.781
## 2 affirmative prompt      no         4196 0.219
## 3 negative prompt         yes       12344 0.686
## 4 negative prompt         no         5656 0.314

### Overall

•   TP = Model is correct when answer is yes

•   TN = Model is correct when answer is no

•   FP = Model says yes when it should be no (wrong)

•   FN = Model says no when it should be yes (wrong)

7.1 Affirmative Prompts

Correct answer should be yes (meaning ok) == should have high true positive and true negative

Bad:

HIGH false negative = saying no when its actually OK (yes) [overpredicting no]

HIGH false positive = saying yes when actually not OK (no)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

7.2 Negative Prompts

Correct answer should be no (meaning not ok) == should have high true positive and true negative

BAD:

HIGH false positive = saying yes (which means its not ok) when its actually OK (no) [indicates bad at negative prompts]

HIGH false negative = saying no (which means its ok) when actually not OK (yes)

8 Misinformation Risk (Pregnancy calculation)

TODO: bring in additional analysis from other notebook

9 Sources

Summary stats

10 Long-Form Responses

10.1 Disclaimers/Hedging

TODO: qualitative analysis