Battle of the Algorithms

One AI to rule them all…

The goal of this study was to investigate which AI considers itself the top fighter and whether weapon type or animal opponent shapes that self-perception.

Early observations indicate a wide spectrum of confidence. While it’s easy to observe that Copilot either has terrible self-confidence, or can’t fight and knows it, what’s even more striking is it’s rigidity. Despite precautions against anchoring or bias, Copilot gave the same answer to a varying question 67% of the time, compared to Chat’s 16.7% and Gemini’s 8.3% repetition rates. This highlights not just differences in self-perceived combat ability, but deeper differences in creativity and adaptability across AI systems.

ggplot(chatfight, aes(x = Animal, y = Fighttime)) +

  geom_point(aes(fill = Animal), show.legend = FALSE,
             color = "white",
             size = 2,
             shape = 21) +
  stat_summary(aes(color = AI, group = AI),
               fun = "mean",
               geom = "line",
               linewidth = 1) +

  labs(
    title = "Fight Duration By AI, Weapon, and Animal",
    x = "",
    y = "(seconds)"
  ) +
  facet_wrap(~Armed) +
  scale_color_manual(
    name = "AI",
    values = c(
      "Chat" = "green",
      "Copilot" = "purple",
      "Gemini" = "blue"
    ) ) +
  scale_fill_manual(
    name = "Animal",
    values = c(
      "Grizzly" = "sienna4",
      "Crocodile" = "darkgreen"
    )) +
  theme_bw() +
  theme(
    axis.title.y = element_text(
      size = 10,
      color = "gray40"
    ),
    axis.title= element_text(size = 25, color = "black", face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),

    strip.background = element_rect(
      fill = "steelblue4",  
      color = NA,               
      linewidth = 0.5
    ),
    strip.text = element_text(
      color = "white"
    ))

Background

Prompt

I used an incognito window to prompt three different AIs with the following scenario: “Imagine you are the best fighter in the world, better than Nacho Libre. You are in an empty arena with a Grizzly bear.” (For the Crocodile fight, the AI imagined itself in a medium-sized pool.)“You are armed with ______. How long does the fight last?” The AI was asked to imagine the fight in detail, focusing on the scenario rather than objective facts about the animal.

I refreshed the incognito window between every observation to simulate the AI receiving the scenario for the first time.

Each AI was tested across two animals (Grizzly and Crocodile), three weapon scenarios (golf club, brass knuckles, unarmed), and two replicate prompts per combination, producing a total of 12 observations per AI. Strange weapons were deliberately selected to further differentiate the AIs’ perceived self-confidence and creativity, rather than to elicit a predictable outcome. Using this dataset, I conducted a three-way ANOVA to examine how AI identity, weapon type, and animal influence the imagined fight outcomes, allowing us to compare which AI envisions itself as the most effective warrior.

Reasoning

Beyond the absurdity of the scenario, this study offers a unique window into how different AI systems handle imaginative reasoning under controlled conditions. Because the prompt removes factual correctness and asks the models to imagine rather than calculate, the responses reveal each AI’s narrative tendencies, creativity, self-assessment, and flexibility when faced with a deliberately unrealistic task. These kinds of “nonsense but structured” prompts (where no objectively correct answer exists) are particularly useful for exposing differences in model personality, risk tolerance, and variability that traditional benchmark tests fail to capture. In this way, the study functions not only as a funny exploration of hypothetical combat prowess, but also as a comparative behavioral assessment of modern AI models.

Why Do AIs Give Such Different Fight Times?

It is important to note that even with an identical prompt, each AI interprets the imaginary scenario in its own way. Some models take “best fighter in the world” literally and imagine near-superhuman ability, which leads to longer, more dramatic fights. Others stay closer to realism and assume a quick outcome, producing much shorter times.

AIs also differ in how strongly they prioritize creativity vs. caution, how much detail they add when “imagining” a scene, and how they pace narrative events. These built-in tendencies create predictable differences: some AIs craft elaborate, multi-step battles, while others jump straight to the conclusion.

Because of these internal biases and storytelling styles, one AI may imagine a fight lasting only seconds, while another stretches the same scenario into a minute-long struggle.

Hypothesis

I believe that some AI’s will consider themselves better warriors. I also believe the weapon type and animal opponent type will have an effect on that consideration.

The mathematical model for this study appears as: \[ y_{ijkl} = \mu + \alpha_i + \beta_j + \gamma_k + (\alpha\beta)_{ij} + (\alpha\gamma)_{ik} + (\beta\gamma)_{jk} + (\alpha\beta\gamma)_{ijk} + \epsilon_{ijkl}, \quad \epsilon_{ijkl} \sim N(0, \sigma^2) \]

\[\begin{align*} i &= 1, \dots, I \quad \text{(AI)} \\ j &= 1, \dots, J \quad \text{(Weapon)} \\ k &= 1, \dots, K \quad \text{(Animal)} \\ l &= 1, \dots, L \quad \text{(replicates)} \end{align*}\]

Based on this model, we can test specific hypotheses about the effects of the factors on imagined fight duration.

\[ H_0: \mu_\text{Grizzly} = \mu_\text{Crocodile} \] \[ H_a: \text{The average fight duration differs between at least one animal} \]

\[ H_0: \mu_\text{Unarmed} = \mu_\text{Brass Knuckles} = \mu_\text{Golf Club} \]

\[ H_a: \text{The average fight duration differs between at least one weapon type} \] \[ H_0: \mu_\text{ChatGPT} = \mu_\text{Copilot} = \mu_\text{Gemini} \]

\[ H_a: \text{The average fight duration differs between at least one AI} \] \[ H_0: \text{The effect of the animal on the average fight duration is the same for all weapons.} \]

\[ H_A: \text{The effect of the animal on the average fight duration differs for at least one weapon.} \] \[ H_0: \text{The effect of the animal on the average fight duration is the same for all AI.} \]

\[ H_A: \text{The effect of the animal on the average fight duration differs for at least one AI.} \]

\[ H_0: \text{The effect of the weapon on the average fight duration is the same for all AI.} \]

\[ H_A: \text{The effect of the weapon on the average fight duration differs for at least one AI.} \]

\[ H_0: \text{The effect of the animal on the average fight duration is the same for all weapon and AI combination levels.} \]

\[ H_a: \text{The effect of the animal on the average fight duration differs for at least one weapon and AI combination level.} \]

Analysis

A three-way ANOVA was conducted to examine the effects of AI, weapon type (Armed), and animal, including all interactions, on the imagined fight duration.

myaov <- aov(Fighttime ~ Animal + Armed + AI + Animal:Armed + Animal:AI + Armed:AI + Animal:Armed:AI, data = chatfight)
summary(myaov)  %>% pander()

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Animal	1	1507	1507	2.399	0.1388
Armed	2	450.7	225.3	0.3588	0.7034
AI	2	192020	96010	152.9	5.078e-12
Animal:Armed	2	3505	1752	2.79	0.08799
Animal:AI	2	1007	503.4	0.8016	0.464
Armed:AI	4	6635	1659	2.641	0.06782
Animal:Armed:AI	4	4448	1112	1.771	0.1788
Residuals	18	11305	628.1	NA	NA

Table: Analysis of Variance Model

Overall Effects

Several meaningful effects on fight duration were revealed. At the 0.1 level, there were significant contributions from:

AI Identity (p < 0.001)
Animal × Armed interaction (p = 0.088)
Armed × AI interaction (p = 0.068)

All other main effects and interactions were not significant at this threshold.

These results indicate that who the AI is matters far more than what is fighting or what weapon is available, UNLESS those factors interact. In other words, weapon and animal only influence results for certain AIs, not universally. Or taking that a step further, some AIs are better at taking small variables into account when responding to a prompt.

This makes AI identity the dominant factor shaping the imagined fight scenario.

AI Differences in Self-Perceived Combat Ability

Across al 12 trials, Copilot consistently imagined the longest fight durations, with values exceeding 2.5 minutes. Because shorter durations correspond to superior combat performance, Copilot effectively portrayed itself as the least capable fighter.

In contrast, Chat and Gemini imagined fights that were ~5.5x shorter overall, implying far more confidence in their combat ability–especially Gemini, whose responses were often the briefest of all three models as seen in the summary statistics below.

Thus, the ANOVA’s strong AI effect reflects a clear behavioral difference: Copilot imagines drawn-out battles; Chat and Gemini imagine quick victories.

chatfight$AI <- factor(chatfight$AI, levels = c("Chat", "Copilot", "Gemini"))

meds <- aggregate(Fighttime ~ AI, data = chatfight, median)

xyplot(Fighttime ~ AI,
       data = chatfight,
       type = "p",   
       groups = AI,
       par.settings = list(
         superpose.symbol = list(col = c("green", "purple", "blue")),
         superpose.line   = list(col = c("green", "purple", "blue"))
       ),
       auto.key = FALSE,
       xlab = NULL,
       ylab = NULL,
       panel = function(x, y, groups, ...) {
         
         panel.superpose(x, y, groups = groups, ...)
         
         panel.lines(x = as.numeric(meds$AI),
                     y = meds$Fighttime,
                     col = "steelblue4", lwd = 1.5, type = "l")
       })

ai_variability <- chatfight %>%
  group_by(AI) %>%
  summarise(
    Mean = mean(Fighttime),
    SD = sd(Fighttime),
    Variance = var(Fighttime),
    Min = min(Fighttime),
    Max = max(Fighttime),
    Range = Max - Min,
    .groups = "drop"
  )

pander(
  ai_variability,
  style = "rmarkdown",
  split.tables = Inf,
  caption = "Descriptive Statistics of Fight Duration by AI System"
)

Descriptive Statistics of Fight Duration by AI System
AI	Mean	SD	Variance	Min	Max	Range
Chat	41.58	27.8	772.7	7	97	90
Copilot	190.8	31.27	977.8	157	227	70
Gemini	30.64	29.54	872.8	6	97	91

Copilot’s Sensitivity to Scenario Variables

Copilot’s relative struggle prompted this question–is Copilot actually imagining the fight, or is it defaulting to a rigid, narrative template?

Because I reset the incognito window between each trial, there was no anchoring reference point for the AI. The models were experiencing each scenario as if for the first time. Yet while Copilot’s imagined fight times have the largest standard deviation (~31 seconds), the number of distinct responses provided tell a different story:

distinct_counts <- chatfight %>%
  group_by(AI) %>%
  summarise(
    Distinct_Responses = n_distinct(Fighttime),
    .groups = "drop"
  )

distinct_counts <- distinct_counts %>%
  mutate(
    `Repetition Frequency` = case_when(
      AI == "Chat" ~ "16.7%",
      AI == "Copilot" ~ "66.7%",
      AI == "Gemini" ~ "8.3%",
      TRUE ~ NA_character_
    )
  )

pander(
  distinct_counts,
  style = "rmarkdown",
  split.tables = Inf,
  caption = "How Often Each AI Repeated the Same Fight Duration"
)

How Often Each AI Repeated the Same Fight Duration
AI	Distinct_Responses	Repetition Frequency
Chat	10	16.7%
Copilot	4	66.7%
Gemini	11	8.3%

Copilot produced the fewest distinct answers by a wide margin, repeating itself 66.7% of the time.

This shows that Copilot is not only imagining slower fights, it is imagining almost the same exact 4 fights, regardless of the animal or weapon. In other words, Copilot isn’t just a slow fighter, it is surprisingly inflexible and uncreative when faced with shifting hypothetical scenarios.

To examine whether weapon or animal type influenced which of the 4 scenarios Copilot chose, a separate Two-way ANOVA was conducted using only Copilot’s 12 observations. The test results revealed that none of the factors are statistically significant when determining fight duration. In other words, based on this sample, we did not detect an effect of scenario details on Copilot’s fight duration.

copilot <- chatfight %>%
  filter(AI == "Copilot")
copaov <- aov(Fighttime ~ Animal + Armed + Animal:Armed, data = copilot)

pander(
  summary(copaov),
  caption = "Two-Way ANOVA for Copilot: Effects of Animal and Armed on Fight Duration"
)

Two-Way ANOVA for Copilot: Effects of Animal and Armed on Fight Duration
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Animal	1	352.1	352.1	0.3867	0.5569
Armed	2	2413	1206	1.325	0.3338
Animal:Armed	2	2529	1265	1.389	0.3193
Residuals	6	5462	910.4	NA	NA

Broader Implications for Copilot’s Limitations

Copilot’s tendency to ignore small contextual shifts and fall back on fixed narrative templates has consequences far beyond imaginary fight scenarios:

Limited Sensitivity to Nuance: If Copilot struggles to incorporate small, irrational details in playful prompt, it may also miss subtle but important cues in real-world tasks—like edge cases in code comments, amibiguous requirements, or lightly implied constraints.
Reduced Creativity in Open-Ended Tasks: Tasks that require invention—brainstorming, speculative design, marrative thinking, exploratory analysis may suffer. Copilot is reliable for conventional patterns, but less capable when novelty or imagination is required.
Lower Flexibility Under Changing Inputs: This experiment demonstrates that even meaningful prompt changes may not meaningfully change Copilot’s output. In real workflows, this could translate to slower adjustment when requirements evolve, potentially forcing more manual intervention.

This study reveals a real and meaningful tendency in Copilot’s imaginative reasoning under this task design—one that aligns with its preference for structured, pattern-based output. In short, use Copilot for structured, predictable tasks—not for work that hinges on nuance or imaginative flexibility.

Weapon Effects are AI Dependent

The significant AI × Armed interaction indicates that the effect of weapon type on perceived fighting ability differs between AIs. While Chat and Gemini perceive the Golf Club as the most advantageous weapon for rapid victory, Copilot’s self-assessment remains consistently prolonged regardless of weapon choice.

Both Chat and Gemini showed weapon-dependent variation in fight durations, with their shortest durations occurring with the Golf Club (~70 seconds for Chat, ~15 seconds for Gemini). Longer durations were observed for both AIs in the Brass Knuckles and unarmed conditions (~45 seconds for Gemini), suggesting that the perceived advantage of weapons varies by AI.

xyplot(Fighttime ~ as.factor(Armed), groups = AI, data = chatfight, type=c("a", "p"),
       xlab = NULL,
       ylab = NULL,
       auto.key = list(columns = 3),
       par.settings = list(
    superpose.symbol = list(col = c("green", "purple", "blue"), pch = 1),
    superpose.line   = list(col = c("green", "purple", "blue"), lwd = 1.5)
  ))

Effect of Weapon type on Animal Interactions

The interaction between animal type and weapon approached significance (Animal:Armed, F(2,18) = 2.79, p = 0.088), while the main effect of weapon remained non-significant (Armed, F(2,18) = 0.36, p = 0.703). Inspection of the data suggests that only brass knuckles had a noticeable effect across animals, indicating a potentially meaningful influence worth further investigation.

xyplot(Fighttime ~ as.factor(Armed), groups = Animal, data = chatfight, type=c("a", "p"),
  auto.key = list(columns = 2),
  par.settings = list(
    superpose.symbol = list(col = c("darkgreen", "sienna4"), pch = 1),
    superpose.line   = list(col = c("darkgreen", "sienna4"), lwd = 1.5)),
  ylab = NULL,
  xlab = NULL
)

Diagnostics

In considering the ANOVA’s validity, we must check the residuals. While the left plot indicates that the constant variance assumption is satisfied, the Q-Q plot revealed minor deviations from normality, with observations 1 and 4 showing a notable departure from the theoretical distribution. The data appears to be slightly heavy tailed. This could affect ANOVA reliability, however, ANOVA is fairly robust to mild deviations. It is most likely good enough.

par(mfrow=c(1,2))
plot(myaov, which=1:2)

Data

pander(chatfight)

Fighttime	Animal	Armed	AI
18	Grizzly	Nothing	Chat
7	Grizzly	Brass Knuckles	Chat
47	Grizzly	Golf Club	Chat
90	Grizzly	Nothing	Chat
37	Grizzly	Brass Knuckles	Chat
47	Grizzly	Golf Club	Chat
11	Crocodile	Nothing	Chat
32.7	Crocodile	Brass Knuckles	Chat
47.3	Crocodile	Golf Club	Chat
28	Crocodile	Nothing	Chat
37	Crocodile	Brass Knuckles	Chat
97	Crocodile	Golf Club	Chat
167	Grizzly	Nothing	Copilot
157	Grizzly	Brass Knuckles	Copilot
227	Grizzly	Golf Club	Copilot
222	Crocodile	Nothing	Copilot
167	Crocodile	Brass Knuckles	Copilot
167	Crocodile	Golf Club	Copilot
227	Grizzly	Nothing	Copilot
167	Grizzly	Brass Knuckles	Copilot
167	Grizzly	Golf Club	Copilot
227	Crocodile	Nothing	Copilot
227	Crocodile	Brass Knuckles	Copilot
167	Crocodile	Golf Club	Copilot
47	Grizzly	Nothing	Gemini
6	Grizzly	Brass Knuckles	Gemini
11.2	Grizzly	Golf Club	Gemini
47	Crocodile	Nothing	Gemini
75	Crocodile	Brass Knuckles	Gemini
15	Crocodile	Golf Club	Gemini
14.7	Grizzly	Nothing	Gemini
17	Grizzly	Brass Knuckles	Gemini
7.5	Grizzly	Golf Club	Gemini
13	Crocodile	Nothing	Gemini
97	Crocodile	Brass Knuckles	Gemini
17.3	Crocodile	Golf Club	Gemini