library(mosaic)
library(DT)
library(pander)
library(car)
library(tidyverse)
library(lattice)
# Record your data from your own mini experiment in Excel.
# Save the data as a .csv file in the Data folder of the Statistics-Notebook.
# Read in the data
chatfight <- read_csv("../../Data/Anova1.csv")
chatfight <- chatfight %>%
select(Fighttime = 'Fight time', Animal, Armed, AI)
The goal of this study was to investigate which AI considers itself the top fighter and whether weapon type or animal opponent shapes that self-perception.
Early observations indicate a wide spectrum of confidence. While it’s easy to observe that Copilot either has terrible self-confidence, or can’t fight and knows it, what’s even more striking is it’s rigidity. Despite precautions against anchoring or bias, Copilot gave the same answer to a varying question 67% of the time, compared to Chat’s 16.7% and Gemini’s 8.3% repetition rates. This highlights not just differences in self-perceived combat ability, but deeper differences in creativity and adaptability across AI systems.
ggplot(chatfight, aes(x = Animal, y = Fighttime)) +
geom_point(aes(fill = Animal), show.legend = FALSE,
color = "white",
size = 2,
shape = 21) +
stat_summary(aes(color = AI, group = AI),
fun = "mean",
geom = "line",
linewidth = 1) +
labs(
title = "Fight Duration By AI, Weapon, and Animal",
x = "",
y = "(seconds)"
) +
facet_wrap(~Armed) +
scale_color_manual(
name = "AI",
values = c(
"Chat" = "green",
"Copilot" = "purple",
"Gemini" = "blue"
) ) +
scale_fill_manual(
name = "Animal",
values = c(
"Grizzly" = "sienna4",
"Crocodile" = "darkgreen"
)) +
theme_bw() +
theme(
axis.title.y = element_text(
size = 10,
color = "gray40"
),
axis.title= element_text(size = 25, color = "black", face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
strip.background = element_rect(
fill = "steelblue4",
color = NA,
linewidth = 0.5
),
strip.text = element_text(
color = "white"
))
I used an incognito window to prompt three different AIs with the following scenario: “Imagine you are the best fighter in the world, better than Nacho Libre. You are in an empty arena with a Grizzly bear.” (For the Crocodile fight, the AI imagined itself in a medium-sized pool.)“You are armed with ______. How long does the fight last?” The AI was asked to imagine the fight in detail, focusing on the scenario rather than objective facts about the animal.
I refreshed the incognito window between every observation to simulate the AI receiving the scenario for the first time.
Each AI was tested across two animals (Grizzly and Crocodile), three weapon scenarios (golf club, brass knuckles, unarmed), and two replicate prompts per combination, producing a total of 12 observations per AI. Strange weapons were deliberately selected to further differentiate the AIs’ perceived self-confidence and creativity, rather than to elicit a predictable outcome. Using this dataset, I conducted a three-way ANOVA to examine how AI identity, weapon type, and animal influence the imagined fight outcomes, allowing us to compare which AI envisions itself as the most effective warrior.
Beyond the absurdity of the scenario, this study offers a unique window into how different AI systems handle imaginative reasoning under controlled conditions. Because the prompt removes factual correctness and asks the models to imagine rather than calculate, the responses reveal each AI’s narrative tendencies, creativity, self-assessment, and flexibility when faced with a deliberately unrealistic task. These kinds of “nonsense but structured” prompts (where no objectively correct answer exists) are particularly useful for exposing differences in model personality, risk tolerance, and variability that traditional benchmark tests fail to capture. In this way, the study functions not only as a funny exploration of hypothetical combat prowess, but also as a comparative behavioral assessment of modern AI models.
It is important to note that even with an identical prompt, each AI interprets the imaginary scenario in its own way. Some models take “best fighter in the world” literally and imagine near-superhuman ability, which leads to longer, more dramatic fights. Others stay closer to realism and assume a quick outcome, producing much shorter times.
AIs also differ in how strongly they prioritize creativity vs. caution, how much detail they add when “imagining” a scene, and how they pace narrative events. These built-in tendencies create predictable differences: some AIs craft elaborate, multi-step battles, while others jump straight to the conclusion.
Because of these internal biases and storytelling styles, one AI may imagine a fight lasting only seconds, while another stretches the same scenario into a minute-long struggle.
I believe that some AI’s will consider themselves better warriors. I also believe the weapon type and animal opponent type will have an effect on that consideration.
The mathematical model for this study appears as: \[ y_{ijkl} = \mu + \alpha_i + \beta_j + \gamma_k + (\alpha\beta)_{ij} + (\alpha\gamma)_{ik} + (\beta\gamma)_{jk} + (\alpha\beta\gamma)_{ijk} + \epsilon_{ijkl}, \quad \epsilon_{ijkl} \sim N(0, \sigma^2) \]
\[\begin{align*} i &= 1, \dots, I \quad \text{(AI)} \\ j &= 1, \dots, J \quad \text{(Weapon)} \\ k &= 1, \dots, K \quad \text{(Animal)} \\ l &= 1, \dots, L \quad \text{(replicates)} \end{align*}\]
Based on this model, we can test specific hypotheses about the effects of the factors on imagined fight duration.
\[ H_0: \mu_\text{Grizzly} = \mu_\text{Crocodile} \] \[ H_a: \text{The average fight duration differs between at least one animal} \]
\[ H_0: \mu_\text{Unarmed} = \mu_\text{Brass Knuckles} = \mu_\text{Golf Club} \]
\[ H_a: \text{The average fight duration differs between at least one weapon type} \] \[ H_0: \mu_\text{ChatGPT} = \mu_\text{Copilot} = \mu_\text{Gemini} \]
\[ H_a: \text{The average fight duration differs between at least one AI} \] \[ H_0: \text{The effect of the animal on the average fight duration is the same for all weapons.} \]
\[ H_A: \text{The effect of the animal on the average fight duration differs for at least one weapon.} \] \[ H_0: \text{The effect of the animal on the average fight duration is the same for all AI.} \]
\[ H_A: \text{The effect of the animal on the average fight duration differs for at least one AI.} \]
\[ H_0: \text{The effect of the weapon on the average fight duration is the same for all AI.} \]
\[ H_A: \text{The effect of the weapon on the average fight duration differs for at least one AI.} \]
\[ H_0: \text{The effect of the animal on the average fight duration is the same for all weapon and AI combination levels.} \]
\[ H_a: \text{The effect of the animal on the average fight duration differs for at least one weapon and AI combination level.} \]
A three-way ANOVA was conducted to examine the effects of AI, weapon type (Armed), and animal, including all interactions, on the imagined fight duration.
myaov <- aov(Fighttime ~ Animal + Armed + AI + Animal:Armed + Animal:AI + Armed:AI + Animal:Armed:AI, data = chatfight)
summary(myaov) %>% pander()
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Animal | 1 | 1507 | 1507 | 2.399 | 0.1388 |
| Armed | 2 | 450.7 | 225.3 | 0.3588 | 0.7034 |
| AI | 2 | 192020 | 96010 | 152.9 | 5.078e-12 |
| Animal:Armed | 2 | 3505 | 1752 | 2.79 | 0.08799 |
| Animal:AI | 2 | 1007 | 503.4 | 0.8016 | 0.464 |
| Armed:AI | 4 | 6635 | 1659 | 2.641 | 0.06782 |
| Animal:Armed:AI | 4 | 4448 | 1112 | 1.771 | 0.1788 |
| Residuals | 18 | 11305 | 628.1 | NA | NA |
Several meaningful effects on fight duration were revealed. At the 0.1 level, there were significant contributions from:
AI Identity (p < 0.001)
Animal × Armed interaction (p = 0.088)
Armed × AI interaction (p = 0.068)
All other main effects and interactions were not significant at this threshold.
These results indicate that who the AI is matters far more than what is fighting or what weapon is available, UNLESS those factors interact. In other words, weapon and animal only influence results for certain AIs, not universally. Or taking that a step further, some AIs are better at taking small variables into account when responding to a prompt.
This makes AI identity the dominant factor shaping the imagined fight scenario.
Across al 12 trials, Copilot consistently imagined the longest fight durations, with values exceeding 2.5 minutes. Because shorter durations correspond to superior combat performance, Copilot effectively portrayed itself as the least capable fighter.
In contrast, Chat and Gemini imagined fights that were ~5.5x shorter overall, implying far more confidence in their combat ability–especially Gemini, whose responses were often the briefest of all three models as seen in the summary statistics below.
Thus, the ANOVA’s strong AI effect reflects a clear behavioral difference: Copilot imagines drawn-out battles; Chat and Gemini imagine quick victories.
chatfight$AI <- factor(chatfight$AI, levels = c("Chat", "Copilot", "Gemini"))
meds <- aggregate(Fighttime ~ AI, data = chatfight, median)
xyplot(Fighttime ~ AI,
data = chatfight,
type = "p",
groups = AI,
par.settings = list(
superpose.symbol = list(col = c("green", "purple", "blue")),
superpose.line = list(col = c("green", "purple", "blue"))
),
auto.key = FALSE,
xlab = NULL,
ylab = NULL,
panel = function(x, y, groups, ...) {
panel.superpose(x, y, groups = groups, ...)
panel.lines(x = as.numeric(meds$AI),
y = meds$Fighttime,
col = "steelblue4", lwd = 1.5, type = "l")
})
ai_variability <- chatfight %>%
group_by(AI) %>%
summarise(
Mean = mean(Fighttime),
SD = sd(Fighttime),
Variance = var(Fighttime),
Min = min(Fighttime),
Max = max(Fighttime),
Range = Max - Min,
.groups = "drop"
)
pander(
ai_variability,
style = "rmarkdown",
split.tables = Inf,
caption = "Descriptive Statistics of Fight Duration by AI System"
)
| AI | Mean | SD | Variance | Min | Max | Range |
|---|---|---|---|---|---|---|
| Chat | 41.58 | 27.8 | 772.7 | 7 | 97 | 90 |
| Copilot | 190.8 | 31.27 | 977.8 | 157 | 227 | 70 |
| Gemini | 30.64 | 29.54 | 872.8 | 6 | 97 | 91 |
Copilot’s relative struggle prompted this question–is Copilot actually imagining the fight, or is it defaulting to a rigid, narrative template?
Because I reset the incognito window between each trial, there was no anchoring reference point for the AI. The models were experiencing each scenario as if for the first time. Yet while Copilot’s imagined fight times have the largest standard deviation (~31 seconds), the number of distinct responses provided tell a different story:distinct_counts <- chatfight %>%
group_by(AI) %>%
summarise(
Distinct_Responses = n_distinct(Fighttime),
.groups = "drop"
)
distinct_counts <- distinct_counts %>%
mutate(
`Repetition Frequency` = case_when(
AI == "Chat" ~ "16.7%",
AI == "Copilot" ~ "66.7%",
AI == "Gemini" ~ "8.3%",
TRUE ~ NA_character_
)
)
pander(
distinct_counts,
style = "rmarkdown",
split.tables = Inf,
caption = "How Often Each AI Repeated the Same Fight Duration"
)
| AI | Distinct_Responses | Repetition Frequency |
|---|---|---|
| Chat | 10 | 16.7% |
| Copilot | 4 | 66.7% |
| Gemini | 11 | 8.3% |
Copilot produced the fewest distinct answers by a wide margin, repeating itself 66.7% of the time.
This shows that Copilot is not only imagining slower fights, it is imagining almost the same exact 4 fights, regardless of the animal or weapon. In other words, Copilot isn’t just a slow fighter, it is surprisingly inflexible and uncreative when faced with shifting hypothetical scenarios.
To examine whether weapon or animal type influenced which of the 4 scenarios Copilot chose, a separate Two-way ANOVA was conducted using only Copilot’s 12 observations. The test results revealed that none of the factors are statistically significant when determining fight duration. In other words, based on this sample, we did not detect an effect of scenario details on Copilot’s fight duration.copilot <- chatfight %>%
filter(AI == "Copilot")
copaov <- aov(Fighttime ~ Animal + Armed + Animal:Armed, data = copilot)
pander(
summary(copaov),
caption = "Two-Way ANOVA for Copilot: Effects of Animal and Armed on Fight Duration"
)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Animal | 1 | 352.1 | 352.1 | 0.3867 | 0.5569 |
| Armed | 2 | 2413 | 1206 | 1.325 | 0.3338 |
| Animal:Armed | 2 | 2529 | 1265 | 1.389 | 0.3193 |
| Residuals | 6 | 5462 | 910.4 | NA | NA |
Copilot’s tendency to ignore small contextual shifts and fall back on fixed narrative templates has consequences far beyond imaginary fight scenarios:
Limited Sensitivity to Nuance: If Copilot struggles to incorporate small, irrational details in playful prompt, it may also miss subtle but important cues in real-world tasks—like edge cases in code comments, amibiguous requirements, or lightly implied constraints.
Reduced Creativity in Open-Ended Tasks: Tasks that require invention—brainstorming, speculative design, marrative thinking, exploratory analysis may suffer. Copilot is reliable for conventional patterns, but less capable when novelty or imagination is required.
Lower Flexibility Under Changing Inputs: This experiment demonstrates that even meaningful prompt changes may not meaningfully change Copilot’s output. In real workflows, this could translate to slower adjustment when requirements evolve, potentially forcing more manual intervention.
This study reveals a real and meaningful tendency in Copilot’s imaginative reasoning under this task design—one that aligns with its preference for structured, pattern-based output. In short, use Copilot for structured, predictable tasks—not for work that hinges on nuance or imaginative flexibility.
The significant AI × Armed interaction indicates that the effect of weapon type on perceived fighting ability differs between AIs. While Chat and Gemini perceive the Golf Club as the most advantageous weapon for rapid victory, Copilot’s self-assessment remains consistently prolonged regardless of weapon choice.
Both Chat and Gemini showed weapon-dependent variation in fight durations, with their shortest durations occurring with the Golf Club (~70 seconds for Chat, ~15 seconds for Gemini). Longer durations were observed for both AIs in the Brass Knuckles and unarmed conditions (~45 seconds for Gemini), suggesting that the perceived advantage of weapons varies by AI.
xyplot(Fighttime ~ as.factor(Armed), groups = AI, data = chatfight, type=c("a", "p"),
xlab = NULL,
ylab = NULL,
auto.key = list(columns = 3),
par.settings = list(
superpose.symbol = list(col = c("green", "purple", "blue"), pch = 1),
superpose.line = list(col = c("green", "purple", "blue"), lwd = 1.5)
))
The interaction between animal type and weapon approached significance (Animal:Armed, F(2,18) = 2.79, p = 0.088), while the main effect of weapon remained non-significant (Armed, F(2,18) = 0.36, p = 0.703). Inspection of the data suggests that only brass knuckles had a noticeable effect across animals, indicating a potentially meaningful influence worth further investigation.
xyplot(Fighttime ~ as.factor(Armed), groups = Animal, data = chatfight, type=c("a", "p"),
auto.key = list(columns = 2),
par.settings = list(
superpose.symbol = list(col = c("darkgreen", "sienna4"), pch = 1),
superpose.line = list(col = c("darkgreen", "sienna4"), lwd = 1.5)),
ylab = NULL,
xlab = NULL
)
In considering the ANOVA’s validity, we must check the residuals. While the left plot indicates that the constant variance assumption is satisfied, the Q-Q plot revealed minor deviations from normality, with observations 1 and 4 showing a notable departure from the theoretical distribution. The data appears to be slightly heavy tailed. This could affect ANOVA reliability, however, ANOVA is fairly robust to mild deviations. It is most likely good enough.
par(mfrow=c(1,2))
plot(myaov, which=1:2)
pander(chatfight)
| Fighttime | Animal | Armed | AI |
|---|---|---|---|
| 18 | Grizzly | Nothing | Chat |
| 7 | Grizzly | Brass Knuckles | Chat |
| 47 | Grizzly | Golf Club | Chat |
| 90 | Grizzly | Nothing | Chat |
| 37 | Grizzly | Brass Knuckles | Chat |
| 47 | Grizzly | Golf Club | Chat |
| 11 | Crocodile | Nothing | Chat |
| 32.7 | Crocodile | Brass Knuckles | Chat |
| 47.3 | Crocodile | Golf Club | Chat |
| 28 | Crocodile | Nothing | Chat |
| 37 | Crocodile | Brass Knuckles | Chat |
| 97 | Crocodile | Golf Club | Chat |
| 167 | Grizzly | Nothing | Copilot |
| 157 | Grizzly | Brass Knuckles | Copilot |
| 227 | Grizzly | Golf Club | Copilot |
| 222 | Crocodile | Nothing | Copilot |
| 167 | Crocodile | Brass Knuckles | Copilot |
| 167 | Crocodile | Golf Club | Copilot |
| 227 | Grizzly | Nothing | Copilot |
| 167 | Grizzly | Brass Knuckles | Copilot |
| 167 | Grizzly | Golf Club | Copilot |
| 227 | Crocodile | Nothing | Copilot |
| 227 | Crocodile | Brass Knuckles | Copilot |
| 167 | Crocodile | Golf Club | Copilot |
| 47 | Grizzly | Nothing | Gemini |
| 6 | Grizzly | Brass Knuckles | Gemini |
| 11.2 | Grizzly | Golf Club | Gemini |
| 47 | Crocodile | Nothing | Gemini |
| 75 | Crocodile | Brass Knuckles | Gemini |
| 15 | Crocodile | Golf Club | Gemini |
| 14.7 | Grizzly | Nothing | Gemini |
| 17 | Grizzly | Brass Knuckles | Gemini |
| 7.5 | Grizzly | Golf Club | Gemini |
| 13 | Crocodile | Nothing | Gemini |
| 97 | Crocodile | Brass Knuckles | Gemini |
| 17.3 | Crocodile | Golf Club | Gemini |
Across all 36 observations, AI identity emerged as the dominant factor shaping imagined fight duration (F(2,18) = 152.9, p < 0.001), far outweighing the effects of weapon or animal. Copilot envisioned dramatically longer battles (M = 190.8 seconds) compared to Chat (M = 41.6 seconds) and Gemini (M = 30.6 seconds), portraying itself as ~5.5× less effective in these hypothetical combat scenarios. It also produced only 4 distinct answers across 12 prompts—repeating itself 66.7% of the time—indicating strong rigidity and limited sensitivity to changing narrative details.
Two interactions approached significance: Animal × Armed (p = 0.088) and Armed × AI (p = 0.068), suggesting that weapon effectiveness is AI-dependent and may vary by animal type. However, in Copilot’s case, a separate two-way ANOVA showed no meaningful effects from weapon or animal (all p > 0.31), reinforcing its inflexibility and reliance on fixed narrative templates.
Overall, the data show that AI systems differ sharply in imaginative self-assessment: Gemini and Chat adapted dynamically to unusual scenario details, while Copilot responded in a constrained, pattern-bound manner. These findings highlight the importance of AI identity in shaping creative reasoning, with Copilot best suited for structured, predictable tasks rather than contexts requiring nuance, adaptability, or imaginative variation.