Primary OEC
\[
H_0:\ \mu_{Control} ≤ \mu_{Treatment}
\]
\[
H_1:\ \mu_{Control} > \mu_{Treatment}
\]
where \(\mu\) represents the mean
Final Type II Rate.
t_final_typeII_rate <- t.test(
final_typeII_rate~Variant,
var.equal=FALSE,
data=officer_summary
)
t_final_typeII_rate
##
## Welch Two Sample t-test
##
## data: final_typeII_rate by Variant
## t = 3.4409, df = 10.923, p-value = 0.005571
## alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
## 95 percent confidence interval:
## 0.04514467 0.20580771
## sample estimates:
## mean in group Control mean in group Treatment
## 0.3933333 0.2678571
Since p < 0.05, the null hypothesis is
rejected.
The Treatment model significantly reduced the proportion of bad loans
incorrectly approved by loan officers.
Compared with the Control model, the Treatment model reduced the
Final Type II Rate by approximately:
\[
39.3\% - 26.8\% = 12.5\%
\]
This corresponds to a reduction of 12.5 percentage
points in Final Type II errors.
cohen.d(
final_typeII_rate~Variant,
data=officer_summary
)
##
## Cohen's d
##
## d estimate: 1.671575 (large)
## 95 percent confidence interval:
## lower upper
## 0.8292943 2.5138561
This indicates that the difference between the Control and Treatment
groups is not only statistically significant but also practically
meaningful.
Combined with the statistically significant Welch’s t-test
result and the large effect size, suggesting that the Treatment model
produced a substantial reduction in Final Type II errors and generated a
meaningful improvement in loan approval quality.
Supporting OEC
\[
H_0:\ Control ≥ Treatment
\]
\[
H_1:\ Control < Treatment
\]
w_typeII <- wilcox.test(
typeII_reduction ~ Variant,
data=officer_summary,
alternative="less"
)
w_agreement <- wilcox.test(
agreement_lift ~ Variant,
data=officer_summary,
alternative="less"
)
w_conflict <- wilcox.test(
conflict_reduction ~ Variant,
data=officer_summary,
alternative="less"
)
w_typeII
##
## Wilcoxon rank sum exact test
##
## data: typeII_reduction by Variant
## W = 120, p-value = 0.2559
## alternative hypothesis: true location shift is less than 0
officer_summary %>%
wilcox_effsize(
typeII_reduction ~ Variant
)
## # A tibble: 1 × 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 typeII_reduction Control Treatment 0.110 10 28 small
w_agreement
##
## Wilcoxon rank sum exact test
##
## data: agreement_lift by Variant
## W = 63.5, p-value = 0.004895
## alternative hypothesis: true location shift is less than 0
officer_summary %>%
wilcox_effsize(
agreement_lift ~ Variant
)
## # A tibble: 1 × 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 agreement_lift Control Treatment 0.411 10 28 moderate
w_conflict
##
## Wilcoxon rank sum exact test
##
## data: conflict_reduction by Variant
## W = 71.5, p-value = 0.0109
## alternative hypothesis: true location shift is less than 0
officer_summary %>%
wilcox_effsize(
conflict_reduction ~ Variant
)
## # A tibble: 1 × 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 conflict_reduction Control Treatment 0.368 10 28 moderate
Overall OEC Results Summary
|
OEC
|
Statistical Test
|
P-value
|
Significant
|
Effect Size
|
Magnitude
|
Interpretation
|
|
Final Type II Rate
|
Welch’s t-test
|
0.0056
|
Yes
|
d = 1.67
|
Large
|
Treatment reduced Final Type II Rate
|
|
Type II Reduction
|
Wilcoxon
|
0.2559
|
No
|
r = 0.11
|
Small
|
No evidence of greater Type II Reduction
|
|
Agreement Lift
|
Wilcoxon
|
0.0049
|
Yes
|
r = 0.411
|
Moderate
|
Higher adoption of AI recommendations
|
|
Conflict Reduction
|
Wilcoxon
|
0.0109
|
Yes
|
r = 0.368
|
Moderate
|
Lower disagreement with AI
|
Based on the statistical analyses above, several key findings can be
drawn.
First, with respect to the final Type II Error rate,
collaboration between loan officers and the new AI model
significantly reduced the risk of ultimately approving bad
loans. Furthermore, the new model demonstrated a lower Type II
Error rate than the Control group using the legacy model.
In addition, both Agreement Lift and
Conflict Reduction increased
significantly. These results suggest a consistent behavioral
pattern: after being exposed to recommendations from the new AI model,
loan officers were more likely to align their decisions with the AI’s
recommendations. This indicates a higher level of trust in, and adoption
of, the new model, resulting in stronger human-AI decision alignment
compared to the legacy model.
However, although the Final Type II Rate was significantly lower
among users of the new model, the increase in Type II
Reduction—a metric intended to capture the AI’s ability to improve human
decisions—did not reach statistical significance. There are at
least two plausible explanations for this result.
1. Insufficient Statistical Power
The Control group included only 10 participants versus 28 in the
Treatment group, creating a notable sample imbalance. Combined with the
Wilcoxon test’s limited sensitivity to small effects (observed effect
size: r = 0.11), the study may have lacked sufficient statistical power
to detect a real but modest difference.
As a result, the non-significant finding may reflect a statistical
Type II error rather than the absence of a true improvement effect from
the new model.
2. Baseline Differences Between Groups
If the Treatment group already had a relatively low initial Type II
Error rate, the new AI model would have less room to improve
performance. In contrast, the Control group’s higher initial error rate
would allow greater improvement even with the legacy model.
To test this, an additional analysis was conducted on initial Type II
Error rates before Model intervention.
Baseline Type II Error Rate Comparison Summary
|
Metric
|
Result
|
|
Control mean Initial Type II Error Rate
|
0.4
|
|
Treatment mean Initial Type II Error Rate
|
0.313
|
|
Control median Initial Type II Error Rate
|
0.433
|
|
Treatment median Initial Type II Error Rate
|
0.3
|
|
Control normality
|
Shapiro-Wilk p = 0.0307 (Not normal)
|
|
Treatment normality
|
Shapiro-Wilk p = 0.0024 (Not normal)
|
|
Variance homogeneity
|
Levene p = 0.5706 (Equal variance)
|
|
Selected statistical test
|
Wilcoxon rank-sum test
|
|
Test statistic
|
W = 213.5
|
|
p-value
|
0.0128
|
|
Conclusion
|
Significant baseline difference
|
The results confirmed this hypothesis: the Control group had a
significantly higher initial Type II Rate than the Treatment group. This
baseline difference likely explains why Type II Reduction was not
statistically significant. While the new model achieved better final
outcomes, the Treatment group’s lower starting error rate limited its
potential improvement, making the observed reduction comparable to that
of the Control group.