Data Preparation and Data Understanding

# Load Packages and Data
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(knitr)
library(kableExtra)
library(moments)
library(broom)
library(patchwork)
library(purrr)
library(car)
library(tibble)
library(effsize)
library(rstatix)
library(DiagrammeR)

df <- read_csv("data.csv", show_col_types = FALSE)

Data Dictionary and Business Interpretation

Variable Technical_Meaning Business_Interpretation
Variant Experimental group assignment Determine whether officers use old or new AI model
loanofficer_id Unique ID of loan officer Track individual performance differences
day Experiment day Observe time trend and learning effect
typeI_init Initial Type I errors Opportunity loss from rejecting good customers
typeI_fin Final Type I errors Final opportunity cost after AI assistance
typeII_init Initial Type II errors Potential bad debt due to human decisions
typeII_fin Final Type II errors Direct financial losses from approving bad loans
ai_typeI AI Type I errors AI-caused loss of good customers
ai_typeII AI Type II errors AI-caused bad debt risk
badloans_num Number of bad loans Actual defaulted loans
goodloans_num Number of good loans Actual profitable loans
agree_init Initial agreement with AI Similarity between human and AI before intervention
agree_fin Final agreement with AI Degree of accepting AI suggestions
conflict_init Initial conflict with AI Initial disagreement with AI
conflict_fin Final conflict with AI Remaining disagreement after AI
revised_per_ai Revised decisions following AI Degree to which AI influences humans
revised_agst_ai Revised decisions against AI Degree to which humans override AI
confidence_init_total Initial confidence score Human confidence before AI
confidence_fin_total Final confidence score Confidence after AI assistance
complt_init Initial completed decisions Initial workload
complt_fin Final completed decisions Final workload
fully_complt Fully completed reviews Actual completed workload

Data-Cleaning

Observations with fully_complt=0 were removed because these loan reviews did not complete both stages of the decision process and therefore were not suitable for evaluating Model-assisted decision quality.

# original data
n_before <- nrow(df)

# remove the data that didn't complete fully process(Officer -> Model -> Final)
data_clean <- df %>%
    filter(fully_complt > 0)

# after cleaning
n_after <- nrow(data_clean)

removed <- n_before-n_after

cleaning_summary <- data.frame(
    Before=n_before,
    After=n_after,
    Removed=removed
)
group_summary <- data_clean %>%

group_by(Variant) %>% summarise(n_observations=n(), n_officers=n_distinct(loanofficer_id))
Variant n_observations n_officers
Control 100 10
Treatment 280 28

After data cleaning, the dataset contained:

  • Control group: 100 observations from 10 loan officers
  • Treatment group: 280 observations from 28 loan officers

Although the sample sizes differ across groups, subsequent analyses were conducted at the loan-officer level rather than the individual observation level to avoid giving excessive weight to officers with larger workloads.

Officer-Level Aggregation and Metric Construction

The dataset contains repeated observations for each loan officer across a 10-day review period. During these ten days, each officer processed multiple loan applications and generated multiple decision records. Therefore, the raw dataset contains repeated measurements for the same individual officer.

To evaluate the overall performance of each loan officer and avoid treating repeated daily records as independent observations, the data were aggregated at the officer level. All relevant variables were summed across the 10-day period for each loanofficer_id.

The aggregated variables include:

  • Initial and final Type I errors
  • Initial and final Type II errors
  • AI Type I and Type II errors
  • Numbers of good loans and bad loans
  • Agreement and conflict measures
  • Revision behavior toward AI recommendations
  • Completion metrics
officer_summary <- data_clean %>%
  group_by(loanofficer_id, Variant) %>%
  summarise(
    total_typeI_init = sum(typeI_init),
    total_typeI_fin = sum(typeI_fin),
    total_typeII_init = sum(typeII_init),
    total_typeII_fin = sum(typeII_fin),
    total_ai_typeI = sum(ai_typeI),
    total_ai_typeII = sum(ai_typeII),
    total_badloans = sum(badloans_num),
    total_goodloans = sum(goodloans_num),
    total_agree_init = sum(agree_init),
    total_agree_fin = sum(agree_fin),
    total_conflict_init = sum(conflict_init),
    total_conflict_fin = sum(conflict_fin),
    total_revised_per_ai = sum(revised_per_ai),
    total_revised_agst_ai = sum(revised_agst_ai),
    total_complt_init = sum(complt_init),
    total_complt_fin = sum(complt_fin),
    total_fully_complt = sum(fully_complt),
    .groups = "drop"
  ) %>%
  mutate(
    Variant = factor(Variant, levels = c("Control", "Treatment"))
  ) %>%
  arrange(Variant, loanofficer_id)
loanofficer_id Variant total_typeI_init total_typeI_fin total_typeII_init total_typeII_fin total_ai_typeI total_ai_typeII total_badloans total_goodloans total_agree_init total_agree_fin total_conflict_init total_conflict_fin total_revised_per_ai total_revised_agst_ai total_complt_init total_complt_fin total_fully_complt
0g7pi6g8 Control 33 37 14 13 24 16 30 70 64 75 35 24 12 1 99 99 99
0gh7r2hr Control 23 23 15 16 24 16 30 70 75 85 21 12 9 0 96 97 96
bzeya726 Control 23 23 14 13 24 16 30 70 79 82 18 18 1 1 97 100 97
dlpxpwdj Control 49 52 8 8 24 16 30 70 50 54 38 40 0 0 88 94 88
i6miisiq Control 50 50 9 9 24 16 30 70 48 59 42 36 8 2 90 95 90
p5g1bxa1 Control 30 31 14 14 24 16 30 70 77 79 22 21 5 4 99 100 99
qwun9ha5 Control 35 34 12 11 24 16 30 70 63 68 31 26 5 0 96 94 94
sarganjx Control 20 23 14 17 24 16 30 70 75 88 17 12 8 1 92 100 92
ugdh6i8o Control 43 50 8 8 24 16 30 70 64 64 32 36 1 5 96 100 96
uui3fiii Control 26 29 12 9 24 16 30 70 71 76 21 18 5 1 92 94 92
0899qxvc Treatment 17 18 7 7 12 8 30 70 79 83 12 9 4 1 91 92 91
09pij0e2 Treatment 31 27 8 8 12 8 30 70 70 79 26 18 9 1 96 97 96
1ckkyukp Treatment 11 15 11 12 12 8 30 70 80 88 13 11 5 1 93 99 93
1ha5khxo Treatment 16 16 10 8 12 8 30 70 83 87 13 12 6 5 96 99 96
2twvlktb Treatment 29 14 10 8 12 8 30 70 65 82 32 18 20 5 97 100 97
4cdwcblq Treatment 18 17 12 9 12 8 30 70 72 80 20 15 6 1 92 95 92
530lfgx0 Treatment 30 16 10 9 12 8 30 70 69 95 25 5 21 0 94 100 94
7bx6hbg5 Treatment 33 20 8 9 12 8 30 70 74 89 25 11 14 0 99 100 99
7fyegrit Treatment 30 15 10 11 12 8 30 70 72 90 28 10 20 2 100 100 100
92vdohom Treatment 18 21 7 6 12 8 30 70 81 81 12 16 1 5 94 97 93
9lejzokf Treatment 62 22 3 7 12 8 30 70 37 81 55 13 43 0 94 94 92
9splhe3u Treatment 12 14 20 9 12 8 30 70 63 97 31 3 28 0 94 100 94
envu2p1p Treatment 23 16 6 6 12 8 30 70 64 76 12 7 6 0 83 83 76
ffg5z8wh Treatment 21 16 9 8 12 8 30 70 75 90 19 10 9 0 94 100 94
gdm2odtq Treatment 29 28 7 7 12 8 30 70 71 78 23 20 5 1 94 98 94
jtifmrhp Treatment 19 21 7 7 12 8 30 70 66 80 17 13 5 0 83 93 83
ju0595ih Treatment 16 14 12 10 12 8 30 70 77 90 22 10 13 0 99 100 99
kmr3oifc Treatment 22 18 7 8 12 8 30 70 79 85 14 9 5 0 93 94 93
o2vs2awq Treatment 13 14 9 8 12 8 30 70 82 94 15 6 9 0 97 100 97
qamcqdoe Treatment 6 15 21 9 12 8 30 70 72 96 27 4 23 0 99 100 99
qo1puiqt Treatment 16 16 9 9 12 8 30 70 76 80 15 14 3 2 91 94 91
srk7424h Treatment 29 24 5 8 12 8 30 70 67 86 24 14 12 0 91 100 91
sybgf6ws Treatment 32 23 10 5 12 8 30 70 67 83 27 13 15 1 94 96 94
uybljp0c Treatment 15 14 12 9 12 8 30 70 73 82 20 12 8 0 93 94 93
v0ml3nyf Treatment 30 33 5 5 12 8 30 70 66 74 25 23 5 0 91 97 91
vflkw3iq Treatment 46 41 5 5 12 8 30 70 54 64 40 33 8 0 94 97 94
yc74rzbp Treatment 21 15 12 11 12 8 30 70 77 92 18 8 11 0 95 100 95
z0icpf5l Treatment 16 20 11 7 12 8 30 70 74 91 24 9 15 0 98 100 98

Preliminary Investigation of Model Performance

Before defining the Overall Evaluation Criteria (OECs), the Model-related metrics were examined.

During data preparation, all loan officers within the same experimental variant were found to have identical AI Type I and AI Type II error counts. This indicates that these metrics reflect the performance of the AI model assigned to each variant rather than individual officer behaviour.

ai_model_performance <- officer_summary %>%
  mutate(
    ai_typeII_rate = total_ai_typeII / total_badloans
  ) %>%
  group_by(Variant) %>%
  summarise(
    ai_typeII_rate = unique(ai_typeII_rate),
    n_officers = n(),
    .groups = "drop"
  ) %>%
  mutate(
    ai_typeII_rate = round(ai_typeII_rate, 3),
    ai_typeII_rate_pct = paste0(round(ai_typeII_rate * 100, 1), "%")
  )
\[ AI\ TypeII\ Rate= \frac{ai\_typeII} {badloans} \]
Preliminary Investigation of AI Model Type II Error Rate
Variant AI Type II Rate Number of Officers AI Type II Rate (%)
Control 0.533 10 53.3%
Treatment 0.267 28 26.7%

A descriptive comparison shows that the Treatment model achieved a substantially lower AI Type II error rate than the Control model (26.7% vs. 53.3%), suggesting improved predictive performance in identifying bad loans.

Because these metrics do not vary across officers within the same variant, they are not suitable for inferential statistical testing and are instead used as diagnostic indicators of model quality.

As the objective of the experiment is to assess whether improved AI predictions lead to better loan decisions, the subsequent OECs focus on outcomes arising from human–AI interaction rather than model performance alone.

Overall Evaluation Criteria (OEC) Definition

To evaluate whether the new AI model improves loan officers’ decision quality, one primary OEC and several key supporting OECs were defined. The primary OEC focuses on the company’s main business objective, while the supporting OECs help explain the mechanism through which the model may influence decision performance.

Primary OEC – Final Type II Rate

\[ Final\ TypeII\ Rate= \frac{typeII_{fin}} {Total Badloans} \] The primary OEC is defined as the proportion of bad loans that were incorrectly approved after Model-assisted review.

This metric was selected because the company’s primary concern is approving loans that later default. Reducing Type II errors directly lowers potential financial losses and reflects the key business objective.

A lower Final Type II Rate indicates better decision quality (smaller-is-better). Therefore, the Treatment group is expected to achieve a lower Final Type II Rate than the Control group, suggesting that the new AI model can better support officers in identifying risky loan applications.

Supporting OEC 1 – Type II Reduction

\[ TypeII\ Reduction= Init\ TypeII\ Rate - Final\ TypeII\ Rate \] Type II Reduction measures how much officers corrected Type II errors after receiving Model recommendations by comparing initial and final Type II error rates.

A larger value indicates greater reduction in Type II errors after interacting with Model recommendations (larger-is-better). This metric helps evaluate whether Model assistance improves decision-making performance beyond officers’ initial decisions.

The Treatment group is expected to show a larger Type II Reduction, suggesting that the new AI model may better support officers in identifying and correcting risky approval decisions.

Supporting OEC 2 – Agreement Lift

\[ Agreement\ Lift= Agree_{fin\_rate} - Agree_{init\_rate} \] Agreement Lift measures the increase in agreement between officers and Model recommendations from initial to final decisions.

A larger value indicates that officers became more aligned with Model recommendations after receiving Model assistance (larger-is-better). This metric reflects the extent to which officers adjusted their decisions toward Model recommendations.

Supporting OEC 3 – Conflict Reduction

\[ Conflict\ Reduction= Conflict_{init\_rate} - Conflict_{fin\_rate} \] Conflict Reduction measures the decrease in disagreement between officers and Model recommendations from initial to final decisions.

A larger value indicates fewer conflicts between officers and Model recommendations after Model assistance (larger-is-better). Together with Agreement Lift, this metric reflects changes in alignment between officers and Model during the decision-making process.

officer_summary <- officer_summary %>%
  
mutate(
  
# Primary OEC
final_typeII_rate = total_typeII_fin / total_badloans,

# Supporting OEC 1
init_typeII_rate = total_typeII_init / total_badloans,
typeII_reduction = init_typeII_rate - final_typeII_rate,

# Supporting OEC 2
agree_init_rate = total_agree_init / total_fully_complt,
agree_fin_rate = total_agree_fin / total_fully_complt,
agreement_lift = agree_fin_rate - agree_init_rate,

# Supporting OEC 3
conflict_init_rate = total_conflict_init / total_fully_complt,
conflict_fin_rate = total_conflict_fin / total_fully_complt,
conflict_reduction = conflict_init_rate - conflict_fin_rate

)
loanofficer_id Variant final_typeII_rate typeII_reduction agreement_lift conflict_reduction
0g7pi6g8 Control 0.433 0.033 0.111 0.111
0gh7r2hr Control 0.533 -0.033 0.104 0.094
bzeya726 Control 0.433 0.033 0.031 0.000
dlpxpwdj Control 0.267 0.000 0.045 -0.023
i6miisiq Control 0.300 0.000 0.122 0.067
p5g1bxa1 Control 0.467 0.000 0.020 0.010
qwun9ha5 Control 0.367 0.033 0.053 0.053
sarganjx Control 0.567 -0.100 0.141 0.054
ugdh6i8o Control 0.267 0.000 0.000 -0.042
uui3fiii Control 0.300 0.100 0.054 0.033
0899qxvc Treatment 0.233 0.000 0.044 0.033
09pij0e2 Treatment 0.267 0.000 0.094 0.083
1ckkyukp Treatment 0.400 -0.033 0.086 0.022
1ha5khxo Treatment 0.267 0.067 0.042 0.010
2twvlktb Treatment 0.267 0.067 0.175 0.144
4cdwcblq Treatment 0.300 0.100 0.087 0.054
530lfgx0 Treatment 0.300 0.033 0.277 0.213
7bx6hbg5 Treatment 0.300 -0.033 0.152 0.141
7fyegrit Treatment 0.367 -0.033 0.180 0.180
92vdohom Treatment 0.200 0.033 0.000 -0.043
9lejzokf Treatment 0.233 -0.133 0.478 0.457
9splhe3u Treatment 0.300 0.367 0.362 0.298
envu2p1p Treatment 0.200 0.000 0.158 0.066
ffg5z8wh Treatment 0.267 0.033 0.160 0.096
gdm2odtq Treatment 0.233 0.000 0.074 0.032
jtifmrhp Treatment 0.233 0.000 0.169 0.048
ju0595ih Treatment 0.333 0.067 0.131 0.121
kmr3oifc Treatment 0.267 -0.033 0.065 0.054
o2vs2awq Treatment 0.267 0.033 0.124 0.093
qamcqdoe Treatment 0.300 0.400 0.242 0.232
qo1puiqt Treatment 0.300 0.000 0.044 0.011
srk7424h Treatment 0.267 -0.100 0.209 0.110
sybgf6ws Treatment 0.167 0.167 0.170 0.149
uybljp0c Treatment 0.300 0.100 0.097 0.086
v0ml3nyf Treatment 0.167 0.000 0.088 0.022
vflkw3iq Treatment 0.167 0.000 0.106 0.074
yc74rzbp Treatment 0.367 0.033 0.158 0.105
z0icpf5l Treatment 0.233 0.133 0.173 0.153

Statistical Assumption Assessment

Normality Assessment

Before selecting statistical methods, the distribution of each OEC was examined within the Control and Treatment groups.

Normality was assessed using three complementary approaches. First, the Shapiro–Wilk test was used as a formal statistical test. Second, skewness, kurtosis, mean, median, and standard deviation were calculated to summarize distribution shape and potential asymmetry. Third, Q-Q plots and histograms were used to visually inspect whether each metric followed an approximately normal distribution.

The hypotheses for the Shapiro–Wilk test are:

\[ H_0: \text{The data follows a normal distribution} \]

\[ H_1: \text{The data does not follow a normal distribution} \]

A p-value greater than 0.05 indicates insufficient evidence to reject the normality assumption, while a p-value below 0.05 suggests that the metric may deviate from normality.

vars <- c(
  "final_typeII_rate",
  "typeII_reduction",
  "agreement_lift",
  "conflict_reduction"
)

normality_summary <- lapply(vars, function(v){

  officer_summary %>%
    group_by(Variant) %>%
    summarise(
      Variable = v,

      Shapiro_p =
        shapiro.test(.data[[v]])$p.value,

      Skewness =
        skewness(.data[[v]], na.rm=TRUE),

      Kurtosis =
        kurtosis(.data[[v]], na.rm=TRUE),

      Mean =
        mean(.data[[v]], na.rm=TRUE),

      Median =
        median(.data[[v]], na.rm=TRUE),

      SD =
        sd(.data[[v]], na.rm=TRUE),

      .groups="drop"
    )

}) %>%
bind_rows()

normality_summary
## # A tibble: 8 × 8
##   Variant   Variable           Shapiro_p Skewness Kurtosis    Mean Median     SD
##   <fct>     <chr>                  <dbl>    <dbl>    <dbl>   <dbl>  <dbl>  <dbl>
## 1 Control   final_typeII_rate   0.300      0.245      1.69 0.393   0.4    0.110 
## 2 Treatment final_typeII_rate   0.218      0.148      2.78 0.268   0.267  0.0591
## 3 Control   typeII_reduction    0.292     -0.348      3.64 0.00667 0      0.0516
## 4 Treatment typeII_reduction    0.000118   1.74       6.37 0.0452  0.0167 0.114 
## 5 Control   agreement_lift      0.503      0.182      1.67 0.0683  0.0538 0.0479
## 6 Treatment agreement_lift      0.00331    1.49       5.71 0.148   0.141  0.101 
## 7 Control   conflict_reduction  0.936     -0.0594     1.95 0.0357  0.0429 0.0495
## 8 Treatment conflict_reduction  0.00281    1.60       6.28 0.109   0.0894 0.101

The Q-Q plots provide visual evidence of normality by comparing the observed data distribution with a theoretical normal distribution. If the points closely follow the reference line, the normality assumption is more plausible. Histograms were also used to identify skewness, heavy tails, or irregular distribution patterns.

The results showed that Final Type II Rate satisfied the normality assumption, while Type II Reduction, Agreement Lift, and Conflict Reduction violated normality assumptions.

Homogeneity of Variance Assessment

Levene’s test was used because it is more robust than the traditional F-test when the data may deviate from normality.

The hypotheses for Levene’s test are:

\[ H_0: \sigma^2_{Control} = \sigma^2_{Treatment} \]

\[ H_1: \sigma^2_{Control} \neq \sigma^2_{Treatment} \]

where \(\sigma^2_{Control}\) and \(\sigma^2_{Treatment}\) represent the variances of the OEC in the Control and Treatment groups, respectively.

A p-value greater than 0.05 indicates insufficient evidence to reject the equal variance assumption, suggesting that the homogeneity of variance assumption is satisfied. A p-value below 0.05 indicates unequal variances between groups.

# Levene test
levene_results <- map_dfr(vars, function(v){

  result <- leveneTest(
    as.formula(paste(v, "~ Variant")),
    data = officer_summary
  )

  tibble(
    Variable = v,
    F_value = round(result$`F value`[1],3),
    P_value = round(result$`Pr(>F)`[1],4),
    
    Assumption = ifelse(
      result$`Pr(>F)`[1] > 0.05,
      "Satisfied (Equal Variance)",
      "Not satisfied (Unequal Variance)"
    )
  )

})

levene_results
## # A tibble: 4 × 4
##   Variable           F_value P_value Assumption                      
##   <chr>                <dbl>   <dbl> <chr>                           
## 1 final_typeII_rate    10.4   0.0027 Not satisfied (Unequal Variance)
## 2 typeII_reduction      1.56  0.220  Satisfied (Equal Variance)      
## 3 agreement_lift        1.94  0.172  Satisfied (Equal Variance)      
## 4 conflict_reduction    1.41  0.242  Satisfied (Equal Variance)

Statistical Test and Effect Size Selection

After evaluating the normality and homogeneity of variance assumptions, appropriate statistical tests were selected for each OEC.

For variables that satisfied the normality assumption, parametric tests were preferred because they generally provide greater statistical power. When the equal variance assumption was violated, Welch’s t-test was selected because it is robust to unequal variances between groups.

For variables that violated the normality assumption, the Wilcoxon rank-sum test was used. This non-parametric alternative does not require normally distributed data and is suitable for comparing two independent groups.

The decision rules used for statistical test selection are summarized below:

Statistical Test Selection Rules
Normality Homogeneity of Variance Statistical Test
Satisfied Satisfied Independent t-test
Satisfied Not satisfied Welch’s t-test
Not satisfied Either satisfied or not satisfied Wilcoxon rank-sum test

While hypothesis tests indicate whether an observed difference is statistically significant, they do not describe the magnitude or practical importance of that difference. Therefore, effect sizes were calculated to quantify the strength of the observed effects.

Different effect size measures were used depending on the statistical test applied.

For t-tests, Cohen’s d was used to measure the standardized difference between group means. For Wilcoxon rank-sum tests, the rank-biserial effect size (r) was used because it is appropriate for non-parametric comparisons.

The interpretation guidelines are summarized below:

Effect Size Interpretation Guidelines
Effect Size Measure Small Medium Large
Cohen’s d 0.20 0.50 0.80
Wilcoxon r 0.10 0.30 0.50

The thresholds above were used to interpret the practical magnitude of the observed effects.

The final statistical methods and corresponding effect size measures selected for each OEC are summarized below:
Statistical Test and Effect Size Selection for Each OEC
OEC Normality Homogeneity of Variance Statistical Test Effect Size Measure
Final Type II Rate Satisfied Not satisfied Welch’s t-test Cohen’s d
Type II Reduction Not satisfied Satisfied Wilcoxon rank-sum test Wilcoxon r
Agreement Lift Not satisfied Satisfied Wilcoxon rank-sum test Wilcoxon r
Conflict Reduction Not satisfied Satisfied Wilcoxon rank-sum test Wilcoxon r

Statistical Test and Results

Primary OEC

\[ H_0:\ \mu_{Control} ≤ \mu_{Treatment} \]

\[ H_1:\ \mu_{Control} > \mu_{Treatment} \]

where \(\mu\) represents the mean Final Type II Rate.

t_final_typeII_rate <- t.test(
final_typeII_rate~Variant,
var.equal=FALSE,
data=officer_summary
)
t_final_typeII_rate
## 
##  Welch Two Sample t-test
## 
## data:  final_typeII_rate by Variant
## t = 3.4409, df = 10.923, p-value = 0.005571
## alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
## 95 percent confidence interval:
##  0.04514467 0.20580771
## sample estimates:
##   mean in group Control mean in group Treatment 
##               0.3933333               0.2678571

Since p < 0.05, the null hypothesis is rejected.

The Treatment model significantly reduced the proportion of bad loans incorrectly approved by loan officers.

Compared with the Control model, the Treatment model reduced the Final Type II Rate by approximately:

\[ 39.3\% - 26.8\% = 12.5\% \]

This corresponds to a reduction of 12.5 percentage points in Final Type II errors.

cohen.d(
 final_typeII_rate~Variant,
 data=officer_summary
)
## 
## Cohen's d
## 
## d estimate: 1.671575 (large)
## 95 percent confidence interval:
##     lower     upper 
## 0.8292943 2.5138561

This indicates that the difference between the Control and Treatment groups is not only statistically significant but also practically meaningful.

Combined with the statistically significant Welch’s t-test result and the large effect size, suggesting that the Treatment model produced a substantial reduction in Final Type II errors and generated a meaningful improvement in loan approval quality.

Supporting OEC

\[ H_0:\ Control ≥ Treatment \]

\[ H_1:\ Control < Treatment \]

w_typeII <- wilcox.test(
  typeII_reduction ~ Variant,
  data=officer_summary,
  alternative="less"
)

w_agreement <- wilcox.test(
  agreement_lift ~ Variant,
  data=officer_summary,
  alternative="less"
)

w_conflict <- wilcox.test(
  conflict_reduction ~ Variant,
  data=officer_summary,
  alternative="less"
)
w_typeII
## 
##  Wilcoxon rank sum exact test
## 
## data:  typeII_reduction by Variant
## W = 120, p-value = 0.2559
## alternative hypothesis: true location shift is less than 0
officer_summary %>%
  wilcox_effsize(
    typeII_reduction ~ Variant
  )
## # A tibble: 1 × 7
##   .y.              group1  group2    effsize    n1    n2 magnitude
## * <chr>            <chr>   <chr>       <dbl> <int> <int> <ord>    
## 1 typeII_reduction Control Treatment   0.110    10    28 small
w_agreement
## 
##  Wilcoxon rank sum exact test
## 
## data:  agreement_lift by Variant
## W = 63.5, p-value = 0.004895
## alternative hypothesis: true location shift is less than 0
officer_summary %>%
  wilcox_effsize(
    agreement_lift ~ Variant
  )
## # A tibble: 1 × 7
##   .y.            group1  group2    effsize    n1    n2 magnitude
## * <chr>          <chr>   <chr>       <dbl> <int> <int> <ord>    
## 1 agreement_lift Control Treatment   0.411    10    28 moderate
w_conflict
## 
##  Wilcoxon rank sum exact test
## 
## data:  conflict_reduction by Variant
## W = 71.5, p-value = 0.0109
## alternative hypothesis: true location shift is less than 0
officer_summary %>%
  wilcox_effsize(
    conflict_reduction ~ Variant
  )
## # A tibble: 1 × 7
##   .y.                group1  group2    effsize    n1    n2 magnitude
## * <chr>              <chr>   <chr>       <dbl> <int> <int> <ord>    
## 1 conflict_reduction Control Treatment   0.368    10    28 moderate
Overall OEC Results Summary
OEC Statistical Test P-value Significant Effect Size Magnitude Interpretation
Final Type II Rate Welch’s t-test 0.0056 Yes d = 1.67 Large Treatment reduced Final Type II Rate
Type II Reduction Wilcoxon 0.2559 No r = 0.11 Small No evidence of greater Type II Reduction
Agreement Lift Wilcoxon 0.0049 Yes r = 0.411 Moderate Higher adoption of AI recommendations
Conflict Reduction Wilcoxon 0.0109 Yes r = 0.368 Moderate Lower disagreement with AI

Based on the statistical analyses above, several key findings can be drawn.

First, with respect to the final Type II Error rate, collaboration between loan officers and the new AI model significantly reduced the risk of ultimately approving bad loans. Furthermore, the new model demonstrated a lower Type II Error rate than the Control group using the legacy model.

In addition, both Agreement Lift and Conflict Reduction increased significantly. These results suggest a consistent behavioral pattern: after being exposed to recommendations from the new AI model, loan officers were more likely to align their decisions with the AI’s recommendations. This indicates a higher level of trust in, and adoption of, the new model, resulting in stronger human-AI decision alignment compared to the legacy model.

However, although the Final Type II Rate was significantly lower among users of the new model, the increase in Type II Reduction—a metric intended to capture the AI’s ability to improve human decisions—did not reach statistical significance. There are at least two plausible explanations for this result.

1. Insufficient Statistical Power

The Control group included only 10 participants versus 28 in the Treatment group, creating a notable sample imbalance. Combined with the Wilcoxon test’s limited sensitivity to small effects (observed effect size: r = 0.11), the study may have lacked sufficient statistical power to detect a real but modest difference.

As a result, the non-significant finding may reflect a statistical Type II error rather than the absence of a true improvement effect from the new model.

2. Baseline Differences Between Groups

If the Treatment group already had a relatively low initial Type II Error rate, the new AI model would have less room to improve performance. In contrast, the Control group’s higher initial error rate would allow greater improvement even with the legacy model.

To test this, an additional analysis was conducted on initial Type II Error rates before Model intervention.

Baseline Type II Error Rate Comparison Summary
Metric Result
Control mean Initial Type II Error Rate 0.4
Treatment mean Initial Type II Error Rate 0.313
Control median Initial Type II Error Rate 0.433
Treatment median Initial Type II Error Rate 0.3
Control normality Shapiro-Wilk p = 0.0307 (Not normal)
Treatment normality Shapiro-Wilk p = 0.0024 (Not normal)
Variance homogeneity Levene p = 0.5706 (Equal variance)
Selected statistical test Wilcoxon rank-sum test
Test statistic W = 213.5
p-value 0.0128
Conclusion Significant baseline difference

The results confirmed this hypothesis: the Control group had a significantly higher initial Type II Rate than the Treatment group. This baseline difference likely explains why Type II Reduction was not statistically significant. While the new model achieved better final outcomes, the Treatment group’s lower starting error rate limited its potential improvement, making the observed reduction comparable to that of the Control group.

Conclusion and Recommendations

Taken together, the results indicate that the new AI model provides meaningful business value by reducing bad-loan approvals and improving human-AI decision alignment. The primary OEC was statistically significant with a large effect size (Cohen’s d = 1.67), providing strong evidence that the new model outperforms the legacy model.

Although Type II Reduction was not statistically significant, this is likely due to limited statistical power and baseline differences between groups rather than a lack of model effectiveness. Overall, the findings support adoption of the new model while continuing to strengthen the evidence through further data collection.

Based on these findings, the following actions are recommended:

1. Proceed with a Controlled Rollout

The model has demonstrated a measurable reduction in bad-loan approval risk. To avoid continued exposure to the higher risk associated with the legacy model, a controlled rollout is recommended while performance continues to be monitored.

2. Continue Data Collection and Improve Experimental Design

Additional data should be collected during rollout to increase statistical power and confidence in the results. Future studies should aim for more balanced Control and Treatment groups (ideally 20–30 participants each). Where resources are limited, cross-over or matched-pair designs may help reduce individual-level variation.

3. Broaden Evaluation Metrics and Analysis

Future evaluations should assess both Type II and Type I Errors to better understand the risk-return trade-off and ensure risk reduction does not come at the cost of rejecting profitable loans.

If case-level data become available, decision-path analysis (Human Initial Decision × AI Recommendation × Human Final Decision × Actual Loan Outcome) could provide deeper insight into how and when the AI improves human decision-making.