Data Preparation and Data Understanding

# Load Packages and Data
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(knitr)
library(kableExtra)
library(moments)
library(broom)
library(patchwork)
library(purrr)
library(car)
library(tibble)
library(effsize)
library(rstatix)
library(DiagrammeR)

df <- read_csv("data.csv", show_col_types = FALSE)

Data Dictionary and Business Interpretation

Variable	Technical_Meaning	Business_Interpretation
Variant	Experimental group assignment	Determine whether officers use old or new AI model
loanofficer_id	Unique ID of loan officer	Track individual performance differences
day	Experiment day	Observe time trend and learning effect
typeI_init	Initial Type I errors	Opportunity loss from rejecting good customers
typeI_fin	Final Type I errors	Final opportunity cost after AI assistance
typeII_init	Initial Type II errors	Potential bad debt due to human decisions
typeII_fin	Final Type II errors	Direct financial losses from approving bad loans
ai_typeI	AI Type I errors	AI-caused loss of good customers
ai_typeII	AI Type II errors	AI-caused bad debt risk
badloans_num	Number of bad loans	Actual defaulted loans
goodloans_num	Number of good loans	Actual profitable loans
agree_init	Initial agreement with AI	Similarity between human and AI before intervention
agree_fin	Final agreement with AI	Degree of accepting AI suggestions
conflict_init	Initial conflict with AI	Initial disagreement with AI
conflict_fin	Final conflict with AI	Remaining disagreement after AI
revised_per_ai	Revised decisions following AI	Degree to which AI influences humans
revised_agst_ai	Revised decisions against AI	Degree to which humans override AI
confidence_init_total	Initial confidence score	Human confidence before AI
confidence_fin_total	Final confidence score	Confidence after AI assistance
complt_init	Initial completed decisions	Initial workload
complt_fin	Final completed decisions	Final workload
fully_complt	Fully completed reviews	Actual completed workload

Data-Cleaning

Observations with fully_complt=0 were removed because these loan reviews did not complete both stages of the decision process and therefore were not suitable for evaluating Model-assisted decision quality.

# original data
n_before <- nrow(df)

# remove the data that didn't complete fully process(Officer -> Model -> Final)
data_clean <- df %>%
    filter(fully_complt > 0)

# after cleaning
n_after <- nrow(data_clean)

removed <- n_before-n_after

cleaning_summary <- data.frame(
    Before=n_before,
    After=n_after,
    Removed=removed
)
group_summary <- data_clean %>%

group_by(Variant) %>% summarise(n_observations=n(), n_officers=n_distinct(loanofficer_id))

Variant	n_observations	n_officers
Control	100	10
Treatment	280	28

After data cleaning, the dataset contained:

Control group: 100 observations from 10 loan officers
Treatment group: 280 observations from 28 loan officers

Although the sample sizes differ across groups, subsequent analyses were conducted at the loan-officer level rather than the individual observation level to avoid giving excessive weight to officers with larger workloads.

Officer-Level Aggregation and Metric Construction

The dataset contains repeated observations for each loan officer across a 10-day review period. During these ten days, each officer processed multiple loan applications and generated multiple decision records. Therefore, the raw dataset contains repeated measurements for the same individual officer.

To evaluate the overall performance of each loan officer and avoid treating repeated daily records as independent observations, the data were aggregated at the officer level. All relevant variables were summed across the 10-day period for each loanofficer_id.

The aggregated variables include:

Initial and final Type I errors
Initial and final Type II errors
AI Type I and Type II errors
Numbers of good loans and bad loans
Agreement and conflict measures
Revision behavior toward AI recommendations
Completion metrics

officer_summary <- data_clean %>%
  group_by(loanofficer_id, Variant) %>%
  summarise(
    total_typeI_init = sum(typeI_init),
    total_typeI_fin = sum(typeI_fin),
    total_typeII_init = sum(typeII_init),
    total_typeII_fin = sum(typeII_fin),
    total_ai_typeI = sum(ai_typeI),
    total_ai_typeII = sum(ai_typeII),
    total_badloans = sum(badloans_num),
    total_goodloans = sum(goodloans_num),
    total_agree_init = sum(agree_init),
    total_agree_fin = sum(agree_fin),
    total_conflict_init = sum(conflict_init),
    total_conflict_fin = sum(conflict_fin),
    total_revised_per_ai = sum(revised_per_ai),
    total_revised_agst_ai = sum(revised_agst_ai),
    total_complt_init = sum(complt_init),
    total_complt_fin = sum(complt_fin),
    total_fully_complt = sum(fully_complt),
    .groups = "drop"
  ) %>%
  mutate(
    Variant = factor(Variant, levels = c("Control", "Treatment"))
  ) %>%
  arrange(Variant, loanofficer_id)

loanofficer_id	Variant	total_typeI_init	total_typeI_fin	total_typeII_init	total_typeII_fin	total_ai_typeI	total_ai_typeII	total_badloans	total_goodloans	total_agree_init	total_agree_fin	total_conflict_init	total_conflict_fin	total_revised_per_ai	total_revised_agst_ai	total_complt_init	total_complt_fin	total_fully_complt
0g7pi6g8	Control	33	37	14	13	24	16	30	70	64	75	35	24	12	1	99	99	99
0gh7r2hr	Control	23	23	15	16	24	16	30	70	75	85	21	12	9	0	96	97	96
bzeya726	Control	23	23	14	13	24	16	30	70	79	82	18	18	1	1	97	100	97
dlpxpwdj	Control	49	52	8	8	24	16	30	70	50	54	38	40	0	0	88	94	88
i6miisiq	Control	50	50	9	9	24	16	30	70	48	59	42	36	8	2	90	95	90
p5g1bxa1	Control	30	31	14	14	24	16	30	70	77	79	22	21	5	4	99	100	99
qwun9ha5	Control	35	34	12	11	24	16	30	70	63	68	31	26	5	0	96	94	94
sarganjx	Control	20	23	14	17	24	16	30	70	75	88	17	12	8	1	92	100	92
ugdh6i8o	Control	43	50	8	8	24	16	30	70	64	64	32	36	1	5	96	100	96
uui3fiii	Control	26	29	12	9	24	16	30	70	71	76	21	18	5	1	92	94	92
0899qxvc	Treatment	17	18	7	7	12	8	30	70	79	83	12	9	4	1	91	92	91
09pij0e2	Treatment	31	27	8	8	12	8	30	70	70	79	26	18	9	1	96	97	96
1ckkyukp	Treatment	11	15	11	12	12	8	30	70	80	88	13	11	5	1	93	99	93
1ha5khxo	Treatment	16	16	10	8	12	8	30	70	83	87	13	12	6	5	96	99	96
2twvlktb	Treatment	29	14	10	8	12	8	30	70	65	82	32	18	20	5	97	100	97
4cdwcblq	Treatment	18	17	12	9	12	8	30	70	72	80	20	15	6	1	92	95	92
530lfgx0	Treatment	30	16	10	9	12	8	30	70	69	95	25	5	21	0	94	100	94
7bx6hbg5	Treatment	33	20	8	9	12	8	30	70	74	89	25	11	14	0	99	100	99
7fyegrit	Treatment	30	15	10	11	12	8	30	70	72	90	28	10	20	2	100	100	100
92vdohom	Treatment	18	21	7	6	12	8	30	70	81	81	12	16	1	5	94	97	93
9lejzokf	Treatment	62	22	3	7	12	8	30	70	37	81	55	13	43	0	94	94	92
9splhe3u	Treatment	12	14	20	9	12	8	30	70	63	97	31	3	28	0	94	100	94
envu2p1p	Treatment	23	16	6	6	12	8	30	70	64	76	12	7	6	0	83	83	76
ffg5z8wh	Treatment	21	16	9	8	12	8	30	70	75	90	19	10	9	0	94	100	94
gdm2odtq	Treatment	29	28	7	7	12	8	30	70	71	78	23	20	5	1	94	98	94
jtifmrhp	Treatment	19	21	7	7	12	8	30	70	66	80	17	13	5	0	83	93	83
ju0595ih	Treatment	16	14	12	10	12	8	30	70	77	90	22	10	13	0	99	100	99
kmr3oifc	Treatment	22	18	7	8	12	8	30	70	79	85	14	9	5	0	93	94	93
o2vs2awq	Treatment	13	14	9	8	12	8	30	70	82	94	15	6	9	0	97	100	97
qamcqdoe	Treatment	6	15	21	9	12	8	30	70	72	96	27	4	23	0	99	100	99
qo1puiqt	Treatment	16	16	9	9	12	8	30	70	76	80	15	14	3	2	91	94	91
srk7424h	Treatment	29	24	5	8	12	8	30	70	67	86	24	14	12	0	91	100	91
sybgf6ws	Treatment	32	23	10	5	12	8	30	70	67	83	27	13	15	1	94	96	94
uybljp0c	Treatment	15	14	12	9	12	8	30	70	73	82	20	12	8	0	93	94	93
v0ml3nyf	Treatment	30	33	5	5	12	8	30	70	66	74	25	23	5	0	91	97	91
vflkw3iq	Treatment	46	41	5	5	12	8	30	70	54	64	40	33	8	0	94	97	94
yc74rzbp	Treatment	21	15	12	11	12	8	30	70	77	92	18	8	11	0	95	100	95
z0icpf5l	Treatment	16	20	11	7	12	8	30	70	74	91	24	9	15	0	98	100	98

Preliminary Investigation of Model Performance

Before defining the Overall Evaluation Criteria (OECs), the Model-related metrics were examined.

During data preparation, all loan officers within the same experimental variant were found to have identical AI Type I and AI Type II error counts. This indicates that these metrics reflect the performance of the AI model assigned to each variant rather than individual officer behaviour.

ai_model_performance <- officer_summary %>%
  mutate(
    ai_typeII_rate = total_ai_typeII / total_badloans
  ) %>%
  group_by(Variant) %>%
  summarise(
    ai_typeII_rate = unique(ai_typeII_rate),
    n_officers = n(),
    .groups = "drop"
  ) %>%
  mutate(
    ai_typeII_rate = round(ai_typeII_rate, 3),
    ai_typeII_rate_pct = paste0(round(ai_typeII_rate * 100, 1), "%")
  )

\[ AI\ TypeII\ Rate= \frac{ai\_typeII} {badloans} \]

Preliminary Investigation of AI Model Type II Error Rate
Variant	AI Type II Rate	Number of Officers	AI Type II Rate (%)
Control	0.533	10	53.3%
Treatment	0.267	28	26.7%

A descriptive comparison shows that the Treatment model achieved a substantially lower AI Type II error rate than the Control model (26.7% vs. 53.3%), suggesting improved predictive performance in identifying bad loans.

Because these metrics do not vary across officers within the same variant, they are not suitable for inferential statistical testing and are instead used as diagnostic indicators of model quality.

As the objective of the experiment is to assess whether improved AI predictions lead to better loan decisions, the subsequent OECs focus on outcomes arising from human–AI interaction rather than model performance alone.

Overall Evaluation Criteria (OEC) Definition

To evaluate whether the new AI model improves loan officers’ decision quality, one primary OEC and several key supporting OECs were defined. The primary OEC focuses on the company’s main business objective, while the supporting OECs help explain the mechanism through which the model may influence decision performance.

Primary OEC – Final Type II Rate

\[ Final\ TypeII\ Rate= \frac{typeII_{fin}} {Total Badloans} \] The primary OEC is defined as the proportion of bad loans that were incorrectly approved after Model-assisted review.

This metric was selected because the company’s primary concern is approving loans that later default. Reducing Type II errors directly lowers potential financial losses and reflects the key business objective.

A lower Final Type II Rate indicates better decision quality (smaller-is-better). Therefore, the Treatment group is expected to achieve a lower Final Type II Rate than the Control group, suggesting that the new AI model can better support officers in identifying risky loan applications.

Supporting OEC 1 – Type II Reduction

\[ TypeII\ Reduction= Init\ TypeII\ Rate - Final\ TypeII\ Rate \] Type II Reduction measures how much officers corrected Type II errors after receiving Model recommendations by comparing initial and final Type II error rates.

A larger value indicates greater reduction in Type II errors after interacting with Model recommendations (larger-is-better). This metric helps evaluate whether Model assistance improves decision-making performance beyond officers’ initial decisions.

The Treatment group is expected to show a larger Type II Reduction, suggesting that the new AI model may better support officers in identifying and correcting risky approval decisions.

Supporting OEC 2 – Agreement Lift

\[ Agreement\ Lift= Agree_{fin\_rate} - Agree_{init\_rate} \] Agreement Lift measures the increase in agreement between officers and Model recommendations from initial to final decisions.

A larger value indicates that officers became more aligned with Model recommendations after receiving Model assistance (larger-is-better). This metric reflects the extent to which officers adjusted their decisions toward Model recommendations.

Supporting OEC 3 – Conflict Reduction

\[ Conflict\ Reduction= Conflict_{init\_rate} - Conflict_{fin\_rate} \] Conflict Reduction measures the decrease in disagreement between officers and Model recommendations from initial to final decisions.

A larger value indicates fewer conflicts between officers and Model recommendations after Model assistance (larger-is-better). Together with Agreement Lift, this metric reflects changes in alignment between officers and Model during the decision-making process.

officer_summary <- officer_summary %>%
  
mutate(
  
# Primary OEC
final_typeII_rate = total_typeII_fin / total_badloans,

# Supporting OEC 1
init_typeII_rate = total_typeII_init / total_badloans,
typeII_reduction = init_typeII_rate - final_typeII_rate,

# Supporting OEC 2
agree_init_rate = total_agree_init / total_fully_complt,
agree_fin_rate = total_agree_fin / total_fully_complt,
agreement_lift = agree_fin_rate - agree_init_rate,

# Supporting OEC 3
conflict_init_rate = total_conflict_init / total_fully_complt,
conflict_fin_rate = total_conflict_fin / total_fully_complt,
conflict_reduction = conflict_init_rate - conflict_fin_rate

)

loanofficer_id	Variant	final_typeII_rate	typeII_reduction	agreement_lift	conflict_reduction
0g7pi6g8	Control	0.433	0.033	0.111	0.111
0gh7r2hr	Control	0.533	-0.033	0.104	0.094
bzeya726	Control	0.433	0.033	0.031	0.000
dlpxpwdj	Control	0.267	0.000	0.045	-0.023
i6miisiq	Control	0.300	0.000	0.122	0.067
p5g1bxa1	Control	0.467	0.000	0.020	0.010
qwun9ha5	Control	0.367	0.033	0.053	0.053
sarganjx	Control	0.567	-0.100	0.141	0.054
ugdh6i8o	Control	0.267	0.000	0.000	-0.042
uui3fiii	Control	0.300	0.100	0.054	0.033
0899qxvc	Treatment	0.233	0.000	0.044	0.033
09pij0e2	Treatment	0.267	0.000	0.094	0.083
1ckkyukp	Treatment	0.400	-0.033	0.086	0.022
1ha5khxo	Treatment	0.267	0.067	0.042	0.010
2twvlktb	Treatment	0.267	0.067	0.175	0.144
4cdwcblq	Treatment	0.300	0.100	0.087	0.054
530lfgx0	Treatment	0.300	0.033	0.277	0.213
7bx6hbg5	Treatment	0.300	-0.033	0.152	0.141
7fyegrit	Treatment	0.367	-0.033	0.180	0.180
92vdohom	Treatment	0.200	0.033	0.000	-0.043
9lejzokf	Treatment	0.233	-0.133	0.478	0.457
9splhe3u	Treatment	0.300	0.367	0.362	0.298
envu2p1p	Treatment	0.200	0.000	0.158	0.066
ffg5z8wh	Treatment	0.267	0.033	0.160	0.096
gdm2odtq	Treatment	0.233	0.000	0.074	0.032
jtifmrhp	Treatment	0.233	0.000	0.169	0.048
ju0595ih	Treatment	0.333	0.067	0.131	0.121
kmr3oifc	Treatment	0.267	-0.033	0.065	0.054
o2vs2awq	Treatment	0.267	0.033	0.124	0.093
qamcqdoe	Treatment	0.300	0.400	0.242	0.232
qo1puiqt	Treatment	0.300	0.000	0.044	0.011
srk7424h	Treatment	0.267	-0.100	0.209	0.110
sybgf6ws	Treatment	0.167	0.167	0.170	0.149
uybljp0c	Treatment	0.300	0.100	0.097	0.086
v0ml3nyf	Treatment	0.167	0.000	0.088	0.022
vflkw3iq	Treatment	0.167	0.000	0.106	0.074
yc74rzbp	Treatment	0.367	0.033	0.158	0.105
z0icpf5l	Treatment	0.233	0.133	0.173	0.153

Statistical Assumption Assessment

Normality Assessment

Before selecting statistical methods, the distribution of each OEC was examined within the Control and Treatment groups.

Normality was assessed using three complementary approaches. First, the Shapiro–Wilk test was used as a formal statistical test. Second, skewness, kurtosis, mean, median, and standard deviation were calculated to summarize distribution shape and potential asymmetry. Third, Q-Q plots and histograms were used to visually inspect whether each metric followed an approximately normal distribution.

The hypotheses for the Shapiro–Wilk test are:

\[ H_0: \text{The data follows a normal distribution} \]

\[ H_1: \text{The data does not follow a normal distribution} \]

A p-value greater than 0.05 indicates insufficient evidence to reject the normality assumption, while a p-value below 0.05 suggests that the metric may deviate from normality.

vars <- c(
  "final_typeII_rate",
  "typeII_reduction",
  "agreement_lift",
  "conflict_reduction"
)

normality_summary <- lapply(vars, function(v){

  officer_summary %>%
    group_by(Variant) %>%
    summarise(
      Variable = v,

      Shapiro_p =
        shapiro.test(.data[[v]])$p.value,

      Skewness =
        skewness(.data[[v]], na.rm=TRUE),

      Kurtosis =
        kurtosis(.data[[v]], na.rm=TRUE),

      Mean =
        mean(.data[[v]], na.rm=TRUE),

      Median =
        median(.data[[v]], na.rm=TRUE),

      SD =
        sd(.data[[v]], na.rm=TRUE),

      .groups="drop"
    )

}) %>%
bind_rows()

normality_summary

## # A tibble: 8 × 8
##   Variant   Variable           Shapiro_p Skewness Kurtosis    Mean Median     SD
##   <fct>     <chr>                  <dbl>    <dbl>    <dbl>   <dbl>  <dbl>  <dbl>
## 1 Control   final_typeII_rate   0.300      0.245      1.69 0.393   0.4    0.110 
## 2 Treatment final_typeII_rate   0.218      0.148      2.78 0.268   0.267  0.0591
## 3 Control   typeII_reduction    0.292     -0.348      3.64 0.00667 0      0.0516
## 4 Treatment typeII_reduction    0.000118   1.74       6.37 0.0452  0.0167 0.114 
## 5 Control   agreement_lift      0.503      0.182      1.67 0.0683  0.0538 0.0479
## 6 Treatment agreement_lift      0.00331    1.49       5.71 0.148   0.141  0.101 
## 7 Control   conflict_reduction  0.936     -0.0594     1.95 0.0357  0.0429 0.0495
## 8 Treatment conflict_reduction  0.00281    1.60       6.28 0.109   0.0894 0.101

The Q-Q plots provide visual evidence of normality by comparing the observed data distribution with a theoretical normal distribution. If the points closely follow the reference line, the normality assumption is more plausible. Histograms were also used to identify skewness, heavy tails, or irregular distribution patterns.

The results showed that Final Type II Rate satisfied the normality assumption, while Type II Reduction, Agreement Lift, and Conflict Reduction violated normality assumptions.

Homogeneity of Variance Assessment

Levene’s test was used because it is more robust than the traditional F-test when the data may deviate from normality.

The hypotheses for Levene’s test are:

\[ H_0: \sigma^2_{Control} = \sigma^2_{Treatment} \]

\[ H_1: \sigma^2_{Control} \neq \sigma^2_{Treatment} \]

where \(\sigma^2_{Control}\) and \(\sigma^2_{Treatment}\) represent the variances of the OEC in the Control and Treatment groups, respectively.

A p-value greater than 0.05 indicates insufficient evidence to reject the equal variance assumption, suggesting that the homogeneity of variance assumption is satisfied. A p-value below 0.05 indicates unequal variances between groups.

# Levene test
levene_results <- map_dfr(vars, function(v){

  result <- leveneTest(
    as.formula(paste(v, "~ Variant")),
    data = officer_summary
  )

  tibble(
    Variable = v,
    F_value = round(result$`F value`[1],3),
    P_value = round(result$`Pr(>F)`[1],4),
    
    Assumption = ifelse(
      result$`Pr(>F)`[1] > 0.05,
      "Satisfied (Equal Variance)",
      "Not satisfied (Unequal Variance)"
    )
  )

})

levene_results

## # A tibble: 4 × 4
##   Variable           F_value P_value Assumption                      
##   <chr>                <dbl>   <dbl> <chr>                           
## 1 final_typeII_rate    10.4   0.0027 Not satisfied (Unequal Variance)
## 2 typeII_reduction      1.56  0.220  Satisfied (Equal Variance)      
## 3 agreement_lift        1.94  0.172  Satisfied (Equal Variance)      
## 4 conflict_reduction    1.41  0.242  Satisfied (Equal Variance)

Statistical Test and Effect Size Selection

After evaluating the normality and homogeneity of variance assumptions, appropriate statistical tests were selected for each OEC.

For variables that satisfied the normality assumption, parametric tests were preferred because they generally provide greater statistical power. When the equal variance assumption was violated, Welch’s t-test was selected because it is robust to unequal variances between groups.

For variables that violated the normality assumption, the Wilcoxon rank-sum test was used. This non-parametric alternative does not require normally distributed data and is suitable for comparing two independent groups.

The decision rules used for statistical test selection are summarized below:

Statistical Test Selection Rules
Normality	Homogeneity of Variance	Statistical Test
Satisfied	Satisfied	Independent t-test
Satisfied	Not satisfied	Welch’s t-test
Not satisfied	Either satisfied or not satisfied	Wilcoxon rank-sum test

While hypothesis tests indicate whether an observed difference is statistically significant, they do not describe the magnitude or practical importance of that difference. Therefore, effect sizes were calculated to quantify the strength of the observed effects.

Different effect size measures were used depending on the statistical test applied.

For t-tests, Cohen’s d was used to measure the standardized difference between group means. For Wilcoxon rank-sum tests, the rank-biserial effect size (r) was used because it is appropriate for non-parametric comparisons.

The interpretation guidelines are summarized below:

Effect Size Interpretation Guidelines
Effect Size Measure	Small	Medium	Large
Cohen’s d	0.20	0.50	0.80
Wilcoxon r	0.10	0.30	0.50

The thresholds above were used to interpret the practical magnitude of the observed effects.

The final statistical methods and corresponding effect size measures selected for each OEC are summarized below:

Statistical Test and Effect Size Selection for Each OEC
OEC	Normality	Homogeneity of Variance	Statistical Test	Effect Size Measure
Final Type II Rate	Satisfied	Not satisfied	Welch’s t-test	Cohen’s d
Type II Reduction	Not satisfied	Satisfied	Wilcoxon rank-sum test	Wilcoxon r
Agreement Lift	Not satisfied	Satisfied	Wilcoxon rank-sum test	Wilcoxon r
Conflict Reduction	Not satisfied	Satisfied	Wilcoxon rank-sum test	Wilcoxon r

Statistical Test and Results

Primary OEC

\[ H_0:\ \mu_{Control} ≤ \mu_{Treatment} \]

\[ H_1:\ \mu_{Control} > \mu_{Treatment} \]

where \(\mu\) represents the mean Final Type II Rate.

t_final_typeII_rate <- t.test(
final_typeII_rate~Variant,
var.equal=FALSE,
data=officer_summary
)

t_final_typeII_rate

## 
##  Welch Two Sample t-test
## 
## data:  final_typeII_rate by Variant
## t = 3.4409, df = 10.923, p-value = 0.005571
## alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
## 95 percent confidence interval:
##  0.04514467 0.20580771
## sample estimates:
##   mean in group Control mean in group Treatment 
##               0.3933333               0.2678571

Since p < 0.05, the null hypothesis is rejected.

The Treatment model significantly reduced the proportion of bad loans incorrectly approved by loan officers.

Compared with the Control model, the Treatment model reduced the Final Type II Rate by approximately:

\[ 39.3\% - 26.8\% = 12.5\% \]

This corresponds to a reduction of 12.5 percentage points in Final Type II errors.

cohen.d(
 final_typeII_rate~Variant,
 data=officer_summary
)

## 
## Cohen's d
## 
## d estimate: 1.671575 (large)
## 95 percent confidence interval:
##     lower     upper 
## 0.8292943 2.5138561

This indicates that the difference between the Control and Treatment groups is not only statistically significant but also practically meaningful.

Combined with the statistically significant Welch’s t-test result and the large effect size, suggesting that the Treatment model produced a substantial reduction in Final Type II errors and generated a meaningful improvement in loan approval quality.

Supporting OEC

\[ H_0:\ Control ≥ Treatment \]

\[ H_1:\ Control < Treatment \]

w_typeII <- wilcox.test(
  typeII_reduction ~ Variant,
  data=officer_summary,
  alternative="less"
)

w_agreement <- wilcox.test(
  agreement_lift ~ Variant,
  data=officer_summary,
  alternative="less"
)

w_conflict <- wilcox.test(
  conflict_reduction ~ Variant,
  data=officer_summary,
  alternative="less"
)

w_typeII

## 
##  Wilcoxon rank sum exact test
## 
## data:  typeII_reduction by Variant
## W = 120, p-value = 0.2559
## alternative hypothesis: true location shift is less than 0

officer_summary %>%
  wilcox_effsize(
    typeII_reduction ~ Variant
  )

## # A tibble: 1 × 7
##   .y.              group1  group2    effsize    n1    n2 magnitude
## * <chr>            <chr>   <chr>       <dbl> <int> <int> <ord>    
## 1 typeII_reduction Control Treatment   0.110    10    28 small

w_agreement

## 
##  Wilcoxon rank sum exact test
## 
## data:  agreement_lift by Variant
## W = 63.5, p-value = 0.004895
## alternative hypothesis: true location shift is less than 0

officer_summary %>%
  wilcox_effsize(
    agreement_lift ~ Variant
  )

## # A tibble: 1 × 7
##   .y.            group1  group2    effsize    n1    n2 magnitude
## * <chr>          <chr>   <chr>       <dbl> <int> <int> <ord>    
## 1 agreement_lift Control Treatment   0.411    10    28 moderate

w_conflict

## 
##  Wilcoxon rank sum exact test
## 
## data:  conflict_reduction by Variant
## W = 71.5, p-value = 0.0109
## alternative hypothesis: true location shift is less than 0

officer_summary %>%
  wilcox_effsize(
    conflict_reduction ~ Variant
  )

## # A tibble: 1 × 7
##   .y.                group1  group2    effsize    n1    n2 magnitude
## * <chr>              <chr>   <chr>       <dbl> <int> <int> <ord>    
## 1 conflict_reduction Control Treatment   0.368    10    28 moderate

Overall OEC Results Summary
OEC	Statistical Test	P-value	Significant	Effect Size	Magnitude	Interpretation
Final Type II Rate	Welch’s t-test	0.0056	Yes	d = 1.67	Large	Treatment reduced Final Type II Rate
Type II Reduction	Wilcoxon	0.2559	No	r = 0.11	Small	No evidence of greater Type II Reduction
Agreement Lift	Wilcoxon	0.0049	Yes	r = 0.411	Moderate	Higher adoption of AI recommendations
Conflict Reduction	Wilcoxon	0.0109	Yes	r = 0.368	Moderate	Lower disagreement with AI

Based on the statistical analyses above, several key findings can be drawn.

First, with respect to the final Type II Error rate, collaboration between loan officers and the new AI model significantly reduced the risk of ultimately approving bad loans. Furthermore, the new model demonstrated a lower Type II Error rate than the Control group using the legacy model.

In addition, both Agreement Lift and Conflict Reduction increased significantly. These results suggest a consistent behavioral pattern: after being exposed to recommendations from the new AI model, loan officers were more likely to align their decisions with the AI’s recommendations. This indicates a higher level of trust in, and adoption of, the new model, resulting in stronger human-AI decision alignment compared to the legacy model.

However, although the Final Type II Rate was significantly lower among users of the new model, the increase in Type II Reduction—a metric intended to capture the AI’s ability to improve human decisions—did not reach statistical significance. There are at least two plausible explanations for this result.

1. Insufficient Statistical Power

The Control group included only 10 participants versus 28 in the Treatment group, creating a notable sample imbalance. Combined with the Wilcoxon test’s limited sensitivity to small effects (observed effect size: r = 0.11), the study may have lacked sufficient statistical power to detect a real but modest difference.

As a result, the non-significant finding may reflect a statistical Type II error rather than the absence of a true improvement effect from the new model.

2. Baseline Differences Between Groups

If the Treatment group already had a relatively low initial Type II Error rate, the new AI model would have less room to improve performance. In contrast, the Control group’s higher initial error rate would allow greater improvement even with the legacy model.

To test this, an additional analysis was conducted on initial Type II Error rates before Model intervention.

Baseline Type II Error Rate Comparison Summary
Metric	Result
Control mean Initial Type II Error Rate	0.4
Treatment mean Initial Type II Error Rate	0.313
Control median Initial Type II Error Rate	0.433
Treatment median Initial Type II Error Rate	0.3
Control normality	Shapiro-Wilk p = 0.0307 (Not normal)
Treatment normality	Shapiro-Wilk p = 0.0024 (Not normal)
Variance homogeneity	Levene p = 0.5706 (Equal variance)
Selected statistical test	Wilcoxon rank-sum test
Test statistic	W = 213.5
p-value	0.0128
Conclusion	Significant baseline difference

The results confirmed this hypothesis: the Control group had a significantly higher initial Type II Rate than the Treatment group. This baseline difference likely explains why Type II Reduction was not statistically significant. While the new model achieved better final outcomes, the Treatment group’s lower starting error rate limited its potential improvement, making the observed reduction comparable to that of the Control group.

Conclusion and Recommendations

Taken together, the results indicate that the new AI model provides meaningful business value by reducing bad-loan approvals and improving human-AI decision alignment. The primary OEC was statistically significant with a large effect size (Cohen’s d = 1.67), providing strong evidence that the new model outperforms the legacy model.

Although Type II Reduction was not statistically significant, this is likely due to limited statistical power and baseline differences between groups rather than a lack of model effectiveness. Overall, the findings support adoption of the new model while continuing to strengthen the evidence through further data collection.

Based on these findings, the following actions are recommended:

1. Proceed with a Controlled Rollout

The model has demonstrated a measurable reduction in bad-loan approval risk. To avoid continued exposure to the higher risk associated with the legacy model, a controlled rollout is recommended while performance continues to be monitored.

2. Continue Data Collection and Improve Experimental Design

Additional data should be collected during rollout to increase statistical power and confidence in the results. Future studies should aim for more balanced Control and Treatment groups (ideally 20–30 participants each). Where resources are limited, cross-over or matched-pair designs may help reduce individual-level variation.

3. Broaden Evaluation Metrics and Analysis

Future evaluations should assess both Type II and Type I Errors to better understand the risk-return trade-off and ensure risk reduction does not come at the cost of rejecting profitable loans.

If case-level data become available, decision-path analysis (Human Initial Decision × AI Recommendation × Human Final Decision × Actual Loan Outcome) could provide deeper insight into how and when the AI improves human decision-making.

A/B Test for New AI Model Deployment in Loan Approval Decisions