1 Overview
2 Task 0 — Load Data & Quick Check
3 Task 1 — Data Partitioning (Train/Test Split)
- 3.1 1A. Create the split
- 3.2 1B. Check outcome balance
4 Task 2 — Statistical Comparison of Groups (Training Data Only)
5 Task 3 — Baseline Model Estimation (Logistic Regression)
- 5.1 3A. Choose predictors
6 Task 4 — Model Performance Evaluation (Test Data)
- 6.1 4A. Predict probabilities on the test set
- 6.2 4B. Confusion matrix (choose a threshold)
7 Confusion matrix (force both classes 0 and 1 to appear)
- 7.1 4C. Compute performance metrics (choose at least two)
8 Helper values (derived from confusion matrix)
- 8.1 4E. Optional: Simple Lift Chart (Deciles)
9 Final Deliverables Checklist

1 Overview

Lab 2 — Model Evaluation, Univariate & Multivariate Analysis

Dataset: week3_credit_scoring.csv
Outcome: defaulted (binary)

In this lab you will: 1. Create a train/test split and confirm outcome balance. 2. Run one t-test (numeric predictor by defaulted) and one ANOVA (numeric variable by a 3+ level categorical predictor). 3. Fit a baseline logistic regression model. 4. Evaluate model performance using a confusion matrix, two+ metrics, and one plot (ROC and/or lift).

Important: For statistical tests (t-test and ANOVA), use training data only.

2 Task 0 — Load Data & Quick Check

# Load data
library(readr)
week3_credit_scoring <- read_csv("week3_credit_scoring.csv")
View(week3_credit_scoring)

# NOTE: Put the CSV in the same folder as this Rmd, or provide a full path.
df <- read.csv("week3_credit_scoring.csv", stringsAsFactors = FALSE)

# Quick structure check
str(df)

## 'data.frame':    350 obs. of  6 variables:
##  $ applicant_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age         : int  34 25 72 70 35 23 35 45 73 37 ...
##  $ income      : num  68597 58217 24933 60542 90067 ...
##  $ loan_amount : num  19860 20958 16444 10853 11420 ...
##  $ credit_score: int  574 728 536 700 818 709 824 522 578 693 ...
##  $ defaulted   : int  1 0 1 0 0 1 0 0 1 0 ...

head(df)

##   applicant_id age income loan_amount credit_score defaulted
## 1            1  34  68597       19860          574         1
## 2            2  25  58217       20958          728         0
## 3            3  72  24933       16444          536         1
## 4            4  70  60542       10853          700         0
## 5            5  35  90067       11420          818         0
## 6            6  23  38756        6172          709         1

# Confirm outcome variable
table(df$defaulted)

## 
##   0   1 
## 282  68

Write 1–2 sentences: What does defaulted = 1 represent in business terms?

Answer: Defaulted = 1 means represents a borrower who did not repay their loan as agreed, indicating elevated credit risk and potential financial loss for the lender.

3 Task 1 — Data Partitioning (Train/Test Split)

3.1 1A. Create the split

set.seed(123)  # for reproducibility

n <- nrow(df)
train_size <- round(0.70 * n)
train_idx <- sample(1:n, size = train_size, replace = FALSE)

train <- df[train_idx, ]
test  <- df[-train_idx, ]

# Sample sizes
nrow(train)

## [1] 245

nrow(test)

## [1] 105

3.2 1B. Check outcome balance

# Proportion of defaulted in full/train/test
prop_full  <- mean(df$defaulted == 1)
prop_train <- mean(train$defaulted == 1)
prop_test  <- mean(test$defaulted == 1)

prop_full

## [1] 0.1942857

prop_train

## [1] 0.2122449

prop_test

## [1] 0.152381

# Simple table for reporting
balance_table <- data.frame(
  split = c("full", "train", "test"),
  n = c(nrow(df), nrow(train), nrow(test)),
  default_rate = c(prop_full, prop_train, prop_test)
)

balance_table

##   split   n default_rate
## 1  full 350    0.1942857
## 2 train 245    0.2122449
## 3  test 105    0.1523810

Write 2–4 sentences: Is the default rate similar across train and test? Why does this matter?

Answer: The default rates across the full (19.4%), training (21.2%), and test (15.2%) datasets are reasonably similar, indicating that the random split preserved the overall outcome distribution. Preserving this balance is important because it helps ensure that model training and evaluation are based on representative samples, leading to more reliable performance estimates.

4 Task 2 — Statistical Comparison of Groups (Training Data Only)

In this task you will perform: - One t-test for a numeric predictor (by defaulted) - One ANOVA where the predictor is a categorical variable with 3+ levels

4.1 2A. Create a 3-level categorical predictor (Option B)

Because the dataset contains only numeric columns, create a categorical variable using transparent thresholds.

Recommended: Create credit_score_group: - Low: credit_score < 600 - Medium: 600–699 - High: 700+

# Create credit score groups (3 levels)
train$credit_score_group <- NA

train$credit_score_group[train$credit_score < 600] <- "Low"
train$credit_score_group[train$credit_score >= 600 & train$credit_score < 700] <- "Medium"
train$credit_score_group[train$credit_score >= 700] <- "High"

train$credit_score_group <- factor(train$credit_score_group, levels = c("Low", "Medium", "High"))

table(train$credit_score_group)

## 
##    Low Medium   High 
##     63     79    103

4.2 2B. T-test (numeric predictor by default group)

Choose one numeric predictor (example: income, loan_amount, or credit_score).

# Example t-test: income differs by default status
# Replace 'income' with your chosen numeric predictor if you prefer.
ttest_out <- t.test(income ~ defaulted, data = train)
ttest_out

## 
##  Welch Two Sample t-test
## 
## data:  income by defaulted
## t = 0.525, df = 93.394, p-value = 0.6008
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -4494.694  7725.663
## sample estimates:
## mean in group 0 mean in group 1 
##        68123.12        66507.63

ttest_out <- t.test(credit_score ~ defaulted, data = train) ttest_out

Interpretation (3–5 sentences): - What does the p-value suggest? - What is the direction of the difference (which group has higher mean)? - What might this mean in the business context?

Answer:The t-test shows no statistically significant difference in mean credit scores between applicants who defaulted and those who did not (p = 0.795). While non-defaulting applicants have a slightly higher average credit score (≈680) than defaulting applicants (≈676), this difference is small and well within the confidence interval. This suggests that, when considered in isolation, credit score alone does not strongly distinguish default risk in this training sample. From a business perspective, this indicates that credit score should be evaluated in combination with other borrower characteristics rather than relied upon as a standalone indicator of default risk.

4.3 2C. ANOVA (numeric variable by 3-level categorical predictor)

Run ANOVA with a numeric response (example: income) across credit_score_group.

# Example ANOVA: mean income differs across credit score groups
fit_aov <- aov(income ~ credit_score_group, data = train)
summary(fit_aov)

##                     Df    Sum Sq   Mean Sq F value Pr(>F)
## credit_score_group   2 4.642e+07  23208222   0.049  0.952
## Residuals          242 1.147e+11 473820263

# Optional: post-hoc comparisons
TukeyHSD(fit_aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = income ~ credit_score_group, data = train)
## 
## $credit_score_group
##                   diff       lwr      upr     p adj
## Medium-Low   -60.65562 -8731.324 8610.013 0.9998500
## High-Low    -914.29111 -9124.560 7295.978 0.9627028
## High-Medium -853.63549 -8530.710 6823.439 0.9628122

Interpretation (4–6 sentences): - What does the F-test suggest? - If Tukey is used, which groups differ most? - Why might this matter for modeling or for business decisions?

Answer:The ANOVA F-test indicates no statistically significant difference in mean income across the low, medium, and high credit score groups (F = 0.049, p = 0.952). This suggests that, in the training data, average income does not vary meaningfully by credit score category. The Tukey post-hoc comparisons confirm this result, as none of the pairwise group differences are statistically significant and all confidence intervals include zero. From a modeling perspective, this indicates that income and credit score groupings do not exhibit strong group-level separation on their own. From a business standpoint, this suggests that income and credit score may capture different aspects of borrower risk and should be evaluated jointly in a multivariate model rather than used independently for segmentation or decision-making.

5 Task 3 — Baseline Model Estimation (Logistic Regression)

Fit a baseline logistic regression model with a small, clearly stated set of predictors.

5.1 3A. Choose predictors

Suggested baseline predictors: - credit_score (expected to reduce default risk) - income (expected to reduce default risk) - loan_amount (could increase risk) - age (optional)

# Fit baseline logistic regression (training data)
fit_glm <- glm(defaulted ~ credit_score + income + loan_amount,
               data = train,
               family = binomial)
summary(fit_glm)

## 
## Call:
## glm(formula = defaulted ~ credit_score + income + loan_amount, 
##     family = binomial, data = train)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -7.272e-01  1.229e+00  -0.592    0.554
## credit_score -3.968e-04  1.592e-03  -0.249    0.803
## income       -3.505e-06  7.216e-06  -0.486    0.627
## loan_amount  -5.336e-06  2.097e-05  -0.254    0.799
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 253.29  on 244  degrees of freedom
## Residual deviance: 252.93  on 241  degrees of freedom
## AIC: 260.93
## 
## Number of Fisher Scoring iterations: 4

Write 4–6 sentences: - Which predictors look most influential? - Do the coefficient signs make business sense?

Answer:n this baseline logistic regression model, none of the predictors—credit score, income, or loan amount—appear to be strongly influential based on their coefficient magnitudes and lack of statistical significance. Among the predictors, income and loan amount show slightly larger absolute coefficients, suggesting marginal influence relative to credit score, though these effects are weak. The negative coefficient for credit score and income aligns with business expectations, as higher credit quality and greater income are typically associated with lower default risk. However, loan amount also has a negative coefficient, which is less intuitive and suggests that loan size alone does not adequately capture repayment burden in this dataset. Overall, the coefficient signs indicate reasonable directional relationships, but the results highlight the limitations of this simple baseline model and the need to consider additional predictors or interactions to better explain default risk.

6 Task 4 — Model Performance Evaluation (Test Data)

6.1 4A. Predict probabilities on the test set

# Predicted probability of default
p_hat <- predict(fit_glm, newdata = test, type = "response")
summary(p_hat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1704  0.2002  0.2152  0.2128  0.2235  0.2586

6.2 4B. Confusion matrix (choose a threshold)

threshold <- 0.50 # classification cutoff

pred_class <- ifelse(p_hat >= threshold, 1, 0)

7 Confusion matrix (force both classes 0 and 1 to appear)

cm <- table( Predicted = factor(pred_class, levels = c(0, 1)), Actual = factor(test$defaulted, levels = c(0, 1)) )

7.1 4C. Compute performance metrics (choose at least two)

8 Helper values (derived from confusion matrix)

TP <- cm[“1”, “1”] TN <- cm[“0”, “0”] FP <- cm[“1”, “0”] FN <- cm[“0”, “1”]

accuracy <- (TP + TN) / sum(cm) precision <- ifelse((TP + FP) == 0, NaN, TP / (TP + FP)) recall <- ifelse((TP + FN) == 0, 0, TP / (TP + FN)) F1 <- ifelse(is.nan(precision) | (precision + recall) == 0, NaN, 2 * precision * recall / (precision + recall))

metrics <- data.frame( metric = c(“accuracy”, “precision”, “recall”, “F1”), value = c(accuracy, precision, recall, F1) )

metrics


**Write 5–7 sentences:**
- Which metric matters most for this business problem?
- What is worse: false positives or false negatives?
- How would you change the threshold if the business cost changes?

> Answer:

## 4D. ROC Curve (base R) and AUC (computed)


``` r
# ROC helper: compute TPR/FPR at many thresholds
roc_points <- function(y_true, p_score) {
  # y_true must be 0/1
  thresh <- sort(unique(p_score))
  # Add endpoints
  thresh <- c(-Inf, thresh, Inf)

  TPR <- numeric(length(thresh))
  FPR <- numeric(length(thresh))

  for (i in 1:length(thresh)) {
    t <- thresh[i]
    y_pred <- ifelse(p_score >= t, 1, 0)

    TP <- sum(y_pred == 1 & y_true == 1)
    TN <- sum(y_pred == 0 & y_true == 0)
    FP <- sum(y_pred == 1 & y_true == 0)
    FN <- sum(y_pred == 0 & y_true == 1)

    TPR[i] <- ifelse((TP + FN) == 0, NA, TP / (TP + FN))
    FPR[i] <- ifelse((FP + TN) == 0, NA, FP / (FP + TN))
  }

  out <- data.frame(FPR = FPR, TPR = TPR)
  out <- out[complete.cases(out), ]

  # Sort by FPR
  out <- out[order(out$FPR, out$TPR), ]
  out
}

auc_trap <- function(x, y) {
  # trapezoid rule; assumes x is sorted ascending
  dx <- diff(x)
  y_avg <- (y[-1] + y[-length(y)]) / 2
  sum(dx * y_avg)
}

roc_df <- roc_points(test$defaulted, p_hat)
auc <- auc_trap(roc_df$FPR, roc_df$TPR)
auc

## [1] 0.6102528

# Plot ROC
plot(roc_df$FPR, roc_df$TPR, type = "l",
     xlab = "False Positive Rate", ylab = "True Positive Rate",
     main = "ROC Curve (Test Set)")
abline(0, 1, lty = 2)
legend("bottomright", legend = paste("AUC =", round(auc, 3)), bty = "n")

8.1 4E. Optional: Simple Lift Chart (Deciles)

# Create deciles based on predicted probability
ord <- order(p_hat, decreasing = TRUE)

p_sorted <- p_hat[ord]
y_sorted <- test$defaulted[ord]

n_test <- length(y_sorted)
decile_id <- ceiling((1:n_test) / (n_test/10))
decile_id[decile_id > 10] <- 10

# Default rate by decile
lift_table <- data.frame(decile = 1:10, n = NA, default_rate = NA)

for (d in 1:10) {
  idx <- which(decile_id == d)
  lift_table$n[d] <- length(idx)
  lift_table$default_rate[d] <- mean(y_sorted[idx] == 1)
}

lift_table

##    decile  n default_rate
## 1       1 10   0.30000000
## 2       2 11   0.09090909
## 3       3 10   0.10000000
## 4       4 11   0.36363636
## 5       5 10   0.10000000
## 6       6 11   0.09090909
## 7       7 10   0.10000000
## 8       8 11   0.36363636
## 9       9 10   0.00000000
## 10     10 11   0.00000000

# Plot default rate by decile
plot(lift_table$decile, lift_table$default_rate, type = "b",
     xlab = "Decile (1 = highest predicted risk)",
     ylab = "Observed default rate",
     main = "Lift by Decile (Test Set)")

Credit Default Risk Model Evaluation Purpose

The purpose of this analysis was to evaluate a baseline credit risk model and assess how well selected borrower characteristics explain and predict loan default. The analysis focused on model evaluation, interpretation of predictors, and translation of results into business-relevant insights rather than maximizing predictive accuracy.

Key Findings from Statistical Tests

Univariate analysis using a t-test found no statistically significant difference in mean credit scores between defaulting and non-defaulting applicants in the training data. Similarly, ANOVA results showed no significant differences in mean income across low, medium, and high credit score groups, with post-hoc tests confirming the absence of meaningful group-level differences. These results suggest that individual predictors, when considered in isolation, do not strongly distinguish default risk in this dataset and should not be relied upon alone for credit decisions.

Baseline Model Interpretation

A baseline logistic regression model was estimated using credit score, income, and loan amount as predictors. None of the predictors were statistically significant, and coefficient magnitudes were small, indicating weak individual effects. While the negative signs on credit score and income align with business intuition—suggesting that stronger credit profiles and higher income reduce default risk—the overall results highlight the limitations of this simple model. These findings reinforce the importance of evaluating predictors jointly and treating this model as a starting point rather than a final solution.

Model Performance & Business Implications

Model evaluation on the test set revealed that, at a 0.50 classification threshold, the model predicted no defaults, resulting in high accuracy (approximately 84.8%) but zero recall. This demonstrates that accuracy alone is misleading in imbalanced classification problems such as credit risk. Recall is the most important metric in this context, as false negatives—failing to identify high-risk borrowers—can lead to direct financial losses. The ROC curve yielded an AUC of approximately 0.61, indicating modest discriminatory power that is better than random but insufficient for standalone deployment.

Model Suitability & Next Steps

In its current form, this baseline model is not sufficient for operational decision-making. Key risks include missed defaults due to low recall and limited predictive separation. To improve performance, future work should incorporate additional predictors (e.g., payment history, utilization ratios), explore alternative modeling approaches, and tune classification thresholds based on business cost tradeoffs. With these enhancements, the model could become a more effective tool for supporting credit risk decisions.

9 Final Deliverables Checklist

Submit: - R script or Rmd (this file) with your code and answers - Short memo (1–2 pages) summarizing: - Key findings from t-test and ANOVA - Baseline model interpretation - Performance results and business implications

Before submitting: - Confirm you used training data only for t-test and ANOVA. - Confirm all plots have titles and labeled axes. - Confirm your memo states which metric matters most and why.

TBANLT 560 — Lab 2 (Student Template)

Model Evaluation, Univariate & Multivariate Analysis

Evelyn Vasquez

2026-02-01