Lab 2 — Model Evaluation, Univariate & Multivariate Analysis
Dataset: week3_credit_scoring.csv
Outcome: defaulted (binary)
In this lab you will: 1. Create a train/test split
and confirm outcome balance. 2. Run one t-test (numeric
predictor by defaulted) and one ANOVA
(numeric variable by a 3+ level categorical predictor). 3. Fit a
baseline logistic regression model. 4. Evaluate model
performance using a confusion matrix, two+
metrics, and one plot (ROC and/or lift).
Important: For statistical tests (t-test and ANOVA), use training data only.
# Load data
library(readr)
week3_credit_scoring <- read_csv("week3_credit_scoring.csv")
View(week3_credit_scoring)
# NOTE: Put the CSV in the same folder as this Rmd, or provide a full path.
df <- read.csv("week3_credit_scoring.csv", stringsAsFactors = FALSE)
# Quick structure check
str(df)
## 'data.frame': 350 obs. of 6 variables:
## $ applicant_id: int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 34 25 72 70 35 23 35 45 73 37 ...
## $ income : num 68597 58217 24933 60542 90067 ...
## $ loan_amount : num 19860 20958 16444 10853 11420 ...
## $ credit_score: int 574 728 536 700 818 709 824 522 578 693 ...
## $ defaulted : int 1 0 1 0 0 1 0 0 1 0 ...
head(df)
## applicant_id age income loan_amount credit_score defaulted
## 1 1 34 68597 19860 574 1
## 2 2 25 58217 20958 728 0
## 3 3 72 24933 16444 536 1
## 4 4 70 60542 10853 700 0
## 5 5 35 90067 11420 818 0
## 6 6 23 38756 6172 709 1
# Confirm outcome variable
table(df$defaulted)
##
## 0 1
## 282 68
Write 1–2 sentences: What does
defaulted = 1 represent in business terms?
Answer: Defaulted = 1 means represents a borrower who did not repay their loan as agreed, indicating elevated credit risk and potential financial loss for the lender.
set.seed(123) # for reproducibility
n <- nrow(df)
train_size <- round(0.70 * n)
train_idx <- sample(1:n, size = train_size, replace = FALSE)
train <- df[train_idx, ]
test <- df[-train_idx, ]
# Sample sizes
nrow(train)
## [1] 245
nrow(test)
## [1] 105
# Proportion of defaulted in full/train/test
prop_full <- mean(df$defaulted == 1)
prop_train <- mean(train$defaulted == 1)
prop_test <- mean(test$defaulted == 1)
prop_full
## [1] 0.1942857
prop_train
## [1] 0.2122449
prop_test
## [1] 0.152381
# Simple table for reporting
balance_table <- data.frame(
split = c("full", "train", "test"),
n = c(nrow(df), nrow(train), nrow(test)),
default_rate = c(prop_full, prop_train, prop_test)
)
balance_table
## split n default_rate
## 1 full 350 0.1942857
## 2 train 245 0.2122449
## 3 test 105 0.1523810
Write 2–4 sentences: Is the default rate similar across train and test? Why does this matter?
Answer: The default rates across the full (19.4%), training (21.2%), and test (15.2%) datasets are reasonably similar, indicating that the random split preserved the overall outcome distribution. Preserving this balance is important because it helps ensure that model training and evaluation are based on representative samples, leading to more reliable performance estimates.
In this task you will perform: - One t-test for a
numeric predictor (by defaulted) - One
ANOVA where the predictor is a categorical variable with
3+ levels
Because the dataset contains only numeric columns, create a categorical variable using transparent thresholds.
Recommended: Create credit_score_group:
- Low: credit_score < 600 - Medium: 600–699 - High: 700+
# Create credit score groups (3 levels)
train$credit_score_group <- NA
train$credit_score_group[train$credit_score < 600] <- "Low"
train$credit_score_group[train$credit_score >= 600 & train$credit_score < 700] <- "Medium"
train$credit_score_group[train$credit_score >= 700] <- "High"
train$credit_score_group <- factor(train$credit_score_group, levels = c("Low", "Medium", "High"))
table(train$credit_score_group)
##
## Low Medium High
## 63 79 103
Choose one numeric predictor (example:
income, loan_amount, or
credit_score).
# Example t-test: income differs by default status
# Replace 'income' with your chosen numeric predictor if you prefer.
ttest_out <- t.test(income ~ defaulted, data = train)
ttest_out
##
## Welch Two Sample t-test
##
## data: income by defaulted
## t = 0.525, df = 93.394, p-value = 0.6008
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -4494.694 7725.663
## sample estimates:
## mean in group 0 mean in group 1
## 68123.12 66507.63
ttest_out <- t.test(credit_score ~ defaulted, data = train) ttest_out
Interpretation (3–5 sentences): - What does the p-value suggest? - What is the direction of the difference (which group has higher mean)? - What might this mean in the business context?
Answer:The t-test shows no statistically significant difference in mean credit scores between applicants who defaulted and those who did not (p = 0.795). While non-defaulting applicants have a slightly higher average credit score (≈680) than defaulting applicants (≈676), this difference is small and well within the confidence interval. This suggests that, when considered in isolation, credit score alone does not strongly distinguish default risk in this training sample. From a business perspective, this indicates that credit score should be evaluated in combination with other borrower characteristics rather than relied upon as a standalone indicator of default risk.
Run ANOVA with a numeric response (example: income)
across credit_score_group.
# Example ANOVA: mean income differs across credit score groups
fit_aov <- aov(income ~ credit_score_group, data = train)
summary(fit_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## credit_score_group 2 4.642e+07 23208222 0.049 0.952
## Residuals 242 1.147e+11 473820263
# Optional: post-hoc comparisons
TukeyHSD(fit_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = income ~ credit_score_group, data = train)
##
## $credit_score_group
## diff lwr upr p adj
## Medium-Low -60.65562 -8731.324 8610.013 0.9998500
## High-Low -914.29111 -9124.560 7295.978 0.9627028
## High-Medium -853.63549 -8530.710 6823.439 0.9628122
Interpretation (4–6 sentences): - What does the F-test suggest? - If Tukey is used, which groups differ most? - Why might this matter for modeling or for business decisions?
Answer:The ANOVA F-test indicates no statistically significant difference in mean income across the low, medium, and high credit score groups (F = 0.049, p = 0.952). This suggests that, in the training data, average income does not vary meaningfully by credit score category. The Tukey post-hoc comparisons confirm this result, as none of the pairwise group differences are statistically significant and all confidence intervals include zero. From a modeling perspective, this indicates that income and credit score groupings do not exhibit strong group-level separation on their own. From a business standpoint, this suggests that income and credit score may capture different aspects of borrower risk and should be evaluated jointly in a multivariate model rather than used independently for segmentation or decision-making.
Fit a baseline logistic regression model with a small, clearly stated set of predictors.
Suggested baseline predictors: - credit_score (expected
to reduce default risk) - income (expected to reduce
default risk) - loan_amount (could increase risk) -
age (optional)
# Fit baseline logistic regression (training data)
fit_glm <- glm(defaulted ~ credit_score + income + loan_amount,
data = train,
family = binomial)
summary(fit_glm)
##
## Call:
## glm(formula = defaulted ~ credit_score + income + loan_amount,
## family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.272e-01 1.229e+00 -0.592 0.554
## credit_score -3.968e-04 1.592e-03 -0.249 0.803
## income -3.505e-06 7.216e-06 -0.486 0.627
## loan_amount -5.336e-06 2.097e-05 -0.254 0.799
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 253.29 on 244 degrees of freedom
## Residual deviance: 252.93 on 241 degrees of freedom
## AIC: 260.93
##
## Number of Fisher Scoring iterations: 4
Write 4–6 sentences: - Which predictors look most influential? - Do the coefficient signs make business sense?
Answer:n this baseline logistic regression model, none of the predictors—credit score, income, or loan amount—appear to be strongly influential based on their coefficient magnitudes and lack of statistical significance. Among the predictors, income and loan amount show slightly larger absolute coefficients, suggesting marginal influence relative to credit score, though these effects are weak. The negative coefficient for credit score and income aligns with business expectations, as higher credit quality and greater income are typically associated with lower default risk. However, loan amount also has a negative coefficient, which is less intuitive and suggests that loan size alone does not adequately capture repayment burden in this dataset. Overall, the coefficient signs indicate reasonable directional relationships, but the results highlight the limitations of this simple baseline model and the need to consider additional predictors or interactions to better explain default risk.
# Predicted probability of default
p_hat <- predict(fit_glm, newdata = test, type = "response")
summary(p_hat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1704 0.2002 0.2152 0.2128 0.2235 0.2586
threshold <- 0.50 # classification cutoff
pred_class <- ifelse(p_hat >= threshold, 1, 0)
cm <- table( Predicted = factor(pred_class, levels = c(0, 1)), Actual = factor(test$defaulted, levels = c(0, 1)) )
cm
TP <- cm[“1”, “1”] TN <- cm[“0”, “0”] FP <- cm[“1”, “0”] FN <- cm[“0”, “1”]
accuracy <- (TP + TN) / sum(cm) precision <- ifelse((TP + FP) == 0, NaN, TP / (TP + FP)) recall <- ifelse((TP + FN) == 0, 0, TP / (TP + FN)) F1 <- ifelse(is.nan(precision) | (precision + recall) == 0, NaN, 2 * precision * recall / (precision + recall))
metrics <- data.frame( metric = c(“accuracy”, “precision”, “recall”, “F1”), value = c(accuracy, precision, recall, F1) )
metrics
**Write 5–7 sentences:**
- Which metric matters most for this business problem?
- What is worse: false positives or false negatives?
- How would you change the threshold if the business cost changes?
> Answer:
## 4D. ROC Curve (base R) and AUC (computed)
``` r
# ROC helper: compute TPR/FPR at many thresholds
roc_points <- function(y_true, p_score) {
# y_true must be 0/1
thresh <- sort(unique(p_score))
# Add endpoints
thresh <- c(-Inf, thresh, Inf)
TPR <- numeric(length(thresh))
FPR <- numeric(length(thresh))
for (i in 1:length(thresh)) {
t <- thresh[i]
y_pred <- ifelse(p_score >= t, 1, 0)
TP <- sum(y_pred == 1 & y_true == 1)
TN <- sum(y_pred == 0 & y_true == 0)
FP <- sum(y_pred == 1 & y_true == 0)
FN <- sum(y_pred == 0 & y_true == 1)
TPR[i] <- ifelse((TP + FN) == 0, NA, TP / (TP + FN))
FPR[i] <- ifelse((FP + TN) == 0, NA, FP / (FP + TN))
}
out <- data.frame(FPR = FPR, TPR = TPR)
out <- out[complete.cases(out), ]
# Sort by FPR
out <- out[order(out$FPR, out$TPR), ]
out
}
auc_trap <- function(x, y) {
# trapezoid rule; assumes x is sorted ascending
dx <- diff(x)
y_avg <- (y[-1] + y[-length(y)]) / 2
sum(dx * y_avg)
}
roc_df <- roc_points(test$defaulted, p_hat)
auc <- auc_trap(roc_df$FPR, roc_df$TPR)
auc
## [1] 0.6102528
# Plot ROC
plot(roc_df$FPR, roc_df$TPR, type = "l",
xlab = "False Positive Rate", ylab = "True Positive Rate",
main = "ROC Curve (Test Set)")
abline(0, 1, lty = 2)
legend("bottomright", legend = paste("AUC =", round(auc, 3)), bty = "n")
# Create deciles based on predicted probability
ord <- order(p_hat, decreasing = TRUE)
p_sorted <- p_hat[ord]
y_sorted <- test$defaulted[ord]
n_test <- length(y_sorted)
decile_id <- ceiling((1:n_test) / (n_test/10))
decile_id[decile_id > 10] <- 10
# Default rate by decile
lift_table <- data.frame(decile = 1:10, n = NA, default_rate = NA)
for (d in 1:10) {
idx <- which(decile_id == d)
lift_table$n[d] <- length(idx)
lift_table$default_rate[d] <- mean(y_sorted[idx] == 1)
}
lift_table
## decile n default_rate
## 1 1 10 0.30000000
## 2 2 11 0.09090909
## 3 3 10 0.10000000
## 4 4 11 0.36363636
## 5 5 10 0.10000000
## 6 6 11 0.09090909
## 7 7 10 0.10000000
## 8 8 11 0.36363636
## 9 9 10 0.00000000
## 10 10 11 0.00000000
# Plot default rate by decile
plot(lift_table$decile, lift_table$default_rate, type = "b",
xlab = "Decile (1 = highest predicted risk)",
ylab = "Observed default rate",
main = "Lift by Decile (Test Set)")
Credit Default Risk Model Evaluation Purpose
The purpose of this analysis was to evaluate a baseline credit risk model and assess how well selected borrower characteristics explain and predict loan default. The analysis focused on model evaluation, interpretation of predictors, and translation of results into business-relevant insights rather than maximizing predictive accuracy.
Key Findings from Statistical Tests
Univariate analysis using a t-test found no statistically significant difference in mean credit scores between defaulting and non-defaulting applicants in the training data. Similarly, ANOVA results showed no significant differences in mean income across low, medium, and high credit score groups, with post-hoc tests confirming the absence of meaningful group-level differences. These results suggest that individual predictors, when considered in isolation, do not strongly distinguish default risk in this dataset and should not be relied upon alone for credit decisions.
Baseline Model Interpretation
A baseline logistic regression model was estimated using credit score, income, and loan amount as predictors. None of the predictors were statistically significant, and coefficient magnitudes were small, indicating weak individual effects. While the negative signs on credit score and income align with business intuition—suggesting that stronger credit profiles and higher income reduce default risk—the overall results highlight the limitations of this simple model. These findings reinforce the importance of evaluating predictors jointly and treating this model as a starting point rather than a final solution.
Model Performance & Business Implications
Model evaluation on the test set revealed that, at a 0.50 classification threshold, the model predicted no defaults, resulting in high accuracy (approximately 84.8%) but zero recall. This demonstrates that accuracy alone is misleading in imbalanced classification problems such as credit risk. Recall is the most important metric in this context, as false negatives—failing to identify high-risk borrowers—can lead to direct financial losses. The ROC curve yielded an AUC of approximately 0.61, indicating modest discriminatory power that is better than random but insufficient for standalone deployment.
Model Suitability & Next Steps
In its current form, this baseline model is not sufficient for operational decision-making. Key risks include missed defaults due to low recall and limited predictive separation. To improve performance, future work should incorporate additional predictors (e.g., payment history, utilization ratios), explore alternative modeling approaches, and tune classification thresholds based on business cost tradeoffs. With these enhancements, the model could become a more effective tool for supporting credit risk decisions.
Submit: - R script or Rmd (this file) with your code and answers - Short memo (1–2 pages) summarizing: - Key findings from t-test and ANOVA - Baseline model interpretation - Performance results and business implications
Before submitting: - Confirm you used training data only for t-test and ANOVA. - Confirm all plots have titles and labeled axes. - Confirm your memo states which metric matters most and why.