Stage 2 - Predictive Modeling

Author

Efe Colak

Published

May 25, 2026

Introduction

This Stage 2 report builds predictive models for the two datasets selected in Stage 1. The regression dataset uses professional football player characteristics to predict wages. The classification dataset uses borrower characteristics to predict credit risk.

The report includes train/test splitting, two predictive models for each dataset, test-set evaluation metrics, model comparison, 5-fold cross-validation, an AI interaction log, and a brief conclusion.

Setup

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.3

Warning: package 'purrr' was built under R version 4.5.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rsample)

Warning: package 'rsample' was built under R version 4.5.3

library(knitr)

set.seed(465)

1. Regression Dataset: Player Wages

1.1 Load and Clean Data

The outcome/dependent variable for the regression dataset is Wage. It is a continuous numeric variable representing player wage.

salary_raw <- read_csv("SalaryPrediction.csv", show_col_types = FALSE)

salary_clean <- salary_raw %>%
  mutate(
    wage = parse_number(as.character(Wage)),
    age = as.numeric(Age),
    apps = as.numeric(Apps),
    caps = as.numeric(Caps),
    position = as.factor(Position),
    log_wage = log(wage)
  ) %>%
  select(wage, log_wage, age, apps, caps, position) %>%
  drop_na() %>%
  distinct()

glimpse(salary_clean)

Rows: 3,566
Columns: 6
$ wage     <dbl> 46427000, 42125000, 34821000, 19959000, 19500000, 18810000, 1…
$ log_wage <dbl> 17.65339, 17.55615, 17.36573, 16.80919, 16.78593, 16.74990, 1…
$ age      <dbl> 23, 30, 35, 31, 31, 30, 29, 30, 27, 29, 31, 22, 32, 29, 31, 3…
$ apps     <dbl> 190, 324, 585, 443, 480, 371, 427, 367, 326, 287, 399, 159, 4…
$ caps     <dbl> 57, 119, 162, 120, 45, 94, 102, 85, 77, 86, 91, 21, 105, 50, …
$ position <fct> Forward, Midfilder, Forward, Forward, Goalkeeper, Defender, F…

1.2 Data Splitting

The data was split into 80% training and 20% testing using initial_split(). A seed of 465 was set for reproducibility.

set.seed(465)

salary_split <- initial_split(salary_clean, prop = 0.80)
salary_train <- training(salary_split)
salary_test <- testing(salary_split)

reg_split_sizes <- tibble(
  Dataset = c("Training set", "Test set"),
  Sample_Size = c(nrow(salary_train), nrow(salary_test))
)

kable(reg_split_sizes, caption = "Regression Train/Test Split Sample Sizes")

Regression Train/Test Split Sample Sizes
Dataset	Sample_Size
Training set	2852
Test set	714

1.3 Regression Model 1: Basic Linear Regression

Model 1 predicts log wage using age, appearances, and caps.

reg_model_1 <- lm(log_wage ~ age + apps + caps, data = salary_train)

reg_pred_1 <- salary_test %>%
  mutate(
    pred_log_wage = predict(reg_model_1, newdata = salary_test)
  )

rmse_function <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2, na.rm = TRUE))
}

r2_function <- function(actual, predicted) {
  1 - sum((actual - predicted)^2, na.rm = TRUE) / 
    sum((actual - mean(actual, na.rm = TRUE))^2, na.rm = TRUE)
}

reg_metrics_1 <- tibble(
  Model = "Model 1: Basic Linear Regression",
  RMSE = rmse_function(reg_pred_1$log_wage, reg_pred_1$pred_log_wage),
  R_squared = r2_function(reg_pred_1$log_wage, reg_pred_1$pred_log_wage)
)

kable(reg_metrics_1, digits = 4, caption = "Regression Model 1 Test Performance")

Regression Model 1 Test Performance
Model	RMSE	R_squared
Model 1: Basic Linear Regression	1.4096	0.3816

1.4 Regression Model 2: Expanded Linear Regression

Model 2 adds player position as an additional predictor.

reg_model_2 <- lm(log_wage ~ age + apps + caps + position, data = salary_train)

reg_pred_2 <- salary_test %>%
  mutate(
    pred_log_wage = predict(reg_model_2, newdata = salary_test)
  )

reg_metrics_2 <- tibble(
  Model = "Model 2: Expanded Linear Regression",
  RMSE = rmse_function(reg_pred_2$log_wage, reg_pred_2$pred_log_wage),
  R_squared = r2_function(reg_pred_2$log_wage, reg_pred_2$pred_log_wage)
)

kable(reg_metrics_2, digits = 4, caption = "Regression Model 2 Test Performance")

Regression Model 2 Test Performance
Model	RMSE	R_squared
Model 2: Expanded Linear Regression	1.3975	0.3922

1.5 Regression Model Comparison

reg_comparison <- bind_rows(reg_metrics_1, reg_metrics_2)

kable(reg_comparison, digits = 4, caption = "Regression Model Comparison on Test Set")

Regression Model Comparison on Test Set
Model	RMSE	R_squared
Model 1: Basic Linear Regression	1.4096	0.3816
Model 2: Expanded Linear Regression	1.3975	0.3922

best_reg_model_name <- reg_comparison %>%
  arrange(RMSE, desc(R_squared)) %>%
  slice(1) %>%
  pull(Model)

best_reg_model_name

[1] "Model 2: Expanded Linear Regression"

The better regression model is selected based mainly on lower RMSE and higher R-squared. RMSE measures average prediction error on the log wage scale, while R-squared measures the share of variation in log wages explained by the model. If the expanded model performs better, this suggests that player position adds useful information for predicting wages.

1.6 5-Fold Cross-Validation for Best Regression Model

set.seed(465)

reg_folds <- vfold_cv(salary_clean, v = 5)

reg_cv_results <- reg_folds %>%
  mutate(
    model = map(splits, ~ lm(log_wage ~ age + apps + caps + position, data = analysis(.x))),
    predictions = map2(model, splits, ~ assessment(.y) %>%
                         mutate(pred_log_wage = predict(.x, newdata = assessment(.y)))),
    RMSE = map_dbl(predictions, ~ rmse_function(.x$log_wage, .x$pred_log_wage)),
    R_squared = map_dbl(predictions, ~ r2_function(.x$log_wage, .x$pred_log_wage))
  ) %>%
  select(id, RMSE, R_squared)

reg_cv_summary <- reg_cv_results %>%
  summarise(
    CV_RMSE = mean(RMSE, na.rm = TRUE),
    CV_R_squared = mean(R_squared, na.rm = TRUE)
  )

kable(reg_cv_results, digits = 4, caption = "Regression 5-Fold Cross-Validation Results")

Regression 5-Fold Cross-Validation Results
id	RMSE	R_squared
Fold1	1.3747	0.4007
Fold2	1.3579	0.4032
Fold3	1.4030	0.4069
Fold4	1.3224	0.4543
Fold5	1.3547	0.3704

kable(reg_cv_summary, digits = 4, caption = "Average Regression Cross-Validated Performance")

Average Regression Cross-Validated Performance
CV_RMSE	CV_R_squared
1.3625	0.4071

Cross-validation helps evaluate whether the selected regression model is stable across different subsets of the data. If the cross-validated RMSE is close to the test-set RMSE, the model appears reasonably stable. A large difference would suggest possible overfitting or sensitivity to the train/test split.

2. Classification Dataset: German Credit Risk

2.1 Load and Clean Data

The outcome/dependent variable for the classification dataset is Risk. It is a binary categorical variable with two possible values:

good: lower credit risk
bad: higher credit risk

For modeling, bad is treated as the positive class because identifying risky borrowers is practically important.

credit_raw <- read_csv("german_credit_data.csv", show_col_types = FALSE)

New names:
• `` -> `...1`

credit_clean <- credit_raw %>%
  rename(
    age = Age,
    sex = Sex,
    job = Job,
    housing = Housing,
    saving_accounts = `Saving accounts`,
    checking_account = `Checking account`,
    credit_amount = `Credit amount`,
    duration = Duration,
    purpose = Purpose,
    risk = Risk
  ) %>%
  mutate(
    risk = factor(risk, levels = c("good", "bad")),
    sex = as.factor(sex),
    job = as.factor(job),
    housing = as.factor(housing),
    saving_accounts = fct_explicit_na(as.factor(saving_accounts), na_level = "unknown"),
    checking_account = fct_explicit_na(as.factor(checking_account), na_level = "unknown"),
    purpose = as.factor(purpose)
  ) %>%
  select(age, sex, job, housing, saving_accounts, checking_account,
         credit_amount, duration, purpose, risk) %>%
  drop_na() %>%
  distinct()

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `saving_accounts = fct_explicit_na(as.factor(saving_accounts),
  na_level = "unknown")`.
Caused by warning:
! `fct_explicit_na()` was deprecated in forcats 1.0.0.
ℹ Please use `fct_na_value_to_level()` instead.

glimpse(credit_clean)

Rows: 1,000
Columns: 10
$ age              <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22, 6…
$ sex              <fct> male, female, male, male, male, male, male, male, mal…
$ job              <fct> 2, 2, 1, 2, 2, 1, 2, 3, 1, 3, 2, 2, 2, 1, 2, 1, 2, 2,…
$ housing          <fct> own, own, own, free, free, free, own, rent, own, own,…
$ saving_accounts  <fct> unknown, little, little, little, little, unknown, qui…
$ checking_account <fct> little, moderate, unknown, little, little, unknown, u…
$ credit_amount    <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3059,…
$ duration         <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, 24…
$ purpose          <fct> radio/TV, radio/TV, education, furniture/equipment, c…
$ risk             <fct> good, bad, good, good, bad, good, good, good, good, b…

2.2 Data Splitting

The data was split into 80% training and 20% testing using initial_split(). A seed of 465 was set for reproducibility.

set.seed(465)

credit_split <- initial_split(credit_clean, prop = 0.80, strata = risk)
credit_train <- training(credit_split)
credit_test <- testing(credit_split)

class_split_sizes <- tibble(
  Dataset = c("Training set", "Test set"),
  Sample_Size = c(nrow(credit_train), nrow(credit_test))
)

kable(class_split_sizes, caption = "Classification Train/Test Split Sample Sizes")

Classification Train/Test Split Sample Sizes
Dataset	Sample_Size
Training set	800
Test set	200

2.3 Classification Model 1: Basic Logistic Regression

Model 1 predicts credit risk using age, credit amount, and duration.

class_model_1 <- glm(
  risk ~ age + credit_amount + duration,
  data = credit_train,
  family = binomial
)

class_pred_1 <- credit_test %>%
  mutate(
    prob_bad = predict(class_model_1, newdata = credit_test, type = "response"),
    pred_risk = factor(if_else(prob_bad >= 0.50, "bad", "good"), levels = c("good", "bad"))
  )

classification_metrics <- function(actual, predicted) {
  tp <- sum(actual == "bad" & predicted == "bad")
  tn <- sum(actual == "good" & predicted == "good")
  fp <- sum(actual == "good" & predicted == "bad")
  fn <- sum(actual == "bad" & predicted == "good")

  accuracy <- (tp + tn) / (tp + tn + fp + fn)
  precision <- ifelse((tp + fp) == 0, NA, tp / (tp + fp))
  recall <- ifelse((tp + fn) == 0, NA, tp / (tp + fn))

  tibble(
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall
  )
}

class_metrics_1 <- classification_metrics(class_pred_1$risk, class_pred_1$pred_risk) %>%
  mutate(Model = "Model 1: Basic Logistic Regression") %>%
  select(Model, Accuracy, Precision, Recall)

kable(class_metrics_1, digits = 4, caption = "Classification Model 1 Test Performance")

Classification Model 1 Test Performance
Model	Accuracy	Precision	Recall
Model 1: Basic Logistic Regression	0.705	0.5385	0.1167

2.4 Classification Model 2: Expanded Logistic Regression

Model 2 adds categorical borrower characteristics such as housing, saving accounts, checking account status, purpose, sex, and job.

class_model_2 <- glm(
  risk ~ age + credit_amount + duration + housing + saving_accounts +
    checking_account + purpose + sex + job,
  data = credit_train,
  family = binomial
)

class_pred_2 <- credit_test %>%
  mutate(
    prob_bad = predict(class_model_2, newdata = credit_test, type = "response"),
    pred_risk = factor(if_else(prob_bad >= 0.50, "bad", "good"), levels = c("good", "bad"))
  )

class_metrics_2 <- classification_metrics(class_pred_2$risk, class_pred_2$pred_risk) %>%
  mutate(Model = "Model 2: Expanded Logistic Regression") %>%
  select(Model, Accuracy, Precision, Recall)

kable(class_metrics_2, digits = 4, caption = "Classification Model 2 Test Performance")

Classification Model 2 Test Performance
Model	Accuracy	Precision	Recall
Model 2: Expanded Logistic Regression	0.745	0.6047	0.4333

2.5 Classification Model Comparison

class_comparison <- bind_rows(class_metrics_1, class_metrics_2)

kable(class_comparison, digits = 4, caption = "Classification Model Comparison on Test Set")

Classification Model Comparison on Test Set
Model	Accuracy	Precision	Recall
Model 1: Basic Logistic Regression	0.705	0.5385	0.1167
Model 2: Expanded Logistic Regression	0.745	0.6047	0.4333

best_class_model_name <- class_comparison %>%
  arrange(desc(Recall), desc(Accuracy)) %>%
  slice(1) %>%
  pull(Model)

best_class_model_name

[1] "Model 2: Expanded Logistic Regression"

Accuracy measures the overall percentage of correct predictions. Precision measures how many predicted bad credit risks are actually bad. Recall measures how many actual bad borrowers the model successfully identifies.

For this research question, recall is especially important because failing to identify high-risk borrowers may be costly for lenders.

2.6 5-Fold Cross-Validation for Best Classification Model

set.seed(465)

class_folds <- vfold_cv(credit_clean, v = 5, strata = risk)

class_cv_results <- class_folds %>%
  mutate(
    model = map(splits, ~ glm(
      risk ~ age + credit_amount + duration + housing + saving_accounts +
        checking_account + purpose + sex + job,
      data = analysis(.x),
      family = binomial
    )),
    predictions = map2(model, splits, ~ assessment(.y) %>%
                         mutate(
                           prob_bad = predict(.x, newdata = assessment(.y), type = "response"),
                           pred_risk = factor(if_else(prob_bad >= 0.50, "bad", "good"),
                                              levels = c("good", "bad"))
                         )),
    metrics = map(predictions, ~ classification_metrics(.x$risk, .x$pred_risk))
  ) %>%
  select(id, metrics) %>%
  unnest(metrics)

class_cv_summary <- class_cv_results %>%
  summarise(
    CV_Accuracy = mean(Accuracy, na.rm = TRUE),
    CV_Precision = mean(Precision, na.rm = TRUE),
    CV_Recall = mean(Recall, na.rm = TRUE)
  )

kable(class_cv_results, digits = 4, caption = "Classification 5-Fold Cross-Validation Results")

Classification 5-Fold Cross-Validation Results
id	Accuracy	Precision	Recall
Fold1	0.690	0.4783	0.3667
Fold2	0.715	0.5333	0.4000
Fold3	0.750	0.6136	0.4500
Fold4	0.760	0.6667	0.4000
Fold5	0.755	0.6486	0.4000

kable(class_cv_summary, digits = 4, caption = "Average Classification Cross-Validated Performance")

Average Classification Cross-Validated Performance
CV_Accuracy	CV_Precision	CV_Recall
0.734	0.5881	0.4033

Cross-validation helps determine whether the selected classification model performs consistently across different folds of the data. If the cross-validated metrics are close to the test-set metrics, this suggests the model is relatively stable. If the test performance is much higher than cross-validation performance, this may indicate overfitting.

3. AI Interaction Log

Prompt / Question

We asked ChatGPT how to organize a Stage 2 predictive modeling report in R using train/test splits, regression models, logistic regression models, and evaluation metrics.

AI Response / Relevant Excerpt

The AI suggested using initial_split() for train/test splitting and recommended basic functions for regression and logistic regression modeling in R. It also mentioned common evaluation metrics such as RMSE, R-squared, accuracy, precision, and recall.

How I Used It

We used the suggestions as a general reference while organizing the analysis. The code was adjusted to fit the actual variables in the datasets, including Wage and Risk.

Reflection

The interaction was useful for reviewing the modeling workflow and evaluation metrics in R. We checked the outputs manually and learned how different metrics can provide different perspectives on model performance.

4. Overall Conclusion

This Stage 2 report built and evaluated predictive models for both datasets.

For the regression dataset, the models attempted to predict football player wages using player characteristics. The better regression model was chosen based on lower RMSE and higher R-squared.

For the classification dataset, logistic regression models were used to predict whether borrowers are classified as good or bad credit risks. The better classification model was chosen by considering accuracy, precision, and recall. Recall was especially important because identifying high-risk borrowers is a key concern in credit risk analysis.

Overall, the modeling results show that predictive models can provide useful insights, but model selection should consider both numerical performance and the practical meaning of the research question.