This Stage 2 report builds predictive models for the two datasets selected in Stage 1. The regression dataset uses professional football player characteristics to predict wages. The classification dataset uses borrower characteristics to predict credit risk.
The report includes train/test splitting, two predictive models for each dataset, test-set evaluation metrics, model comparison, 5-fold cross-validation, an AI interaction log, and a brief conclusion.
Setup
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.3
Warning: package 'purrr' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rsample)
Warning: package 'rsample' was built under R version 4.5.3
library(knitr)set.seed(465)
1. Regression Dataset: Player Wages
1.1 Load and Clean Data
The outcome/dependent variable for the regression dataset is Wage. It is a continuous numeric variable representing player wage.
The better regression model is selected based mainly on lower RMSE and higher R-squared. RMSE measures average prediction error on the log wage scale, while R-squared measures the share of variation in log wages explained by the model. If the expanded model performs better, this suggests that player position adds useful information for predicting wages.
1.6 5-Fold Cross-Validation for Best Regression Model
Cross-validation helps evaluate whether the selected regression model is stable across different subsets of the data. If the cross-validated RMSE is close to the test-set RMSE, the model appears reasonably stable. A large difference would suggest possible overfitting or sensitivity to the train/test split.
2. Classification Dataset: German Credit Risk
2.1 Load and Clean Data
The outcome/dependent variable for the classification dataset is Risk. It is a binary categorical variable with two possible values:
good: lower credit risk
bad: higher credit risk
For modeling, bad is treated as the positive class because identifying risky borrowers is practically important.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `saving_accounts = fct_explicit_na(as.factor(saving_accounts),
na_level = "unknown")`.
Caused by warning:
! `fct_explicit_na()` was deprecated in forcats 1.0.0.
ℹ Please use `fct_na_value_to_level()` instead.
class_comparison <-bind_rows(class_metrics_1, class_metrics_2)kable(class_comparison, digits =4, caption ="Classification Model Comparison on Test Set")
Accuracy measures the overall percentage of correct predictions. Precision measures how many predicted bad credit risks are actually bad. Recall measures how many actual bad borrowers the model successfully identifies.
For this research question, recall is especially important because failing to identify high-risk borrowers may be costly for lenders.
2.6 5-Fold Cross-Validation for Best Classification Model
Average Classification Cross-Validated Performance
CV_Accuracy
CV_Precision
CV_Recall
0.734
0.5881
0.4033
Cross-validation helps determine whether the selected classification model performs consistently across different folds of the data. If the cross-validated metrics are close to the test-set metrics, this suggests the model is relatively stable. If the test performance is much higher than cross-validation performance, this may indicate overfitting.
3. AI Interaction Log
Prompt / Question
We asked ChatGPT how to organize a Stage 2 predictive modeling report in R using train/test splits, regression models, logistic regression models, and evaluation metrics.
AI Response / Relevant Excerpt
The AI suggested using initial_split() for train/test splitting and recommended basic functions for regression and logistic regression modeling in R. It also mentioned common evaluation metrics such as RMSE, R-squared, accuracy, precision, and recall.
How I Used It
We used the suggestions as a general reference while organizing the analysis. The code was adjusted to fit the actual variables in the datasets, including Wage and Risk.
Reflection
The interaction was useful for reviewing the modeling workflow and evaluation metrics in R. We checked the outputs manually and learned how different metrics can provide different perspectives on model performance.
4. Overall Conclusion
This Stage 2 report built and evaluated predictive models for both datasets.
For the regression dataset, the models attempted to predict football player wages using player characteristics. The better regression model was chosen based on lower RMSE and higher R-squared.
For the classification dataset, logistic regression models were used to predict whether borrowers are classified as good or bad credit risks. The better classification model was chosen by considering accuracy, precision, and recall. Recall was especially important because identifying high-risk borrowers is a key concern in credit risk analysis.
Overall, the modeling results show that predictive models can provide useful insights, but model selection should consider both numerical performance and the practical meaning of the research question.