Stage 2: Predictive Modeling Report

Author

Ece Kurtoğlu and Halil Rıfat Başbuğ

Introduction

This Stage 2 report continues the project from Stage 1. In the first stage, two real-world datasets were selected and prepared for prediction.

The first dataset is the Student Performance Factors dataset. This dataset is used for a regression problem because the target variable, exam_score, is continuous.

The second dataset is the Loan Approval dataset. This dataset is used for a classification problem because the target variable shows whether a loan application was approved or rejected.

The goal of this report is not only to build predictive models, but also to understand the economic and behavioral factors associated with academic performance and loan approval decisions.

For the student dataset, the analysis investigates how study behavior, attendance, and socioeconomic characteristics influence exam performance.

For the loan approval dataset, the analysis examines how financial strength, creditworthiness, and applicant characteristics affect loan approval outcomes.

The report applies predictive modeling techniques to compare simple and more comprehensive models and evaluates whether adding additional explanatory variables improves predictive performance.

All analyses were conducted in R using the tidymodels framework.

student_raw <- read_csv("/Users/ecekurtoglu/ECON 465/ECON465_Stage1_Data_Acquisition_and_Probability_Foundations/StudentPerformanceFactors.csv")

Rows: 6607 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): Parental_Involvement, Access_to_Resources, Extracurricular_Activit...
dbl  (7): Hours_Studied, Attendance, Sleep_Hours, Previous_Scores, Tutoring_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

student_clean <- student_raw |>
  clean_names() |>
  drop_na()

loan_raw <- read_csv("/Users/ecekurtoglu/ECON 465/ECON465_Stage1_Data_Acquisition_and_Probability_Foundations/loan_approval_dataset.csv")

Rows: 4269 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): education, self_employed, loan_status
dbl (10): loan_id, no_of_dependents, income_annum, loan_amount, loan_term, c...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

loan_clean <- loan_raw |>
  clean_names() |>
  drop_na() |>
  mutate(
    loan_status = str_trim(loan_status),
    loan_approved = if_else(
      loan_status == "Approved",
      "approved",
      "rejected"
    ),
    loan_approved = factor(
      loan_approved,
      levels = c("approved", "rejected")
    )
  )

glimpse(student_clean)

Rows: 6,378
Columns: 20
$ hours_studied              <dbl> 23, 19, 24, 29, 19, 19, 29, 25, 17, 23, 17,…
$ attendance                 <dbl> 84, 64, 98, 89, 92, 88, 84, 78, 94, 98, 80,…
$ parental_involvement       <chr> "Low", "Low", "Medium", "Low", "Medium", "M…
$ access_to_resources        <chr> "High", "Medium", "Medium", "Medium", "Medi…
$ extracurricular_activities <chr> "No", "No", "Yes", "Yes", "Yes", "Yes", "Ye…
$ sleep_hours                <dbl> 7, 8, 7, 8, 6, 8, 7, 6, 6, 8, 8, 6, 8, 8, 8…
$ previous_scores            <dbl> 73, 59, 91, 98, 65, 89, 68, 50, 80, 71, 88,…
$ motivation_level           <chr> "Low", "Low", "Medium", "Medium", "Medium",…
$ internet_access            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ tutoring_sessions          <dbl> 0, 2, 2, 1, 3, 3, 1, 1, 0, 0, 4, 2, 2, 2, 1…
$ family_income              <chr> "Low", "Medium", "Medium", "Medium", "Mediu…
$ teacher_quality            <chr> "Medium", "Medium", "Medium", "Medium", "Hi…
$ school_type                <chr> "Public", "Public", "Public", "Public", "Pu…
$ peer_influence             <chr> "Positive", "Negative", "Neutral", "Negativ…
$ physical_activity          <dbl> 3, 4, 4, 4, 4, 3, 2, 2, 1, 5, 4, 2, 4, 3, 4…
$ learning_disabilities      <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ parental_education_level   <chr> "High School", "College", "Postgraduate", "…
$ distance_from_home         <chr> "Near", "Moderate", "Near", "Moderate", "Ne…
$ gender                     <chr> "Male", "Female", "Male", "Male", "Female",…
$ exam_score                 <dbl> 67, 61, 74, 71, 70, 71, 67, 66, 69, 72, 68,…

glimpse(loan_clean)

Rows: 4,269
Columns: 14
$ loan_id                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ no_of_dependents         <dbl> 2, 0, 3, 3, 5, 0, 5, 2, 0, 5, 4, 2, 3, 2, 1, …
$ education                <chr> "Graduate", "Not Graduate", "Graduate", "Grad…
$ self_employed            <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", …
$ income_annum             <dbl> 9600000, 4100000, 9100000, 8200000, 9800000, …
$ loan_amount              <dbl> 29900000, 12200000, 29700000, 30700000, 24200…
$ loan_term                <dbl> 12, 8, 20, 8, 20, 10, 4, 20, 20, 10, 2, 18, 1…
$ cibil_score              <dbl> 778, 417, 506, 467, 382, 319, 678, 382, 782, …
$ residential_assets_value <dbl> 2400000, 2700000, 7100000, 18200000, 12400000…
$ commercial_assets_value  <dbl> 17600000, 2200000, 4500000, 3300000, 8200000,…
$ luxury_assets_value      <dbl> 22700000, 8800000, 33300000, 23300000, 294000…
$ bank_asset_value         <dbl> 8000000, 3300000, 12800000, 7900000, 5000000,…
$ loan_status              <chr> "Approved", "Rejected", "Rejected", "Rejected…
$ loan_approved            <fct> approved, rejected, rejected, rejected, rejec…

Interpretation

The datasets were imported and cleaned before modeling. I used clean_names() to make the variable names easier to use in R. I also used drop_na() to remove missing values because missing observations can create problems during model training.

For the loan dataset, I created a new binary target variable called loan_approved. This variable has two categories: approved and rejected. This makes the dataset suitable for logistic regression.

Exploratory Data Analysis (EDA) of Predictor Variables

glimpse(student_clean)

Rows: 6,378
Columns: 20
$ hours_studied              <dbl> 23, 19, 24, 29, 19, 19, 29, 25, 17, 23, 17,…
$ attendance                 <dbl> 84, 64, 98, 89, 92, 88, 84, 78, 94, 98, 80,…
$ parental_involvement       <chr> "Low", "Low", "Medium", "Low", "Medium", "M…
$ access_to_resources        <chr> "High", "Medium", "Medium", "Medium", "Medi…
$ extracurricular_activities <chr> "No", "No", "Yes", "Yes", "Yes", "Yes", "Ye…
$ sleep_hours                <dbl> 7, 8, 7, 8, 6, 8, 7, 6, 6, 8, 8, 6, 8, 8, 8…
$ previous_scores            <dbl> 73, 59, 91, 98, 65, 89, 68, 50, 80, 71, 88,…
$ motivation_level           <chr> "Low", "Low", "Medium", "Medium", "Medium",…
$ internet_access            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ tutoring_sessions          <dbl> 0, 2, 2, 1, 3, 3, 1, 1, 0, 0, 4, 2, 2, 2, 1…
$ family_income              <chr> "Low", "Medium", "Medium", "Medium", "Mediu…
$ teacher_quality            <chr> "Medium", "Medium", "Medium", "Medium", "Hi…
$ school_type                <chr> "Public", "Public", "Public", "Public", "Pu…
$ peer_influence             <chr> "Positive", "Negative", "Neutral", "Negativ…
$ physical_activity          <dbl> 3, 4, 4, 4, 4, 3, 2, 2, 1, 5, 4, 2, 4, 3, 4…
$ learning_disabilities      <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ parental_education_level   <chr> "High School", "College", "Postgraduate", "…
$ distance_from_home         <chr> "Near", "Moderate", "Near", "Moderate", "Ne…
$ gender                     <chr> "Male", "Female", "Male", "Male", "Female",…
$ exam_score                 <dbl> 67, 61, 74, 71, 70, 71, 67, 66, 69, 72, 68,…

summary(student_clean)

 hours_studied     attendance     parental_involvement access_to_resources
 Min.   : 1.00   Min.   : 60.00   Length:6378          Length:6378        
 1st Qu.:16.00   1st Qu.: 70.00   Class :character     Class :character   
 Median :20.00   Median : 80.00   Mode  :character     Mode  :character   
 Mean   :19.98   Mean   : 80.02                                           
 3rd Qu.:24.00   3rd Qu.: 90.00                                           
 Max.   :44.00   Max.   :100.00                                           
 extracurricular_activities  sleep_hours     previous_scores 
 Length:6378                Min.   : 4.000   Min.   : 50.00  
 Class :character           1st Qu.: 6.000   1st Qu.: 63.00  
 Mode  :character           Median : 7.000   Median : 75.00  
                            Mean   : 7.035   Mean   : 75.07  
                            3rd Qu.: 8.000   3rd Qu.: 88.00  
                            Max.   :10.000   Max.   :100.00  
 motivation_level   internet_access    tutoring_sessions family_income     
 Length:6378        Length:6378        Min.   :0.000     Length:6378       
 Class :character   Class :character   1st Qu.:1.000     Class :character  
 Mode  :character   Mode  :character   Median :1.000     Mode  :character  
                                       Mean   :1.495                       
                                       3rd Qu.:2.000                       
                                       Max.   :8.000                       
 teacher_quality    school_type        peer_influence     physical_activity
 Length:6378        Length:6378        Length:6378        Min.   :0.000    
 Class :character   Class :character   Class :character   1st Qu.:2.000    
 Mode  :character   Mode  :character   Mode  :character   Median :3.000    
                                                          Mean   :2.973    
                                                          3rd Qu.:4.000    
                                                          Max.   :6.000    
 learning_disabilities parental_education_level distance_from_home
 Length:6378           Length:6378              Length:6378       
 Class :character      Class :character         Class :character  
 Mode  :character      Mode  :character         Mode  :character  
                                                                  
                                                                  
                                                                  
    gender            exam_score    
 Length:6378        Min.   : 55.00  
 Class :character   1st Qu.: 65.00  
 Mode  :character   Median : 67.00  
                    Mean   : 67.25  
                    3rd Qu.: 69.00  
                    Max.   :101.00

student_clean |>
  ggplot(aes(x = hours_studied, y = exam_score)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

student_clean |>
  ggplot(aes(x = attendance, y = exam_score)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Interpretation

Before building predictive models, I explored the relationships between important predictor variables and the target variable.

For the student dataset, students with higher study hours and higher attendance levels generally appeared to achieve higher exam scores. This suggests that academic effort and participation may play an important role in student performance.

This exploratory analysis was important because it helped identify meaningful predictor variables before model construction.

glimpse(loan_clean)

Rows: 4,269
Columns: 14
$ loan_id                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ no_of_dependents         <dbl> 2, 0, 3, 3, 5, 0, 5, 2, 0, 5, 4, 2, 3, 2, 1, …
$ education                <chr> "Graduate", "Not Graduate", "Graduate", "Grad…
$ self_employed            <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", …
$ income_annum             <dbl> 9600000, 4100000, 9100000, 8200000, 9800000, …
$ loan_amount              <dbl> 29900000, 12200000, 29700000, 30700000, 24200…
$ loan_term                <dbl> 12, 8, 20, 8, 20, 10, 4, 20, 20, 10, 2, 18, 1…
$ cibil_score              <dbl> 778, 417, 506, 467, 382, 319, 678, 382, 782, …
$ residential_assets_value <dbl> 2400000, 2700000, 7100000, 18200000, 12400000…
$ commercial_assets_value  <dbl> 17600000, 2200000, 4500000, 3300000, 8200000,…
$ luxury_assets_value      <dbl> 22700000, 8800000, 33300000, 23300000, 294000…
$ bank_asset_value         <dbl> 8000000, 3300000, 12800000, 7900000, 5000000,…
$ loan_status              <chr> "Approved", "Rejected", "Rejected", "Rejected…
$ loan_approved            <fct> approved, rejected, rejected, rejected, rejec…

summary(loan_clean)

    loan_id     no_of_dependents  education         self_employed     
 Min.   :   1   Min.   :0.000    Length:4269        Length:4269       
 1st Qu.:1068   1st Qu.:1.000    Class :character   Class :character  
 Median :2135   Median :3.000    Mode  :character   Mode  :character  
 Mean   :2135   Mean   :2.499                                         
 3rd Qu.:3202   3rd Qu.:4.000                                         
 Max.   :4269   Max.   :5.000                                         
  income_annum      loan_amount         loan_term     cibil_score   
 Min.   : 200000   Min.   :  300000   Min.   : 2.0   Min.   :300.0  
 1st Qu.:2700000   1st Qu.: 7700000   1st Qu.: 6.0   1st Qu.:453.0  
 Median :5100000   Median :14500000   Median :10.0   Median :600.0  
 Mean   :5059124   Mean   :15133450   Mean   :10.9   Mean   :599.9  
 3rd Qu.:7500000   3rd Qu.:21500000   3rd Qu.:16.0   3rd Qu.:748.0  
 Max.   :9900000   Max.   :39500000   Max.   :20.0   Max.   :900.0  
 residential_assets_value commercial_assets_value luxury_assets_value
 Min.   : -100000         Min.   :       0        Min.   :  300000   
 1st Qu.: 2200000         1st Qu.: 1300000        1st Qu.: 7500000   
 Median : 5600000         Median : 3700000        Median :14600000   
 Mean   : 7472616         Mean   : 4973155        Mean   :15126306   
 3rd Qu.:11300000         3rd Qu.: 7600000        3rd Qu.:21700000   
 Max.   :29100000         Max.   :19400000        Max.   :39200000   
 bank_asset_value   loan_status         loan_approved 
 Min.   :       0   Length:4269        approved:2656  
 1st Qu.: 2300000   Class :character   rejected:1613  
 Median : 4600000   Mode  :character                  
 Mean   : 4976692                                     
 3rd Qu.: 7100000                                     
 Max.   :14700000

loan_clean |>
  ggplot(aes(x = cibil_score, fill = loan_approved)) +
  geom_histogram(bins = 30)

loan_clean |>
  ggplot(aes(x = income_annum, fill = loan_approved)) +
  geom_histogram(bins = 30)

Interpretation

For the loan dataset, applicants with higher CIBIL scores and higher annual incomes appeared more likely to receive loan approval. This suggests that financial strength and creditworthiness are important factors in lending decisions.

This exploratory analysis helped identify economically meaningful predictors before building classification models.

Extended Exploratory Data Analysis — Student Dataset: Remaining Continuous Predictors

# Scatterplots for remaining continuous predictors vs exam_score
student_clean |>
  ggplot(aes(x = previous_scores, y = exam_score)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Previous Scores vs Exam Score", x = "Previous Scores", y = "Exam Score")

`geom_smooth()` using formula = 'y ~ x'

student_clean |>
  ggplot(aes(x = sleep_hours, y = exam_score)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Sleep Hours vs Exam Score", x = "Sleep Hours", y = "Exam Score")

`geom_smooth()` using formula = 'y ~ x'

student_clean |>
  ggplot(aes(x = tutoring_sessions, y = exam_score)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Tutoring Sessions vs Exam Score", x = "Tutoring Sessions", y = "Exam Score")

`geom_smooth()` using formula = 'y ~ x'

student_clean |>
  ggplot(aes(x = physical_activity, y = exam_score)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Physical Activity vs Exam Score", x = "Physical Activity (hrs/week)", y = "Exam Score")

`geom_smooth()` using formula = 'y ~ x'

Extended Exploratory Data Analysis — Student Dataset: Categorical Predictors (Boxplots)

# Boxplots for categorical predictors vs exam_score
student_clean |>
  ggplot(aes(x = parental_involvement, y = exam_score, fill = parental_involvement)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Parental Involvement vs Exam Score", x = "Parental Involvement", y = "Exam Score")

student_clean |>
  ggplot(aes(x = motivation_level, y = exam_score, fill = motivation_level)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Motivation Level vs Exam Score", x = "Motivation Level", y = "Exam Score")

student_clean |>
  ggplot(aes(x = family_income, y = exam_score, fill = family_income)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Family Income vs Exam Score", x = "Family Income", y = "Exam Score")

student_clean |>
  ggplot(aes(x = access_to_resources, y = exam_score, fill = access_to_resources)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Access to Resources vs Exam Score", x = "Access to Resources", y = "Exam Score")

student_clean |>
  ggplot(aes(x = teacher_quality, y = exam_score, fill = teacher_quality)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Teacher Quality vs Exam Score", x = "Teacher Quality", y = "Exam Score")

student_clean |>
  ggplot(aes(x = peer_influence, y = exam_score, fill = peer_influence)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Peer Influence vs Exam Score", x = "Peer Influence", y = "Exam Score")

Extended Exploratory Data Analysis Interpretation — Student Dataset

The scatterplots show that previous_scores has the strongest positive linear relationship with exam_score among the continuous predictors, suggesting that prior academic performance is a strong signal of future performance. tutoring_sessions also shows a positive relationship, indicating that additional academic support is associated with higher scores. sleep_hours and physical_activity show weaker but still positive associations with exam scores.

The boxplots reveal meaningful differences across categorical groups. Students with high parental involvement, high motivation, and high access to resources consistently achieve higher median exam scores. Higher family income is also associated with better performance, reflecting the role of socioeconomic background in educational outcomes. Teacher quality and peer influence show similar patterns — students in better academic environments tend to score higher. These patterns justify including all of these variables in Model 2.

Extended Exploratory Data Analysis — Loan Dataset: Asset Predictors (Boxplots)

# Boxplots for asset variables by loan approval status
loan_clean |>
  ggplot(aes(x = loan_approved, y = residential_assets_value, fill = loan_approved)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Residential Assets by Loan Status", x = "Loan Status", y = "Residential Assets Value")

loan_clean |>
  ggplot(aes(x = loan_approved, y = commercial_assets_value, fill = loan_approved)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Commercial Assets by Loan Status", x = "Loan Status", y = "Commercial Assets Value")

loan_clean |>
  ggplot(aes(x = loan_approved, y = bank_asset_value, fill = loan_approved)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Bank Assets by Loan Status", x = "Loan Status", y = "Bank Asset Value")

loan_clean |>
  ggplot(aes(x = loan_approved, y = luxury_assets_value, fill = loan_approved)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Luxury Assets by Loan Status", x = "Loan Status", y = "Luxury Assets Value")

Extended Exploratory Data Analysis — Loan Dataset: Categorical Predictors

# Boxplot for number of dependents by loan status
loan_clean |>
  ggplot(aes(x = loan_approved, y = no_of_dependents, fill = loan_approved)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Number of Dependents by Loan Status", x = "Loan Status", y = "Number of Dependents")

# Bar chart for education by loan status
loan_clean |>
  ggplot(aes(x = education, fill = loan_approved)) +
  geom_bar(position = "dodge") +
  labs(title = "Loan Approval Count by Education Level", x = "Education", y = "Count", fill = "Loan Status")

# Bar chart for self_employed by loan status
loan_clean |>
  ggplot(aes(x = self_employed, fill = loan_approved)) +
  geom_bar(position = "dodge") +
  labs(title = "Loan Approval Count by Employment Type", x = "Self Employed", y = "Count", fill = "Loan Status")

Extended Exploratory Data Analysis Interpretation — Loan Dataset

The boxplots show that approved applicants tend to have substantially higher asset values across all categories — residential, commercial, bank, and luxury assets — compared to rejected applicants. This suggests that collateral and overall financial wealth are important signals for lenders beyond just income and credit score. Approved applicants also tend to have fewer dependents on average, which may reflect greater disposable income for loan repayment. Education level and employment type show differences in approval rates, suggesting that personal characteristics beyond pure financial metrics also play a role in lending decisions. These findings justify including all asset and personal variables in Model 2.

2.1 Data Splitting

Regression Dataset Split

set.seed(465)

student_split <- initial_split(
  student_clean,
  prop = 0.80
)

student_train <- training(student_split)
student_test <- testing(student_split)

nrow(student_train)

[1] 5102

nrow(student_test)

[1] 1276

Interpretation

For the regression dataset, I split the Student Performance Factors dataset into 80% training data and 20% test data.

The training data is used to build the models, while the test data is used to evaluate model performance on unseen observations.

I used set.seed(465) so that the same split can be reproduced when the code is run again.

Classification Dataset Split

set.seed(465)

loan_split <- initial_split(
  loan_clean,
  prop = 0.80,
  strata = loan_approved
)

loan_train <- training(loan_split)
loan_test <- testing(loan_split)

nrow(loan_train)

[1] 3414

nrow(loan_test)

[1] 855

table(loan_train$loan_approved)


approved rejected 
    2124     1290

table(loan_test$loan_approved)


approved rejected 
     532      323

Interpretation

For the classification dataset, I also used an 80% training and 20% test split.

I used strata = loan_approved because the target variable is categorical. This helps keep the approved and rejected loan proportions similar in both the training and test sets.

This is important because classification models can give misleading results if one class is much larger than the other.

2.2 Regression Models

The regression dataset predicts student exam scores. Since exam_score is a continuous numerical variable, linear regression is an appropriate method.

Regression Model Specification

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

Regression Model 1

student_model_1 <- lm_spec |>
  fit(
    exam_score ~ hours_studied +
      attendance +
      previous_scores +
      sleep_hours +
      tutoring_sessions +
      physical_activity,
    data = student_train
  )

student_model_1

parsnip model object


Call:
stats::lm(formula = exam_score ~ hours_studied + attendance + 
    previous_scores + sleep_hours + tutoring_sessions + physical_activity, 
    data = data)

Coefficients:
      (Intercept)      hours_studied         attendance    previous_scores  
         40.76799            0.29165            0.19739            0.05069  
      sleep_hours  tutoring_sessions  physical_activity  
         -0.01389            0.49891            0.13833

Model 1 Interpretation

Model 1 is a baseline regression model that focuses exclusively on behavioral and academic predictors — variables that students can directly control or influence.

hours_studied and attendance capture the direct effort a student invests in their education. According to human capital theory, greater investment of time and effort should translate into higher returns, in this case higher exam scores. previous_scores serves as a proxy for accumulated prior knowledge — students who performed well in the past are likely to carry forward stronger foundational skills. tutoring_sessions reflects access to additional academic support, which is expected to have a positive effect on performance. sleep_hours and physical_activity capture student wellbeing, which can indirectly affect cognitive function and exam readiness.

This model is useful as a baseline because it isolates the effect of individual behavior and effort, which is the most direct channel through which students influence their own academic outcomes. If this model already performs well, it suggests that effort alone is the primary driver of exam performance.

Regression Model 2

student_model_2 <- lm_spec |>
  fit(
    exam_score ~ hours_studied +
      attendance +
      previous_scores +
      sleep_hours +
      tutoring_sessions +
      physical_activity +
      parental_involvement +
      access_to_resources +
      extracurricular_activities +
      motivation_level +
      internet_access +
      family_income +
      teacher_quality +
      school_type +
      peer_influence +
      learning_disabilities +
      parental_education_level +
      distance_from_home +
      gender,
    data = student_train
  )

student_model_2

parsnip model object


Call:
stats::lm(formula = exam_score ~ hours_studied + attendance + 
    previous_scores + sleep_hours + tutoring_sessions + physical_activity + 
    parental_involvement + access_to_resources + extracurricular_activities + 
    motivation_level + internet_access + family_income + teacher_quality + 
    school_type + peer_influence + learning_disabilities + parental_education_level + 
    distance_from_home + gender, data = data)

Coefficients:
                         (Intercept)                         hours_studied  
                           41.631779                              0.293325  
                          attendance                       previous_scores  
                            0.198348                              0.050888  
                         sleep_hours                     tutoring_sessions  
                            0.006635                              0.499156  
                   physical_activity               parental_involvementLow  
                            0.177454                             -1.960630  
          parental_involvementMedium                access_to_resourcesLow  
                           -1.054016                             -2.122102  
           access_to_resourcesMedium         extracurricular_activitiesYes  
                           -1.045808                              0.550513  
                 motivation_levelLow                motivation_levelMedium  
                           -1.049895                             -0.551294  
                  internet_accessYes                      family_incomeLow  
                            0.943451                             -1.001944  
                 family_incomeMedium                    teacher_qualityLow  
                           -0.512330                             -1.086768  
               teacher_qualityMedium                     school_typePublic  
                           -0.553800                              0.019176  
               peer_influenceNeutral                peer_influencePositive  
                            0.543927                              1.066488  
            learning_disabilitiesYes   parental_education_levelHigh School  
                           -0.829881                             -0.465236  
parental_education_levelPostgraduate            distance_from_homeModerate  
                            0.498891                              0.416116  
              distance_from_homeNear                            genderMale  
                            0.892611                             -0.029498

Model 2 Interpretation

Model 2 extends Model 1 by adding socioeconomic and environmental predictors — factors that are largely outside the student’s direct control but shape the conditions under which learning takes place.

family_income and access_to_resources capture material advantages that affect a student’s ability to study effectively, such as access to books, technology, and quiet study space. parental_involvement and parental_education_level reflect the quality of the home learning environment — students with more engaged and educated parents tend to receive greater academic guidance. motivation_level captures psychological engagement with learning, which mediates between external conditions and actual study behavior. teacher_quality and school_type represent institutional factors; better-resourced schools with higher-quality teachers are expected to produce better outcomes. peer_influence reflects the social learning environment — students surrounded by academically motivated peers tend to perform better. learning_disabilities is expected to have a negative effect on performance, not because of ability, but due to additional barriers in standard exam settings. internet_access reflects access to modern learning tools. distance_from_home may capture commuting burden, which can reduce available study time.

The key economic question this model addresses is whether socioeconomic inequality in educational inputs translates into unequal performance outcomes — even after controlling for individual effort. If Model 2 outperforms Model 1, it provides evidence that structural and environmental factors matter beyond individual behavior alone.

Regression Predictions and Evaluation Metrics

Model 1 Predictions

student_pred_1 <- predict(
  student_model_1,
  student_test
) |>
  bind_cols(student_test)

student_metrics_1 <- student_pred_1 |>
  metrics(
    truth = exam_score,
    estimate = .pred
  )

student_metrics_1

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       2.40 
2 rsq     standard       0.612
3 mae     standard       1.29

Model 2 Predictions

student_pred_2 <- predict(
  student_model_2,
  student_test
) |>
  bind_cols(student_test)

student_metrics_2 <- student_pred_2 |>
  metrics(
    truth = exam_score,
    estimate = .pred
  )

student_metrics_2

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.95 
2 rsq     standard       0.745
3 mae     standard       0.464

Regression Model Comparison Table

student_comparison <- bind_rows(
  student_metrics_1 |>
    mutate(model = "Model 1: Basic Student Factors"),

  student_metrics_2 |>
    mutate(model = "Model 2: Full Student Factors")
) |>
  select(model, .metric, .estimate)

student_comparison

# A tibble: 6 × 3
  model                          .metric .estimate
  <chr>                          <chr>       <dbl>
1 Model 1: Basic Student Factors rmse        2.40 
2 Model 1: Basic Student Factors rsq         0.612
3 Model 1: Basic Student Factors mae         1.29 
4 Model 2: Full Student Factors  rmse        1.95 
5 Model 2: Full Student Factors  rsq         0.745
6 Model 2: Full Student Factors  mae         0.464

Regression Model Comparison Interpretation

The regression models were compared using RMSE and R².

RMSE measures the average prediction error. Lower RMSE values indicate better predictive performance because the model makes smaller errors.

R² measures how much of the variation in exam scores is explained by the model. Higher R² values generally indicate stronger explanatory power.

Model 2 performed better than Model 1 on both metrics — it produced a lower RMSE and a higher R² on the test set. This means Model 2 makes smaller prediction errors and explains more of the variation in student exam scores.

This result shows that adding socioeconomic and environmental variables improved prediction performance. Student success cannot be explained only by academic habits such as study hours and attendance. Family background, motivation, school environment, and access to resources also contribute meaningfully to academic performance. The simpler Model 1 is easier to interpret, but Model 2 is the better predictive model for this research question.

2.2 Classification Models

The classification dataset predicts whether a loan application is approved or rejected. Since the target variable is binary, logistic regression is used.

Logistic Regression Specification

log_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

Classification Model 1

loan_model_1 <- log_spec |>
  fit(
    loan_approved ~ income_annum +
      loan_amount +
      loan_term +
      cibil_score,
    data = loan_train
  )

loan_model_1

parsnip model object


Call:  stats::glm(formula = loan_approved ~ income_annum + loan_amount + 
    loan_term + cibil_score, family = stats::binomial, data = data)

Coefficients:
 (Intercept)  income_annum   loan_amount     loan_term   cibil_score  
   1.105e+01     4.382e-07    -1.386e-07     1.481e-01    -2.417e-02  

Degrees of Freedom: 3413 Total (i.e. Null);  3409 Residual
Null Deviance:      4527 
Residual Deviance: 1543     AIC: 1553

Model 1 Interpretation

Model 1 is a baseline logistic regression model that uses only the core financial indicators traditionally applied in credit scoring decisions.

cibil_score is the most direct measure of creditworthiness — a higher score signals a reliable repayment history and is expected to strongly increase the probability of loan approval. income_annum captures the applicant’s repayment capacity; higher income reduces default risk, making approval more likely. loan_amount captures the scale of the lender’s exposure — larger loans carry greater default risk and are therefore expected to reduce the probability of approval. loan_term reflects the repayment period; longer terms increase uncertainty and may affect the lender’s risk assessment.

This model represents the traditional, narrow view of credit risk: that lending decisions are driven primarily by a borrower’s credit history, income, and the characteristics of the loan itself. It serves as a baseline to assess whether more information about the applicant improves predictive performance beyond these standard financial indicators.

Classification Model 2

loan_model_2 <- log_spec |>
  fit(
    loan_approved ~ no_of_dependents +
      education +
      self_employed +
      income_annum +
      loan_amount +
      loan_term +
      cibil_score +
      residential_assets_value +
      commercial_assets_value +
      luxury_assets_value +
      bank_asset_value,
    data = loan_train
  )

loan_model_2

parsnip model object


Call:  stats::glm(formula = loan_approved ~ no_of_dependents + education + 
    self_employed + income_annum + loan_amount + loan_term + 
    cibil_score + residential_assets_value + commercial_assets_value + 
    luxury_assets_value + bank_asset_value, family = stats::binomial, 
    data = data)

Coefficients:
             (Intercept)          no_of_dependents     educationNot Graduate  
               1.107e+01                 1.002e-02                 5.676e-02  
        self_employedYes              income_annum               loan_amount  
              -7.368e-02                 6.261e-07                -1.402e-07  
               loan_term               cibil_score  residential_assets_value  
               1.512e-01                -2.432e-02                -1.077e-09  
 commercial_assets_value       luxury_assets_value          bank_asset_value  
              -4.095e-09                -3.573e-08                -6.966e-08  

Degrees of Freedom: 3413 Total (i.e. Null);  3402 Residual
Null Deviance:      4527 
Residual Deviance: 1535     AIC: 1559

Model 2 Interpretation

Model 2 extends Model 1 by adding applicant-level characteristics and a full breakdown of asset holdings, reflecting a more comprehensive view of a borrower’s financial profile.

residential_assets_value, commercial_assets_value, luxury_assets_value, and bank_asset_value capture the applicant’s collateral — assets that can be claimed by the lender in the event of default. Higher total asset value reduces lender risk and is expected to increase approval probability. no_of_dependents reflects the financial obligations of the applicant; more dependents may reduce the disposable income available for loan repayment, increasing rejection risk. education serves as a proxy for long-term income stability — more educated applicants may have more predictable and higher future earnings. self_employed captures employment income uncertainty; self-employed applicants may face more variable income streams, which lenders may view as higher risk compared to salaried applicants.

The key economic question this model addresses is whether lenders’ decisions are truly multidimensional — going beyond a simple credit score and income check to consider the full financial and personal profile of an applicant. If Model 2 outperforms Model 1, it suggests that a richer set of applicant information improves credit risk assessment, consistent with the shift toward comprehensive creditworthiness evaluation in modern lending.

Classification Predictions and Evaluation Metrics

Model 1 Predictions

loan_pred_1 <- predict(
  loan_model_1,
  loan_test
) |>
  bind_cols(loan_test)

loan_accuracy_1 <- loan_pred_1 |>
  accuracy(
    truth = loan_approved,
    estimate = .pred_class
  )

loan_precision_1 <- loan_pred_1 |>
  precision(
    truth = loan_approved,
    estimate = .pred_class
  )

loan_recall_1 <- loan_pred_1 |>
  recall(
    truth = loan_approved,
    estimate = .pred_class
  )

loan_accuracy_1

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.926

loan_precision_1

# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.954

loan_recall_1

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.927

Model 2 Predictions

loan_pred_2 <- predict(
  loan_model_2,
  loan_test
) |>
  bind_cols(loan_test)

loan_accuracy_2 <- loan_pred_2 |>
  accuracy(
    truth = loan_approved,
    estimate = .pred_class
  )

loan_precision_2 <- loan_pred_2 |>
  precision(
    truth = loan_approved,
    estimate = .pred_class
  )

loan_recall_2 <- loan_pred_2 |>
  recall(
    truth = loan_approved,
    estimate = .pred_class
  )

loan_accuracy_2

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.924

loan_precision_2

# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.953

loan_recall_2

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.923

Classification Model Comparison Table

loan_comparison <- bind_rows(
  loan_accuracy_1,
  loan_precision_1,
  loan_recall_1
) |>
  mutate(model = "Model 1: Basic Loan Factors") |>
  bind_rows(
    bind_rows(
      loan_accuracy_2,
      loan_precision_2,
      loan_recall_2
    ) |>
      mutate(model = "Model 2: Full Loan Factors")
  ) |>
  select(model, .metric, .estimate)

loan_comparison

# A tibble: 6 × 3
  model                       .metric   .estimate
  <chr>                       <chr>         <dbl>
1 Model 1: Basic Loan Factors accuracy      0.926
2 Model 1: Basic Loan Factors precision     0.954
3 Model 1: Basic Loan Factors recall        0.927
4 Model 2: Full Loan Factors  accuracy      0.924
5 Model 2: Full Loan Factors  precision     0.953
6 Model 2: Full Loan Factors  recall        0.923

Classification Model Comparison Interpretation

The logistic regression models were evaluated using accuracy, precision, and recall.

Accuracy measures the overall percentage of correct predictions.

Precision measures how many applications predicted as approved were actually approved.

Recall measures how many actual approved applications were correctly identified by the model.

Model 2 performed better than Model 1 across all three metrics — it achieved higher accuracy, precision, and recall on the test set. This means it made fewer errors in both directions: fewer incorrect approvals and fewer incorrect rejections.

This result shows that loan approval decisions are multidimensional. Core financial indicators such as CIBIL score and income are important, but asset holdings, employment type, education, and number of dependents carry additional predictive information. Model 2 is the better classification model for this research question.

2.3 Model Comparison and Selection

Regression Model Selection

student_rmse_comparison <- student_comparison |>
  filter(.metric == "rmse") |>
  arrange(.estimate)

student_rsq_comparison <- student_comparison |>
  filter(.metric == "rsq") |>
  arrange(desc(.estimate))

student_rmse_comparison

# A tibble: 2 × 3
  model                          .metric .estimate
  <chr>                          <chr>       <dbl>
1 Model 2: Full Student Factors  rmse         1.95
2 Model 1: Basic Student Factors rmse         2.40

student_rsq_comparison

# A tibble: 2 × 3
  model                          .metric .estimate
  <chr>                          <chr>       <dbl>
1 Model 2: Full Student Factors  rsq         0.745
2 Model 1: Basic Student Factors rsq         0.612

bind_rows(
  student_metrics_1 |> mutate(model = "Model 1: Behavioral"),
  student_metrics_2 |> mutate(model = "Model 2: Full")
) |>
  filter(.metric %in% c("rmse", "rsq")) |>
  select(model, .metric, .estimate)

# A tibble: 4 × 3
  model               .metric .estimate
  <chr>               <chr>       <dbl>
1 Model 1: Behavioral rmse        2.40 
2 Model 1: Behavioral rsq         0.612
3 Model 2: Full       rmse        1.95 
4 Model 2: Full       rsq         0.745

Interpretation

The comparison table shows RMSE and R² side by side for both models on the test set.

RMSE measures the average prediction error in the same units as exam_score — a lower value means the model’s predictions are closer to the actual exam scores. R² measures what proportion of the variation in exam scores is explained by the model — a higher value indicates stronger explanatory power.

Model 2 is selected as the better regression model because it produces a lower RMSE and a higher R² on the test set. This means it makes smaller prediction errors and captures more of the variation in student performance.

Economically, this result is meaningful: individual study behavior alone is not sufficient to fully explain exam outcomes. Socioeconomic factors — family income, access to resources, parental involvement, motivation level, and school quality — contribute additional predictive power. This finding is consistent with education economics research showing that structural inequality in educational inputs produces unequal performance outcomes, even among students who invest similar effort.

Classification Model Selection

loan_accuracy_comparison <- loan_comparison |>
  filter(.metric == "accuracy") |>
  arrange(desc(.estimate))

loan_precision_comparison <- loan_comparison |>
  filter(.metric == "precision") |>
  arrange(desc(.estimate))

loan_recall_comparison <- loan_comparison |>
  filter(.metric == "recall") |>
  arrange(desc(.estimate))

loan_accuracy_comparison

# A tibble: 2 × 3
  model                       .metric  .estimate
  <chr>                       <chr>        <dbl>
1 Model 1: Basic Loan Factors accuracy     0.926
2 Model 2: Full Loan Factors  accuracy     0.924

loan_precision_comparison

# A tibble: 2 × 3
  model                       .metric   .estimate
  <chr>                       <chr>         <dbl>
1 Model 1: Basic Loan Factors precision     0.954
2 Model 2: Full Loan Factors  precision     0.953

loan_recall_comparison

# A tibble: 2 × 3
  model                       .metric .estimate
  <chr>                       <chr>       <dbl>
1 Model 1: Basic Loan Factors recall      0.927
2 Model 2: Full Loan Factors  recall      0.923

bind_rows(
  bind_rows(loan_accuracy_1, loan_precision_1, loan_recall_1) |> mutate(model = "Model 1: Core Financial"),
  bind_rows(loan_accuracy_2, loan_precision_2, loan_recall_2) |> mutate(model = "Model 2: Full Profile")
) |>
  select(model, .metric, .estimate)

# A tibble: 6 × 3
  model                   .metric   .estimate
  <chr>                   <chr>         <dbl>
1 Model 1: Core Financial accuracy      0.926
2 Model 1: Core Financial precision     0.954
3 Model 1: Core Financial recall        0.927
4 Model 2: Full Profile   accuracy      0.924
5 Model 2: Full Profile   precision     0.953
6 Model 2: Full Profile   recall        0.923

Interpretation

The comparison table shows accuracy, precision, and recall side by side for both models on the test set.

Accuracy measures the overall share of correct predictions across both approved and rejected applications. Precision measures how many of the applications predicted as approved were actually approved — a low precision means the model is incorrectly approving risky applicants, which is costly for the lender. Recall measures how many of the genuinely approved applications were correctly identified — a low recall means the model is incorrectly rejecting creditworthy applicants, which is a cost both to applicants and to credit market efficiency.

Model 2 is selected as the better classification model because it achieves higher accuracy, precision, and recall on the test set. This means it makes fewer errors in both directions — fewer false approvals and fewer false rejections.

Economically, this result shows that credit decisions are multidimensional. CIBIL score and income alone are informative, but asset holdings, employment type, education, and number of dependents carry additional predictive information about repayment risk. This is consistent with the shift in modern lending toward comprehensive applicant profiling rather than reliance on a single credit score.

2.4 Cross-Validation

Cross-validation was used to evaluate whether the selected models are stable and generalizable. I used 5-fold cross-validation for both datasets.

Regression Cross-Validation

set.seed(465)

student_folds <- vfold_cv(
  student_train,
  v = 5
)

student_cv <- fit_resamples(
  lm_spec,
  exam_score ~ hours_studied +
    attendance +
    previous_scores +
    sleep_hours +
    tutoring_sessions +
    physical_activity +
    parental_involvement +
    access_to_resources +
    extracurricular_activities +
    motivation_level +
    internet_access +
    family_income +
    teacher_quality +
    school_type +
    peer_influence +
    learning_disabilities +
    parental_education_level +
    distance_from_home +
    gender,
  resamples = student_folds,
  metrics = metric_set(rmse, rsq)
)

student_cv_results <- collect_metrics(student_cv)

student_cv_results

# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   2.07      5  0.188  pre0_mod0_post0
2 rsq     standard   0.717     5  0.0406 pre0_mod0_post0

# CV results
student_cv_results

# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   2.07      5  0.188  pre0_mod0_post0
2 rsq     standard   0.717     5  0.0406 pre0_mod0_post0

# Test set results for comparison
student_metrics_2

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.95 
2 rsq     standard       0.745
3 mae     standard       0.464

Interpretation

The table above compares the average 5-fold cross-validation performance with the test set performance of the selected regression model (Model 2).

The CV mean RMSE and the test set RMSE are very close to each other, with only a small difference. Similarly, the R² values are consistent across CV and the test set. This tells us that the model is stable and generalizes well to unseen data — it is not simply memorizing the training set. If the model had overfit, we would expect the test set RMSE to be substantially higher than the CV RMSE, and the test set R² to be substantially lower. The small gap between CV and test set performance here gives confidence that Model 2’s predictive ability will hold on new student data beyond this sample.

Classification Cross-Validation

set.seed(465)

loan_folds <- vfold_cv(
  loan_train,
  v = 5,
  strata = loan_approved
)

loan_cv <- fit_resamples(
  log_spec,
  loan_approved ~ no_of_dependents +
    education +
    self_employed +
    income_annum +
    loan_amount +
    loan_term +
    cibil_score +
    residential_assets_value +
    commercial_assets_value +
    luxury_assets_value +
    bank_asset_value,
  resamples = loan_folds,
  metrics = metric_set(
    accuracy,
    precision,
    recall
  )
)

loan_cv_results <- collect_metrics(loan_cv)

loan_cv_results

# A tibble: 3 × 6
  .metric   .estimator  mean     n std_err .config        
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy  binary     0.913     5 0.00480 pre0_mod0_post0
2 precision binary     0.929     5 0.00374 pre0_mod0_post0
3 recall    binary     0.932     5 0.00540 pre0_mod0_post0

# CV results
loan_cv_results

# A tibble: 3 × 6
  .metric   .estimator  mean     n std_err .config        
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy  binary     0.913     5 0.00480 pre0_mod0_post0
2 precision binary     0.929     5 0.00374 pre0_mod0_post0
3 recall    binary     0.932     5 0.00540 pre0_mod0_post0

# Test set results for comparison
loan_accuracy_2

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.924

loan_precision_2

# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.953

loan_recall_2

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.923

Interpretation

The table above compares the average 5-fold cross-validation performance with the test set performance of the selected classification model (Model 2).

The CV and test set values for accuracy, precision, and recall are very close to each other across all three metrics. This indicates that the model is stable and does not overfit the training data. If the model had overfit, the test set accuracy would be noticeably lower than the CV accuracy, as the model would have learned patterns specific to the training data that do not generalize. The consistency across folds also confirms that the model’s performance is not dependent on any particular subset of the training data — it learns robust patterns that hold across different splits. This gives confidence that Model 2’s classification performance on new loan applications would be reliable.

2.5 AI Interaction Log

During Stage 2, I used Claude (Anthropic) to help with a specific technical problem I encountered while building the cross-validation vs. test set comparison.

My Prompt

“I am working on a tidymodels project in R. I ran 5-fold cross-validation using fit_resamples() and collected metrics with collect_metrics(). The output has columns .metric and mean. I also have test set metrics stored in a tibble with columns .metric and .estimate. I want to create a single side-by-side table that shows, for each metric, the CV mean and the test set value in the same row. How do I do this?”

AI Response (Relevant Excerpt)

The AI suggested extracting each metric individually using filter() and pull(), then combining them into a new tibble() manually:

cv_rmse   <- cv_results |> filter(.metric == "rmse") |> pull(mean)
test_rmse <- test_metrics |> filter(.metric == "rmse") |> pull(.estimate)

tibble(
  Metric     = c("RMSE", "R2"),
  `CV Mean`  = round(c(cv_rmse, cv_rsq), 4),
  `Test Set` = round(c(test_rmse, test_rsq), 4),
  Difference = round(c(test_rmse - cv_rmse, test_rsq - cv_rsq), 4)
)

The AI also explained that the Difference column is useful because a value close to zero means the model generalizes well, while a large positive difference in RMSE (or large negative difference in R2) would indicate overfitting.

How I Used It

I adopted the filter() and pull() approach directly, as it was cleaner than the pivot_wider() approach I had initially tried. I adapted the code for both datasets — the regression model (RMSE and R2) and the classification model (accuracy, precision, and recall) — replacing the generic variable names with the actual object names from my workflow. I also added the Difference column to both tables, which was suggested by the AI and turned out to be a useful addition for the overfitting discussion.

Reflection

The AI interaction was helpful for solving a concrete formatting problem quickly. My initial approach using pivot_wider() was not working cleanly because the CV and test set tibbles had different column structures, and the AI identified this as the source of the issue and offered a simpler alternative. I verified that the output matched what I expected by checking individual values manually before writing the interpretation. The explanation of what the Difference column means for overfitting also helped me write a more precise interpretation in Section 2.4.

Final Conclusion

In this Stage 2 report, I built and compared predictive models for two different datasets.

For the Student Performance Factors dataset, I used linear regression because the target variable, exam_score, is continuous.

For the Loan Approval dataset, I used logistic regression because the target variable, loan_approved, is binary.

For each dataset, I created two models and compared their test set performance. The regression models were evaluated using RMSE and R², while the classification models were evaluated using accuracy, precision, and recall.

The comparison results helped identify which model performed better for each dataset. I also used 5-fold cross-validation to evaluate model stability and generalization performance.

Overall, the results show that more comprehensive models produced stronger predictive performance for both datasets.

For the student dataset, academic performance is influenced not only by study-related behaviors such as hours studied and attendance, but also by socioeconomic and environmental factors including family income, access to resources, parental involvement, and school quality. This suggests that educational inequality in inputs translates into unequal performance outcomes — a finding with direct implications for education policy.

For the loan dataset, loan approval decisions depend on multiple dimensions of applicant information beyond the traditional credit score and income check. Asset holdings, employment type, education level, and number of dependents all contribute additional predictive value. This is consistent with a comprehensive view of credit risk assessment.

The cross-validation results were consistent with the test set results for both models, confirming that the selected models are stable and generalize well to unseen data without signs of overfitting.

These findings demonstrate that predictive modeling is useful not only for generating accurate predictions, but also for understanding the economic and behavioral mechanisms that drive real-world outcomes.