ECON 465 – Final Report: From Data to Economic Insight

Author

Ece Kurtoğlu and Halil Rıfat Başbuğ

Introduction

This final report combines all work from Stages 1 and 2 into a single, reproducible document. The report covers two real-world economic datasets and follows a consistent structure for each: economic question, dataset description, probability analysis, predictive modeling, model comparison, cross-validation, and economic interpretation.

Dataset 1 (Regression): The Student Performance Factors dataset is used to predict a continuous outcome — student exam scores — using behavioral and socioeconomic characteristics.

Dataset 2 (Classification): The Loan Approval dataset is used to predict a binary outcome — whether a loan application is approved or rejected — using financial and applicant characteristics.

All analyses are conducted in R using the tidymodels framework. The seed set.seed(465) is applied at all random steps to ensure full reproducibility. All file paths are relative.


Dataset 1: Regression — Student Performance

Economic Question

Do socioeconomic background factors predict exam scores independently of individual study effort, and if so, through which channels — resource access, parental involvement, or institutional quality?

This question is grounded in human capital theory, which views education as an investment that generates returns in the form of productivity and earnings. The key mechanism we investigate is whether structural inequality operates through the learning environment (school quality, resource access) or through the home environment (parental involvement, family income) — or both — even after holding individual effort (hours studied, attendance) constant. If socioeconomic factors predict scores independently of effort, this is evidence that unequal educational inputs translate into unequal outcomes through channels beyond individual control, which has direct implications for where policy intervention would be most effective.

Dataset Description and Source

The first dataset is the Student Performance Factors dataset, obtained from Kaggle. It contains information about students’ study habits, attendance, previous scores, family background, school characteristics, and final exam scores.

  • Source: Kaggle — Student Performance Factors dataset (StudentPerformanceFactors.csv)
  • Target variable: exam_score (continuous, numeric)
  • Why this dataset is relevant: Exam scores are a direct measure of human capital accumulation. Predicting them allows us to understand which behavioral and socioeconomic inputs matter most for educational outcomes, which has direct implications for education economics and policy.

Data Import and Cleaning

# Import the student performance dataset
student_raw <- read_csv("StudentPerformanceFactors.csv")
Rows: 6607 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): Parental_Involvement, Access_to_Resources, Extracurricular_Activit...
dbl  (7): Hours_Studied, Attendance, Sleep_Hours, Previous_Scores, Tutoring_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(student_raw)
Rows: 6,607
Columns: 20
$ Hours_Studied              <dbl> 23, 19, 24, 29, 19, 19, 29, 25, 17, 23, 17,…
$ Attendance                 <dbl> 84, 64, 98, 89, 92, 88, 84, 78, 94, 98, 80,…
$ Parental_Involvement       <chr> "Low", "Low", "Medium", "Low", "Medium", "M…
$ Access_to_Resources        <chr> "High", "Medium", "Medium", "Medium", "Medi…
$ Extracurricular_Activities <chr> "No", "No", "Yes", "Yes", "Yes", "Yes", "Ye…
$ Sleep_Hours                <dbl> 7, 8, 7, 8, 6, 8, 7, 6, 6, 8, 8, 6, 8, 8, 8…
$ Previous_Scores            <dbl> 73, 59, 91, 98, 65, 89, 68, 50, 80, 71, 88,…
$ Motivation_Level           <chr> "Low", "Low", "Medium", "Medium", "Medium",…
$ Internet_Access            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ Tutoring_Sessions          <dbl> 0, 2, 2, 1, 3, 3, 1, 1, 0, 0, 4, 2, 2, 2, 1…
$ Family_Income              <chr> "Low", "Medium", "Medium", "Medium", "Mediu…
$ Teacher_Quality            <chr> "Medium", "Medium", "Medium", "Medium", "Hi…
$ School_Type                <chr> "Public", "Public", "Public", "Public", "Pu…
$ Peer_Influence             <chr> "Positive", "Negative", "Neutral", "Negativ…
$ Physical_Activity          <dbl> 3, 4, 4, 4, 4, 3, 2, 2, 1, 5, 4, 2, 4, 3, 4…
$ Learning_Disabilities      <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ Parental_Education_Level   <chr> "High School", "College", "Postgraduate", "…
$ Distance_from_Home         <chr> "Near", "Moderate", "Near", "Moderate", "Ne…
$ Gender                     <chr> "Male", "Female", "Male", "Male", "Female",…
$ Exam_Score                 <dbl> 67, 61, 74, 71, 70, 71, 67, 66, 69, 72, 68,…
nrow(student_raw)
[1] 6607
# Clean variable names and remove missing values
student_clean <- student_raw |>
  clean_names() |>
  drop_na()

glimpse(student_clean)
Rows: 6,378
Columns: 20
$ hours_studied              <dbl> 23, 19, 24, 29, 19, 19, 29, 25, 17, 23, 17,…
$ attendance                 <dbl> 84, 64, 98, 89, 92, 88, 84, 78, 94, 98, 80,…
$ parental_involvement       <chr> "Low", "Low", "Medium", "Low", "Medium", "M…
$ access_to_resources        <chr> "High", "Medium", "Medium", "Medium", "Medi…
$ extracurricular_activities <chr> "No", "No", "Yes", "Yes", "Yes", "Yes", "Ye…
$ sleep_hours                <dbl> 7, 8, 7, 8, 6, 8, 7, 6, 6, 8, 8, 6, 8, 8, 8…
$ previous_scores            <dbl> 73, 59, 91, 98, 65, 89, 68, 50, 80, 71, 88,…
$ motivation_level           <chr> "Low", "Low", "Medium", "Medium", "Medium",…
$ internet_access            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ tutoring_sessions          <dbl> 0, 2, 2, 1, 3, 3, 1, 1, 0, 0, 4, 2, 2, 2, 1…
$ family_income              <chr> "Low", "Medium", "Medium", "Medium", "Mediu…
$ teacher_quality            <chr> "Medium", "Medium", "Medium", "Medium", "Hi…
$ school_type                <chr> "Public", "Public", "Public", "Public", "Pu…
$ peer_influence             <chr> "Positive", "Negative", "Neutral", "Negativ…
$ physical_activity          <dbl> 3, 4, 4, 4, 4, 3, 2, 2, 1, 5, 4, 2, 4, 3, 4…
$ learning_disabilities      <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ parental_education_level   <chr> "High School", "College", "Postgraduate", "…
$ distance_from_home         <chr> "Near", "Moderate", "Near", "Moderate", "Ne…
$ gender                     <chr> "Male", "Female", "Male", "Male", "Female",…
$ exam_score                 <dbl> 67, 61, 74, 71, 70, 71, 67, 66, 69, 72, 68,…
nrow(student_clean)
[1] 6378

The dataset was imported using read_csv(). Variable names were standardised with clean_names() from the janitor package, and rows with missing values were removed using drop_na(). The final dataset is tidy: each row is one student observation and each column is one variable.

Predictor Variable Exploration

Before modeling, all predictor variables are explored to understand their distributions, detect outliers, and identify economically meaningful patterns. This directly informs which variables belong in which model.

Summary of All Variables

# Summary statistics for all variables
summary(student_clean)
 hours_studied     attendance     parental_involvement access_to_resources
 Min.   : 1.00   Min.   : 60.00   Length:6378          Length:6378        
 1st Qu.:16.00   1st Qu.: 70.00   Class :character     Class :character   
 Median :20.00   Median : 80.00   Mode  :character     Mode  :character   
 Mean   :19.98   Mean   : 80.02                                           
 3rd Qu.:24.00   3rd Qu.: 90.00                                           
 Max.   :44.00   Max.   :100.00                                           
 extracurricular_activities  sleep_hours     previous_scores 
 Length:6378                Min.   : 4.000   Min.   : 50.00  
 Class :character           1st Qu.: 6.000   1st Qu.: 63.00  
 Mode  :character           Median : 7.000   Median : 75.00  
                            Mean   : 7.035   Mean   : 75.07  
                            3rd Qu.: 8.000   3rd Qu.: 88.00  
                            Max.   :10.000   Max.   :100.00  
 motivation_level   internet_access    tutoring_sessions family_income     
 Length:6378        Length:6378        Min.   :0.000     Length:6378       
 Class :character   Class :character   1st Qu.:1.000     Class :character  
 Mode  :character   Mode  :character   Median :1.000     Mode  :character  
                                       Mean   :1.495                       
                                       3rd Qu.:2.000                       
                                       Max.   :8.000                       
 teacher_quality    school_type        peer_influence     physical_activity
 Length:6378        Length:6378        Length:6378        Min.   :0.000    
 Class :character   Class :character   Class :character   1st Qu.:2.000    
 Mode  :character   Mode  :character   Mode  :character   Median :3.000    
                                                          Mean   :2.973    
                                                          3rd Qu.:4.000    
                                                          Max.   :6.000    
 learning_disabilities parental_education_level distance_from_home
 Length:6378           Length:6378              Length:6378       
 Class :character      Class :character         Class :character  
 Mode  :character      Mode  :character         Mode  :character  
                                                                  
                                                                  
                                                                  
    gender            exam_score    
 Length:6378        Min.   : 55.00  
 Class :character   1st Qu.: 65.00  
 Mode  :character   Median : 67.00  
                    Mean   : 67.25  
                    3rd Qu.: 69.00  
                    Max.   :101.00  

Continuous Predictors — Distribution and Outliers

# Boxplots to detect outliers in continuous predictors
student_clean |>
  select(hours_studied, attendance, previous_scores,
         sleep_hours, tutoring_sessions, physical_activity, exam_score) |>
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  ggplot(aes(x = variable, y = value, fill = variable)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~variable, scales = "free") +
  labs(
    title = "Distribution and Outliers — Continuous Predictors",
    x = NULL, y = "Value"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

Continuous Predictors — Relationship with Exam Score

# Scatterplots: continuous predictors vs exam_score
student_clean |>
  select(hours_studied, attendance, previous_scores,
         sleep_hours, tutoring_sessions, physical_activity, exam_score) |>
  pivot_longer(-exam_score, names_to = "variable", values_to = "value") |>
  ggplot(aes(x = value, y = exam_score)) +
  geom_point(alpha = 0.2, size = 0.8) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  facet_wrap(~variable, scales = "free_x") +
  labs(
    title = "Continuous Predictors vs Exam Score",
    x = "Predictor Value", y = "Exam Score"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Categorical Predictors — Relationship with Exam Score

# Boxplots: categorical predictors vs exam_score
student_clean |>
  select(exam_score, parental_involvement, access_to_resources,
         motivation_level, family_income, teacher_quality,
         peer_influence, school_type) |>
  pivot_longer(-exam_score, names_to = "variable", values_to = "category") |>
  ggplot(aes(x = category, y = exam_score, fill = category)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~variable, scales = "free_x") +
  labs(
    title = "Categorical Predictors vs Exam Score",
    x = NULL, y = "Exam Score"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

EDA Interpretation: The scatterplots show that previous_scores and hours_studied have the strongest positive linear relationships with exam_score among continuous predictors — these are the most direct effort and ability signals. tutoring_sessions also shows a positive association, reflecting the value of additional support. The boxplots reveal that students with high parental involvement, high access to resources, high motivation, and high family income consistently achieve higher median exam scores, with visible differences between categories. This pattern directly motivates the two-model structure: Model 1 captures the effort channel; Model 2 adds the structural inequality channel.

Probability Distribution Analysis

Summary Statistics

# Compute summary statistics for the regression target variable
student_summary <- student_clean |>
  summarise(
    Mean     = mean(exam_score),
    Median   = median(exam_score),
    SD       = sd(exam_score),
    Min      = min(exam_score),
    Q1       = quantile(exam_score, 0.25),
    Q3       = quantile(exam_score, 0.75),
    Max      = max(exam_score)
  )

student_summary
# A tibble: 1 × 7
   Mean Median    SD   Min    Q1    Q3   Max
  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  67.3     67  3.91    55    65    69   101

The average exam score and the median are close to one another, which is an initial indication that the distribution is roughly symmetric. The standard deviation measures the spread of scores around the mean. The interquartile range (Q1 to Q3) captures the middle 50% of students’ scores and shows where most performance is concentrated.

Histogram of Exam Scores

# Create histogram of exam scores
ggplot(student_clean, aes(x = exam_score)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Exam Scores",
    x = "Exam Score",
    y = "Number of Students"
  ) +
  theme_minimal()

The histogram shows that exam scores are concentrated around the middle range. The distribution appears approximately symmetric with a slight skew. Most students achieve moderate scores, and extreme values are relatively uncommon.

Log-Transformed Histogram

# Apply log transformation to exam scores
student_clean <- student_clean |>
  mutate(log_exam_score = log(exam_score))

ggplot(student_clean, aes(x = log_exam_score)) +
  geom_histogram(bins = 30, fill = "darkorange", color = "white") +
  labs(
    title = "Distribution of Log-Transformed Exam Scores",
    x = "Log of Exam Score",
    y = "Number of Students"
  ) +
  theme_minimal()

After applying the log transformation, the shape of the distribution changes only slightly. This confirms that exam scores are not strongly right-skewed in the original scale. Because the original distribution is already approximately symmetric, the untransformed exam_score variable is more appropriate for regression modeling than the log-transformed version.

Theoretical Distribution

Based on the histogram, exam scores appear to be approximately normally distributed. The concentration of observations around the mean with symmetrically tapering tails is consistent with a normal distribution. This makes exam_score suitable as a regression target without transformation.


Data Splitting

set.seed(465)

student_split <- initial_split(student_clean, prop = 0.80)

student_train <- training(student_split)
student_test  <- testing(student_split)

cat("Training set size:", nrow(student_train), "\n")
Training set size: 5102 
cat("Test set size:    ", nrow(student_test), "\n")
Test set size:     1276 

The student dataset was split into 80% training data and 20% test data using initial_split(). The training set is used to estimate the model parameters, and the test set is held out to evaluate predictive performance on unseen observations. set.seed(465) ensures the split is reproducible.

Predictive Modeling

Model Specification

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

Linear regression is appropriate because the target variable exam_score is continuous and numeric.

Model 1: Behavioral Predictors

student_model_1 <- lm_spec |>
  fit(
    exam_score ~ hours_studied +
      attendance +
      previous_scores +
      sleep_hours +
      tutoring_sessions +
      physical_activity,
    data = student_train
  )

tidy(student_model_1)
# A tibble: 7 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        40.8      0.396     103.    0       
2 hours_studied       0.292    0.00583    50.0   0       
3 attendance          0.197    0.00306    64.5   0       
4 previous_scores     0.0507   0.00247    20.6   2.75e-90
5 sleep_hours        -0.0139   0.0240     -0.578 5.63e- 1
6 tutoring_sessions   0.499    0.0286     17.5   2.17e-66
7 physical_activity   0.138    0.0344      4.02  5.82e- 5

Model 1 rationale: This baseline model focuses exclusively on behavioral and academic predictors — variables that students can directly control. hours_studied and attendance capture the direct effort invested in learning. previous_scores proxies for accumulated prior knowledge. tutoring_sessions reflects access to additional academic support. sleep_hours and physical_activity capture student wellbeing, which can indirectly affect cognitive function and exam readiness. This model tests whether individual effort alone is sufficient to explain academic performance.

Model 2: Full Socioeconomic Model

student_model_2 <- lm_spec |>
  fit(
    exam_score ~ hours_studied +
      attendance +
      previous_scores +
      sleep_hours +
      tutoring_sessions +
      physical_activity +
      parental_involvement +
      access_to_resources +
      extracurricular_activities +
      motivation_level +
      internet_access +
      family_income +
      teacher_quality +
      school_type +
      peer_influence +
      learning_disabilities +
      parental_education_level +
      distance_from_home +
      gender,
    data = student_train
  )

tidy(student_model_2)
# A tibble: 28 × 5
   term                       estimate std.error statistic   p.value
   <chr>                         <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                41.6       0.386     108.    0        
 2 hours_studied               0.293     0.00487    60.3   0        
 3 attendance                  0.198     0.00256    77.5   0        
 4 previous_scores             0.0509    0.00206    24.7   4.26e-127
 5 sleep_hours                 0.00664   0.0201      0.331 7.41e-  1
 6 tutoring_sessions           0.499     0.0238     21.0   1.47e- 93
 7 physical_activity           0.177     0.0288      6.17  7.32e- 10
 8 parental_involvementLow    -1.96      0.0851    -23.0   6.96e-112
 9 parental_involvementMedium -1.05      0.0685    -15.4   3.35e- 52
10 access_to_resourcesLow     -2.12      0.0854    -24.8   1.31e-128
# ℹ 18 more rows

Model 2 rationale: This extended model adds socioeconomic and environmental predictors — factors largely outside a student’s direct control. family_income and access_to_resources capture material advantages affecting study quality. parental_involvement and parental_education_level reflect the home learning environment. motivation_level captures psychological engagement. teacher_quality and school_type represent institutional factors. peer_influence reflects the social learning environment. learning_disabilities captures additional barriers in standard exam settings. This model tests whether structural and environmental inequality translates into unequal performance outcomes, even after controlling for individual effort.

Predictions and Evaluation Metrics

student_pred_1 <- predict(student_model_1, student_test) |>
  bind_cols(student_test)

student_metrics_1 <- student_pred_1 |>
  metrics(truth = exam_score, estimate = .pred)

student_metrics_1
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       2.40 
2 rsq     standard       0.612
3 mae     standard       1.29 
student_pred_2 <- predict(student_model_2, student_test) |>
  bind_cols(student_test)

student_metrics_2 <- student_pred_2 |>
  metrics(truth = exam_score, estimate = .pred)

student_metrics_2
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.95 
2 rsq     standard       0.745
3 mae     standard       0.464

Model Comparison and Selection

student_comparison <- bind_rows(
  student_metrics_1 |> mutate(Model = "Model 1: Behavioral"),
  student_metrics_2 |> mutate(Model = "Model 2: Full Socioeconomic")
) |>
  filter(.metric %in% c("rmse", "rsq")) |>
  select(Model, Metric = .metric, Estimate = .estimate) |>
  mutate(Estimate = round(Estimate, 4))

student_comparison
# A tibble: 4 × 3
  Model                       Metric Estimate
  <chr>                       <chr>     <dbl>
1 Model 1: Behavioral         rmse      2.40 
2 Model 1: Behavioral         rsq       0.612
3 Model 2: Full Socioeconomic rmse      1.95 
4 Model 2: Full Socioeconomic rsq       0.745

RMSE measures the average prediction error in the same units as exam scores — a lower value indicates better predictions. measures the proportion of variation in exam scores explained by the model — a higher value indicates stronger explanatory power.

Model 2 is selected as the better regression model. It produces a lower RMSE and a higher R² on the test set, meaning it makes smaller prediction errors and captures more of the variation in student performance. Although Model 2 is more complex — using 19 predictors instead of 6 — the improvement in both RMSE and R² on the held-out test set justifies the additional variables. The gain in predictive accuracy is not merely the result of fitting the training data more closely, because the test set was never seen by the model during training. This confirms that the extra socioeconomic predictors carry genuine explanatory signal rather than noise.

Economically, this result is meaningful: individual effort alone is not sufficient to fully explain exam outcomes. Socioeconomic factors — family income, access to resources, parental involvement, motivation level, and school quality — contribute additional predictive power. This finding is consistent with education economics research showing that structural inequality in educational inputs produces unequal performance outcomes, even among students who invest similar effort.

Cross-Validation

set.seed(465)

student_folds <- vfold_cv(student_train, v = 5)

student_cv <- fit_resamples(
  lm_spec,
  exam_score ~ hours_studied +
    attendance +
    previous_scores +
    sleep_hours +
    tutoring_sessions +
    physical_activity +
    parental_involvement +
    access_to_resources +
    extracurricular_activities +
    motivation_level +
    internet_access +
    family_income +
    teacher_quality +
    school_type +
    peer_influence +
    learning_disabilities +
    parental_education_level +
    distance_from_home +
    gender,
  resamples = student_folds,
  metrics = metric_set(rmse, rsq)
)

student_cv_results <- collect_metrics(student_cv)
student_cv_results
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   2.07      5  0.188  pre0_mod0_post0
2 rsq     standard   0.717     5  0.0406 pre0_mod0_post0
cv_rmse  <- student_cv_results |> filter(.metric == "rmse") |> pull(mean)
cv_rsq   <- student_cv_results |> filter(.metric == "rsq")  |> pull(mean)

test_rmse <- student_metrics_2 |> filter(.metric == "rmse") |> pull(.estimate)
test_rsq  <- student_metrics_2 |> filter(.metric == "rsq")  |> pull(.estimate)

tibble(
  Metric     = c("RMSE", "R²"),
  `CV Mean`  = round(c(cv_rmse, cv_rsq), 4),
  `Test Set` = round(c(test_rmse, test_rsq), 4),
  Difference = round(c(test_rmse - cv_rmse, test_rsq - cv_rsq), 4)
)
# A tibble: 2 × 4
  Metric `CV Mean` `Test Set` Difference
  <chr>      <dbl>      <dbl>      <dbl>
1 RMSE       2.07       1.95     -0.121 
2 R²         0.717      0.745     0.0275

The 5-fold cross-validation results for Model 2 are very close to the test set results. The small difference in RMSE and R² between the CV mean and the test set indicates that the model is stable and generalizes well to unseen data. If the model had overfit, the test set RMSE would be substantially higher than the CV RMSE and the test set R² would be substantially lower. The consistency here confirms that Model 2’s predictive ability is not specific to the training sample.

Economic Interpretation — Dataset 1

Answer to the economic question: The full model results confirm that exam scores are shaped by a combination of individual effort and structural background factors. Students who study more hours, attend more regularly, have higher previous scores, and receive tutoring consistently perform better. However, even after controlling for these behavioral variables, socioeconomic characteristics — family income, access to resources, parental involvement, school quality, and motivation — contribute additional explanatory power.

Coefficient interpretation: In a linear regression, each coefficient represents the expected change in exam score for a one-unit increase in that predictor, holding all other variables constant. The table below extracts the key coefficients from Model 2 with their estimates and significance levels:

# Extract and display key coefficients from Model 2
student_coefs <- tidy(student_model_2) |>
  filter(term != "(Intercept)") |>
  select(term, estimate, std.error, p.value) |>
  mutate(
    estimate  = round(estimate, 3),
    std.error = round(std.error, 3),
    p.value   = round(p.value, 4),
    significant = if_else(p.value < 0.05, "Yes", "No")
  ) |>
  arrange(desc(abs(estimate)))

student_coefs
# A tibble: 27 × 5
   term                       estimate std.error p.value significant
   <chr>                         <dbl>     <dbl>   <dbl> <chr>      
 1 access_to_resourcesLow       -2.12      0.085       0 Yes        
 2 parental_involvementLow      -1.96      0.085       0 Yes        
 3 teacher_qualityLow           -1.09      0.107       0 Yes        
 4 peer_influencePositive        1.07      0.08        0 Yes        
 5 parental_involvementMedium   -1.05      0.069       0 Yes        
 6 motivation_levelLow          -1.05      0.085       0 Yes        
 7 access_to_resourcesMedium    -1.05      0.068       0 Yes        
 8 family_incomeLow             -1.00      0.082       0 Yes        
 9 internet_accessYes            0.943     0.113       0 Yes        
10 distance_from_homeNear        0.893     0.101       0 Yes        
# ℹ 17 more rows

Reading these coefficients in terms of our original economic question: the coefficient on hours_studied is the marginal return to one additional hour of study per week in exam score points, holding everything else constant — this is the direct effort channel. The coefficient on attendance similarly captures the engagement channel. Crucially, the coefficients on family_income, access_to_resources, and parental_involvement — estimated after controlling for hours studied and attendance — represent the structural inequality channel: the part of exam score differences that cannot be explained by effort alone. If these coefficients are positive and significant, it means that even two students who study equally hard and attend equally often will have different expected exam scores depending on their socioeconomic background. This is the core economic finding: structural inputs shape outcomes independently of individual effort, which is what justifies policy intervention beyond motivating students to study more.

Policy implications: These findings suggest that interventions focused only on encouraging students to study more may be insufficient if structural disadvantages are not addressed. Policies aimed at improving resource access, school quality, and parental engagement are likely to produce additional gains in educational outcomes. From a human capital perspective, equalising educational inputs would reduce performance inequality and increase the aggregate stock of human capital in the economy.


Dataset 2: Classification — Loan Approval

Economic Question

Do lenders rely primarily on creditworthiness signals (CIBIL score, income) or does a broader applicant profile (assets, education, employment type) independently predict loan approval — and what does this reveal about how credit risk is assessed in practice?

Access to credit is a fundamental mechanism in modern economies. The key question is not merely whether we can predict approval, but what information lenders actually use: if simple credit scores and income dominate, this suggests lending follows a narrow risk-based model; if broader profile variables add predictive power, this suggests lenders incorporate a richer assessment of borrower characteristics. Understanding this distinction matters for credit market efficiency and financial inclusion policy.

Dataset Description and Source

The second dataset is the Loan Approval Prediction dataset, obtained from Kaggle. It contains information about loan applicants, including income, loan amount, loan term, credit score, number of dependents, employment status, education, and asset values.

  • Source: Kaggle — Loan Approval Prediction dataset (loan_approval_dataset.csv)
  • Target variable: loan_approved (binary: 1 = Approved, 0 = Rejected)
  • Why this dataset is relevant: Loan approval is a central decision in credit markets. Predicting whether a loan will be approved based on observable applicant characteristics allows us to understand what financial and personal factors lenders rely on — and whether simpler or more complex credit scoring models are more effective.

Data Import and Cleaning

# Import the loan approval dataset
loan_raw <- read_csv("loan_approval_dataset.csv")
Rows: 4269 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): education, self_employed, loan_status
dbl (10): loan_id, no_of_dependents, income_annum, loan_amount, loan_term, c...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(loan_raw)
Rows: 4,269
Columns: 13
$ loan_id                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ no_of_dependents         <dbl> 2, 0, 3, 3, 5, 0, 5, 2, 0, 5, 4, 2, 3, 2, 1, …
$ education                <chr> "Graduate", "Not Graduate", "Graduate", "Grad…
$ self_employed            <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", …
$ income_annum             <dbl> 9600000, 4100000, 9100000, 8200000, 9800000, …
$ loan_amount              <dbl> 29900000, 12200000, 29700000, 30700000, 24200…
$ loan_term                <dbl> 12, 8, 20, 8, 20, 10, 4, 20, 20, 10, 2, 18, 1…
$ cibil_score              <dbl> 778, 417, 506, 467, 382, 319, 678, 382, 782, …
$ residential_assets_value <dbl> 2400000, 2700000, 7100000, 18200000, 12400000…
$ commercial_assets_value  <dbl> 17600000, 2200000, 4500000, 3300000, 8200000,…
$ luxury_assets_value      <dbl> 22700000, 8800000, 33300000, 23300000, 294000…
$ bank_asset_value         <dbl> 8000000, 3300000, 12800000, 7900000, 5000000,…
$ loan_status              <chr> "Approved", "Rejected", "Rejected", "Rejected…
nrow(loan_raw)
[1] 4269
# Clean variable names, remove missing values, and create binary target variable
loan_clean <- loan_raw |>
  clean_names() |>
  drop_na() |>
  mutate(
    loan_status   = str_trim(loan_status),
    loan_approved = if_else(
      loan_status == "Approved", "approved", "rejected"
    ),
    loan_approved = factor(loan_approved, levels = c("approved", "rejected"))
  )

glimpse(loan_clean)
Rows: 4,269
Columns: 14
$ loan_id                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ no_of_dependents         <dbl> 2, 0, 3, 3, 5, 0, 5, 2, 0, 5, 4, 2, 3, 2, 1, …
$ education                <chr> "Graduate", "Not Graduate", "Graduate", "Grad…
$ self_employed            <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", …
$ income_annum             <dbl> 9600000, 4100000, 9100000, 8200000, 9800000, …
$ loan_amount              <dbl> 29900000, 12200000, 29700000, 30700000, 24200…
$ loan_term                <dbl> 12, 8, 20, 8, 20, 10, 4, 20, 20, 10, 2, 18, 1…
$ cibil_score              <dbl> 778, 417, 506, 467, 382, 319, 678, 382, 782, …
$ residential_assets_value <dbl> 2400000, 2700000, 7100000, 18200000, 12400000…
$ commercial_assets_value  <dbl> 17600000, 2200000, 4500000, 3300000, 8200000,…
$ luxury_assets_value      <dbl> 22700000, 8800000, 33300000, 23300000, 294000…
$ bank_asset_value         <dbl> 8000000, 3300000, 12800000, 7900000, 5000000,…
$ loan_status              <chr> "Approved", "Rejected", "Rejected", "Rejected…
$ loan_approved            <fct> approved, rejected, rejected, rejected, rejec…
nrow(loan_clean)
[1] 4269
table(loan_clean$loan_approved)

approved rejected 
    2656     1613 

The dataset was imported using read_csv(). Variable names were standardised with clean_names(). Whitespace was removed from the loan_status string variable using str_trim() before creating the binary factor target loan_approved. Missing values were removed using drop_na(). The result is a tidy dataset with one row per loan application and one column per variable.

Predictor Variable Exploration

Summary of All Variables

# Summary statistics for all variables
summary(loan_clean)
    loan_id     no_of_dependents  education         self_employed     
 Min.   :   1   Min.   :0.000    Length:4269        Length:4269       
 1st Qu.:1068   1st Qu.:1.000    Class :character   Class :character  
 Median :2135   Median :3.000    Mode  :character   Mode  :character  
 Mean   :2135   Mean   :2.499                                         
 3rd Qu.:3202   3rd Qu.:4.000                                         
 Max.   :4269   Max.   :5.000                                         
  income_annum      loan_amount         loan_term     cibil_score   
 Min.   : 200000   Min.   :  300000   Min.   : 2.0   Min.   :300.0  
 1st Qu.:2700000   1st Qu.: 7700000   1st Qu.: 6.0   1st Qu.:453.0  
 Median :5100000   Median :14500000   Median :10.0   Median :600.0  
 Mean   :5059124   Mean   :15133450   Mean   :10.9   Mean   :599.9  
 3rd Qu.:7500000   3rd Qu.:21500000   3rd Qu.:16.0   3rd Qu.:748.0  
 Max.   :9900000   Max.   :39500000   Max.   :20.0   Max.   :900.0  
 residential_assets_value commercial_assets_value luxury_assets_value
 Min.   : -100000         Min.   :       0        Min.   :  300000   
 1st Qu.: 2200000         1st Qu.: 1300000        1st Qu.: 7500000   
 Median : 5600000         Median : 3700000        Median :14600000   
 Mean   : 7472616         Mean   : 4973155        Mean   :15126306   
 3rd Qu.:11300000         3rd Qu.: 7600000        3rd Qu.:21700000   
 Max.   :29100000         Max.   :19400000        Max.   :39200000   
 bank_asset_value   loan_status         loan_approved 
 Min.   :       0   Length:4269        approved:2656  
 1st Qu.: 2300000   Class :character   rejected:1613  
 Median : 4600000   Mode  :character                  
 Mean   : 4976692                                     
 3rd Qu.: 7100000                                     
 Max.   :14700000                                     

Continuous Predictors — Distribution and Outliers by Approval Status

# Boxplots: continuous financial predictors by approval status
loan_clean |>
  select(loan_approved, income_annum, loan_amount, loan_term,
         cibil_score, residential_assets_value,
         commercial_assets_value, luxury_assets_value, bank_asset_value) |>
  pivot_longer(-loan_approved, names_to = "variable", values_to = "value") |>
  ggplot(aes(x = loan_approved, y = value, fill = loan_approved)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~variable, scales = "free_y") +
  labs(
    title = "Continuous Predictors by Loan Approval Status",
    x = "Loan Approval Status", y = "Value"
  ) +
  theme_minimal()

Categorical Predictors — Approval Rate by Group

# Bar charts: approval rate by categorical predictors
loan_clean |>
  select(loan_approved, education, self_employed) |>
  pivot_longer(-loan_approved, names_to = "variable", values_to = "category") |>
  ggplot(aes(x = category, fill = loan_approved)) +
  geom_bar(position = "fill") +
  facet_wrap(~variable, scales = "free_x") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Loan Approval Rate by Categorical Predictors",
    x = NULL, y = "Share of Applications", fill = "Status"
  ) +
  theme_minimal()

CIBIL Score Distribution by Approval Status

# CIBIL score is the key creditworthiness signal — explore its distribution
loan_clean |>
  ggplot(aes(x = cibil_score, fill = loan_approved)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
  labs(
    title = "CIBIL Score Distribution by Loan Approval Status",
    x = "CIBIL Score", y = "Count", fill = "Loan Status"
  ) +
  theme_minimal()

EDA Interpretation: The boxplots show that approved applicants have substantially higher CIBIL scores and higher annual incomes on average — these are the clearest separating variables between approved and rejected applications. The CIBIL score histogram shows near-complete separation: rejected applicants cluster at low scores while approved applicants cluster at high scores, suggesting CIBIL score alone carries substantial predictive power. Asset values (residential, commercial, luxury, bank) are also higher for approved applicants, but with more overlap. Education level and employment type show smaller differences in approval rates. This pattern directly motivates Model 1 (core financial indicators) as the baseline and Model 2 (full profile) as the extended test — the EDA already suggests that financial signals will dominate.

Probability Distribution Analysis

Summary Statistics

# Compute summary statistics for the classification target variable
loan_clean |>
  mutate(loan_approved_num = as.numeric(loan_approved == "approved")) |>
  summarise(
    Mean   = mean(loan_approved_num),
    Median = median(loan_approved_num),
    SD     = sd(loan_approved_num),
    Min    = min(loan_approved_num),
    Q1     = quantile(loan_approved_num, 0.25),
    Q3     = quantile(loan_approved_num, 0.75),
    Max    = max(loan_approved_num)
  )
# A tibble: 1 × 7
   Mean Median    SD   Min    Q1    Q3   Max
  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.622      1 0.485     0     0     1     1

The mean of the binary target represents the proportion of loan applications that were approved in the dataset. Since the target is binary (0 or 1), the standard deviation and quartiles reflect the balance of the two classes rather than continuous spread.

Histogram of Loan Approval Status

loan_clean |>
  mutate(loan_approved_num = as.numeric(loan_approved == "approved")) |>
  ggplot(aes(x = loan_approved_num)) +
  geom_histogram(bins = 2, fill = "steelblue", color = "white") +
  scale_x_continuous(breaks = c(0, 1),
                     labels = c("0 = Rejected", "1 = Approved")) +
  labs(
    title = "Distribution of Loan Approval Status",
    x = "Loan Approved",
    y = "Number of Applications"
  ) +
  theme_minimal()

The histogram shows the counts of approved and rejected applications. Since loan_approved is a binary variable, this distribution is not continuous.

Log-Transformed Histogram

loan_clean |>
  mutate(loan_approved_num = as.numeric(loan_approved == "approved"),
         log_loan = log(loan_approved_num + 1)) |>
  ggplot(aes(x = log_loan)) +
  geom_histogram(bins = 2, fill = "darkorange", color = "white") +
  labs(
    title = "Distribution of Log-Transformed Loan Approval Status",
    x = "Log of Loan Approved",
    y = "Number of Applications"
  ) +
  theme_minimal()

Because loan_approved is binary, the log transformation does not produce a normal distribution — it only shifts values from {0, 1} to {0, log(2)}. Log transformation is more appropriate for skewed continuous variables and does not improve the distribution of a binary target.

Theoretical Distribution

The loan_approved variable follows a Bernoulli distribution, as it has exactly two possible outcomes: approved (1) or rejected (0). The probability parameter of this distribution equals the proportion of approved applications in the dataset. A normal or log-normal approximation is not appropriate for this variable.


Data Splitting

set.seed(465)

loan_split <- initial_split(
  loan_clean,
  prop   = 0.80,
  strata = loan_approved
)

loan_train <- training(loan_split)
loan_test  <- testing(loan_split)

cat("Training set size:", nrow(loan_train), "\n")
Training set size: 3414 
cat("Test set size:    ", nrow(loan_test), "\n")
Test set size:     855 
cat("\nClass balance in training set:\n")

Class balance in training set:
print(table(loan_train$loan_approved))

approved rejected 
    2124     1290 
cat("\nClass balance in test set:\n")

Class balance in test set:
print(table(loan_test$loan_approved))

approved rejected 
     532      323 

The loan dataset was split 80/20 into training and test sets. strata = loan_approved was used to ensure that the proportion of approved and rejected applications is preserved in both sets, which is important for classification models when classes may not be perfectly balanced.

Predictive Modeling

Model Specification

log_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

Logistic regression is the appropriate method because the target variable is binary. It models the log-odds of approval as a linear function of predictors.

Model 1: Core Financial Indicators

loan_model_1 <- log_spec |>
  fit(
    loan_approved ~ income_annum +
      loan_amount +
      loan_term +
      cibil_score,
    data = loan_train
  )

tidy(loan_model_1)
# A tibble: 5 × 5
  term             estimate    std.error statistic   p.value
  <chr>               <dbl>        <dbl>     <dbl>     <dbl>
1 (Intercept)  11.1         0.456            24.2  1.53e-129
2 income_annum  0.000000438 0.0000000640      6.84 7.79e- 12
3 loan_amount  -0.000000139 0.0000000197     -7.04 1.86e- 12
4 loan_term     0.148       0.0125           11.8  3.93e- 32
5 cibil_score  -0.0242      0.000905        -26.7  3.08e-157

Model 1 rationale: This baseline model uses only the core financial indicators traditionally applied in credit scoring. cibil_score is the most direct measure of creditworthiness — a higher score signals a reliable repayment history and is expected to strongly increase approval probability. income_annum captures repayment capacity; higher income reduces default risk. loan_amount captures lender exposure — larger loans carry more risk and may reduce approval probability. loan_term reflects the repayment horizon; longer terms increase uncertainty. This model tests whether standard financial indicators are sufficient for predicting approval decisions.

Model 2: Full Applicant Profile

loan_model_2 <- log_spec |>
  fit(
    loan_approved ~ no_of_dependents +
      education +
      self_employed +
      income_annum +
      loan_amount +
      loan_term +
      cibil_score +
      residential_assets_value +
      commercial_assets_value +
      luxury_assets_value +
      bank_asset_value,
    data = loan_train
  )

tidy(loan_model_2)
# A tibble: 12 × 5
   term                     estimate    std.error statistic   p.value
   <chr>                       <dbl>        <dbl>     <dbl>     <dbl>
 1 (Intercept)               1.11e+1 0.479          23.1    2.88e-118
 2 no_of_dependents          1.00e-2 0.0389          0.258  7.97e-  1
 3 educationNot Graduate     5.68e-2 0.131           0.433  6.65e-  1
 4 self_employedYes         -7.37e-2 0.131          -0.564  5.73e-  1
 5 income_annum              6.26e-7 0.000000101     6.19   6.16e- 10
 6 loan_amount              -1.40e-7 0.0000000198   -7.09   1.37e- 12
 7 loan_term                 1.51e-1 0.0127         11.9    1.35e- 32
 8 cibil_score              -2.43e-2 0.000912      -26.7    1.46e-156
 9 residential_assets_value -1.08e-9 0.0000000132   -0.0815 9.35e-  1
10 commercial_assets_value  -4.10e-9 0.0000000189   -0.216  8.29e-  1
11 luxury_assets_value      -3.57e-8 0.0000000193   -1.85   6.43e-  2
12 bank_asset_value         -6.97e-8 0.0000000371   -1.88   6.04e-  2

Model 2 rationale: This extended model adds applicant characteristics and a full breakdown of asset holdings. Asset variables (residential_assets_value, commercial_assets_value, luxury_assets_value, bank_asset_value) capture collateral — the lender can claim these in the event of default, which reduces credit risk. no_of_dependents reflects financial obligations that may reduce disposable income. education proxies for long-term income stability. self_employed captures employment income uncertainty, which lenders may treat as higher risk. This model tests whether a richer applicant profile improves loan approval prediction beyond core financial metrics.

Predictions and Evaluation Metrics

loan_pred_1 <- predict(loan_model_1, loan_test) |>
  bind_cols(loan_test)

loan_accuracy_1  <- loan_pred_1 |> accuracy(truth = loan_approved,  estimate = .pred_class)
loan_precision_1 <- loan_pred_1 |> precision(truth = loan_approved, estimate = .pred_class)
loan_recall_1    <- loan_pred_1 |> recall(truth = loan_approved,    estimate = .pred_class)

loan_accuracy_1
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.926
loan_precision_1
# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.954
loan_recall_1
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.927
loan_pred_2 <- predict(loan_model_2, loan_test) |>
  bind_cols(loan_test)

loan_accuracy_2  <- loan_pred_2 |> accuracy(truth = loan_approved,  estimate = .pred_class)
loan_precision_2 <- loan_pred_2 |> precision(truth = loan_approved, estimate = .pred_class)
loan_recall_2    <- loan_pred_2 |> recall(truth = loan_approved,    estimate = .pred_class)

loan_accuracy_2
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.924
loan_precision_2
# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.953
loan_recall_2
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.923

Model Comparison and Selection

loan_comparison <- bind_rows(
  bind_rows(loan_accuracy_1, loan_precision_1, loan_recall_1) |>
    mutate(Model = "Model 1: Core Financial"),
  bind_rows(loan_accuracy_2, loan_precision_2, loan_recall_2) |>
    mutate(Model = "Model 2: Full Profile")
) |>
  select(Model, Metric = .metric, Estimate = .estimate) |>
  mutate(Estimate = round(Estimate, 4))

loan_comparison
# A tibble: 6 × 3
  Model                   Metric    Estimate
  <chr>                   <chr>        <dbl>
1 Model 1: Core Financial accuracy     0.926
2 Model 1: Core Financial precision    0.954
3 Model 1: Core Financial recall       0.927
4 Model 2: Full Profile   accuracy     0.924
5 Model 2: Full Profile   precision    0.953
6 Model 2: Full Profile   recall       0.923

Accuracy measures the overall share of correct predictions. Precision measures how many applications predicted as approved were actually approved — low precision means approving risky borrowers, which is costly for lenders. Recall measures how many genuinely approved applications were correctly identified — low recall means incorrectly rejecting creditworthy applicants, which reduces credit market efficiency.

Model 1 is selected as the better classification model. It achieves higher accuracy, precision, and recall on the test set despite using fewer predictors. This means the simpler model generalizes better to unseen loan applications.

Economically, this result shows that CIBIL score, income, loan amount, and loan term are the primary drivers of loan approval decisions in this dataset. Adding asset holdings, employment type, education, and number of dependents in Model 2 did not improve predictive performance — these variables carry little additional information beyond the core financial indicators. For practical lending decisions, Model 1 offers the advantage of simplicity and interpretability without sacrificing predictive accuracy.

Cross-Validation

set.seed(465)

loan_folds <- vfold_cv(loan_train, v = 5, strata = loan_approved)

loan_cv <- fit_resamples(
  log_spec,
  loan_approved ~ income_annum +
    loan_amount +
    loan_term +
    cibil_score,
  resamples = loan_folds,
  metrics   = metric_set(accuracy, precision, recall)
)

loan_cv_results <- collect_metrics(loan_cv)
loan_cv_results
# A tibble: 3 × 6
  .metric   .estimator  mean     n std_err .config        
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy  binary     0.914     5 0.00542 pre0_mod0_post0
2 precision binary     0.927     5 0.00448 pre0_mod0_post0
3 recall    binary     0.935     5 0.00610 pre0_mod0_post0
cv_acc  <- loan_cv_results |> filter(.metric == "accuracy")  |> pull(mean)
cv_prec <- loan_cv_results |> filter(.metric == "precision") |> pull(mean)
cv_rec  <- loan_cv_results |> filter(.metric == "recall")    |> pull(mean)

test_acc  <- loan_accuracy_1  |> pull(.estimate)
test_prec <- loan_precision_1 |> pull(.estimate)
test_rec  <- loan_recall_1    |> pull(.estimate)

tibble(
  Metric     = c("Accuracy", "Precision", "Recall"),
  `CV Mean`  = round(c(cv_acc, cv_prec, cv_rec), 4),
  `Test Set` = round(c(test_acc, test_prec, test_rec), 4),
  Difference = round(c(test_acc - cv_acc, test_prec - cv_prec, test_rec - cv_rec), 4)
)
# A tibble: 3 × 4
  Metric    `CV Mean` `Test Set` Difference
  <chr>         <dbl>      <dbl>      <dbl>
1 Accuracy      0.914      0.926     0.0127
2 Precision     0.927      0.954     0.0268
3 Recall        0.935      0.927    -0.0083

The 5-fold cross-validation results for Model 1 are very close to the test set results across all three metrics. The small differences confirm that the model is stable and does not overfit the training data. If the model had overfit, we would expect test set accuracy to be substantially lower than CV accuracy. The consistency across all five folds also shows that the model’s performance does not depend on any particular subset of the training data, giving confidence that Model 1 would perform reliably on new loan applications.

Economic Interpretation — Dataset 2

Answer to the economic question: The model results confirm that applicant characteristics can predict loan approval with high accuracy. The most important predictors are creditworthiness (CIBIL score), repayment capacity (annual income), loan exposure (loan amount), and repayment horizon (loan term). These are the core variables used in traditional credit scoring, and the results show they are sufficient to explain most of the variation in approval decisions.

Coefficient interpretation: In logistic regression, coefficients represent the change in the log-odds of approval for a one-unit increase in each predictor, holding all other variables constant. The table below extracts the key coefficients from Model 1:

# Extract and display coefficients from the selected classification model (Model 1)
loan_coefs <- tidy(loan_model_1) |>
  filter(term != "(Intercept)") |>
  select(term, estimate, std.error, p.value) |>
  mutate(
    estimate  = round(estimate, 4),
    std.error = round(std.error, 4),
    p.value   = round(p.value, 4),
    odds_ratio  = round(exp(estimate), 4),
    significant = if_else(p.value < 0.05, "Yes", "No")
  )

loan_coefs
# A tibble: 4 × 6
  term         estimate std.error p.value odds_ratio significant
  <chr>           <dbl>     <dbl>   <dbl>      <dbl> <chr>      
1 income_annum   0         0            0      1     Yes        
2 loan_amount    0         0            0      1     Yes        
3 loan_term      0.148     0.0125       0      1.16  Yes        
4 cibil_score   -0.0242    0.0009       0      0.976 Yes        

Reading these results in terms of our original economic question: if the odds ratio on cibil_score is substantially greater than 1 and highly significant, while the odds ratios on income_annum and loan_amount are smaller or less significant, this tells us that creditworthiness — not income or loan size — is the primary gateway in this lender’s decision process. This answers the mechanistic question directly: lenders are not simply comparing income to loan size (debt-to-income ratio); they are primarily screening on repayment history as summarised by the credit score. The fact that Model 1 with only these four variables achieves high accuracy is itself evidence that lenders operate a narrow, credit-score-dominated model rather than a holistic applicant assessment — a finding with direct implications for financial inclusion, since applicants with limited credit histories face systematic barriers regardless of their actual assets or income.

Policy implications: The finding that core financial indicators dominate approval decisions suggests that lenders in this dataset rely primarily on creditworthiness and repayment capacity rather than personal characteristics such as education or employment type. From a credit market perspective, this is consistent with efficient risk-based lending. However, if applicants with lower CIBIL scores or lower incomes are systematically denied credit, this may perpetuate financial exclusion. Policymakers interested in broadening credit access might consider credit-building programs or alternative lending models for underserved populations.


Limitations and Replication

Limitations

Limitation 1 — External validity of the datasets: Both datasets are from Kaggle and may not represent the full population of students or loan applicants. The student dataset does not specify a country or time period, which limits how far the findings can be generalised. The loan dataset reflects the approval decisions of a particular lender or market context. Any predictive model trained on these data should be validated in other settings before being used for real-world decisions.

Limitation 2 — Omitted variable bias and causality: The regression and classification models identify correlations, not causal relationships. For the student dataset, unobserved factors such as innate ability, teaching quality variation within school type, or household stability may confound the estimated coefficients. For the loan dataset, the CIBIL score itself aggregates many applicant characteristics, which makes it difficult to identify the independent effect of each financial variable. Neither model should be used to make causal claims without additional identification strategies such as instrumental variables or natural experiments.

Reproducibility

All random processes use set.seed(465) to ensure identical results each time the document is rendered. All file paths are relative (e.g., StudentPerformanceFactors.csv and loan_approval_dataset.csv), so the document will render correctly from any machine where the data files are in the same directory as the .qmd file. The tidymodels and tidyverse frameworks are standard R packages available via CRAN. The document can be rendered using quarto::quarto_render("ECON465_Stage3_FinalReport.qmd") or the RStudio Render button.


AI Use Log

Stage 2 Interaction (Documented in Stage 2)

During Stage 2, Claude (Anthropic) was used to resolve a technical problem involving the construction of a side-by-side CV vs. test set comparison table.

Prompt given: “I am working on a tidymodels project in R. I ran 5-fold cross-validation using fit_resamples() and collected metrics with collect_metrics(). The output has columns .metric and mean. I also have test set metrics stored in a tibble with columns .metric and .estimate. I want to create a single side-by-side table that shows, for each metric, the CV mean and the test set value in the same row. How do I do this?”

AI response used: The AI suggested extracting each metric individually using filter() and pull(), then assembling a new tibble() manually. It also recommended adding a Difference column and explained that a value close to zero indicates good generalization.

How it was used: The filter() and pull() approach was adopted directly, as it was cleaner than the pivot_wider() approach initially tried. The code was adapted for both the regression model (RMSE and R²) and the classification model (accuracy, precision, recall), replacing generic variable names with actual object names from the workflow. The Difference column was added to both tables as suggested.

Verification and reflection: The output was verified by manually checking individual metric values before writing interpretations. The AI’s explanation of what the Difference column means for overfitting also directly informed the cross-validation interpretation sections. The interaction was helpful for solving a concrete formatting problem and confirming the correct approach to comparing two tibbles with different column structures.

Stage 3 Interaction

During Stage 3, Claude was used to help structure the economic interpretation section for the regression model.

Prompt given: “In my linear regression model predicting student exam scores, Model 2 performs better than Model 1 when socioeconomic variables are added. How should I interpret this economically — what does it mean that adding family income and parental involvement improved predictions even after controlling for hours studied and attendance?”

AI response: The AI explained that this result is consistent with human capital theory and educational economics research, which documents that structural inequality in educational inputs — differences in resources, guidance, and school quality — translates into unequal performance outcomes even among students who invest similar effort. It suggested framing the finding in terms of the gap between individual effort and structural opportunity, and connecting it to policy implications around equalizing educational inputs.

How it was used: The framing around human capital theory and structural inequality was incorporated into the economic interpretation section. The specific language about “equalizing educational inputs” was reworded and expanded. The AI did not generate the statistical results or code — only the conceptual framing of what the model comparison means economically.

Verification and reflection: The interpretation was cross-checked against the intuition from the coefficient estimates. The AI’s framing was accurate and consistent with standard economic theory. It was useful for connecting the quantitative results to broader economic concepts in a way that adds analytical depth to the discussion.


Final Suggestions

Suggested Improvement

If more time and better data were available, the most valuable improvement would be to test nonlinear models such as random forests or gradient boosting trees for both datasets. The current analysis uses only linear and logistic regression, which assume linear relationships between predictors and the outcome. Tree-based ensemble methods can capture nonlinear interactions — for example, the combined effect of high family income and high parental involvement may be greater than the sum of their individual effects. For the loan dataset, a random forest might also reduce the dependence on a single CIBIL score by learning more complex patterns across multiple financial indicators. Comparing regularised regression (Ridge, LASSO) to the OLS models used here would also help assess whether variable selection or coefficient shrinkage improves generalization, particularly in Model 2 where many predictors are included.

New Economic Question Inspired by the Analysis

The classification results showed that CIBIL score is the dominant predictor of loan approval. This raises the following question for future research:

“What is the causal effect of a one-unit increase in CIBIL score on the probability of loan approval, and does this effect vary across income groups?”

This question moves beyond prediction to causal identification. Answering it would require either a regression discontinuity design — exploiting the fact that lenders often apply discrete score thresholds — or a quasi-experimental approach using changes in credit reporting rules or scoring formula updates as natural experiments. The answer would have direct implications for credit market efficiency and financial inclusion policy.


Overall Conclusion

This project applied the complete data science workflow — data acquisition, cleaning, probability analysis, predictive modeling, cross-validation, and economic interpretation — to two real-world datasets.

For Dataset 1 (Regression), the full socioeconomic model (Model 2) outperformed the behavioral-only baseline (Model 1) on both RMSE and R², confirming that student exam performance is shaped by structural inequality in educational inputs, not only individual effort. The model is stable across cross-validation folds, indicating reliable predictive performance.

For Dataset 2 (Classification), the simpler core financial model (Model 1) outperformed the extended profile model (Model 2) on accuracy, precision, and recall, confirming that CIBIL score, income, loan amount, and loan term are the primary drivers of loan approval decisions. The model is also stable across cross-validation folds.

Together, these findings illustrate how predictive modeling can generate economically meaningful insights — not only producing accurate predictions, but also revealing which factors matter most for real-world economic outcomes, with direct implications for education policy and credit market design.