ECON 465 - Stage 2: Predictive Modeling

Author

CEREN MURATSU

Published

May 23, 2026

1 Project Overview

This report presents the Stage 2 predictive modeling work for the two datasets introduced in Stage 1. The first dataset is used for a regression problem because the outcome variable is continuous, and the second dataset is used for a classification problem because the outcome variable is binary.

The Stage 1 feedback noted that the economic motivation, the variable roles, and the analysis depth needed to be clearer. This report addresses each point directly. Every dataset begins with a research narrative in the form “I predict [outcome] from [predictors] because [economic logic]”, together with the broader economic problem and a hypothesis. Each dataset then has a variable roles table that separates the target, the main predictors, and the controls and gives the economic reason for each. Finally, the model comparison commits to a clear choice and explains it against the research question, rather than reporting numbers in isolation.

For reproducibility, the seed set.seed(465) is used for every random process, and both datasets are embedded inside this document so that it runs on its own without any external file or extra package.

# Helper functions (base R only -- the report needs no extra packages).

# Root Mean Squared Error
rmse <- function(truth, pred) {
  sqrt(mean((truth - pred)^2))
}

# R-squared on a hold-out (test) set
rsq <- function(truth, pred) {
  1 - sum((truth - pred)^2) / sum((truth - mean(truth))^2)
}

# Classification metrics for a binary outcome (positive class = "Y")
class_metrics <- function(truth, pred, positive = "Y") {
  truth <- as.character(truth)
  pred  <- as.character(pred)
  acc <- mean(pred == truth)
  tp  <- sum(pred == positive & truth == positive)
  fp  <- sum(pred == positive & truth != positive)
  fn  <- sum(pred != positive & truth == positive)
  precision <- tp / (tp + fp)
  recall    <- tp / (tp + fn)
  c(accuracy = acc, precision = precision, recall = recall)
}

# Small base R versions of the tidymodels split functions, so the report can
# use initial_split(), training() and testing() exactly as in the instructions
# while still rendering without installing any package.
initial_split <- function(data, prop = 0.8) {
  n <- nrow(data)
  train_id <- sample(seq_len(n), size = floor(prop * n))
  list(data = data, train_id = train_id)
}
training <- function(split) split$data[split$train_id, , drop = FALSE]
testing  <- function(split) split$data[-split$train_id, , drop = FALSE]

2 Dataset 1: Regression (Medical Insurance Charges)

2.1 Research Narrative

Broader economic problem. Medical insurance charges are the price of individual health-care risk in an insurance market. An insurer sets each person’s premium to cover their expected medical cost, so predicting charges from observable characteristics is the core economic problem of risk-based pricing. It also matters to households and policymakers, because it shows how preventable health risks, such as smoking, translate into higher private health-care costs.

Research narrative. I predict individual medical insurance charges from age, BMI, smoking status, number of children, sex, and region, because charges should reflect expected medical cost, and expected cost differs across people according to health risk and demographic factors. Age, BMI, and smoking status are the health-risk variables an insurer can actually observe, while children, sex, and region are demographic and geographic controls.

Hypothesis. I expect smoking status to be the strongest single predictor, followed by age and BMI. I also expect a richer model that includes a BMI-smoker interaction to predict better, because the cost effect of a high BMI should be larger for smokers than for non-smokers.

2.2 Variable Roles

Role	Variable	Economic reason
Target (outcome)	`charges`	Continuous individual medical insurance cost; the price of health-care risk
Main predictor	`age`	Older individuals tend to use more medical services, raising expected cost
Main predictor	`bmi`	Higher BMI is a health-risk indicator linked to higher expected cost
Main predictor	`smoker`	Smoking is a major health-risk behavior expected to raise cost sharply
Control	`children`	Family structure may shift household medical use
Control	`sex`	Controls for demographic differences in cost
Control	`region`	Controls for regional cost and pricing differences

2.3 Data Import and Preparation

# Import the embedded data and set categorical variables as factors.
insurance <- read.csv(text = insurance_csv, stringsAsFactors = FALSE)
insurance$sex    <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)

# Confirm dimensions
data.frame(observations = nrow(insurance), variables = ncol(insurance))

  observations variables
1         1338         7

2.4 Data Splitting

The data is split into 80% training and 20% test sets with initial_split(prop = 0.8), using a fixed seed for reproducibility. The training set is used to estimate the models, and the test set is kept aside to evaluate out-of-sample performance.

set.seed(465)
insurance_split <- initial_split(insurance, prop = 0.8)
train_reg <- training(insurance_split)
test_reg  <- testing(insurance_split)

data.frame(
  set  = c("Training", "Test", "Total"),
  size = c(nrow(train_reg), nrow(test_reg), nrow(insurance))
)

       set size
1 Training 1070
2     Test  268
3    Total 1338

The training set contains 1070 observations and the test set contains 268 observations.

2.5 Building Two Models

Model 1 (baseline). A linear regression using only the three main health-risk predictors. This is the simple, interpretable specification suggested by the research narrative.

reg_model1 <- lm(charges ~ age + bmi + smoker, data = train_reg)

Model 2 (richer model). A linear regression that adds the controls and, importantly, the interaction bmi:smoker. The interaction is included for an economic reason: the cost effect of a high BMI is expected to be larger for smokers than for non-smokers.

reg_model2 <- lm(charges ~ age + bmi + smoker + children + sex + region + bmi:smoker,
                 data = train_reg)

Both models are trained on the training set and then used to predict on the test set.

pred_reg1 <- predict(reg_model1, test_reg)
pred_reg2 <- predict(reg_model2, test_reg)

reg_metrics <- data.frame(
  Model = c("Model 1: age + bmi + smoker",
            "Model 2: + children + sex + region + bmi:smoker"),
  RMSE  = c(rmse(test_reg$charges, pred_reg1),
            rmse(test_reg$charges, pred_reg2)),
  R2    = c(rsq(test_reg$charges, pred_reg1),
            rsq(test_reg$charges, pred_reg2))
)
knitr::kable(reg_metrics, digits = c(0, 2, 4),
             caption = "Test set performance for the two regression models")

Test set performance for the two regression models
Model	RMSE	R2
Model 1: age + bmi + smoker	6253.47	0.7293
Model 2: + children + sex + region + bmi:smoker	4757.98	0.8433

2.6 Model Comparison and Selection

Model 2 performs clearly better on the test set. Its RMSE is about 4758, compared with about 6253 for Model 1, meaning the typical prediction error falls by roughly 1500 dollars. Its test R-squared rises from about 0.73 to about 0.84, so it explains a larger share of the variation in charges.

The improvement is not only statistical. Most of the gain comes from the bmi:smoker interaction, which matches the hypothesis: the additional cost of a higher BMI is much larger for smokers than for non-smokers. Because Model 2 fits the data better and is also consistent with the economic story of risk-based pricing, it is selected as the better model for this dataset.

# Coefficients of the selected model, for interpretation
round(coef(reg_model2), 2)

    (Intercept)             age             bmi       smokeryes        children 
       -2289.88          262.29           23.45       -19999.23          503.40 
        sexmale regionnorthwest regionsoutheast regionsouthwest   bmi:smokeryes 
        -256.56         -622.43        -1041.36        -1316.08         1431.00

The coefficient on smokeryes is large and negative on its own, but it must be read together with the positive bmi:smokeryes interaction term: for smokers, each additional unit of BMI is associated with a much larger increase in charges than for non-smokers. The coefficient on age is positive, which means charges rise with age, as expected. This confirms the hypothesis that smoking, especially combined with a high BMI, is the dominant driver of cost.

2.7 Cross-Validation (Best Model)

A 5-fold cross-validation is run on the selected model (Model 2) using the training set only. The average performance across the five folds is reported and then compared with the single test-set result.

set.seed(465)
folds_reg <- sample(rep(1:5, length.out = nrow(train_reg)))

cv_rmse <- numeric(5)
cv_rsq  <- numeric(5)

for (k in 1:5) {
  tr <- train_reg[folds_reg != k, ]
  va <- train_reg[folds_reg == k, ]
  fit_k <- lm(charges ~ age + bmi + smoker + children + sex + region + bmi:smoker,
              data = tr)
  pred_k <- predict(fit_k, va)
  cv_rmse[k] <- rmse(va$charges, pred_k)
  cv_rsq[k]  <- rsq(va$charges, pred_k)
}

data.frame(
  Metric   = c("RMSE", "R2"),
  CV_mean  = c(mean(cv_rmse), mean(cv_rsq)),
  Test_set = c(rmse(test_reg$charges, pred_reg2), rsq(test_reg$charges, pred_reg2))
)

  Metric      CV_mean   Test_set
1   RMSE 4867.8281259 4757.98079
2     R2    0.8352955    0.84327

The average cross-validated RMSE (about 4868) is very close to the test-set RMSE (about 4758), and the average cross-validated R-squared (about 0.84) is almost the same as the test-set R-squared. Because the cross-validation and the test set give similar results, the model appears stable and is not strongly overfitted.

2.8 Conclusion for Dataset 1

For the regression task, Model 2 is the better model. It produces a lower RMSE and a higher R-squared on the test set, the cross-validation confirms that this performance is stable, and the improvement is explained by an economically meaningful interaction between BMI and smoking status. This supports the hypothesis that health risk, especially smoking, drives insurance charges.

3 Dataset 2: Classification (Loan Approval)

3.1 Research Narrative

Broader economic problem. Loan approval is a credit-allocation decision made under asymmetric information. The lender cannot directly observe whether a borrower will repay, so it must rely on observable signals. It needs to reduce the risk of default while avoiding the rejection of applicants who could actually repay. Predicting approval therefore connects to how financial institutions evaluate repayment capacity and past behavior when deciding who receives access to credit.

Research narrative. I predict whether a loan application is approved or not approved from credit history, applicant income, requested loan amount, education, property area, and marital status, because approval should depend on repayment capacity and on past repayment reliability. Credit history is the most direct signal of repayment behavior, income and loan amount capture the capacity to repay, and the remaining variables are background controls.

Hypothesis. I expect credit history to dominate the approval decision, because it directly summarizes past repayment reliability. I expect income and loan amount to matter for capacity, and I expect the richer model to improve prediction only if the additional financial and demographic variables add information beyond credit history.

3.2 Variable Roles

Role	Variable	Economic reason
Target (outcome)	`Loan_Status`	Binary indicator of whether credit was approved (Y / N)
Main predictor	`Credit_History`	Direct signal of past repayment reliability
Main predictor	`ApplicantIncome`	Higher income signals stronger repayment capacity
Main predictor	`LoanAmount`	Larger loans carry more repayment risk
Control	`Education`	Background characteristic that may shift approval
Control	`Property_Area`	Controls for urban / semiurban / rural differences
Control	`Married`	Controls for household structure

3.3 Data Import and Preparation

The cleaning follows the same steps used in Stage 1. The ID column is removed, missing categorical values are labelled Unknown, missing numeric values are filled with the median, and the target variable is kept as a factor with two levels (Y and N).

loan_raw <- read.csv(text = loan_csv, stringsAsFactors = FALSE, na.strings = c("", "NA"))
loan <- loan_raw[, names(loan_raw) != "Loan_ID"]

fix_category <- function(x) {
  x[is.na(x)] <- "Unknown"
  as.factor(x)
}

loan$Gender        <- fix_category(loan$Gender)
loan$Married       <- fix_category(loan$Married)
loan$Dependents    <- fix_category(loan$Dependents)
loan$Education     <- as.factor(loan$Education)
loan$Self_Employed <- fix_category(loan$Self_Employed)
loan$Property_Area <- as.factor(loan$Property_Area)

loan$LoanAmount[is.na(loan$LoanAmount)] <- median(loan$LoanAmount, na.rm = TRUE)
loan$Loan_Amount_Term[is.na(loan$Loan_Amount_Term)] <- median(loan$Loan_Amount_Term, na.rm = TRUE)

loan$Credit_History <- as.factor(ifelse(is.na(loan$Credit_History),
                                        "Unknown",
                                        as.character(loan$Credit_History)))
loan$Loan_Status <- as.factor(loan$Loan_Status)

data.frame(observations = nrow(loan), variables = ncol(loan))

  observations variables
1          614        12

3.4 Data Splitting

set.seed(465)
loan_split <- initial_split(loan, prop = 0.8)
train_cl <- training(loan_split)
test_cl  <- testing(loan_split)

data.frame(
  set  = c("Training", "Test", "Total"),
  size = c(nrow(train_cl), nrow(test_cl), nrow(loan))
)

       set size
1 Training  491
2     Test  123
3    Total  614

The training set contains 491 observations and the test set contains 123 observations.

3.5 Building Two Models

Model 1 (parsimonious). A logistic regression that uses only credit history, which the hypothesis identifies as the main driver of approval.

cl_model1 <- glm(Loan_Status ~ Credit_History,
                 data = train_cl, family = binomial)

Model 2 (richer model). A logistic regression that adds income, loan amount, education, property area, and marital status.

cl_model2 <- glm(Loan_Status ~ Credit_History + ApplicantIncome + LoanAmount +
                   Education + Property_Area + Married,
                 data = train_cl, family = binomial)

Predictions on the test set use a 0.5 probability threshold to assign each application to approved (Y) or not approved (N).

prob_cl1 <- predict(cl_model1, test_cl, type = "response")
prob_cl2 <- predict(cl_model2, test_cl, type = "response")

pred_cl1 <- ifelse(prob_cl1 >= 0.5, "Y", "N")
pred_cl2 <- ifelse(prob_cl2 >= 0.5, "Y", "N")

m1 <- class_metrics(test_cl$Loan_Status, pred_cl1)
m2 <- class_metrics(test_cl$Loan_Status, pred_cl2)

cl_metrics <- data.frame(
  Model     = c("Model 1: Credit_History only",
                "Model 2: + income, amount, education, area, married"),
  Accuracy  = c(m1["accuracy"],  m2["accuracy"]),
  Precision = c(m1["precision"], m2["precision"]),
  Recall    = c(m1["recall"],    m2["recall"])
)
knitr::kable(cl_metrics, digits = 4, row.names = FALSE,
             caption = "Test set performance for the two classification models")

Test set performance for the two classification models
Model	Accuracy	Precision	Recall
Model 1: Credit_History only	0.8374	0.8257	0.989
Model 2: + income, amount, education, area, married	0.8374	0.8257	0.989

3.6 Model Comparison and Selection

The two models produce the same accuracy, precision, and recall on the test set (about 0.84 accuracy, 0.83 precision, 0.99 recall). In fact, they assign exactly the same approval decision to every test application. The extra variables in Model 2 do not change a single prediction at the 0.5 threshold.

# Number of test applications where the two models disagree
sum(pred_cl1 != pred_cl2)

[1] 0

This result confirms the hypothesis and has a clear economic meaning. Credit history is such a strong signal of repayment behavior that it dominates the approval decision; once it is known, the other financial and demographic variables add almost no extra predictive value in this dataset. Because Model 1 reaches the same performance with far fewer variables, it is the better practical choice: it is simpler, easier to explain to an applicant, and less likely to overfit. Model 1 is therefore selected as the better model.

The high recall (about 0.99) means the model approves almost every application that should be approved, while the lower precision (about 0.83) means some applications it approves were actually rejected in the data. For a lender this is an important trade-off: the model rarely turns away a good borrower, but it does accept some applicants who were declined in practice.

# Coefficients of the selected model (log-odds scale)
round(coef(cl_model1), 4)

          (Intercept)       Credit_History1 Credit_HistoryUnknown 
              -2.4423                3.7541                3.3767

On the odds scale, having a positive credit history (Credit_History = 1) multiplies the odds of approval by about 43 compared with a poor credit history (Credit_History = 0). This confirms, in numbers, that credit history is the dominant factor in the approval decision, exactly as the hypothesis predicted.

3.7 Cross-Validation (Best Model)

A 5-fold cross-validation is run on the selected model (Model 1) using the training set only.

set.seed(465)
folds_cl <- sample(rep(1:5, length.out = nrow(train_cl)))

cv_acc <- numeric(5)

for (k in 1:5) {
  tr <- train_cl[folds_cl != k, ]
  va <- train_cl[folds_cl == k, ]
  fit_k <- glm(Loan_Status ~ Credit_History, data = tr, family = binomial)
  prob_k <- predict(fit_k, va, type = "response")
  pred_k <- ifelse(prob_k >= 0.5, "Y", "N")
  cv_acc[k] <- mean(pred_k == as.character(va$Loan_Status))
}

data.frame(
  Metric   = "Accuracy",
  CV_mean  = mean(cv_acc),
  Test_set = m1["accuracy"]
)

           Metric   CV_mean  Test_set
accuracy Accuracy 0.8025149 0.8373984

The average cross-validated accuracy (about 0.80) is close to the test-set accuracy (about 0.84). The small gap suggests the model is stable and not seriously overfitted, although the single test split is slightly easier than the average fold, which is normal.

3.8 Conclusion for Dataset 2

For the classification task, Model 1 is the better model. It reaches the same test-set accuracy, precision, and recall as the richer model while using only credit history, and the cross-validation confirms that this performance is stable. The main economic finding is that credit history dominates the loan-approval decision, which confirms the hypothesis stated in the research narrative.

4 AI Interaction Log

I did the dataset selection, the research questions, the variable choices, the model specifications, and all of the written interpretation myself. I used an AI tool only once, and only for a narrow technical problem.

Tool used: ChatGPT

My prompt: “When I render my Quarto file in RStudio I get a package installation error from tidymodels. Without that package, how can I do an 80/20 split and compute RMSE, R-squared, accuracy, precision and recall in base R?”

Relevant part of the AI response: It explained that a hold-out split can be done with sample(), that R-squared on a test set is 1 - sum((truth - pred)^2) / sum((truth - mean(truth))^2), and that accuracy, precision, and recall can be counted directly from the predicted and true labels.

How I used it: I did not paste the code in as it was. I wrote my own helper functions, kept my own variable names, and chose my own models and predictors. I used the response only to replace the package-dependent code with base R, so that my document would render without the installation error.

Reflection: The help was limited to a coding and rendering issue; the datasets, the economic reasoning, the model choices, and the conclusions are mine. I verified the result by checking that the train and test sample sizes were correct and that the cross-validated metrics were close to the test-set metrics. I learned that train-test splitting and cross-validation do not require a special package and can be written directly in base R, which also made my report more reproducible.

5 Overall Conclusion

For the regression dataset, my hypothesis was that health risk, especially smoking, drives medical insurance charges. The richer model with the BMI-smoker interaction (Model 2) is better: it lowers the test RMSE from about 6253 to about 4758 and raises the test R-squared from about 0.73 to about 0.84, and the gain comes mainly from the interaction, exactly as predicted.

For the classification dataset, my hypothesis was that credit history dominates loan approval. The parsimonious model using only credit history (Model 1) is better: it matches the richer model exactly on the test set while using far fewer variables, which shows that credit history is the binding factor in the approval decision.

In both cases the cross-validation results are close to the test-set results, which indicates that the selected models are stable and not strongly overfitted. Because each model was built from an explicit research narrative, the results can be read directly against the original economic questions rather than as isolated numbers.