library(tidyverse)
library(tidymodels)
library(WDI)
library(readr)
set.seed(465)ECON 465 Stage 2: Predictive Modeling
1. Introduction
This report presents the Stage 2 predictive modeling analysis. The report includes one regression dataset and one classification dataset.
For the regression task, we use World Bank WDI country-level data to predict life expectancy. Based on the feedback received in Stage 1, only 2023 data are used. This means each country appears once, so the dataset is treated as cross-sectional rather than panel data.
For the classification task, we use the Adult Census Income dataset to predict whether an individual earns more than $50,000 per year.
2. Regression Dataset: Life Expectancy
2.1 Data Preparation
df_reg_raw <- WDI(
country = "all",
indicator = c(
life_expectancy = "SP.DYN.LE00.IN",
gdp_per_capita = "NY.GDP.PCAP.CD",
urbanization_rate = "SP.URB.TOTL.IN.ZS",
fertility_rate = "SP.DYN.TFRT.IN"
),
start = 2023,
end = 2023,
extra = TRUE
)
df_reg <- df_reg_raw |>
filter(region != "Aggregates") |>
drop_na(life_expectancy, gdp_per_capita, urbanization_rate, fertility_rate) |>
select(
country, iso3c, year, region, income,
life_expectancy, gdp_per_capita,
urbanization_rate, fertility_rate
)
glimpse(df_reg)Rows: 203
Columns: 9
$ country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Ang…
$ iso3c <chr> "AFG", "ALB", "DZA", "AND", "AGO", "ATG", "ARG", "AR…
$ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023…
$ region <chr> "Middle East, North Africa, Afghanistan & Pakistan",…
$ income <chr> "Low income", "Upper middle income", "Upper middle i…
$ life_expectancy <dbl> 66.03500, 79.60200, 76.26100, 84.04100, 64.61700, 77…
$ gdp_per_capita <dbl> 413.7579, 9730.8692, 5370.4772, 46812.4261, 2916.136…
$ urbanization_rate <dbl> 25.47305, 58.21061, 74.73281, 88.82016, 69.85150, 24…
$ fertility_rate <dbl> 4.840, 1.348, 2.766, 1.082, 5.124, 1.578, 1.500, 1.9…
The regression dataset is restricted to 2023. Therefore, each country appears only once. This avoids the panel data problem where country-year observations are nested within countries.
2.2 Train/Test Split
set.seed(465)
reg_split <- initial_split(df_reg, prop = 0.8)
reg_train <- training(reg_split)
reg_test <- testing(reg_split)
reg_sample_sizes <- tibble(
dataset = c("Training set", "Test set"),
observations = c(nrow(reg_train), nrow(reg_test))
)
knitr::kable(reg_sample_sizes, caption = "Regression Dataset: Train/Test Split")| dataset | observations |
|---|---|
| Training set | 162 |
| Test set | 41 |
The training set is used to estimate the models. The test set is used to evaluate out-of-sample predictive performance.
2.3 Regression Model 1
reg_model_1 <- lm(
life_expectancy ~ gdp_per_capita + urbanization_rate,
data = reg_train
)
summary(reg_model_1)
Call:
lm(formula = life_expectancy ~ gdp_per_capita + urbanization_rate,
data = reg_train)
Residuals:
Min 1Q Median 3Q Max
-14.8482 -3.1913 0.7464 3.6907 10.6231
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.423e+01 1.144e+00 56.130 < 2e-16 ***
gdp_per_capita 9.770e-05 1.238e-05 7.893 4.52e-13 ***
urbanization_rate 1.148e-01 1.848e-02 6.211 4.41e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.125 on 159 degrees of freedom
Multiple R-squared: 0.4961, Adjusted R-squared: 0.4898
F-statistic: 78.28 on 2 and 159 DF, p-value: < 2.2e-16
Model 1 predicts life expectancy using GDP per capita and urbanization rate. Both coefficients are statistically significant at the 1% level. The positive coefficient on GDP per capita indicates that wealthier countries tend to have higher life expectancy, which is consistent with the Preston curve — a well-established relationship in development economics. The positive coefficient on urbanization rate suggests that countries with larger urban populations also tend to have higher life expectancy, likely reflecting better access to healthcare and sanitation in urban areas. However, with an R-squared of approximately 0.50, the model explains only around half of the variation in life expectancy, suggesting that important predictors are missing.
2.4 Regression Model 2
reg_model_2 <- lm(
life_expectancy ~ gdp_per_capita + urbanization_rate + fertility_rate,
data = reg_train
)
summary(reg_model_2)
Call:
lm(formula = life_expectancy ~ gdp_per_capita + urbanization_rate +
fertility_rate, data = reg_train)
Residuals:
Min 1Q Median 3Q Max
-13.0992 -2.3557 0.3195 2.4182 7.2663
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.906e+01 1.282e+00 61.692 < 2e-16 ***
gdp_per_capita 6.357e-05 8.524e-06 7.458 5.48e-12 ***
urbanization_rate 3.086e-02 1.355e-02 2.277 0.0241 *
fertility_rate -3.699e+00 2.579e-01 -14.340 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.389 on 158 degrees of freedom
Multiple R-squared: 0.7811, Adjusted R-squared: 0.7769
F-statistic: 187.9 on 3 and 158 DF, p-value: < 2.2e-16
Model 2 adds fertility rate as an additional predictor. The coefficient on fertility rate is negative and highly significant, meaning that countries with higher fertility rates tend to have lower life expectancy. This is consistent with demographic transition theory: countries in early stages of development typically have both high fertility and low life expectancy, while countries in later stages have lower fertility and higher life expectancy. Adding fertility rate substantially improves model fit, with R-squared rising to approximately 0.78. GDP per capita and urbanization rate remain statistically significant, though the magnitude of the urbanization coefficient decreases once fertility is controlled for, suggesting that part of the urbanization effect was previously capturing variation explained by fertility differences.
2.5 Regression Test Set Performance
reg_test_predictions <- reg_test |>
mutate(
pred_model_1 = predict(reg_model_1, newdata = reg_test),
pred_model_2 = predict(reg_model_2, newdata = reg_test)
)
reg_metrics_1 <- reg_test_predictions |>
metrics(truth = life_expectancy, estimate = pred_model_1) |>
filter(.metric %in% c("rmse", "rsq")) |>
mutate(model = "Model 1")
reg_metrics_2 <- reg_test_predictions |>
metrics(truth = life_expectancy, estimate = pred_model_2) |>
filter(.metric %in% c("rmse", "rsq")) |>
mutate(model = "Model 2")
reg_comparison <- bind_rows(reg_metrics_1, reg_metrics_2) |>
select(model, .metric, .estimate) |>
pivot_wider(names_from = .metric, values_from = .estimate)
knitr::kable(
reg_comparison,
digits = 3,
caption = "Regression Model Comparison: Test Set Performance"
)| model | rmse | rsq |
|---|---|---|
| Model 1 | 5.189 | 0.473 |
| Model 2 | 3.444 | 0.779 |
2.6 Regression Model Comparison
RMSE shows the average prediction error in years of life expectancy. A lower RMSE means better predictive performance. R-squared shows how much variation in life expectancy is explained by the model.
Model 2 performs better on both metrics. Its RMSE is lower than Model 1, meaning predictions are more accurate. R-squared also increased, meaning Model 2 explains a greater share of the variation in life expectancy across countries. Adding fertility rate as a predictor substantially improved predictive performance. This is consistent with demographic transition theory, which links higher fertility rates to lower life expectancy in developing countries.
2.7 Regression Visualization
ggplot(reg_test_predictions, aes(x = life_expectancy, y = pred_model_2)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(
title = "Actual vs Predicted Life Expectancy (Model 2)",
x = "Actual Life Expectancy",
y = "Predicted Life Expectancy"
) +
theme_minimal()This plot compares actual and predicted life expectancy values for the best regression model. Points closer to the diagonal line indicate better predictive performance. Most countries cluster tightly around the line, suggesting the model predicts well across a wide range of life expectancy values. However, a few countries with very low life expectancy appear below the line, meaning the model overestimates their life expectancy. These are likely low-income countries in Sub-Saharan Africa where factors such as conflict, disease burden, or weak health infrastructure reduce life expectancy beyond what GDP, urbanization, and fertility alone can capture.
2.8 Regression Cross-Validation
set.seed(465)
# Define model specification (as taught in Week 9)
reg_spec <- linear_reg() |>
set_engine("lm") |>
set_mode("regression")
# Create 5 folds from the full dataset
reg_folds <- vfold_cv(df_reg, v = 5)
# Run 5-fold cross-validation on the best model (Model 2)
reg_cv_results <- fit_resamples(
reg_spec,
life_expectancy ~ gdp_per_capita + urbanization_rate + fertility_rate,
resamples = reg_folds,
metrics = metric_set(rmse, rsq)
)
reg_cv_summary <- collect_metrics(reg_cv_results) |>
select(.metric, mean, std_err)
# Save CV values for inline reporting
reg_cv_rmse <- reg_cv_summary |> filter(.metric == "rmse") |> pull(mean)
reg_cv_rsq <- reg_cv_summary |> filter(.metric == "rsq") |> pull(mean)
# Save test set values for inline comparison
reg_test_rmse <- reg_metrics_2 |> filter(.metric == "rmse") |> pull(.estimate)
reg_test_rsq <- reg_metrics_2 |> filter(.metric == "rsq") |> pull(.estimate)
knitr::kable(
reg_cv_summary,
digits = 3,
caption = "Regression Model 2: 5-Fold Cross-Validation Results"
)| .metric | mean | std_err |
|---|---|---|
| rmse | 3.382 | 0.205 |
| rsq | 0.771 | 0.038 |
The cross-validation results show the average model performance across five folds. The cross-validated RMSE is 3.382 years and R-squared is 0.771, compared to the test set RMSE of 3.444 and R-squared of 0.779. The difference between the two is minimal, which suggests the model is stable and not overfitting. The model generalizes well to new data that it has not seen during training.
3. Classification Dataset: Adult Census Income
3.1 Data Preparation
adult_raw <- read_csv("adult.csv", show_col_types = FALSE)
adult_clean <- adult_raw |>
mutate(across(where(is.character), ~ str_trim(.x))) |>
mutate(across(where(is.character), ~ na_if(.x, "?"))) |>
rename(
education_num = education.num,
marital_status = marital.status,
capital_gain = capital.gain,
capital_loss = capital.loss,
hours_per_week = hours.per.week
) |>
select(
income, age, education_num, occupation,
hours_per_week, sex, marital_status,
capital_gain, capital_loss
) |>
mutate(
income = factor(income, levels = c("<=50K", ">50K")),
high_income = factor(
if_else(income == ">50K", "Yes", "No"),
levels = c("No", "Yes")
),
occupation = as.factor(occupation),
sex = as.factor(sex),
marital_status = as.factor(marital_status)
) |>
drop_na()
glimpse(adult_clean)Rows: 30,718
Columns: 10
$ income <fct> <=50K, <=50K, <=50K, <=50K, <=50K, >50K, <=50K, >50K, >…
$ age <dbl> 82, 54, 41, 34, 38, 74, 68, 41, 45, 38, 52, 32, 46, 45,…
$ education_num <dbl> 9, 4, 10, 9, 6, 16, 9, 10, 16, 15, 13, 14, 15, 7, 14, 1…
$ occupation <fct> Exec-managerial, Machine-op-inspct, Prof-specialty, Oth…
$ hours_per_week <dbl> 18, 40, 40, 45, 40, 20, 40, 60, 35, 45, 20, 55, 40, 76,…
$ sex <fct> Female, Female, Female, Female, Male, Female, Female, M…
$ marital_status <fct> Widowed, Divorced, Separated, Divorced, Separated, Neve…
$ capital_gain <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ capital_loss <dbl> 4356, 3900, 3900, 3770, 3770, 3683, 3683, 3004, 3004, 2…
$ high_income <fct> No, No, No, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Y…
The classification dataset contains individual-level observations.
3.2 Train/Test Split
set.seed(465)
class_split <- initial_split(adult_clean, prop = 0.8)
class_train <- training(class_split)
class_test <- testing(class_split)
class_sample_sizes <- tibble(
dataset = c("Training set", "Test set"),
observations = c(nrow(class_train), nrow(class_test))
)
knitr::kable(class_sample_sizes, caption = "Classification Dataset: Train/Test Split")| dataset | observations |
|---|---|
| Training set | 24574 |
| Test set | 6144 |
3.3 Logistic Regression Model 1
class_model_1 <- glm(
high_income ~ age + education_num + hours_per_week,
data = class_train,
family = binomial
)
summary(class_model_1)
Call:
glm(formula = high_income ~ age + education_num + hours_per_week,
family = binomial, data = class_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.418065 0.125777 -66.93 <2e-16 ***
age 0.048430 0.001361 35.57 <2e-16 ***
education_num 0.338738 0.007379 45.90 <2e-16 ***
hours_per_week 0.041494 0.001479 28.05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27621 on 24573 degrees of freedom
Residual deviance: 22410 on 24570 degrees of freedom
AIC: 22418
Number of Fisher Scoring iterations: 5
Model 1 predicts high income using age, education level, and hours worked per week. All three coefficients are statistically significant at the 1% level. The positive coefficient on age indicates that older individuals are more likely to earn above $50,000, consistent with returns to experience in labor markets. The positive coefficient on education level reflects the well-documented education premium in earnings. The positive coefficient on hours per week suggests that individuals who work more hours are more likely to be high earners. However, with accuracy of 0.786 on the test set and a recall of only 0.334, the model correctly identifies only one-third of actual high-income individuals, indicating that important predictors are missing.
3.4 Logistic Regression Model 2
class_model_2 <- glm(
high_income ~ age + education_num + hours_per_week +
occupation + sex + capital_gain + capital_loss,
data = class_train,
family = binomial
)Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(class_model_2)
Call:
glm(formula = high_income ~ age + education_num + hours_per_week +
occupation + sex + capital_gain + capital_loss, family = binomial,
data = class_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.157e+00 1.575e-01 -51.809 < 2e-16 ***
age 4.213e-02 1.524e-03 27.642 < 2e-16 ***
education_num 2.532e-01 9.537e-03 26.548 < 2e-16 ***
hours_per_week 3.087e-02 1.666e-03 18.523 < 2e-16 ***
occupationArmed-Forces -7.927e-01 1.245e+00 -0.637 0.5241
occupationCraft-repair 7.080e-02 7.963e-02 0.889 0.3739
occupationExec-managerial 7.657e-01 7.532e-02 10.167 < 2e-16 ***
occupationFarming-fishing -1.285e+00 1.454e-01 -8.837 < 2e-16 ***
occupationHandlers-cleaners -8.632e-01 1.465e-01 -5.894 3.78e-09 ***
occupationMachine-op-inspct -2.172e-01 1.037e-01 -2.094 0.0362 *
occupationOther-service -1.043e+00 1.208e-01 -8.636 < 2e-16 ***
occupationPriv-house-serv -4.503e+00 1.815e+00 -2.481 0.0131 *
occupationProf-specialty 3.886e-01 7.801e-02 4.982 6.31e-07 ***
occupationProtective-serv 3.618e-01 1.235e-01 2.929 0.0034 **
occupationSales 1.911e-01 8.031e-02 2.379 0.0174 *
occupationTech-support 5.613e-01 1.104e-01 5.085 3.67e-07 ***
occupationTransport-moving -1.147e-01 1.020e-01 -1.124 0.2608
sexMale 1.271e+00 4.943e-02 25.707 < 2e-16 ***
capital_gain 3.122e-04 1.123e-05 27.810 < 2e-16 ***
capital_loss 6.457e-04 3.845e-05 16.795 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27621 on 24573 degrees of freedom
Residual deviance: 18980 on 24554 degrees of freedom
AIC: 19020
Number of Fisher Scoring iterations: 8
Model 2 extends Model 1 by adding occupation, sex, capital gain, and capital loss. Most occupation categories are statistically significant, with executive and managerial roles showing a strong positive association with high income, while farming, service, and handler occupations show negative associations. The coefficient on sex indicates that male individuals have substantially higher predicted probability of earning above $50,000, even after controlling for occupation, education, and hours worked, which reflects well-documented gender wage gaps in the labor economics literature. Capital gain and capital loss are both highly significant, capturing investment income that is strongly associated with high earners. These additional variables substantially improve model performance across all three metrics.
3.5 Classification Test Set Performance
class_probs_1 <- predict(class_model_1, newdata = class_test, type = "response")
class_probs_2 <- predict(class_model_2, newdata = class_test, type = "response")
class_test_predictions <- class_test |>
mutate(
pred_model_1 = factor(
ifelse(class_probs_1 > 0.5, "Yes", "No"),
levels = c("No", "Yes")
),
pred_model_2 = factor(
ifelse(class_probs_2 > 0.5, "Yes", "No"),
levels = c("No", "Yes")
)
)
class_metrics_1 <- tibble(
model = "Model 1",
accuracy = accuracy_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_1
),
precision = precision_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_1,
event_level = "second"
),
recall = recall_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_1,
event_level = "second"
)
)
class_metrics_2 <- tibble(
model = "Model 2",
accuracy = accuracy_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_2
),
precision = precision_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_2,
event_level = "second"
),
recall = recall_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_2,
event_level = "second"
)
)
class_comparison <- bind_rows(class_metrics_1, class_metrics_2)
knitr::kable(
class_comparison,
digits = 3,
caption = "Classification Model Comparison: Test Set Performance"
)| model | accuracy | precision | recall |
|---|---|---|---|
| Model 1 | 0.786 | 0.622 | 0.334 |
| Model 2 | 0.825 | 0.720 | 0.477 |
3.6 Classification Model Comparison
Accuracy measures overall prediction performance. Precision measures how often the model is correct when it predicts high income. Recall measures how many actual high-income individuals are identified correctly.
Model 2 performs better across all three metrics. Accuracy, precision, and recall all increased compared to Model 1. The improvement in recall is particularly important: Model 1 correctly identifies only around one-third of actual high-income individuals, while Model 2 identifies close to half. Adding occupation, sex, capital gain, and capital loss substantially improved the model’s ability to identify high-income individuals. This is consistent with labor economics theory, where occupation and gender are well-known predictors of earnings.
3.7 Classification Visualization
class_comparison |>
pivot_longer(
cols = c(accuracy, precision, recall),
names_to = "metric",
values_to = "value"
) |>
ggplot(aes(x = metric, y = value, fill = model)) +
geom_col(position = "dodge") +
labs(
title = "Classification Model Performance Comparison",
x = "Metric",
y = "Value",
fill = "Model"
) +
theme_minimal()This plot compares the classification performance metrics of both logistic regression models. Model 2 consistently outperforms Model 1 across all three metrics. The largest gap between the two models appears in recall, where Model 2 identifies substantially more true high-income individuals. Accuracy differs less dramatically because both models correctly classify the majority class (low income) well.
3.8 Classification Cross-Validation
set.seed(465)
# Define model specification (as taught in Week 9)
logistic_spec <- logistic_reg() |>
set_engine("glm") |>
set_mode("classification")
# Create 5 folds from the full dataset
class_folds <- vfold_cv(adult_clean, v = 5)
# Run 5-fold cross-validation on the best model (Model 2)
class_cv_results <- fit_resamples(
logistic_spec,
high_income ~ age + education_num + hours_per_week +
occupation + sex + capital_gain + capital_loss,
resamples = class_folds,
metrics = metric_set(accuracy)
)→ A | warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
There were issues with some computations A: x1
There were issues with some computations A: x5
class_cv_summary <- collect_metrics(class_cv_results) |>
select(.metric, mean, std_err)
# Save CV accuracy for inline reporting
class_cv_acc <- class_cv_summary |> filter(.metric == "accuracy") |> pull(mean)
# Save test set values for inline comparison
class_test_acc <- class_metrics_2 |> pull(accuracy)
class_test_prec <- class_metrics_2 |> pull(precision)
class_test_rec <- class_metrics_2 |> pull(recall)
knitr::kable(
class_cv_summary,
digits = 3,
caption = "Classification Model 2: 5-Fold Cross-Validation Results"
)| .metric | mean | std_err |
|---|---|---|
| accuracy | 0.825 | 0.002 |
Since the test-set evaluation already reports accuracy, precision, and recall, cross-validation is used here mainly to check the stability of the best model’s overall predictive performance. The cross-validation results show the average classification performance across five folds. The cross-validated accuracy is 0.825, compared to the test set accuracy of 0.825. The two values are almost identical, which confirms that the model is stable and generalizes well to unseen data. There is no sign of overfitting. The small standard error across folds indicates that model performance is consistent regardless of which subset of observations is used for validation. The test set precision of 0.72 and recall of 0.477 further confirm the model’s ability to correctly identify high-income individuals.
4. AI Interaction Log
Tool used: ChatGPT
Prompt:
“In R, I am using tidymodels and I want to calculate precision and recall for a logistic regression model where the positive class is the second factor level. My outcome variable is a factor with levels No and Yes. When I use precision_vec() and recall_vec(), I get results for the wrong class. How do I fix this?”
AI Response (relevant excerpt):
The AI explained that by default, tidymodels treats the first factor level as the positive class. To calculate precision and recall for the second level (Yes), the argument event_level = "second" should be passed to precision_vec() and recall_vec().
How we used it:
We had already written the model building and prediction code ourselves. The issue was specifically that our precision and recall results looked unexpectedly low and we could not identify the cause. We used the AI explanation to understand the event_level argument and added it to our metric calculations. The rest of the code was written and adapted by us.
Reflection:
The AI was helpful for resolving a specific technical issue that was not covered in the course labs. We verified the fix by checking that the resulting precision and recall values were consistent with the confusion matrix. The main lesson was that factor level ordering in R affects how classification metrics are computed, and it is important to specify the positive class explicitly.
5. Conclusion
For the regression task, Model 2 performs better on both RMSE and R squared compared to Model 1. Adding fertility rate as a predictor substantially improved predictive accuracy. The cross validation results confirm model stability, with cross validated performance closely matching the test set results.
For the classification task, Model 2 performs better across all three metrics: accuracy, precision, and recall. Adding occupation, sex, and capital variables improved the model’s ability to identify high-income individuals. The cross validation results are consistent with the test set results, confirming that the model generalizes well to new data with no sign of overfitting.
A key limitation is that the WDI regression analysis uses only 2023 cross sectional country level data in order to avoid panel data dependence problems. Therefore, the results reflect relationships observed in 2023 only, and temporal trends across years are not captured.