ECON 465 – Data Science Project: From Data to Economic Insight

Author

Gül Ertan Özgüzer

Published

May 7, 2025

Project Overview

This is a 3-stage, 4-week project that guides you through the complete data science workflow. You will find two different datasets and answer two different economic questions:

  • Dataset 1 (Regression): Predict a continuous outcome (e.g., house price, GDP, life expectancy)
  • Dataset 2 (Classification): Predict a binary outcome (e.g., default/no default, high/low unemployment)

For each dataset, you will build at least two predictive models, compare them, and evaluate performance. The project is designed to be completed iteratively, with feedback at each stage.

Timeline: 4 weeks (Weeks 11–14)
Weight: 35% of final grade
Group size: Max 2 students (or individual upon request)


Stage 1: Data Acquisition & Probability Foundations (Week 11)

Deliverable: Data Proposal & Probability Analysis Report (25 points)

Due: 10th of May at midnight

Tasks

1.1 Find Two Datasets (8 points)

Find two real-world economic datasets:

Dataset Outcome Type Example Target Variable Example Sources
Dataset 1 (Regression) Continuous (numeric) House price, GDP per capita, life expectancy, inflation rate WDI, FRED, OECD, TURKSTAT, Kaggle
Dataset 2 (Classification) Binary (Yes/No or 0/1) Default status, high/low growth, recession indicator, firm bankruptcy WDI, FRED, Kaggle, public company data

Requirements for each dataset:

  • At least 500 observations and 5 variables (including target variable)
  • You may not use: Gapminder, iris, mtcars, Default, or any dataset used in class labs

Submission for each dataset:

  • Dataset source URL or brief description of how you obtained it
  • Explanation of why this dataset is relevant to an economic question

1.2 Formulate Two Economic Questions (5 points)

Write one clear, focused economic question for each dataset.

Examples for regression (continuous outcome):

  • “What factors predict house prices in İzmir?”
  • “Which economic indicators best predict a country’s GDP growth rate?”

Examples for classification (binary outcome):

  • “Can we predict whether a borrower will default on a loan?”
  • “Can we classify Turkish provinces as high‑unemployment vs. low‑unemployment?”

Submission: State each question in one sentence.

1.3 Data Import & Cleaning (6 points)

For each dataset: - Import the data using appropriate functions (read_csv, read_excel, WDI, fredr, etc.) - Handle missing values, rename variables, ensure correct data types - Create a tidy dataset (each variable is a column, each observation a row)

1.4 Probability Distribution Analysis (6 points)

For each dataset, select the target variable (outcome variable):

  • Compute summary statistics (mean, median, standard deviation, quartiles)
  • Create a histogram of the variable. Is it normally distributed or skewed?
  • If skewed, apply a log transformation and create a new histogram. How does the shape change?
  • Based on the shape, propose which theoretical distribution (normal, log‑normal, exponential) might approximate your data.

What to submit (as a Quarto document):

For each dataset:

  1. Dataset description and source
  2. Economic question
  3. Code for importing and cleaning (commented)
  4. Summary statistics table
  5. Two histograms (original and log‑transformed) with interpretation
  6. Proposal of theoretical distribution

Stage 2: Predictive Modeling (Weeks 12–13)

Deliverable: Model Report (35 points)

Due: 24th of May at midnight

Tasks

2.1 Data Splitting (5 points)

For each dataset: - Split your data into training (80%) and test (20%) sets using initial_split() - Set a seed (set.seed(465)) for reproducibility

2.2 Build at Least Two Predictive Models per Dataset (12 points)

For the regression dataset (continuous outcome): Build and compare at least two of:

  • Linear regression
  • Decision tree
  • Random forest (encouraged)
  • Other regression method (e.g., k‑NN regression)

For the classification dataset (binary outcome): Build and compare at least two of:

  • Logistic regression

  • Decision tree

  • k‑Nearest Neighbors

  • Random forest (encouraged) For each model:

  • Train the model on the training set

  • Make predictions on the test set

  • Compute appropriate evaluation metrics:

    • Regression: RMSE, R²
    • Classification: Accuracy, confusion matrix, precision, recall

2.3 Probability Predictions (for Classification Only) (6 points)

For your classification dataset (if using logistic regression or random forest):

  • Output predicted probabilities (not just class labels)
  • Create a histogram of predicted probabilities
  • Choose a probability threshold (e.g., 0.5) and justify your choice
  • Discuss what would happen if you lowered or raised the threshold

2.4 Model Comparison & Selection (6 points)

For each dataset:

  • Compare model performance using test set metrics
  • Identify which model performs better and explain why
  • Discuss the bias‑variance tradeoff for your models

2.5 Cross‑Validation (6 points)

For each dataset, perform k‑fold cross‑validation (5‑fold or 10‑fold) on your best model:

  • Report the average cross‑validated performance
  • Compare the cross‑validated performance with the test set performance
  • Discuss any differences (e.g., overfitting, data leakage)

What to submit (as a Quarto document):

For each dataset:

  1. Train/test split description
  2. Model specifications and training code for each model
  3. Prediction results and evaluation metrics
  4. Model comparison table
  5. Cross‑validation results and interpretation
  6. For classification: predicted probability histogram and threshold justification

Stage 3: Final Analysis & Presentation (Weeks 13–14)

Deliverable: Final Report (25 points) + Presentation (15 points)

Final Report Due: 3rd of June at midnight
Presentations: 4th of June in class

Tasks

3.1 Complete Analysis Pipeline (6 points)

  • Combine all work from Stages 1–2 into a single, reproducible Quarto document
  • Organize clearly with sections for Dataset 1 (Regression) and Dataset 2 (Classification)
  • Ensure all code runs without errors (run render() to verify)
  • Use a clear narrative: Question → Data → Probability Analysis → Modeling → Results → Conclusion (for each dataset)

3.2 Economic Interpretation (6 points)

For each dataset:

  • Answer your original economic question based on your model results
  • Interpret the coefficients (if using linear/logistic regression) or feature importance (if using tree‑based methods)
  • Discuss how your findings could inform economic policy, business decisions, or future research

3.3 Limitations & Replication (4 points)

  • Describe at least two limitations for your overall analysis (could apply to one or both datasets)
  • Explain what steps make your analysis reproducible (e.g., relative paths, set.seed(), documented environment)

3.4 AI Use Log (4 points)

  • Document any AI tools used (ChatGPT, Copilot, etc.) in an AI Use Log
  • For each AI interaction, state: the prompt given, how you used the output, and how you verified or modified it

3.5 Final Suggestions (5 points)

  • Suggest one improvement you would make if you had more time or better data
  • Pose one new economic question inspired by your analysis that you would like to investigate further

3.6 Presentation (15 points – Week 14)

  • 10‑minute presentation summarizing your project (8 minutes speaking, 2 minutes Q&A)
  • Include for both datasets: economic question, dataset description, probability analysis highlights, model comparison (which model won and why), main findings, limitations
  • Present in class (or submit a recorded video if remote)

What to submit (Week 13):

  • A single, self‑contained Quarto document (.qmd) with all code and narrative for both datasets
  • The rendered output (.html or .pdf)
  • RPubs link (if published)

What to prepare for Week 14:

  • Presentation slides (Quarto revealjs, PowerPoint, or PDF)
  • A brief summary (2‑3 slides) highlighting your main economic insights for both datasets

Grading Summary

Stage Component Points
Stage 1 Data Proposal & Probability Analysis (both datasets) 25
Stage 2 Model Report (both datasets) 35
Stage 3 Final Report (both datasets) 25
Stage 3 Presentation 15
Total 100

Important Notes

  • Reproducibility is mandatory. All file paths must be relative (e.g., data/your_dataset.csv). Set a seed (set.seed(465)) for all random processes.
  • No reused class datasets. Find your own data. If you are unsure about a dataset, ask the instructor.
  • AI assistance is allowed but must be documented. Include an AI Use Log in your final report.
  • Late submission: 10% deduction per day.

Example Project Ideas

Dataset Type Economic Question Dataset Source Target Variable Models
Regression What predicts housing prices in İzmir? Real estate websites Price (TL) Linear regression, Random forest
Classification Can we predict loan default? Kaggle credit data Default (Yes/No) Logistic regression, Decision tree
Regression What affects life expectancy? World Bank Life expectancy Linear regression, Decision tree
Classification Can we classify countries by recession risk? FRED / WDI Recession (Yes/No) Logistic regression, Random forest
Regression What drives GDP per capita growth? World Bank GDP growth (%) Linear regression, Random forest
Classification Which firms are likely to go bankrupt? Borsa İstanbul Bankruptcy (Yes/No) Logistic regression, k‑NN

Weekly Timeline Summary

Week Stage Focus
Week 11 Stage 1 Find two datasets, import, clean, probability distributions (due 10 May)
Week 12–13 Stage 2 Predictive modeling for both datasets (due 24 May)
Week 13 Stage 3 Final report combining both datasets (due 3 June)
Week 14 Stage 3 Presentations (in class 4 June)