ECON 465 – Data Science Project: From Data to Economic Insight

Author

Gül Ertan Özgüzer

Published

May 7, 2025

Project Overview

This is a 3-stage, 4-week project that guides you through the complete data science workflow. You will find two different datasets and answer two different economic questions:

Dataset 1 (Regression): Predict a continuous outcome (e.g., house price, GDP, life expectancy)
Dataset 2 (Classification): Predict a binary outcome (e.g., default/no default, high/low unemployment)

For each dataset, you will build at least two predictive models, compare them, and evaluate performance. The project is designed to be completed iteratively, with feedback at each stage.

Timeline: 4 weeks (Weeks 11–14)
Weight: 35% of final grade
Group size: Max 2 students (or individual upon request)

Stage 1: Data Acquisition & Probability Foundations (Week 11)

Deliverable: Data Proposal & Probability Analysis Report (25 points)

Due: 10th of May at midnight

Tasks

1.1 Find Two Datasets (8 points)

Find two real-world economic datasets:

Dataset	Outcome Type	Example Target Variable	Example Sources
Dataset 1 (Regression)	Continuous (numeric)	House price, GDP per capita, life expectancy, inflation rate	WDI, FRED, OECD, TURKSTAT, Kaggle
Dataset 2 (Classification)	Binary (Yes/No or 0/1)	Default status, high/low growth, recession indicator, firm bankruptcy	WDI, FRED, Kaggle, public company data

Requirements for each dataset:

At least 500 observations and 5 variables (including target variable)
You may not use: Gapminder, iris, mtcars, Default, or any dataset used in class labs

Submission for each dataset:

Dataset source URL or brief description of how you obtained it
Explanation of why this dataset is relevant to an economic question

1.2 Formulate Two Economic Questions (5 points)

Write one clear, focused economic question for each dataset.

Examples for regression (continuous outcome):

“What factors predict house prices in İzmir?”
“Which economic indicators best predict a country’s GDP growth rate?”

Examples for classification (binary outcome):

“Can we predict whether a borrower will default on a loan?”
“Can we classify Turkish provinces as high‑unemployment vs. low‑unemployment?”

Submission: State each question in one sentence.

1.3 Data Import & Cleaning (6 points)

For each dataset: - Import the data using appropriate functions (read_csv, read_excel, WDI, fredr, etc.) - Handle missing values, rename variables, ensure correct data types - Create a tidy dataset (each variable is a column, each observation a row)

1.4 Probability Distribution Analysis (6 points)

For each dataset, select the target variable (outcome variable):

Compute summary statistics (mean, median, standard deviation, quartiles)
Create a histogram of the variable. Is it normally distributed or skewed?
If skewed, apply a log transformation and create a new histogram. How does the shape change?
Based on the shape, propose which theoretical distribution (normal, log‑normal, exponential) might approximate your data.

What to submit (as a Quarto document):

For each dataset:

Dataset description and source
Economic question
Code for importing and cleaning (commented)
Summary statistics table
Two histograms (original and log‑transformed) with interpretation
Proposal of theoretical distribution

Stage 2: Predictive Modeling (Weeks 12–13)

Deliverable: Model Report (35 points)

Due: 24th of May at midnight

Tasks

2.1 Data Splitting (5 points)

For each dataset: - Split your data into training (80%) and test (20%) sets using initial_split() - Set a seed (set.seed(465)) for reproducibility

2.2 Build at Least Two Predictive Models per Dataset (12 points)

For the regression dataset (continuous outcome): Build and compare at least two of:

Linear regression
Decision tree
Random forest (encouraged)
Other regression method (e.g., k‑NN regression)

For the classification dataset (binary outcome): Build and compare at least two of:

Logistic regression
Decision tree
k‑Nearest Neighbors
Random forest (encouraged) For each model:
Train the model on the training set
Make predictions on the test set
Compute appropriate evaluation metrics:
- Regression: RMSE, R²
- Classification: Accuracy, confusion matrix, precision, recall

2.3 Probability Predictions (for Classification Only) (6 points)

For your classification dataset (if using logistic regression or random forest):

Output predicted probabilities (not just class labels)
Create a histogram of predicted probabilities
Choose a probability threshold (e.g., 0.5) and justify your choice
Discuss what would happen if you lowered or raised the threshold

2.4 Model Comparison & Selection (6 points)

For each dataset:

Compare model performance using test set metrics
Identify which model performs better and explain why
Discuss the bias‑variance tradeoff for your models

2.5 Cross‑Validation (6 points)

For each dataset, perform k‑fold cross‑validation (5‑fold or 10‑fold) on your best model:

Report the average cross‑validated performance
Compare the cross‑validated performance with the test set performance
Discuss any differences (e.g., overfitting, data leakage)

What to submit (as a Quarto document):

For each dataset:

Train/test split description
Model specifications and training code for each model
Prediction results and evaluation metrics
Model comparison table
Cross‑validation results and interpretation
For classification: predicted probability histogram and threshold justification

Stage 3: Final Analysis & Presentation (Weeks 13–14)

Deliverable: Final Report (25 points) + Presentation (15 points)

Final Report Due: 3rd of June at midnight
Presentations: 4th of June in class

Tasks

3.1 Complete Analysis Pipeline (6 points)

Combine all work from Stages 1–2 into a single, reproducible Quarto document
Organize clearly with sections for Dataset 1 (Regression) and Dataset 2 (Classification)
Ensure all code runs without errors (run render() to verify)
Use a clear narrative: Question → Data → Probability Analysis → Modeling → Results → Conclusion (for each dataset)

3.2 Economic Interpretation (6 points)

For each dataset:

Answer your original economic question based on your model results
Interpret the coefficients (if using linear/logistic regression) or feature importance (if using tree‑based methods)
Discuss how your findings could inform economic policy, business decisions, or future research

3.3 Limitations & Replication (4 points)

Describe at least two limitations for your overall analysis (could apply to one or both datasets)
Explain what steps make your analysis reproducible (e.g., relative paths, set.seed(), documented environment)

3.4 AI Use Log (4 points)

Document any AI tools used (ChatGPT, Copilot, etc.) in an AI Use Log
For each AI interaction, state: the prompt given, how you used the output, and how you verified or modified it

3.5 Final Suggestions (5 points)

Suggest one improvement you would make if you had more time or better data
Pose one new economic question inspired by your analysis that you would like to investigate further

3.6 Presentation (15 points – Week 14)

10‑minute presentation summarizing your project (8 minutes speaking, 2 minutes Q&A)
Include for both datasets: economic question, dataset description, probability analysis highlights, model comparison (which model won and why), main findings, limitations
Present in class (or submit a recorded video if remote)

What to submit (Week 13):

A single, self‑contained Quarto document (.qmd) with all code and narrative for both datasets
The rendered output (.html or .pdf)
RPubs link (if published)

What to prepare for Week 14:

Presentation slides (Quarto revealjs, PowerPoint, or PDF)
A brief summary (2‑3 slides) highlighting your main economic insights for both datasets

Grading Summary

Stage	Component	Points
Stage 1	Data Proposal & Probability Analysis (both datasets)	25
Stage 2	Model Report (both datasets)	35
Stage 3	Final Report (both datasets)	25
Stage 3	Presentation	15
Total		100

Important Notes

Reproducibility is mandatory. All file paths must be relative (e.g., data/your_dataset.csv). Set a seed (set.seed(465)) for all random processes.
No reused class datasets. Find your own data. If you are unsure about a dataset, ask the instructor.
AI assistance is allowed but must be documented. Include an AI Use Log in your final report.
Late submission: 10% deduction per day.

Example Project Ideas

Dataset Type	Economic Question	Dataset Source	Target Variable	Models
Regression	What predicts housing prices in İzmir?	Real estate websites	Price (TL)	Linear regression, Random forest
Classification	Can we predict loan default?	Kaggle credit data	Default (Yes/No)	Logistic regression, Decision tree
Regression	What affects life expectancy?	World Bank	Life expectancy	Linear regression, Decision tree
Classification	Can we classify countries by recession risk?	FRED / WDI	Recession (Yes/No)	Logistic regression, Random forest
Regression	What drives GDP per capita growth?	World Bank	GDP growth (%)	Linear regression, Random forest
Classification	Which firms are likely to go bankrupt?	Borsa İstanbul	Bankruptcy (Yes/No)	Logistic regression, k‑NN

Weekly Timeline Summary

Week	Stage	Focus
Week 11	Stage 1	Find two datasets, import, clean, probability distributions (due 10 May)
Week 12–13	Stage 2	Predictive modeling for both datasets (due 24 May)
Week 13	Stage 3	Final report combining both datasets (due 3 June)
Week 14	Stage 3	Presentations (in class 4 June)