ECON 465 – Data Science Project: From Data to Economic Insight

Author

Gül Ertan Özgüzer

Published

April 29, 2025

1 Project Overview

This is a 3-stage, 4-week project that guides you through the complete data science workflow. You will find your own dataset, formulate an economic question, clean and explore the data, build predictive models, and present your findings. The project is designed to be completed iteratively, with feedback at each stage.

Timeline: 4 weeks (Weeks 10–13, with final submission and presentations in Week 14)
Weight: 35% of final grade
Group size: max 2 students(or individual upon request)

2 Stage 1: Data Acquisition & Probability Foundations (Week 10)

2.1 Deliverable: Data Proposal & Probability Analysis Report (25 points)

Due: the 9th of May at midnight

2.2 Tasks

2.2.1 Find a Dataset (8 points)

Find a real-world economic dataset that interests you. Possible sources:

World Bank Open Data (WDI package)
FRED Economic Data (fredr package)
OECD Data (OECD package)
TURKSTAT (Turkish Statistical Institute)
Kaggle economic datasets
IPUMS (census microdata)

Your dataset should have at least 500 observations and 5 variables (including a target variable for prediction).

You may not use: Gapminder, iris, mtcars, or any dataset used in class labs.

Submission: Provide the dataset source URL or a brief description of how you obtained it. Explain why this dataset is relevant to an economic question.

2.2.2 Formulate an Economic Question (5 points)

Write a clear, focused economic question that your analysis will answer. Examples:

“Can we predict house prices in Istanbul based on size, age, district, and number of rooms?”
“What economic indicators best predict a country’s GDP growth rate?”
“Can we classify Turkish firms as high‑growth vs. low‑growth using financial ratios?”

Submission: State your question in one sentence.

2.2.3 Data Import & Cleaning (6 points)

Import the data into R using appropriate functions (read_csv, read_excel, WDI, fredr, etc.).
Handle missing values, rename variables, ensure correct data types.
Create a tidy dataset (each variable is a column, each observation a row).

2.2.4 Probability Distribution Analysis (6 points)

Select one continuous variable from your dataset (e.g., income, price, GDP).
Compute summary statistics (mean, median, standard deviation, quartiles).
Create a histogram of the variable. Is it normally distributed or skewed?
If skewed, apply a log transformation and create a new histogram. How does the shape change?
Based on the shape, propose which theoretical distribution (normal, log‑normal, exponential) might approximate your data.

What to submit (as a Quarto document):

Dataset description and source
Economic question
Code for importing and cleaning (commented)
Summary statistics table
Two histograms (original and log‑transformed) with interpretation
Proposal of theoretical distribution

3 Stage 2: Predictive Modeling (Weeks 11–12)

3.1 Deliverable: Model Report (35 points)

Due: 24th of May at midnight

3.2 Tasks

3.2.1 Data Splitting (5 points)

Split your data into training (80%) and test (20%) sets using initial_split().
Set a seed (set.seed(465)) for reproducibility.

3.2.2 Build at Least Two Predictive Models (12 points)

Based on your economic question, build and compare at least two different predictive models:

If predicting a continuous outcome (e.g., price, GDP): - Linear regression - Decision tree - Random forest (from Week 10 material) – optional but encouraged

If predicting a binary outcome (e.g., high/low, default/no default): - Logistic regression - Decision tree - k‑Nearest Neighbors

For each model: - Train the model on the training set - Make predictions on the test set - Compute appropriate evaluation metrics (RMSE, R² for regression; accuracy, confusion matrix for classification)

3.2.3 Probability Predictions (for Classification) (6 points)

If using logistic regression, output predicted probabilities (not just class labels).
Create a histogram of predicted probabilities.
Choose a probability threshold (e.g., 0.5) and justify your choice. What would happen if you lowered or raised the threshold? How does this relate to economic decision‑making?

3.2.4 Model Comparison & Selection (6 points)

Compare model performance using test set metrics.
Identify which model performs better and explain why (e.g., “Random forest captures non‑linear relationships better”).
Discuss the bias‑variance tradeoff for your models. Which model has lower bias? Which has lower variance? How does this affect generalization to new data?

3.2.5 Cross‑Validation (6 points)

Perform k‑fold cross‑validation (5‑fold or 10‑fold) on your best model.
Report the average cross‑validated performance.
Compare the cross‑validated performance with the test set performance. Are they similar? If not, what might explain the difference (e.g., overfitting, data leakage)?

What to submit (as a Quarto document):

Train/test split description
Model specifications and training code for each model
Prediction results and evaluation metrics
Model comparison table
Cross‑validation results and interpretation
For classification: predicted probability histogram and threshold justification

4 Stage 3: Final Analysis & Presentation (Weeks 13–14)

4.1 Deliverable: Final Report (25 points) + Presentation (15 points)

Final Report Due: 3rd of June at Midnight Presentations: 4th of June in class

4.2 Tasks

4.2.1 Complete Analysis Pipeline (6 points)

Combine all work from Stages 1–2 into a single, reproducible Quarto document.
Ensure all code runs without errors (run render() to verify).
Use a clear narrative: Question → Data → Probability Analysis → Modeling → Results → Conclusion.

4.2.2 Economic Interpretation (6 points)

Answer your original economic question based on your model results.
Interpret the coefficients (if using linear/logistic regression) or feature importance (if using tree‑based methods).
Discuss how your findings could inform economic policy, business decisions, or future research.

4.2.3 Limitations & Replication (4 points)

Describe at least two limitations of your analysis (e.g., data quality, omitted variables, small sample size, generalizability outside your sample).
Explain what steps make your analysis reproducible (e.g., relative paths, set.seed(), documented environment).

4.2.4 AI Use Log (4 points)

Document any AI tools used (ChatGPT, Copilot, etc.) in an AI Use Log.
For each AI interaction, state: the prompt given, how you used the output, and how you verified or modified it.

4.2.5 Final Suggestions (5 points)

Suggest one improvement you would make if you had more time or better data.
Pose one new economic question inspired by your analysis that you would like to investigate further.

4.2.6 Presentation (15 points – Week 14)

10‑minute presentation summarizing your project (8 minutes speaking, 2 minutes Q&A).
Include: economic question, dataset description, probability analysis highlights (histograms, distribution), model comparison (which model won and why), main findings, limitations.
Present in class (or submit a recorded video if remote).

What to submit (Week 13):

A single, self‑contained Quarto document (.qmd) with all code and narrative.
The rendered output (.html or .pdf).
RPubs link (if published).

What to prepare for Week 14:

Presentation slides (Quarto revealjs, PowerPoint, or PDF).
A brief summary (2‑3 slides) highlighting your main economic insight.

5 Grading Summary

Stage	Component	Points
Stage 1	Data Proposal & Probability Analysis	25
Stage 2	Model Report	35
Stage 3	Final Report	25
Stage 3	Presentation	15
Total		100

6 Important Notes

Reproducibility is mandatory. All file paths must be relative (e.g., data/your_dataset.csv). Set a seed (set.seed(465)) for all random processes.
No reused class datasets. Find your own data. If you are unsure about a dataset, ask the instructor.
AI assistance is allowed but must be documented. Include an AI Use Log in your final report.
Late submission: 10% deduction per day.

7 Example Project Ideas

Economic Question	Dataset Source	Target Variable	Model Type	Probability Focus
What predicts housing prices in İzmir?	Real estate websites	Price (TL)	Linear regression	Distribution of prices (log‑normal)
Can we classify Turkish provinces by unemployment risk?	TÜİK, World Bank	High/low unemployment (binary)	Logistic regression	Distribution of unemployment rates
Which firms are likely to default on loans?	Public company data (Borsa İstanbul)	Default (binary)	Random forest	Distribution of financial ratios
How does life expectancy relate to health spending?	World Bank	Life expectancy	Linear regression	Distribution of life expectancy
What predicts consumer credit risk?	Kaggle credit data	Default (binary)	Logistic regression	Distribution of credit scores

8 Weekly Timeline Summary

Week	Stage	Focus
Week 11	Stage 1	Find data, import, clean, probability distributions
Week 13	Stage 2	Predictive modeling (2+ models), cross‑validation
Week 14	Stage 3	Final report (combine all stages)
Week 14	Stage 3	Presentations