ECON 465 – Data Science Project: From Data to Economic Insight
Project Overview
This is a 3-stage, 4-week project that guides you through the complete data science workflow. You will find two different datasets and answer two different economic questions:
- Dataset 1 (Regression): Predict a continuous outcome (e.g., house price, GDP, life expectancy)
- Dataset 2 (Classification): Predict a binary outcome (e.g., default/no default, high/low unemployment)
For each dataset, you will build at least two predictive models, compare them, and evaluate performance. The project is designed to be completed iteratively, with feedback at each stage.
Timeline: 4 weeks (Weeks 11–14)
Weight: 35% of final grade
Group size: Max 2 students (or individual upon request)
Stage 1: Data Acquisition & Probability Foundations (Week 11)
Deliverable: Data Proposal & Probability Analysis Report (25 points)
Due: 10th of May at midnight
Tasks
1.1 Find Two Datasets (8 points)
Find two real-world economic datasets:
| Dataset | Outcome Type | Example Target Variable | Example Sources |
|---|---|---|---|
| Dataset 1 (Regression) | Continuous (numeric) | House price, GDP per capita, life expectancy, inflation rate | WDI, FRED, OECD, TURKSTAT, Kaggle |
| Dataset 2 (Classification) | Binary (Yes/No or 0/1) | Default status, high/low growth, recession indicator, firm bankruptcy | WDI, FRED, Kaggle, public company data |
Requirements for each dataset:
- At least 500 observations and 5 variables (including target variable)
- You may not use: Gapminder, iris, mtcars, Default, or any dataset used in class labs
Submission for each dataset:
- Dataset source URL or brief description of how you obtained it
- Explanation of why this dataset is relevant to an economic question
1.2 Formulate Two Economic Questions (5 points)
Write one clear, focused economic question for each dataset.
Examples for regression (continuous outcome):
- “What factors predict house prices in İzmir?”
- “Which economic indicators best predict a country’s GDP growth rate?”
Examples for classification (binary outcome):
- “Can we predict whether a borrower will default on a loan?”
- “Can we classify Turkish provinces as high‑unemployment vs. low‑unemployment?”
Submission: State each question in one sentence.
1.3 Data Import & Cleaning (6 points)
For each dataset: - Import the data using appropriate functions (read_csv, read_excel, WDI, fredr, etc.) - Handle missing values, rename variables, ensure correct data types - Create a tidy dataset (each variable is a column, each observation a row)
1.4 Probability Distribution Analysis (6 points)
For each dataset, select the target variable (outcome variable):
- Compute summary statistics (mean, median, standard deviation, quartiles)
- Create a histogram of the variable. Is it normally distributed or skewed?
- If skewed, apply a log transformation and create a new histogram. How does the shape change?
- Based on the shape, propose which theoretical distribution (normal, log‑normal, exponential) might approximate your data.
What to submit (as a Quarto document):
For each dataset:
- Dataset description and source
- Economic question
- Code for importing and cleaning (commented)
- Summary statistics table
- Two histograms (original and log‑transformed) with interpretation
- Proposal of theoretical distribution
Stage 2: Predictive Modeling (Weeks 12–13)
Deliverable: Model Report (35 points)
Due: 24th of May at midnight
Tasks
2.1 Data Splitting (5 points)
For each dataset: - Split your data into training (80%) and test (20%) sets using initial_split() - Set a seed (set.seed(465)) for reproducibility
2.2 Build at Least Two Predictive Models per Dataset (12 points)
For the regression dataset (continuous outcome): Build and compare at least two of:
- Linear regression
- Decision tree
- Random forest (encouraged)
- Other regression method (e.g., k‑NN regression)
For the classification dataset (binary outcome): Build and compare at least two of:
Logistic regression
Decision tree
k‑Nearest Neighbors
Random forest (encouraged) For each model:
Train the model on the training set
Make predictions on the test set
Compute appropriate evaluation metrics:
- Regression: RMSE, R²
- Classification: Accuracy, confusion matrix, precision, recall
2.3 Probability Predictions (for Classification Only) (6 points)
For your classification dataset (if using logistic regression or random forest):
- Output predicted probabilities (not just class labels)
- Create a histogram of predicted probabilities
- Choose a probability threshold (e.g., 0.5) and justify your choice
- Discuss what would happen if you lowered or raised the threshold
2.4 Model Comparison & Selection (6 points)
For each dataset:
- Compare model performance using test set metrics
- Identify which model performs better and explain why
- Discuss the bias‑variance tradeoff for your models
2.5 Cross‑Validation (6 points)
For each dataset, perform k‑fold cross‑validation (5‑fold or 10‑fold) on your best model:
- Report the average cross‑validated performance
- Compare the cross‑validated performance with the test set performance
- Discuss any differences (e.g., overfitting, data leakage)
What to submit (as a Quarto document):
For each dataset:
- Train/test split description
- Model specifications and training code for each model
- Prediction results and evaluation metrics
- Model comparison table
- Cross‑validation results and interpretation
- For classification: predicted probability histogram and threshold justification
Stage 3: Final Analysis & Presentation (Weeks 13–14)
Deliverable: Final Report (25 points) + Presentation (15 points)
Final Report Due: 3rd of June at midnight
Presentations: 4th of June in class
Tasks
3.1 Complete Analysis Pipeline (6 points)
- Combine all work from Stages 1–2 into a single, reproducible Quarto document
- Organize clearly with sections for Dataset 1 (Regression) and Dataset 2 (Classification)
- Ensure all code runs without errors (run
render()to verify) - Use a clear narrative: Question → Data → Probability Analysis → Modeling → Results → Conclusion (for each dataset)
3.2 Economic Interpretation (6 points)
For each dataset:
- Answer your original economic question based on your model results
- Interpret the coefficients (if using linear/logistic regression) or feature importance (if using tree‑based methods)
- Discuss how your findings could inform economic policy, business decisions, or future research
3.3 Limitations & Replication (4 points)
- Describe at least two limitations for your overall analysis (could apply to one or both datasets)
- Explain what steps make your analysis reproducible (e.g., relative paths,
set.seed(), documented environment)
3.4 AI Use Log (4 points)
- Document any AI tools used (ChatGPT, Copilot, etc.) in an AI Use Log
- For each AI interaction, state: the prompt given, how you used the output, and how you verified or modified it
3.5 Final Suggestions (5 points)
- Suggest one improvement you would make if you had more time or better data
- Pose one new economic question inspired by your analysis that you would like to investigate further
3.6 Presentation (15 points – Week 14)
- 10‑minute presentation summarizing your project (8 minutes speaking, 2 minutes Q&A)
- Include for both datasets: economic question, dataset description, probability analysis highlights, model comparison (which model won and why), main findings, limitations
- Present in class (or submit a recorded video if remote)
What to submit (Week 13):
- A single, self‑contained Quarto document (
.qmd) with all code and narrative for both datasets - The rendered output (
.htmlor.pdf) - RPubs link (if published)
What to prepare for Week 14:
- Presentation slides (Quarto revealjs, PowerPoint, or PDF)
- A brief summary (2‑3 slides) highlighting your main economic insights for both datasets
Grading Summary
| Stage | Component | Points |
|---|---|---|
| Stage 1 | Data Proposal & Probability Analysis (both datasets) | 25 |
| Stage 2 | Model Report (both datasets) | 35 |
| Stage 3 | Final Report (both datasets) | 25 |
| Stage 3 | Presentation | 15 |
| Total | 100 |
Important Notes
- Reproducibility is mandatory. All file paths must be relative (e.g.,
data/your_dataset.csv). Set a seed (set.seed(465)) for all random processes. - No reused class datasets. Find your own data. If you are unsure about a dataset, ask the instructor.
- AI assistance is allowed but must be documented. Include an AI Use Log in your final report.
- Late submission: 10% deduction per day.
Example Project Ideas
| Dataset Type | Economic Question | Dataset Source | Target Variable | Models |
|---|---|---|---|---|
| Regression | What predicts housing prices in İzmir? | Real estate websites | Price (TL) | Linear regression, Random forest |
| Classification | Can we predict loan default? | Kaggle credit data | Default (Yes/No) | Logistic regression, Decision tree |
| Regression | What affects life expectancy? | World Bank | Life expectancy | Linear regression, Decision tree |
| Classification | Can we classify countries by recession risk? | FRED / WDI | Recession (Yes/No) | Logistic regression, Random forest |
| Regression | What drives GDP per capita growth? | World Bank | GDP growth (%) | Linear regression, Random forest |
| Classification | Which firms are likely to go bankrupt? | Borsa İstanbul | Bankruptcy (Yes/No) | Logistic regression, k‑NN |
Weekly Timeline Summary
| Week | Stage | Focus |
|---|---|---|
| Week 11 | Stage 1 | Find two datasets, import, clean, probability distributions (due 10 May) |
| Week 12–13 | Stage 2 | Predictive modeling for both datasets (due 24 May) |
| Week 13 | Stage 3 | Final report combining both datasets (due 3 June) |
| Week 14 | Stage 3 | Presentations (in class 4 June) |