ECON 465 – Data Science Project: From Data to Economic Insight
1 Project Overview
This is a 3-stage, 4-week project that guides you through the complete data science workflow. You will find your own dataset, formulate an economic question, clean and explore the data, build predictive models, and present your findings. The project is designed to be completed iteratively, with feedback at each stage.
Timeline: 4 weeks (Weeks 10–13, with final submission and presentations in Week 14)
Weight: 35% of final grade
Group size: max 2 students(or individual upon request)
2 Stage 1: Data Acquisition & Probability Foundations (Week 10)
2.1 Deliverable: Data Proposal & Probability Analysis Report (25 points)
Due: the 9th of May at midnight
2.2 Tasks
2.2.1 Find a Dataset (8 points)
Find a real-world economic dataset that interests you. Possible sources:
- World Bank Open Data (
WDIpackage) - FRED Economic Data (
fredrpackage) - OECD Data (
OECDpackage) - TURKSTAT (Turkish Statistical Institute)
- Kaggle economic datasets
- IPUMS (census microdata)
Your dataset should have at least 500 observations and 5 variables (including a target variable for prediction).
You may not use: Gapminder, iris, mtcars, or any dataset used in class labs.
Submission: Provide the dataset source URL or a brief description of how you obtained it. Explain why this dataset is relevant to an economic question.
2.2.2 Formulate an Economic Question (5 points)
Write a clear, focused economic question that your analysis will answer. Examples:
- “Can we predict house prices in Istanbul based on size, age, district, and number of rooms?”
- “What economic indicators best predict a country’s GDP growth rate?”
- “Can we classify Turkish firms as high‑growth vs. low‑growth using financial ratios?”
Submission: State your question in one sentence.
2.2.3 Data Import & Cleaning (6 points)
- Import the data into R using appropriate functions (
read_csv,read_excel,WDI,fredr, etc.). - Handle missing values, rename variables, ensure correct data types.
- Create a tidy dataset (each variable is a column, each observation a row).
2.2.4 Probability Distribution Analysis (6 points)
- Select one continuous variable from your dataset (e.g., income, price, GDP).
- Compute summary statistics (mean, median, standard deviation, quartiles).
- Create a histogram of the variable. Is it normally distributed or skewed?
- If skewed, apply a log transformation and create a new histogram. How does the shape change?
- Based on the shape, propose which theoretical distribution (normal, log‑normal, exponential) might approximate your data.
What to submit (as a Quarto document):
- Dataset description and source
- Economic question
- Code for importing and cleaning (commented)
- Summary statistics table
- Two histograms (original and log‑transformed) with interpretation
- Proposal of theoretical distribution
3 Stage 2: Predictive Modeling (Weeks 11–12)
3.1 Deliverable: Model Report (35 points)
Due: 24th of May at midnight
3.2 Tasks
3.2.1 Data Splitting (5 points)
- Split your data into training (80%) and test (20%) sets using
initial_split(). - Set a seed (
set.seed(465)) for reproducibility.
3.2.2 Build at Least Two Predictive Models (12 points)
Based on your economic question, build and compare at least two different predictive models:
If predicting a continuous outcome (e.g., price, GDP): - Linear regression - Decision tree - Random forest (from Week 10 material) – optional but encouraged
If predicting a binary outcome (e.g., high/low, default/no default): - Logistic regression - Decision tree - k‑Nearest Neighbors
For each model: - Train the model on the training set - Make predictions on the test set - Compute appropriate evaluation metrics (RMSE, R² for regression; accuracy, confusion matrix for classification)
3.2.3 Probability Predictions (for Classification) (6 points)
- If using logistic regression, output predicted probabilities (not just class labels).
- Create a histogram of predicted probabilities.
- Choose a probability threshold (e.g., 0.5) and justify your choice. What would happen if you lowered or raised the threshold? How does this relate to economic decision‑making?
3.2.4 Model Comparison & Selection (6 points)
- Compare model performance using test set metrics.
- Identify which model performs better and explain why (e.g., “Random forest captures non‑linear relationships better”).
- Discuss the bias‑variance tradeoff for your models. Which model has lower bias? Which has lower variance? How does this affect generalization to new data?
3.2.5 Cross‑Validation (6 points)
- Perform k‑fold cross‑validation (5‑fold or 10‑fold) on your best model.
- Report the average cross‑validated performance.
- Compare the cross‑validated performance with the test set performance. Are they similar? If not, what might explain the difference (e.g., overfitting, data leakage)?
What to submit (as a Quarto document):
- Train/test split description
- Model specifications and training code for each model
- Prediction results and evaluation metrics
- Model comparison table
- Cross‑validation results and interpretation
- For classification: predicted probability histogram and threshold justification
4 Stage 3: Final Analysis & Presentation (Weeks 13–14)
4.1 Deliverable: Final Report (25 points) + Presentation (15 points)
Final Report Due: 3rd of June at Midnight Presentations: 4th of June in class
4.2 Tasks
4.2.1 Complete Analysis Pipeline (6 points)
- Combine all work from Stages 1–2 into a single, reproducible Quarto document.
- Ensure all code runs without errors (run
render()to verify). - Use a clear narrative: Question → Data → Probability Analysis → Modeling → Results → Conclusion.
4.2.2 Economic Interpretation (6 points)
- Answer your original economic question based on your model results.
- Interpret the coefficients (if using linear/logistic regression) or feature importance (if using tree‑based methods).
- Discuss how your findings could inform economic policy, business decisions, or future research.
4.2.3 Limitations & Replication (4 points)
- Describe at least two limitations of your analysis (e.g., data quality, omitted variables, small sample size, generalizability outside your sample).
- Explain what steps make your analysis reproducible (e.g., relative paths,
set.seed(), documented environment).
4.2.4 AI Use Log (4 points)
- Document any AI tools used (ChatGPT, Copilot, etc.) in an AI Use Log.
- For each AI interaction, state: the prompt given, how you used the output, and how you verified or modified it.
4.2.5 Final Suggestions (5 points)
- Suggest one improvement you would make if you had more time or better data.
- Pose one new economic question inspired by your analysis that you would like to investigate further.
4.2.6 Presentation (15 points – Week 14)
- 10‑minute presentation summarizing your project (8 minutes speaking, 2 minutes Q&A).
- Include: economic question, dataset description, probability analysis highlights (histograms, distribution), model comparison (which model won and why), main findings, limitations.
- Present in class (or submit a recorded video if remote).
What to submit (Week 13):
- A single, self‑contained Quarto document (
.qmd) with all code and narrative. - The rendered output (
.htmlor.pdf). - RPubs link (if published).
What to prepare for Week 14:
- Presentation slides (Quarto revealjs, PowerPoint, or PDF).
- A brief summary (2‑3 slides) highlighting your main economic insight.
5 Grading Summary
| Stage | Component | Points |
|---|---|---|
| Stage 1 | Data Proposal & Probability Analysis | 25 |
| Stage 2 | Model Report | 35 |
| Stage 3 | Final Report | 25 |
| Stage 3 | Presentation | 15 |
| Total | 100 |
6 Important Notes
- Reproducibility is mandatory. All file paths must be relative (e.g.,
data/your_dataset.csv). Set a seed (set.seed(465)) for all random processes. - No reused class datasets. Find your own data. If you are unsure about a dataset, ask the instructor.
- AI assistance is allowed but must be documented. Include an AI Use Log in your final report.
- Late submission: 10% deduction per day.
7 Example Project Ideas
| Economic Question | Dataset Source | Target Variable | Model Type | Probability Focus |
|---|---|---|---|---|
| What predicts housing prices in İzmir? | Real estate websites | Price (TL) | Linear regression | Distribution of prices (log‑normal) |
| Can we classify Turkish provinces by unemployment risk? | TÜİK, World Bank | High/low unemployment (binary) | Logistic regression | Distribution of unemployment rates |
| Which firms are likely to default on loans? | Public company data (Borsa İstanbul) | Default (binary) | Random forest | Distribution of financial ratios |
| How does life expectancy relate to health spending? | World Bank | Life expectancy | Linear regression | Distribution of life expectancy |
| What predicts consumer credit risk? | Kaggle credit data | Default (binary) | Logistic regression | Distribution of credit scores |
8 Weekly Timeline Summary
| Week | Stage | Focus |
|---|---|---|
| Week 11 | Stage 1 | Find data, import, clean, probability distributions |
| Week 13 | Stage 2 | Predictive modeling (2+ models), cross‑validation |
| Week 14 | Stage 3 | Final report (combine all stages) |
| Week 14 | Stage 3 | Presentations |