ECON 465: Introduction to Data Science with Applications to Economics

Gül Ertan Özgüzer

2026-02-18

ECON 465 Spring 2025-2026

Week 1: From Economic Questions to Data-Driven Answers

Instructor: Gül Ertan Özgüzer

What is Data Science?

Hacking Skills (Computation & Programming)
Math & Statistics Knowledge (Theory)
Substantive Expertise (Domain Knowledge)

The Intersection is Data Science

The application of computational and statistical methods to extract knowledge and insight from data within a specific domain.

The Output: Actionable insight, prediction, or understanding that informs decisions.

Economics as a Natural Domain for Data Science

Economics has always been data-driven, focused on:

Understanding scarce resource allocation
Modeling human decision-making (micro) and systemic outcomes (macro)
Evaluating the causal effect of policies and events

The Traditional Economist’s Toolkit

Theory and mathematical models
Econometrics—a subfield of statistics focused on causal identification from (often limited) observational data

The Modern Shift

Digital era provides unprecedented data volume and variety (transactions, text, satellite imagery, social media)
This requires new tools to complement traditional econometrics
Data science bridges the gap between abundant data and economic insight

Feature	Causal Inference	Prediction
Goal	Identify the true effect of X on Y	Forecast Y as accurately as possible using X’s
Focus	I unbiased estimators, addressing confounding	Out-of-sample accuracy, model flexibility, many predictors
Key Question	“What is the mechanism? Is the effect real?”	“How accurate is the forecast?”
Example	“What is the effect of x on y?”	“Predict x.”

The Two Mindsets in Economic Data Analysis

This Course: You will learn the tools for both, understanding when and why to use each paradigm.

The Core Challenge – Correlation is not Causation

A Classic Economic Puzzle:

Observed Fact (Correlation): Studies find that, on average, young men who served in the Vietnam War earned lower wages post-service than non-veterans of the same age.
The Naive (and Likely Wrong) Conclusion (Causation): “War service causes lower lifetime earnings.”

Why This Conclusion is Problematic

The Selection Problem: Veterans and non-veterans are not randomly assigned. Men who were drafted or enlisted may have been systematically different before the war.

Potential Confounders (Alternative Explanations):

Lost Labor Market Experience: Missing crucial early-career experience and training.
Disrupted Education: Military service interrupting college attendance.
Selection Bias (The Key Issue): Men from lower-income backgrounds were more likely to be drafted, which independently links to lower future earnings.

So, What is the True Causal Effect?

Is the wage gap caused by the experience of war itself?
Or is it simply due to pre-existing differences between the two groups?

Disentangling these is the fundamental challenge of causal inference in economics.

A Parallel Example – Elite Education

Observed Fact: Graduates of Ivy League universities earn significantly more than graduates of state universities.

Naive Conclusion: Ivy League education causes higher earnings.

Confounder (Selection)

Ivy League students are highly selected. They likely had: - Higher SAT scores - Stronger family connections - Greater socioeconomic advantages before college

All these factors independently predict higher future income.

The Causal Question: How much of the earnings premium is the effect of the school itself, versus the effect of the pre-existing attributes of its students?

The Data Science & Econometrics Mission

To use data, theory, and clever research designs to move beyond correlation and credibly answer causal questions.

A Machine Learning Example – Finding Hidden Economic Structures

The Puzzle:

International organizations like the World Bank classify countries into income groups (low, middle, high).

But are these categories sufficient for understanding economic development?

Observed Data: We have 50 countries, each with 10 economic indicators:

GDP per capita
Life expectancy
Education spending (% of GDP)
Income inequality (Gini coefficient)
Inflation rate
Unemployment rate
Industrial output share
Agricultural output share
Trade openness
Foreign direct investment

The Challenge:

With 10 different indicators, it’s impossible to visually identify which countries are similar. Each country is a point in 10-dimensional space—our human brains can’t visualize this.

The Economic Question: Are there natural groupings of countries that share similar economic structures, independent of our pre-existing labels?

The Machine Learning Solution: k-means Clustering

An unsupervised learning algorithm that:

Automatically discovers hidden patterns in high-dimensional data
Groups countries based on similarity across ALL indicators simultaneously
Finds the “natural” number of clusters in the data

What the Algorithm Does:

Think of it like this: The algorithm places k “centroids” in the 10-dimensional space, then assigns each country to the nearest centroid, then moves the centroids to the center of their assigned countries, and repeats until the groups stabilize.

The algorithm might discover something surprising:

What the Algorithm Does:

Pick 4 random countries as temporary “group centers.”
Assign every other country to the center it’s most similar to. Now you have 4 loose groups.
Find the true center of each group—the imaginary country that would be the “average” of all members. Move the center there.
Reassign countries to the nearest center. Some may switch groups.
Repeat steps 3-4 until nobody switches groups anymore.
It’s like repeatedly asking: “Is this still the best group for you? If not, where should you really be?”
The magic: The algorithm finds natural groupings without anyone telling it what to look for.

The Result:

Cluster	Characteristics	Example Countries
1	High GDP, high life expectancy, low inequality, high education spending	Norway, Switzerland, Canada
2	Rapidly growing, moderate inequality, industrializing	Turkey, Brazil, Malaysia
3	Resource-dependent, high inequality, moderate GDP	Saudi Arabia, Chile, Russia
4	Low GDP, low life expectancy, agricultural-based	Malawi, Nepal, Haiti

What We Learned:

The algorithm found meaningful economic structures without being told what to look for
These clusters don’t perfectly match traditional income classifications—suggesting development is multidimensional
This knowledge helps:
- Target development aid more effectively
- Identify peer countries for policy comparison
- Understand which economic structures tend to co-occur

The Key Difference from Our Causality Example:

	Causality Example	ML Example
Goal	Find effect of X on Y	Discover hidden patterns
Method	Instrumental variables	k-means clustering
Output	“War causes 15% wage loss”	“There are 4 distinct development types”
Question	“Why?”	“What’s here?”

The Data Science Workflow in Economics

A cyclical, iterative process:

Question & Design: Start with an economic question.
Data Acquisition: Collect data from APIs, files, databases.
Data Wrangling: Clean, transform, and structure data.
Exploration & Visualization (EDA): Understand patterns.
Modeling & Inference: Apply statistical or ML models.
Communication & Reproducibility: Present results clearly.

Why R for Economics and Data Science?

Born for Statistics, Built for Interactivity - Created by statisticians for data analysis. - Test ideas quickly, visualize results instantly, and iterate.

Free, Open Source, and Cross-Platform - Zero cost; runs identically on Windows, Mac, and Linux.

A Thriving Ecosystem

Early access to the latest statistical and machine learning methodologies.
Large, active, and welcoming community.

The tidyverse for Clear, Reproducible Code

Provides a consistent “grammar” for data manipulation and visualization.
Clarity is essential for reproducible research.

Seamless Integration with Quarto

R works hand-in-hand with Quarto.
Weave together narrative, code, and output into a single, dynamic document.
The gold standard for transparency in economic research.