ECON 465 – Week 5 Lab: Probability, Sampling & Simulation in Economics

Author

Gül Ertan Özgüzer

Lab Objectives

By the end of this lab, you will be able to:

Understand the difference between population and sample
Use simulation to demonstrate the Law of Large Numbers
Generate and visualize probability distributions (normal, binomial)
Understand sampling distributions and the Central Limit Theorem
Calculate and interpret confidence intervals for economic estimates
Apply these concepts to real economic data (unemployment, income)

The Economic Question

How confident can we be in estimates from survey data? When the Turkish Statistical Institute (TÜİK) reports that the unemployment rate is 10.5%, what does that number really mean? How much would it change if we surveyed a different set of households? In this lab, we use simulation to understand sampling variability – the foundation of statistical inference in economics.

Datasets for This Lab

We will use:

Simulated data (created in R) to demonstrate concepts
The gapminder dataset for real economic applications
The infer package for modern statistical inference

# Load required packages
library(tidyverse)
library(dslabs)
library(infer)

# Set seed for reproducibility (so everyone gets same "random" results)
set.seed(465)

1 Population vs. Sample – The Fundamental Distinction

1.1 What Is a Population?

In economics, the population is the entire group we want to understand:

All households in Turkey
All firms in the manufacturing sector
All citizens eligible to vote

Problem: We almost never have data on the entire population. Census data is rare and expensive.

1.2 What Is a Sample?

A sample is a subset of the population that we actually observe. Survey data (like TÜİK’s labor force survey) is a sample.

Key insight: The sample is never exactly like the population. There is always sampling error.

1.3 Simulating a Population

Let’s create a simple population: suppose we have a country with 100,000 workers. The true unemployment rate is 12%.

# Create a population of workers
population_size <- 100000
true_unemployment <- 0.12  # 12% unemployment

# Generate population: 1 = unemployed, 0 = employed
population <- data.frame(
  worker_id = 1:population_size,
  employed = rbinom(population_size, 1, 1 - true_unemployment)
)

# Check the true unemployment rate
true_rate <- mean(population$employed == 0)
true_rate

[1] 0.11987

Explanation:

rbinom(n, size, prob) generates random binary (0/1) data
Here, each worker has a 12% chance of being unemployed (0)
mean(population$employed == 0) counts the proportion of zeros

1.4 Taking a Sample

Now we act like surveyors: we can’t survey all 100,000 workers. We take a random sample of 1,000.

# Take a random sample of 1,000 workers
sample_size <- 1000
sample_data <- population |>
  slice_sample(n = sample_size)

# Calculate sample unemployment rate
sample_rate <- mean(sample_data$employed == 0)
sample_rate

[1] 0.113

Compare: The sample rate is close to 12% but not exactly equal. If we take another sample, we’ll get a slightly different number. This is sampling variability.

2 The Law of Large Numbers

2.1 What Is the Law of Large Numbers?

As sample size increases, the sample average gets closer to the population average.

Let’s demonstrate by taking samples of increasing size and plotting the results.

# Function to sample and compute unemployment rate
sample_sizes <- seq(10, 5000, by = 50)
sample_rates <- numeric(length(sample_sizes))

for (i in 1:length(sample_sizes)) {
  sample_temp <- population |>
    slice_sample(n = sample_sizes[i])
  sample_rates[i] <- mean(sample_temp$employed == 0)
}

# Create data frame for plotting
lln_data <- data.frame(
  sample_size = sample_sizes,
  unemployment_rate = sample_rates
)

# Plot
ggplot(lln_data, aes(x = sample_size, y = unemployment_rate)) +
  geom_line(color = "steelblue", size = 1) +
  geom_hline(yintercept = true_unemployment, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Law of Large Numbers: Sample Estimates Converge to True Value",
    subtitle = "Red line: true unemployment rate (12%)",
    x = "Sample Size",
    y = "Estimated Unemployment Rate"
  ) +
  theme_minimal()

Interpretation: With very small samples (n=10, 20), estimates vary wildly. As sample size grows, estimates stabilize around the true 12%. This is why TÜİK surveys 50,000+ households – large samples give reliable estimates.

2.2 Probability Distributions in Economics

3 The Normal Distribution

Many economic variables (height, income on log scale, test scores) follow a normal distribution.

# Generate data from a normal distribution
normal_data <- data.frame(
  value = rnorm(10000, mean = 0, sd = 1)
)

# Plot histogram with normal curve
ggplot(normal_data, aes(x = value)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, fill = "steelblue", color = "white") +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 1) +
  labs(
    title = "Standard Normal Distribution",
    subtitle = "Mean = 0, Standard Deviation = 1",
    x = "Value",
    y = "Density"
  ) +
  theme_minimal()

Economic application: Log income, log GDP, and many economic shocks are approximately normal.

3.1 The Binomial Distribution

The binomial distribution models the number of “successes” in a fixed number of trials. Example: number of unemployed workers in a sample of 1,000.

# Simulate 100 surveys, each with 1,000 workers
binomial_data <- data.frame(
  unemployed = rbinom(100, size = 1000, prob = true_unemployment)
)

# Plot
ggplot(binomial_data, aes(x = unemployed)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(
    title = "Binomial Distribution: Number of Unemployed in 1,000-Worker Samples",
    subtitle = "True unemployment rate = 12%",
    x = "Number of Unemployed Workers",
    y = "Number of Surveys"
  ) +
  theme_minimal()

4 Sampling Distributions

4.1 What Is a Sampling Distribution?

A sampling distribution is the distribution of a statistic (like the sample mean) across many samples. It tells us how much the estimate varies due to random sampling.

4.2 Simulating the Sampling Distribution of the Mean

Let’s take 1,000 different samples and record the unemployment rate each time.

# Simulate 1,000 samples, each of size 500
n_samples <- 1000
sample_rates_sampling <- numeric(n_samples)

for (i in 1:n_samples) {
  sample_temp <- population |>
    slice_sample(n = 500)
  sample_rates_sampling[i] <- mean(sample_temp$employed == 0)
}

# Create data frame
sampling_dist <- data.frame(unemployment_rate = sample_rates_sampling)

# Plot sampling distribution
ggplot(sampling_dist, aes(x = unemployment_rate)) +
  geom_histogram(binwidth = 0.005, fill = "steelblue", color = "white") +
  geom_vline(xintercept = true_unemployment, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Sampling Distribution of the Unemployment Rate",
    subtitle = "1,000 samples, each of size 500. Red line: true population rate (12%)",
    x = "Estimated Unemployment Rate",
    y = "Number of Samples"
  ) +
  theme_minimal()

X‑axis: Each bar represents a range of estimated unemployment rates obtained from the 1,000 random samples.

Y‑axis: The number of samples (out of 1,000) that fell into each bin.

Key features:

Center: The histogram is centered around the true population unemployment rate (12%), shown by the red dashed line. This means that on average, the sample estimates are unbiased – they are neither systematically too high nor too low.
Shape: The distribution is approximately bell‑shaped (normal). This is a direct consequence of the Central Limit Theorem, which states that for large samples (here n=500), the sampling distribution of the sample proportion will be nearly normal, regardless of the shape of the original population distribution.
Spread (variability): The width of the histogram reflects how much the estimate varies from sample to sample. The standard deviation of these 1,000 rates is the empirical standard error.
Implications for real‑world inference: If TÜİK surveyed a single random sample of 500 workers, the estimated unemployment rate could be as low as ~9% or as high as ~15% (the approximate range covering 95% of the histogram).

The histogram shows the sampling variability – the natural random error that comes from surveying only a subset of the population.

Take‑home message:

The sampling distribution tells us how much trust we can place in a single sample estimate. Because we know the histogram is normal and its spread is predictable, we can construct confidence intervals that quantify the uncertainty. This is the foundation of inferential statistics in economics.

5 The Central Limit Theorem

5.1 What Is the Central Limit Theorem?

The Central Limit Theorem (CLT) states that for sufficiently large sample sizes, the sampling distribution of the sample mean will be approximately normal – regardless of the population distribution.

Let’s demonstrate with a non-normal population: income, which is highly skewed.

# Create a skewed population (exponential distribution)
skewed_population <- data.frame(
  income = rexp(100000, rate = 1/50000)  # mean = 50,000 TL
)

# Look at the population distribution
ggplot(skewed_population, aes(x = income)) +
  geom_histogram(bins = 100, fill = "steelblue", color = "white") +
  labs(
    title = "Population Distribution: Highly Skewed (Exponential)",
    subtitle = "Mean = 50,000 TL, but many low incomes and a long tail of high incomes",
    x = "Income (TL)",
    y = "Count"
  ) +
  scale_x_continuous(limits = c(0, 300000)) +
  theme_minimal()

Now, take many samples of size 30 and plot the distribution of sample means.

# Sample means from the skewed population
sample_means <- numeric(1000)

for (i in 1:1000) {
  sample_temp <- skewed_population |>
    slice_sample(n = 30)
  sample_means[i] <- mean(sample_temp$income)
}

# Plot sampling distribution of the mean
clt_data <- data.frame(sample_mean = sample_means)

ggplot(clt_data, aes(x = sample_mean)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  labs(
    title = "Central Limit Theorem: Sampling Distribution of the Mean (n=30)",
    subtitle = "Even though the population is highly skewed, sample means are approximately normal",
    x = "Sample Mean Income (TL)",
    y = "Number of Samples"
  ) +
  theme_minimal()

Key insight: This is why we can use normal-based confidence intervals even when the underlying data is not normal – as long as the sample size is large enough.

6 Confidence Intervals

6.1 What Is a Confidence Interval?

A confidence interval provides a range of plausible values for a population parameter. For example: “We are 95% confident that the true unemployment rate is between 10.2% and 10.8%.”

6.2 Calculating Confidence Intervals from the Gapminder Dataset

Let’s use real data: what is the average life expectancy in Europe? We’ll treat the 2012 data as a sample.

# Get European countries in 2012
europe_2012 <- gapminder |>
  filter(year == 2012, continent == "Europe") |>
  drop_na(life_expectancy)

# Calculate sample statistics
sample_mean <- mean(europe_2012$life_expectancy)
sample_sd <- sd(europe_2012$life_expectancy)
n <- nrow(europe_2012)

# Standard error
se <- sample_sd / sqrt(n)

# 95% confidence interval (using normal approximation)
margin_error <- 1.96 * se
ci_lower <- sample_mean - margin_error
ci_upper <- sample_mean + margin_error

cat("Sample mean life expectancy:", round(sample_mean, 1), "years\n")

Sample mean life expectancy: 78.3 years

cat("95% Confidence Interval: [", round(ci_lower, 1), ",", round(ci_upper, 1), "]\n")

95% Confidence Interval: [ 77.2 , 79.4 ]

Interpretation: We are 95% confident that the true average life expectancy in Europe (if we could measure everyone) lies between these bounds.

6.3 Using the infer Package for Modern Inference

The infer package provides a consistent workflow for statistical inference.

# Calculate confidence interval using infer
europe_ci <- europe_2012 |>
  specify(response = life_expectancy) |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "mean") |>
  get_confidence_interval(level = 0.95)

europe_ci

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1     77.1     79.3

Explanation:

specify(response = life_expectancy) tells infer which variable we’re studying
generate(reps = 1000, type = "bootstrap") creates 1,000 bootstrap samples
calculate(stat = "mean") computes the mean for each sample
get_confidence_interval(level = 0.95) calculates the 95% confidence interval

6.4 Your Turn – Practice with Real Economic Data

Task: Confidence Interval for GDP per Capita

Using the gapminder dataset:

Filter to Asian countries in 2012.
Calculate the sample mean and standard deviation of GDP per capita.
Compute the standard error.
Construct a 95% confidence interval.
Interpret the result in one sentence.

Glossary of Functions Used

Function	What it does
`rbinom(n, size, prob)`	Generates random binomial data
`rexp(n, rate)`	Generates random exponential data
`slice_sample(n)`	Takes a random sample of rows
`set.seed()`	Makes random numbers reproducible
`mean()`	Calculates average
`sd()`	Calculates standard deviation
`sqrt()`	Square root
`specify()`	Declares variables for inference
`generate()`	Creates bootstrap samples
`calculate()`	Computes statistics from samples
`get_confidence_interval()`	Calculates confidence interval

Summary: What We Learned Today

Population vs. Sample: We rarely have population data – we work with samples.
Law of Large Numbers: Larger samples give more reliable estimates.
Probability Distributions: Normal and binomial distributions are building blocks.
Sampling Distribution: The distribution of a statistic across many samples.
Central Limit Theorem: Sample means become normal as sample size increases.
Confidence Intervals: A range of plausible values for a population parameter.

These concepts are the foundation of statistical inference in economics. When TÜİK reports an unemployment rate, we now understand: it’s an estimate, it has uncertainty, and confidence intervals tell us how precise that estimate is.