Lecture 4: Distributions

POLS3316, Instructor: Tom Hanna, Spring 2025, University of Houston

2026-02-09

Data Assumptions

Why do we use data?

  • Purpose: analyzing data for causal inference (to begin to make statements about cause and effect - inferring causes)
  • Complex and uncertain data requires that we make…

Assumptions about the data

  • Because the world is complex, to make sense of unknowns we make assumptions about data
  • The assumptions are useful approximations even when not precisely true
  • We still need to check that the real data does not seriously violate the assumptions

Data Assumptions: Random, Independent, and Identically Distributed

  • Randomness and independence matter as assumptions about data
  • Specifically, these are assumptions about the Data Generating Process or DGP
  • The Data Generating Process: the way the world produces the data

The Data Generating Process

  • The source of the data matters - the DGP matters
  • Previously stated: Data comes from a random world
  • So the DGP is random

Independence and Distribution

  • Events in the data are independent and identically distributed - the IID assumption

Statistical Independence

  • Independence is statistical independence - the outcome of one event does not affect our belief about the probability of another event

    -   If we draw a number from a hat, then flip a coin, the hat draw does not affect the value of the coin toss
    -   X does not affect Y - the outcome of X does not affect our belief about the probability of Y

What if X does affect Y?

If X does affect Y

  • begin to infer a causal relationship
  • possibly through one or more additional variables
  • in some direction
  • not necessarily that X causes Y
  • not quite accurate summary “correlation does not imply causation”
  • Better: Correlation does not prove causation

Identically Distributed

  • Identically distributed: drawn from the same probability distribution

So…

Distributions

Introduction to distributions

  • R has functions for at least 20 probability distributions
  • The most important is the normal distribution
  • This is because of the central limit theorem
  • We will look at these in the most detail: normal, binomial, uniform, poisson
  • Because these are probability distributions they allow us to determine the probability that events are random chance or not

Distribution examples

  • The following are histograms
  • They represent the frequency or simply the number count of observations for each value
  • For example, if the value 4 shows 500, it means that 4 came up 500 times in the data
  • The graphs were produced by generating random numbers based on the particular distribution with an R function

Uniform distribution

All outcomes are equally likely

Uniform distribution

All outcomes are equally likely

Uniform Distribution: Probability

    - The probability of any frequency is 1/10
    - Any deviation from a value of 10,000 is a random deviation from the expected value

Uniform distribution: with code

All outcomes are equally likely

# Set a seed for reproducibility

set.seed(123)

rand.unif <- runif(100000, min = 0, max = 10)
hist(rand.unif, breaks = 20, freq = TRUE, main = "uniform distribution of 100,000 random draws from 0 to 10", xlab = 'x', col = "red")

Normal Distribution

  • symmetrical around its mean with most values near the central peak
  • width is a function of the standard deviation
  • Other names: Gaussian distribution, bell curve

Normal Distribution

Normal Distribution: Probability

  • The probability of a value is determined by how far it is from the mean in terms of standard deviations

Normal Distribution: with code and probabilities

set.seed(123)

# plot a normal distribution with mean = 0, sd = 1, 100,000 random draws, 200 breaks, and red color
# add lines to the plot to illustrate the 68-95-99.7 rule

rand.norm<- rnorm(100000)
h <- hist(rand.norm, breaks = 200, freq = TRUE, main = "normal distribution, mean = 0, sd = 1, 100,000 random draws", xlab = 'x', col = "red")
abline(v = c(-1, 1), col = "blue", lwd = 2)
abline(v = c(-2, 2), col = "blue", lwd = 2)
abline(v = c(-3, 3), col = "blue", lwd = 2)

# create labels to indicate the percentage within each range


# get max y from the histogram for label placement
ymax <- max(h$counts)

# add labels for the 68-95-99.7 rule

text(1.1, ymax*0.7, "68%",   pos = 4, col = "blue")
text(2.1, ymax*0.15, "95%",   pos = 4, col = "blue")
text(3.1, ymax*0.05, "99.7%", pos = 4, col = "blue")

# add arrows to illustrate the range of values within each standard deviation range 

## arrows from +sd line to −sd line at same y-level
arrows(x0 = 1,  y0 = ymax*0.7, x1 = -1,  y1 = ymax*0.7,
       code = 3, angle = 15, length = 0.08, col = "blue")   # 68% range
arrows(x0 = 2,  y0 = ymax*0.15, x1 = -2,  y1 = ymax*0.15,
       code = 3, angle = 15, length = 0.08, col = "blue")   # 95% range
arrows(x0 = 3,  y0 = ymax*0.05, x1 = -3,  y1 = ymax*0.05,
       code = 3, angle = 15, length = 0.08, col = "blue")   # 99.7% range

Binomial Distribution

  • binary
  • success/failure
  • yes/no
  • distribution for a number of Bernoulli trials

Binomial - Bernoulli example

  • n = 1 makes this a Bernoulli distribution

Binomial example: with code

  • n = 1 makes this a Bernoulli distribution
rand.binom<- rbinom(100000,1,.5)
hist(rand.binom, breaks = 200, freq = TRUE, main = "binomial distribution, p = .5, 1 trial, 100,000 draws", xlab = 'x', col = "red")

Binomial example: with code

  • trials = 25
rand.binom2 <- rbinom(100000,25,.5)
hist(rand.binom2, breaks = 200, freq = TRUE, main = "binomial distribution, p = .5, 25 trials, 100,000 draws", xlab = 'x', col = "red")

Preview of the Central Limit Theorem

What happens if we do the same thing above but do it 1,000 times and plot the counts?

Preview of the Central Limit Theorem

Preview of the Central Limit Theorem: code

rand.binom3<- rbinom(100000,1000,.5)
hist(rand.binom3, breaks = 200, freq = TRUE, main = "Histogram of binomial distribution, p = .5, 1000 trials, 100,000 draws", xlab = 'x', col = "red")

Preview of the Central Limit Theorem

  • For sufficiently large sample sizes, the distribution of sample means approximates a normal distribution
  • This means with a large enough number of trials, we can apply the normal distribution to know things about measures of central tendency, measures of dispersion, and probabilities
  • Sample sizes above 30
  • This is just a preview

68-95-99.7 Rule

  • One of the rules for normal distributions is:

The 68-95-99.7 rule

  • 68% of the data is within 1 standard deviation of the mean
  • 95% of the data is within 2 standard deviations of the mean
  • 99.7% of the data is within 3 standard deviations of the mean

Preview of the Law of Large Numbers

  • The law of large numbers tells us that if we repeat an experiment a large number of time, the average of the results will be close to the expected value
  • This allows us to apply the actual mean of the sample to the expected mean of the population
  • We can treat the statistics of the sample as estimates of the parameters of the population

Statistics and Parameters

  • A statistic is a measure calculated from a sample of data

    - e.g., sample mean, sample variance, sample standard deviation
  • A parameter is a measure calculated from the entire population

     - e.g., population mean, population variance, population standard deviation

Poisson distribution

  • Count of number of events in a fixed time/space
  • Known constant mean rate of occurrence
  • Independent of time since last event

Poisson distribution

Poisson distribution: Assumptions

  • The probability is different than the normal distribution
  • lamda is both the mean and the variance of the distribution
  • We don’t generally use standar deviation with Poisson
  • The probabilities in the following example are specific to this setup with lambda = 1
  • This is why count variables should not strictly be handled with Ordinary Least Squares (OLS) regression, which is the linear regression technique we will be using

Poisson distribution: Probability

Poisson distribution: with code

set.seed(123)

rand.poiss<- rpois(100000,1)
hp <- hist(rand.poiss, breaks = 200, freq = TRUE, main = "poisson distribution, lambda = 1, 100,000 draws", xlab = 'x', col = "red")

# add lines to illustrate the probabilities of 0, 1, 2, and 3 events occurring

ypmax <- max(hp$counts)

# labels on the right
text(1.1, ypmax*0.9, "74%",   pos = 4, col = "blue")
text(2.1, ypmax*0.8, "92%",   pos = 4, col = "blue")
text(3.1, ypmax*0.2, "98%",   pos = 4, col = "blue")

# arrows spanning from 0 to each quantile line
arrows(x0 = 0.0, y0 = ypmax*0.9, x1 = 1, y1 = ypmax*0.9,
       code = 3, angle = 15, length = 0.08, col = "blue")
arrows(x0 = 0.0, y0 = ypmax*0.5, x1 = 2, y1 = ypmax*0.5,
       code = 3, angle = 15, length = 0.08, col = "blue")
arrows(x0 = 0.0, y0 = ypmax*0.2, x1 = 3, y1 = ypmax*0.2,
       code = 3, angle = 15, length = 0.08, col = "blue")

Why we can’t use standard OLS regression for other DGP directly

  • We base the likelihood of something being significant on the proximity to the mean
  • As things get further from the mean in a normal distribution, they become less likely
  • We can apply OLS to sample means of multiple trials because of the Central Limit Theorem
  • For specific distributions, like Poisson, we have often have better techniques

Why we don’t use standard OLS regression for other DGP: Example

Poisson vs. Normal Distribution

Poisson v. Normal: Code

# Set a seed for reproducibility
set.seed(123)

# Generate data
poisson_data <- rpois(1000, lambda = 1)
normal_data <- rnorm(100000, mean = 1, sd = 1)

# Create histogram for Poisson (density)
h <- hist(poisson_data, probability = TRUE, 
          main = "Poisson(λ=1) vs. Normal(μ=1,σ=1)", 
          xlab = "Value", ylab = "Density", ylim = c(0, 0.4), 
          col = rgb(0.7, 0.9, 1, 0.7),
          xlim = c(-1, 6))  # extend x to show negatives

# Overlay normal density
lines(density(normal_data), col = "red", lwd = 2)

# Legend
legend("topright", legend = c("Poisson", "Normal"), 
       col = c("lightblue", "red"), lty = 1, lwd = 2)

# Lines
abline(v = qnorm(0.975, 1, 1), col = "red", lwd = 2, lty = 2)  # ~2.96
abline(v = qpois(0.98, 1), col = "blue", lwd = 2, lty = 2)     # 3

# Labels
text(2.98, 0.35, "97.5%\nNormal", col = "red", pos = 4)
text(3.1,  0.28, "98%\nPoisson", col = "blue", pos = 4)

# Arrow pointing left from 97.5% line to negatives, with label
arrows(x0 = 2.8, y0 = 0.20, x1 = -0.3, y1 = 0.20,
       code = 2, angle = 20, length = 0.1, col = "red", lwd = 1.5)
text(0.5, 0.22, "97.5% includes\nnegative values", col = "red", cex = 0.8)

Authorship and License