Good morning. In our previous lectures, we mastered the art of Descriptive Statistics. We learned to take a given dataset—a sample—and describe it perfectly using tables, graphs, means, and variances. We were working with known information.
Today, we cross a monumental bridge into the world of Statistical Inference. The core question of inference is: How can we use the information from a limited sample to make intelligent, reliable conclusions about the entire population?
Think of it this way: * A food inspector tests a small batch of soup (the sample) to decide if the entire pot (the population) is safe to eat. * A political pollster surveys 1,000 people (the sample) to predict the election outcome for millions of voters (the population).
In both cases, we are using a small, known piece to understand a large, unknown whole. This process involves uncertainty and risk. To manage this uncertainty, we need a new language and a new set of tools. That language is the language of probability and random variables.
A Random Variable (R.V.) is a variable whose possible values are numerical outcomes of a random experiment. It’s a formal way to link the unpredictable outcomes of an experiment to a numerical value.
X: Let X = 1 if the
outcome is Heads, and X = 0 if the outcome is Tails.The random variable X provides the numerical
representation we need to perform statistical analysis. There are two
main types of random variables, which will be our focus today.
A Discrete Random Variable is one that can only take on a countable number of distinct values. Think of “number of children in a family” (0, 1, 2, …), “number of defects in a batch” (0, 1, 2, …), or “the result of a dice roll” (1, 2, 3, 4, 5, 6).
For a discrete R.V., we describe its behavior using a Probability Mass Function (PMF), denoted as \(P_X(x)\). This function gives the probability that the random variable X is exactly equal to some specific value x.
\[ P_X(x) = P(X=x) \]
The PMF has two key properties: 1. \(0 \le P(X=x) \le 1\) (The probability for any value is between 0 and 1). 2. \(\sum_{x} P(X=x) = 1\) (The sum of probabilities over all possible values is 1).
The simplest and most fundamental discrete R.V. is the Bernoulli. It models a single trial with only two possible outcomes, which we label “success” (X=1) and “failure” (X=0).
The PMF is: \[ P_X(x) = p^x (1-p)^{1-x} \quad \text{for } x \in \{0, 1\} \]
What happens when we perform \(n\) independent Bernoulli trials and count the number of successes? This gives us a Binomial Random Variable.
A Binomial R.V. X is defined by two parameters: * \(n\): the number of trials. * \(p\): the probability of success on each
trial.
The PMF, which gives the probability of getting exactly \(x\) successes in \(n\) trials, is: \[ P(X=x) = \binom{n}{x} p^x (1-p)^{n-x} = \frac{n!}{x!(n-x)!} p^x (1-p)^{n-x} \]
Let’s use the example from your notes. Out of 10 clients (\(n=10\)), the probability of any single client defaulting is 0.05 (\(p=0.05\)). What is the probability that exactly 3 clients will default (\(x=3\))?
Calculation: \[ P(X=3) = \frac{10!}{3!(10-3)!} (0.05)^3 (1-0.05)^{10-3} \] \[ P(X=3) = \frac{10!}{3!7!} (0.05)^3 (0.95)^7 \] \[ \frac{10 \cdot 9 \cdot 8}{3 \cdot 2 \cdot 1} \cdot (0.000125) \cdot (0.6983) \] \[ 120 \cdot (0.000125) \cdot (0.6983) \approx 0.01047 \]
So, there is about a 1% chance that exactly 3 out of 10 clients will default.
dbinomR makes this easy with the dbinom function (d for
density/mass).
# P(X=3) for a Binomial(n=10, p=0.05)
prob_3_defaults <- dbinom(x = 3, size = 10, prob = 0.05)
cat("The probability of exactly 3 defaults is:", prob_3_defaults, "\n")## The probability of exactly 3 defaults is: 0.01047506
A Continuous Random Variable can take any value within a given range. Think of height, weight, temperature, or time. The number of possible values is infinite.
Because there are infinite possible values, the probability of a continuous R.V. being exactly equal to any single value is zero. \(P(X=x) = 0\).
Instead, we use a Probability Density Function (PDF), \(f_X(x)\), to describe the likelihood of the variable falling within a range. The probability is the area under the PDF curve between two points.
\[ P(a \le X \le b) = \int_a^b f_X(x) dx \]
The PDF has two key properties: 1. \(f_X(x) \ge 0\) (The curve is always non-negative). 2. \(\int_{-\infty}^{\infty} f_X(x) dx = 1\) (The total area under the curve is 1).
The most important continuous distribution is the Normal Distribution. It’s a bell-shaped, symmetric curve defined by two parameters: * The mean \(\mu\), which determines the center of the bell. * The variance \(\sigma^2\), which determines the spread or width of the bell.
We denote this as \(X \sim \mathcal{N}(\mu, \sigma^2)\).
Let’s use the flight duration example from your notes. The duration of a Milan-New York flight is normally distributed with a mean of 500 minutes and a variance of 625 minutes². So, \(X \sim \mathcal{N}(\mu=500, \sigma^2=625)\). This means the standard deviation is \(\sigma = \sqrt{625} = 25\).
Question 1: What is the probability a flight will last
between 500 and 550 minutes? We use the pnorm
function in R, which calculates the cumulative probability \(P(X \le x)\). The logic is \(P(500 < X < 550) = P(X < 550) - P(X <
500)\).
prob_between <- pnorm(550, mean = 500, sd = 25) - pnorm(500, mean = 500, sd = 25)
cat("Probability of flight lasting between 500 and 550 mins:", prob_between, "\n")## Probability of flight lasting between 500 and 550 mins: 0.4772499
# Visualization
plot_normal(500, 25, lb = 500, ub = 550,
title = "P(500 < Flight Duration < 550)")
text(525, 0.008, labels = paste0(round(prob_between*100, 1), "%"), cex=1.2)Question 2: What is the maximum flight duration for the 10%
fastest flights? This is a percentile question. We need to find
the value \(x\) such that \(P(X \le x) = 0.10\). We use the
qnorm function (q for quantile).
fastest_10_percent <- qnorm(0.10, mean = 500, sd = 25)
cat("The 10th percentile of flight duration is:", fastest_10_percent, "minutes.\n")## The 10th percentile of flight duration is: 467.9612 minutes.
# Visualization
plot_normal(500, 25, ub = fastest_10_percent, shade_col = "lightgreen",
title = "10% Fastest Flights")
text(460, 0.008, labels = "10%", cex=1.2)
abline(v=fastest_10_percent, lty=2, col="red")Often, we need to apply a linear transformation to a random variable, such as converting units or calculating profit. If \(Y = a + bX\), where \(a\) and \(b\) are constants:
Important Note: If \(X\) is normally distributed, \(X \sim \mathcal{N}(\mu, \sigma^2)\), then its linear transformation \(Y\) is also normally distributed: \(Y \sim \mathcal{N}(a+b\mu, b^2\sigma^2)\).
A crucial linear transformation is standardization. It converts any random variable \(X\) into a new variable \(Z\) with a mean of 0 and a standard deviation of 1.
\[ Z = \frac{X - \mu}{\sigma} \]
This is a linear transformation where \(a = -\mu/\sigma\) and \(b = 1/\sigma\). Let’s prove its properties: * \(E(Z) = E(\frac{X - \mu}{\sigma}) = \frac{1}{\sigma} E(X - \mu) = \frac{1}{\sigma} (E(X) - \mu) = \frac{1}{\sigma} (\mu - \mu) = 0\) * \(Var(Z) = Var(\frac{X - \mu}{\sigma}) = \frac{1}{\sigma^2} Var(X - \mu) = \frac{1}{\sigma^2} Var(X) = \frac{1}{\sigma^2} \sigma^2 = 1\)
If the original variable \(X\) is Normal, \(X \sim \mathcal{N}(\mu, \sigma^2)\), then the standardized variable \(Z\) follows the Standard Normal Distribution, \(Z \sim \mathcal{N}(0, 1)\).
Let’s re-calculate \(P(X < 540)\) for our flight data using standardization. First, we convert 540 minutes into a Z-score: \[ Z = \frac{540 - 500}{25} = \frac{40}{25} = 1.6 \] So, \(P(X < 540)\) is the same as \(P(Z < 1.6)\).
## P(Z < 1.6) = 0.9452007
## P(X < 540) = 0.9452007
## The results are identical, as expected.
When we draw a random sample of size \(n\) from a population, we are essentially observing \(n\) random variables: \((X_1, X_2, \dots, X_n)\). We assume these variables are independent and identically distributed (i.i.d.), meaning: * Independent: The value of one observation does not affect the value of another. * Identically Distributed: Each observation \(X_i\) comes from the same population distribution, and thus has the same mean \(\mu\) and variance \(\sigma^2\).
The most important sample statistic is the sample mean, \(\bar{X}\). Since it’s calculated from random variables, \(\bar{X}\) is itself a random variable. The probability distribution of \(\bar{X}\) is called its sampling distribution.
\[ \bar{X} = \frac{X_1 + X_2 + \dots + X_n}{n} \]
Using the rules for linear combinations, we can find its expected value and variance. * Expected Value of the Sample Mean: \[ E(\bar{X}) = E\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n} \sum E(X_i) = \frac{1}{n} \sum \mu = \frac{1}{n} (n\mu) = \mu \] * Variance of the Sample Mean: \[ Var(\bar{X}) = Var\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n^2} \sum Var(X_i) = \frac{1}{n^2} \sum \sigma^2 = \frac{1}{n^2} (n\sigma^2) = \frac{\sigma^2}{n} \] * Standard Error: The standard deviation of the sample mean is called the standard error. \[ SE(\bar{X}) = \sigma_{\bar{X}} = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \] Notice that as the sample size \(n\) gets larger, the standard error gets smaller. This means the sample mean becomes a more precise estimate of the population mean.
This is perhaps the most magical and powerful theorem in all of statistics.
The Central Limit Theorem states: If you draw a sufficiently large random sample (typically \(n > 30\)) from any population (regardless of its original distribution), the sampling distribution of the sample mean \(\bar{X}\) will be approximately normal.
\[ \bar{X} \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{for large } n \]
This is incredible! Even if our population is skewed, bimodal, or uniform, the distribution of its sample means will be a nice, predictable normal distribution. This theorem is the foundation that allows us to perform hypothesis tests and create confidence intervals for the mean.
Illustration of the Central Limit Theorem. As sample size n increases, the sampling distribution of the mean becomes more normal and less spread out, regardless of the original population distribution.
Now we have all the tools to formally connect our sample to the population.
| Parameter (Population) | Estimator (Formula / R.V.) | Estimate (A single number) |
|---|---|---|
| Mean \(\mu\) | \(\bar{X} = \frac{\sum X_i}{n}\) | \(\bar{x}\) |
| Variance \(\sigma^2\) | \(S^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}\) | \(s^2\) |
| Proportion \(p\) | \(\hat{P} = \frac{\sum X_i}{n}\) | \(\hat{p}\) |
How do we know if an estimator is a good one? The first property we look for is unbiasedness.
An estimator \(\hat{\theta}\) is unbiased for a parameter \(\theta\) if its expected value is equal to the parameter. \[ E(\hat{\theta}) = \theta \] This means that if we were to take many, many samples, the average of all our estimates would be exactly equal to the true population value. The estimator doesn’t systematically overestimate or underestimate the parameter.
Yes. As we proved earlier, \(E(\bar{X}) = \mu\). So, the sample mean is an unbiased estimator of the population mean.
Yes. The reason the formula for sample variance has \(n-1\) in the denominator instead of \(n\) is precisely to make it an unbiased estimator of the population variance \(\sigma^2\). That is, \(E(S^2) = \sigma^2\).
Let’s apply these concepts to the examples from your
Lectures 13/14 notes.
A population of students has a mean degree grade of \(\mu = 100\) with a variance of \(\sigma^2 = 16\).
Question 1: Consider a sample of 50 students (\(n=50\)). What is the probability that their average grade is less than 101?
Step 1: Define the sampling distribution of the sample mean \(\bar{X}\). Since the sample size \(n=50\) is large (greater than 30), the Central Limit Theorem applies. * \(E(\bar{X}) = \mu = 100\) * \(Var(\bar{X}) = \frac{\sigma^2}{n} = \frac{16}{50} = 0.32\) * \(SE(\bar{X}) = \sqrt{0.32} \approx 0.5657\) So, the sampling distribution is \(\bar{X} \approx \mathcal{N}(\mu=100, \sigma^2=0.32)\).
Step 2: Calculate the probability. We want to find \(P(\bar{X} < 101)\).
# We use pnorm with the parameters of the SAMPLING DISTRIBUTION
prob_mean_lt_101 <- pnorm(101, mean = 100, sd = sqrt(16/50))
cat("The probability that the sample mean grade is less than 101 is:", prob_mean_lt_101, "\n")## The probability that the sample mean grade is less than 101 is: 0.9614501
# Visualization
plot_normal(100, sqrt(16/50), ub = 101,
title = "Sampling Distribution of the Mean Grade (n=50)")
text(100, 0.4, labels = paste0(round(prob_mean_lt_101*100, 1), "%"), cex=1.2)
abline(v=101, lty=2, col="red")Question 2: It is known that the proportion of students with a grade > 105 is \(p=0.22\). In a sample of 50 students, what is the probability that the sample proportion with a grade > 105 is more than 0.30?
Step 1: Define the sampling distribution of the sample proportion \(\hat{P}\). This is a Bernoulli population where “success” is having a grade > 105. * The population proportion is \(p = 0.22\). * The sample size is \(n = 50\). * Check CLT condition: \(n \cdot p \cdot (1-p) = 50 \cdot 0.22 \cdot (1-0.22) = 50 \cdot 0.1716 = 8.58\). Since \(8.58 > 5\), the normal approximation is appropriate.
The sampling distribution of \(\hat{P}\) is: * \(E(\hat{P}) = p = 0.22\) * \(Var(\hat{P}) = \frac{p(1-p)}{n} = \frac{0.22(0.78)}{50} \approx 0.003432\) * \(SE(\hat{P}) = \sqrt{0.003432} \approx 0.05858\) So, \(\hat{P} \approx \mathcal{N}(\mu=0.22, \sigma^2=0.003432)\).
Step 2: Calculate the probability. We want to find \(P(\hat{P} > 0.30)\).
# Calculate the standard error of the proportion
se_p <- sqrt(0.22 * (1 - 0.22) / 50)
# We want P(P_hat > 0.3), which is 1 - P(P_hat <= 0.3)
prob_prop_gt_030 <- 1 - pnorm(0.30, mean = 0.22, sd = se_p)
cat("The probability that the sample proportion is greater than 0.30 is:", prob_prop_gt_030, "\n")## The probability that the sample proportion is greater than 0.30 is: 0.08603581
# Visualization
plot_normal(0.22, se_p, lb = 0.30, shade_col = "salmon",
title = "Sampling Distribution of the Proportion (n=50)")
text(0.32, 3, labels = paste0(round(prob_prop_gt_030*100, 1), "%"), cex=1.2)
abline(v=0.30, lty=2, col="red")Today, we have built the essential bridge from descriptive statistics to statistical inference. We’ve learned that: * Random Variables are the mathematical language we use to model uncertainty. * The Normal Distribution is a powerful and ubiquitous tool. * The Central Limit Theorem is the magic that allows us to make inferences about the mean, even when we don’t know the shape of the population. * Every sample statistic, like the sample mean or sample proportion, has its own sampling distribution, which describes its behavior across all possible samples.
Understanding these sampling distributions is the absolute foundation for everything that comes next. In our upcoming lectures, we will use this foundation to build two of the most important tools in statistics: 1. Confidence Intervals: Estimating a population parameter with a range of plausible values. 2. Hypothesis Testing: Making a formal decision about a claim regarding a population parameter.
You have done excellent work today. Mastering these concepts is crucial, so please review them carefully.
🎓 End of Lecture 5 - Well done!
## 📋 Session Information:
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 xfun_0.52
## [5] cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
## [9] lifecycle_1.0.4 cli_3.6.5 sass_0.4.10 jquerylib_0.1.4
## [13] compiler_4.5.1 rstudioapi_0.17.1 tools_4.5.1 evaluate_1.0.4
## [17] bslib_0.9.0 yaml_2.3.10 rlang_1.1.6 jsonlite_2.0.0