Chapter 1: The Bridge from Description to Inference

1.1 A New Frontier: From Certainty to Uncertainty

Good morning. In our previous lectures, we mastered the art of Descriptive Statistics. We learned to take a given dataset—a sample—and describe it perfectly using tables, graphs, means, and variances. We were working with known information.

Today, we cross a monumental bridge into the world of Statistical Inference. The core question of inference is: How can we use the information from a limited sample to make intelligent, reliable conclusions about the entire population?

Think of it this way: * A food inspector tests a small batch of soup (the sample) to decide if the entire pot (the population) is safe to eat. * A political pollster surveys 1,000 people (the sample) to predict the election outcome for millions of voters (the population).

In both cases, we are using a small, known piece to understand a large, unknown whole. This process involves uncertainty and risk. To manage this uncertainty, we need a new language and a new set of tools. That language is the language of probability and random variables.

1.2 What is a Random Variable?

A Random Variable (R.V.) is a variable whose possible values are numerical outcomes of a random experiment. It’s a formal way to link the unpredictable outcomes of an experiment to a numerical value.

  • Random Experiment: Tossing a coin.
  • Outcomes: Heads, Tails.
  • Random Variable X: Let X = 1 if the outcome is Heads, and X = 0 if the outcome is Tails.

The random variable X provides the numerical representation we need to perform statistical analysis. There are two main types of random variables, which will be our focus today.

Chapter 2: Discrete Random Variables

A Discrete Random Variable is one that can only take on a countable number of distinct values. Think of “number of children in a family” (0, 1, 2, …), “number of defects in a batch” (0, 1, 2, …), or “the result of a dice roll” (1, 2, 3, 4, 5, 6).

2.1 The Probability Mass Function (PMF)

For a discrete R.V., we describe its behavior using a Probability Mass Function (PMF), denoted as \(P_X(x)\). This function gives the probability that the random variable X is exactly equal to some specific value x.

\[ P_X(x) = P(X=x) \]

The PMF has two key properties: 1. \(0 \le P(X=x) \le 1\) (The probability for any value is between 0 and 1). 2. \(\sum_{x} P(X=x) = 1\) (The sum of probabilities over all possible values is 1).

2.2 Expected Value and Variance

  • Expected Value (Mean): The long-run average value of the random variable. It’s a weighted average of all possible values, where the weights are their probabilities. \[ E(X) = \mu = \sum_{x} x \cdot P(X=x) \]
  • Variance: A measure of the spread or dispersion of the random variable’s values around its mean. \[ Var(X) = \sigma^2 = E[(X-\mu)^2] = \sum_{x} (x-\mu)^2 \cdot P(X=x) \] A useful shortcut formula is: \[ Var(X) = E(X^2) - \mu^2 \]

2.3 The Bernoulli Random Variable

The simplest and most fundamental discrete R.V. is the Bernoulli. It models a single trial with only two possible outcomes, which we label “success” (X=1) and “failure” (X=0).

  • Example: A single customer either churns (success, X=1) or does not churn (failure, X=0).
  • The probability of success is denoted by \(p\).
  • The probability of failure is \(1-p\).

The PMF is: \[ P_X(x) = p^x (1-p)^{1-x} \quad \text{for } x \in \{0, 1\} \]

  • Expected Value: \(E(X) = (0 \cdot (1-p)) + (1 \cdot p) = p\)
  • Variance: \(Var(X) = (0-p)^2(1-p) + (1-p)^2 p = p^2(1-p) + (1-p)^2 p = p(1-p)(p + (1-p)) = p(1-p)\)

2.4 The Binomial Random Variable

What happens when we perform \(n\) independent Bernoulli trials and count the number of successes? This gives us a Binomial Random Variable.

  • Example: We contact 10 customers (\(n=10\)). Each has a 5% probability of churning (\(p=0.05\)). The number of customers who churn is a Binomial R.V.

A Binomial R.V. X is defined by two parameters: * \(n\): the number of trials. * \(p\): the probability of success on each trial.

The PMF, which gives the probability of getting exactly \(x\) successes in \(n\) trials, is: \[ P(X=x) = \binom{n}{x} p^x (1-p)^{n-x} = \frac{n!}{x!(n-x)!} p^x (1-p)^{n-x} \]

  • Expected Value: \(E(X) = np\)
  • Variance: \(Var(X) = np(1-p)\)

Manual Example: Client Defaults

Let’s use the example from your notes. Out of 10 clients (\(n=10\)), the probability of any single client defaulting is 0.05 (\(p=0.05\)). What is the probability that exactly 3 clients will default (\(x=3\))?

Calculation: \[ P(X=3) = \frac{10!}{3!(10-3)!} (0.05)^3 (1-0.05)^{10-3} \] \[ P(X=3) = \frac{10!}{3!7!} (0.05)^3 (0.95)^7 \] \[ \frac{10 \cdot 9 \cdot 8}{3 \cdot 2 \cdot 1} \cdot (0.000125) \cdot (0.6983) \] \[ 120 \cdot (0.000125) \cdot (0.6983) \approx 0.01047 \]

So, there is about a 1% chance that exactly 3 out of 10 clients will default.

R Example: Using dbinom

R makes this easy with the dbinom function (d for density/mass).

# P(X=3) for a Binomial(n=10, p=0.05)
prob_3_defaults <- dbinom(x = 3, size = 10, prob = 0.05)
cat("The probability of exactly 3 defaults is:", prob_3_defaults, "\n")
## The probability of exactly 3 defaults is: 0.01047506

Chapter 3: Continuous Random Variables

A Continuous Random Variable can take any value within a given range. Think of height, weight, temperature, or time. The number of possible values is infinite.

3.1 The Probability Density Function (PDF)

Because there are infinite possible values, the probability of a continuous R.V. being exactly equal to any single value is zero. \(P(X=x) = 0\).

Instead, we use a Probability Density Function (PDF), \(f_X(x)\), to describe the likelihood of the variable falling within a range. The probability is the area under the PDF curve between two points.

\[ P(a \le X \le b) = \int_a^b f_X(x) dx \]

The PDF has two key properties: 1. \(f_X(x) \ge 0\) (The curve is always non-negative). 2. \(\int_{-\infty}^{\infty} f_X(x) dx = 1\) (The total area under the curve is 1).

3.2 The Normal Distribution: The Superstar of Statistics

The most important continuous distribution is the Normal Distribution. It’s a bell-shaped, symmetric curve defined by two parameters: * The mean \(\mu\), which determines the center of the bell. * The variance \(\sigma^2\), which determines the spread or width of the bell.

We denote this as \(X \sim \mathcal{N}(\mu, \sigma^2)\).

R Example: Flight Durations

Let’s use the flight duration example from your notes. The duration of a Milan-New York flight is normally distributed with a mean of 500 minutes and a variance of 625 minutes². So, \(X \sim \mathcal{N}(\mu=500, \sigma^2=625)\). This means the standard deviation is \(\sigma = \sqrt{625} = 25\).

Question 1: What is the probability a flight will last between 500 and 550 minutes? We use the pnorm function in R, which calculates the cumulative probability \(P(X \le x)\). The logic is \(P(500 < X < 550) = P(X < 550) - P(X < 500)\).

prob_between <- pnorm(550, mean = 500, sd = 25) - pnorm(500, mean = 500, sd = 25)
cat("Probability of flight lasting between 500 and 550 mins:", prob_between, "\n")
## Probability of flight lasting between 500 and 550 mins: 0.4772499
# Visualization
plot_normal(500, 25, lb = 500, ub = 550, 
            title = "P(500 < Flight Duration < 550)")
text(525, 0.008, labels = paste0(round(prob_between*100, 1), "%"), cex=1.2)

Question 2: What is the maximum flight duration for the 10% fastest flights? This is a percentile question. We need to find the value \(x\) such that \(P(X \le x) = 0.10\). We use the qnorm function (q for quantile).

fastest_10_percent <- qnorm(0.10, mean = 500, sd = 25)
cat("The 10th percentile of flight duration is:", fastest_10_percent, "minutes.\n")
## The 10th percentile of flight duration is: 467.9612 minutes.
# Visualization
plot_normal(500, 25, ub = fastest_10_percent, shade_col = "lightgreen",
            title = "10% Fastest Flights")
text(460, 0.008, labels = "10%", cex=1.2)
abline(v=fastest_10_percent, lty=2, col="red")

Chapter 4: Transforming and Combining Random Variables

4.1 Linear Transformation of a Random Variable

Often, we need to apply a linear transformation to a random variable, such as converting units or calculating profit. If \(Y = a + bX\), where \(a\) and \(b\) are constants:

  • Expected Value: \(E(Y) = E(a + bX) = a + bE(X) = a + b\mu\)
  • Variance: \(Var(Y) = Var(a + bX) = b^2 Var(X) = b^2 \sigma^2\)
  • Standard Deviation: \(Sd(Y) = |b| Sd(X) = |b| \sigma\)

Important Note: If \(X\) is normally distributed, \(X \sim \mathcal{N}(\mu, \sigma^2)\), then its linear transformation \(Y\) is also normally distributed: \(Y \sim \mathcal{N}(a+b\mu, b^2\sigma^2)\).

4.2 Standardization: The Great Equalizer

A crucial linear transformation is standardization. It converts any random variable \(X\) into a new variable \(Z\) with a mean of 0 and a standard deviation of 1.

\[ Z = \frac{X - \mu}{\sigma} \]

This is a linear transformation where \(a = -\mu/\sigma\) and \(b = 1/\sigma\). Let’s prove its properties: * \(E(Z) = E(\frac{X - \mu}{\sigma}) = \frac{1}{\sigma} E(X - \mu) = \frac{1}{\sigma} (E(X) - \mu) = \frac{1}{\sigma} (\mu - \mu) = 0\) * \(Var(Z) = Var(\frac{X - \mu}{\sigma}) = \frac{1}{\sigma^2} Var(X - \mu) = \frac{1}{\sigma^2} Var(X) = \frac{1}{\sigma^2} \sigma^2 = 1\)

If the original variable \(X\) is Normal, \(X \sim \mathcal{N}(\mu, \sigma^2)\), then the standardized variable \(Z\) follows the Standard Normal Distribution, \(Z \sim \mathcal{N}(0, 1)\).

R Example: Flight Duration via Standardization

Let’s re-calculate \(P(X < 540)\) for our flight data using standardization. First, we convert 540 minutes into a Z-score: \[ Z = \frac{540 - 500}{25} = \frac{40}{25} = 1.6 \] So, \(P(X < 540)\) is the same as \(P(Z < 1.6)\).

prob_z <- pnorm(1.6, mean = 0, sd = 1) # Or just pnorm(1.6)
cat("P(Z < 1.6) =", prob_z, "\n")
## P(Z < 1.6) = 0.9452007
prob_x <- pnorm(540, mean = 500, sd = 25)
cat("P(X < 540) =", prob_x, "\n")
## P(X < 540) = 0.9452007
cat("The results are identical, as expected.\n")
## The results are identical, as expected.

Chapter 5: The Power of Many - Sums, Means, and the Central Limit Theorem

5.1 The Sample: A Collection of i.i.d. Random Variables

When we draw a random sample of size \(n\) from a population, we are essentially observing \(n\) random variables: \((X_1, X_2, \dots, X_n)\). We assume these variables are independent and identically distributed (i.i.d.), meaning: * Independent: The value of one observation does not affect the value of another. * Identically Distributed: Each observation \(X_i\) comes from the same population distribution, and thus has the same mean \(\mu\) and variance \(\sigma^2\).

5.2 The Sampling Distribution of the Sample Mean

The most important sample statistic is the sample mean, \(\bar{X}\). Since it’s calculated from random variables, \(\bar{X}\) is itself a random variable. The probability distribution of \(\bar{X}\) is called its sampling distribution.

\[ \bar{X} = \frac{X_1 + X_2 + \dots + X_n}{n} \]

Using the rules for linear combinations, we can find its expected value and variance. * Expected Value of the Sample Mean: \[ E(\bar{X}) = E\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n} \sum E(X_i) = \frac{1}{n} \sum \mu = \frac{1}{n} (n\mu) = \mu \] * Variance of the Sample Mean: \[ Var(\bar{X}) = Var\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n^2} \sum Var(X_i) = \frac{1}{n^2} \sum \sigma^2 = \frac{1}{n^2} (n\sigma^2) = \frac{\sigma^2}{n} \] * Standard Error: The standard deviation of the sample mean is called the standard error. \[ SE(\bar{X}) = \sigma_{\bar{X}} = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \] Notice that as the sample size \(n\) gets larger, the standard error gets smaller. This means the sample mean becomes a more precise estimate of the population mean.

5.3 The Central Limit Theorem (CLT)

This is perhaps the most magical and powerful theorem in all of statistics.

The Central Limit Theorem states: If you draw a sufficiently large random sample (typically \(n > 30\)) from any population (regardless of its original distribution), the sampling distribution of the sample mean \(\bar{X}\) will be approximately normal.

\[ \bar{X} \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{for large } n \]

This is incredible! Even if our population is skewed, bimodal, or uniform, the distribution of its sample means will be a nice, predictable normal distribution. This theorem is the foundation that allows us to perform hypothesis tests and create confidence intervals for the mean.

Illustration of the Central Limit Theorem. As sample size n increases, the sampling distribution of the mean becomes more normal and less spread out, regardless of the original population distribution.

Illustration of the Central Limit Theorem. As sample size n increases, the sampling distribution of the mean becomes more normal and less spread out, regardless of the original population distribution.

Chapter 6: The Logic of Statistical Inference

Now we have all the tools to formally connect our sample to the population.

6.1 Estimators and Estimates

  • An Estimator is a sample statistic used to estimate a population parameter. It is a random variable because its value depends on the random sample selected. We use uppercase letters, e.g., \(\bar{X}\) or \(S^2\).
  • An Estimate is the specific numerical value an estimator takes for a given, observed sample. It is a single number. We use lowercase letters, e.g., \(\bar{x} = 101.5\) or \(s^2 = 15.2\).
Parameter (Population) Estimator (Formula / R.V.) Estimate (A single number)
Mean \(\mu\) \(\bar{X} = \frac{\sum X_i}{n}\) \(\bar{x}\)
Variance \(\sigma^2\) \(S^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}\) \(s^2\)
Proportion \(p\) \(\hat{P} = \frac{\sum X_i}{n}\) \(\hat{p}\)

6.2 Properties of Good Estimators: Unbiasedness

How do we know if an estimator is a good one? The first property we look for is unbiasedness.

An estimator \(\hat{\theta}\) is unbiased for a parameter \(\theta\) if its expected value is equal to the parameter. \[ E(\hat{\theta}) = \theta \] This means that if we were to take many, many samples, the average of all our estimates would be exactly equal to the true population value. The estimator doesn’t systematically overestimate or underestimate the parameter.

Example: Is the Sample Mean \(\bar{X}\) Unbiased?

Yes. As we proved earlier, \(E(\bar{X}) = \mu\). So, the sample mean is an unbiased estimator of the population mean.

Example: Is the Sample Variance \(S^2\) Unbiased?

Yes. The reason the formula for sample variance has \(n-1\) in the denominator instead of \(n\) is precisely to make it an unbiased estimator of the population variance \(\sigma^2\). That is, \(E(S^2) = \sigma^2\).

Chapter 7: Sampling Distributions in Practice

Let’s apply these concepts to the examples from your Lectures 13/14 notes.

7.1 Example: Bachelor’s Degree Grades

A population of students has a mean degree grade of \(\mu = 100\) with a variance of \(\sigma^2 = 16\).

Question 1: Consider a sample of 50 students (\(n=50\)). What is the probability that their average grade is less than 101?

Step 1: Define the sampling distribution of the sample mean \(\bar{X}\). Since the sample size \(n=50\) is large (greater than 30), the Central Limit Theorem applies. * \(E(\bar{X}) = \mu = 100\) * \(Var(\bar{X}) = \frac{\sigma^2}{n} = \frac{16}{50} = 0.32\) * \(SE(\bar{X}) = \sqrt{0.32} \approx 0.5657\) So, the sampling distribution is \(\bar{X} \approx \mathcal{N}(\mu=100, \sigma^2=0.32)\).

Step 2: Calculate the probability. We want to find \(P(\bar{X} < 101)\).

# We use pnorm with the parameters of the SAMPLING DISTRIBUTION
prob_mean_lt_101 <- pnorm(101, mean = 100, sd = sqrt(16/50))
cat("The probability that the sample mean grade is less than 101 is:", prob_mean_lt_101, "\n")
## The probability that the sample mean grade is less than 101 is: 0.9614501
# Visualization
plot_normal(100, sqrt(16/50), ub = 101, 
            title = "Sampling Distribution of the Mean Grade (n=50)")
text(100, 0.4, labels = paste0(round(prob_mean_lt_101*100, 1), "%"), cex=1.2)
abline(v=101, lty=2, col="red")

Question 2: It is known that the proportion of students with a grade > 105 is \(p=0.22\). In a sample of 50 students, what is the probability that the sample proportion with a grade > 105 is more than 0.30?

Step 1: Define the sampling distribution of the sample proportion \(\hat{P}\). This is a Bernoulli population where “success” is having a grade > 105. * The population proportion is \(p = 0.22\). * The sample size is \(n = 50\). * Check CLT condition: \(n \cdot p \cdot (1-p) = 50 \cdot 0.22 \cdot (1-0.22) = 50 \cdot 0.1716 = 8.58\). Since \(8.58 > 5\), the normal approximation is appropriate.

The sampling distribution of \(\hat{P}\) is: * \(E(\hat{P}) = p = 0.22\) * \(Var(\hat{P}) = \frac{p(1-p)}{n} = \frac{0.22(0.78)}{50} \approx 0.003432\) * \(SE(\hat{P}) = \sqrt{0.003432} \approx 0.05858\) So, \(\hat{P} \approx \mathcal{N}(\mu=0.22, \sigma^2=0.003432)\).

Step 2: Calculate the probability. We want to find \(P(\hat{P} > 0.30)\).

# Calculate the standard error of the proportion
se_p <- sqrt(0.22 * (1 - 0.22) / 50)

# We want P(P_hat > 0.3), which is 1 - P(P_hat <= 0.3)
prob_prop_gt_030 <- 1 - pnorm(0.30, mean = 0.22, sd = se_p)
cat("The probability that the sample proportion is greater than 0.30 is:", prob_prop_gt_030, "\n")
## The probability that the sample proportion is greater than 0.30 is: 0.08603581
# Visualization
plot_normal(0.22, se_p, lb = 0.30, shade_col = "salmon",
            title = "Sampling Distribution of the Proportion (n=50)")
text(0.32, 3, labels = paste0(round(prob_prop_gt_030*100, 1), "%"), cex=1.2)
abline(v=0.30, lty=2, col="red")

Chapter 8: Conclusion and Next Steps

Today, we have built the essential bridge from descriptive statistics to statistical inference. We’ve learned that: * Random Variables are the mathematical language we use to model uncertainty. * The Normal Distribution is a powerful and ubiquitous tool. * The Central Limit Theorem is the magic that allows us to make inferences about the mean, even when we don’t know the shape of the population. * Every sample statistic, like the sample mean or sample proportion, has its own sampling distribution, which describes its behavior across all possible samples.

Understanding these sampling distributions is the absolute foundation for everything that comes next. In our upcoming lectures, we will use this foundation to build two of the most important tools in statistics: 1. Confidence Intervals: Estimating a population parameter with a range of plausible values. 2. Hypothesis Testing: Making a formal decision about a claim regarding a population parameter.

You have done excellent work today. Mastering these concepts is crucial, so please review them carefully.

🎓 End of Lecture 5 - Well done!

## 📋 Session Information:
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.6.1          fastmap_1.2.0     xfun_0.52        
##  [5] cachem_1.1.0      knitr_1.50        htmltools_0.5.8.1 rmarkdown_2.29   
##  [9] lifecycle_1.0.4   cli_3.6.5         sass_0.4.10       jquerylib_0.1.4  
## [13] compiler_4.5.1    rstudioapi_0.17.1 tools_4.5.1       evaluate_1.0.4   
## [17] bslib_0.9.0       yaml_2.3.10       rlang_1.1.6       jsonlite_2.0.0