This is a comprehensive, detailed lecture note for Foundations of Sampling Theory, designed as an R Markdown (.Rmd) document. You can copy the code block below directly into a new R Markdown file in RStudio.
---
title: "Foundations of Sampling Theory: A Practical Introduction with R"
author: "Lecture Notes"
date: "2026-04-23"
output:
html_document:
toc: true
toc_float: true
theme: cosmo
highlight: tango
---
# 1. Introduction to Sampling Theory
Sampling theory is the study of the relationship between a **population** and the **samples** drawn from it. The goal is to make valid inferences about the population without having to conduct a full census.
### Key Definitions:
* **Population ($N$):** The complete set of items or individuals under study.
* **Sample ($n$):** A subset of the population selected for observation.
* **Sampling Frame:** A list of all units in the population from which the sample is drawn.
* **Parameter:** A numerical characteristic of the population (e.g., population mean $\mu$).
* **Statistic:** A numerical characteristic of the sample (e.g., sample mean $\bar{x}$), used to estimate the parameter.
---
# 2. Probability vs. Non-Probability Sampling
Sampling methods are broadly categorized into two types:
1. **Probability Sampling:** Every element in the population has a known, non-zero probability of being selected. This allows for the calculation of sampling error and statistical inference.
2. **Non-Probability Sampling:** Selection is based on convenience or judgment (e.g., "man on the street" interviews). This is easier but cannot be used for formal statistical generalization.
---
# 3. Simple Random Sampling (SRS)
In SRS, every possible sample of size $n$ has an equal probability of being selected.
### Mathematical Foundation:
The probability of selecting any unit is $P(i) = \frac{1}{N}$.
The sample mean $\bar{y} = \frac{1}{n} \sum y_i$ is an unbiased estimator of the population mean $\bar{Y}$.
### R Example:
We will use the built-in `iris` dataset as our population ($N=150$).
``` r
set.seed(123) # For reproducibility
# Define Population
population <- iris
# Draw a Simple Random Sample of n=20
n <- 20
srs_sample <- population %>%
slice_sample(n = n)
# Compare Means
cat("Population Mean (Sepal.Length):", mean(population$Sepal.Length), "\n")
## Population Mean (Sepal.Length): 5.843333
cat("Sample Mean (Sepal.Length):", mean(srs_sample$Sepal.Length), "\n")
## Sample Mean (Sepal.Length): 5.795
The population is divided into non-overlapping groups called strata based on a characteristic (e.g., Gender, Species). A random sample is then drawn from each stratum.
Why? To ensure representation of small subgroups and to reduce the variance of the estimate.
We will sample 5 plants from each of the 3 Species in
iris.
# Stratified Sampling: 5 from each species
strat_sample <- population %>%
group_by(Species) %>%
slice_sample(n = 5) %>%
ungroup()
# View counts per stratum
table(strat_sample$Species)
##
## setosa versicolor virginica
## 5 5 5
# Estimate Mean using weights (Required for Stratified Designs)
# In R, the 'survey' package is the gold standard for this.
design_strat <- svydesign(id = ~1,
strata = ~Species,
data = strat_sample,
fpc = rep(50, 15)) # 150 total / 3 strata = 50 per stratum
svymean(~Sepal.Length, design_strat)
## mean SE
## Sepal.Length 5.7867 0.1891
In systematic sampling, we select a starting point at random and then take every \(k^{th}\) element from the list, where \(k = N/n\).
N <- nrow(population)
n <- 30
k <- ceiling(N / n)
# Choose a random start between 1 and k
start <- sample(1:k, 1)
# Select indices
indices <- seq(start, N, by = k)
# Create sample
sys_sample <- population[indices, ]
head(sys_sample)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
The population is divided into clusters (e.g., schools, neighborhoods). We randomly select entire clusters and then sample everyone within those clusters.
Why? It is often cheaper or more logistically feasible than SRS.
Let’s pretend we have 10 “Gardens” (clusters) and we sample 2 Gardens.
# Create dummy cluster IDs
population_clusters <- population %>%
mutate(garden_id = rep(1:10, each = 15))
# Randomly select 2 Garden IDs
selected_gardens <- sample(1:10, 2)
# Extract all plants from those gardens
cluster_sample <- population_clusters %>%
filter(garden_id %in% selected_gardens)
# Check sample
unique(cluster_sample$garden_id)
## [1] 2 5
nrow(cluster_sample)
## [1] 30
{survey}Real-world survey analysis requires the survey package
to account for sampling weights. A weight \(w_i = 1/\pi_i\) (where \(\pi_i\) is the inclusion probability)
represents how many population units a single sample unit “stands
for.”
svydesign objectThis is the core of sampling analysis in R.
# 1. Define the design (e.g., SRS)
# fpc = Finite Population Correction
my_design <- svydesign(id = ~1,
data = srs_sample,
fpc = rep(150, nrow(srs_sample)))
# 2. Calculate Mean and Standard Error
mean_est <- svymean(~Sepal.Length, my_design)
print(mean_est)
## mean SE
## Sepal.Length 5.795 0.1878
# 3. Calculate Population Total
total_est <- svytotal(~Sepal.Length, my_design)
print(total_est)
## total SE
## Sepal.Length 869.25 28.175
# 4. 95% Confidence Interval
confint(mean_est)
## 2.5 % 97.5 %
## Sepal.Length 5.426854 6.163146
| Method | Best When… | Pros | Cons |
|---|---|---|---|
| SRS | Population is homogeneous | Simplest math, unbiased | High cost if pop. is spread out |
| Stratified | Subgroups vary significantly | Most precise estimates | Needs info on strata for whole pop. |
| Systematic | Pop. list has no patterns | Fast and easy | Bias if list has periodic patterns |
| Cluster | Sampling units are geographically clustered | Low cost, high efficiency | Higher variance (less precise) |
mtcars
dataset, perform a stratified sample based on the number of cylinders
(cyl).# Visualizing Sampling Distribution Example
means <- replicate(1000, {
s <- sample(iris$Sepal.Length, 20)
mean(s)
})
hist(means, main="Sampling Distribution of Mean", col="skyblue", border="white")
abline(v=mean(iris$Sepal.Length), col="red", lwd=2)
End of Lecture Notes ```