This is a comprehensive, detailed lecture note for Foundations of Sampling Theory, designed as an R Markdown (.Rmd) document. You can copy the code block below directly into a new R Markdown file in RStudio.

How to use this:

  1. Open RStudio.
  2. Go to File -> New File -> R Markdown.
  3. Delete everything in the new file and paste the code below.
  4. Click the Knit button (top of the script editor) to generate a beautiful HTML or PDF lecture handout.

---
title: "Foundations of Sampling Theory: A Practical Introduction with R"
author: "Lecture Notes"
date: "2026-04-23"
output: 
  html_document:
    toc: true
    toc_float: true
    theme: cosmo
    highlight: tango
---



# 1. Introduction to Sampling Theory

Sampling theory is the study of the relationship between a **population** and the **samples** drawn from it. The goal is to make valid inferences about the population without having to conduct a full census.

### Key Definitions:
*   **Population ($N$):** The complete set of items or individuals under study.
*   **Sample ($n$):** A subset of the population selected for observation.
*   **Sampling Frame:** A list of all units in the population from which the sample is drawn.
*   **Parameter:** A numerical characteristic of the population (e.g., population mean $\mu$).
*   **Statistic:** A numerical characteristic of the sample (e.g., sample mean $\bar{x}$), used to estimate the parameter.

---

# 2. Probability vs. Non-Probability Sampling

Sampling methods are broadly categorized into two types:

1.  **Probability Sampling:** Every element in the population has a known, non-zero probability of being selected. This allows for the calculation of sampling error and statistical inference.
2.  **Non-Probability Sampling:** Selection is based on convenience or judgment (e.g., "man on the street" interviews). This is easier but cannot be used for formal statistical generalization.

---

# 3. Simple Random Sampling (SRS)

In SRS, every possible sample of size $n$ has an equal probability of being selected.

### Mathematical Foundation:
The probability of selecting any unit is $P(i) = \frac{1}{N}$.
The sample mean $\bar{y} = \frac{1}{n} \sum y_i$ is an unbiased estimator of the population mean $\bar{Y}$.

### R Example:
We will use the built-in `iris` dataset as our population ($N=150$).


``` r
set.seed(123) # For reproducibility

# Define Population
population <- iris

# Draw a Simple Random Sample of n=20
n <- 20
srs_sample <- population %>%
  slice_sample(n = n)

# Compare Means
cat("Population Mean (Sepal.Length):", mean(population$Sepal.Length), "\n")
## Population Mean (Sepal.Length): 5.843333
cat("Sample Mean (Sepal.Length):", mean(srs_sample$Sepal.Length), "\n")
## Sample Mean (Sepal.Length): 5.795

4. Stratified Random Sampling

The population is divided into non-overlapping groups called strata based on a characteristic (e.g., Gender, Species). A random sample is then drawn from each stratum.

Why? To ensure representation of small subgroups and to reduce the variance of the estimate.

R Example:

We will sample 5 plants from each of the 3 Species in iris.

# Stratified Sampling: 5 from each species
strat_sample <- population %>%
  group_by(Species) %>%
  slice_sample(n = 5) %>%
  ungroup()

# View counts per stratum
table(strat_sample$Species)
## 
##     setosa versicolor  virginica 
##          5          5          5
# Estimate Mean using weights (Required for Stratified Designs)
# In R, the 'survey' package is the gold standard for this.
design_strat <- svydesign(id = ~1, 
                          strata = ~Species, 
                          data = strat_sample, 
                          fpc = rep(50, 15)) # 150 total / 3 strata = 50 per stratum

svymean(~Sepal.Length, design_strat)
##                mean     SE
## Sepal.Length 5.7867 0.1891

5. Systematic Sampling

In systematic sampling, we select a starting point at random and then take every \(k^{th}\) element from the list, where \(k = N/n\).

R Example:

N <- nrow(population)
n <- 30
k <- ceiling(N / n)

# Choose a random start between 1 and k
start <- sample(1:k, 1)

# Select indices
indices <- seq(start, N, by = k)

# Create sample
sys_sample <- population[indices, ]

head(sys_sample)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa

6. Cluster Sampling

The population is divided into clusters (e.g., schools, neighborhoods). We randomly select entire clusters and then sample everyone within those clusters.

Why? It is often cheaper or more logistically feasible than SRS.

R Example:

Let’s pretend we have 10 “Gardens” (clusters) and we sample 2 Gardens.

# Create dummy cluster IDs
population_clusters <- population %>%
  mutate(garden_id = rep(1:10, each = 15))

# Randomly select 2 Garden IDs
selected_gardens <- sample(1:10, 2)

# Extract all plants from those gardens
cluster_sample <- population_clusters %>%
  filter(garden_id %in% selected_gardens)

# Check sample
unique(cluster_sample$garden_id)
## [1] 2 5
nrow(cluster_sample)
## [1] 30

7. Estimating Totals and Variances with {survey}

Real-world survey analysis requires the survey package to account for sampling weights. A weight \(w_i = 1/\pi_i\) (where \(\pi_i\) is the inclusion probability) represents how many population units a single sample unit “stands for.”

The svydesign object

This is the core of sampling analysis in R.

# 1. Define the design (e.g., SRS)
# fpc = Finite Population Correction
my_design <- svydesign(id = ~1, 
                       data = srs_sample, 
                       fpc = rep(150, nrow(srs_sample)))

# 2. Calculate Mean and Standard Error
mean_est <- svymean(~Sepal.Length, my_design)
print(mean_est)
##               mean     SE
## Sepal.Length 5.795 0.1878
# 3. Calculate Population Total
total_est <- svytotal(~Sepal.Length, my_design)
print(total_est)
##               total     SE
## Sepal.Length 869.25 28.175
# 4. 95% Confidence Interval
confint(mean_est)
##                 2.5 %   97.5 %
## Sepal.Length 5.426854 6.163146

8. Summary Comparison

Method Best When… Pros Cons
SRS Population is homogeneous Simplest math, unbiased High cost if pop. is spread out
Stratified Subgroups vary significantly Most precise estimates Needs info on strata for whole pop.
Systematic Pop. list has no patterns Fast and easy Bias if list has periodic patterns
Cluster Sampling units are geographically clustered Low cost, high efficiency Higher variance (less precise)

Exercises for Students

  1. Modify the SRS code: Change the sample size \(n\) to 50 and observe what happens to the Standard Error.
  2. Stratified Sampling: Using the mtcars dataset, perform a stratified sample based on the number of cylinders (cyl).
  3. Visualizing Bias: Create a loop that takes 100 different SRS samples, calculates their means, and plots a histogram to show the “Sampling Distribution of the Mean.”
# Visualizing Sampling Distribution Example
means <- replicate(1000, {
  s <- sample(iris$Sepal.Length, 20)
  mean(s)
})
hist(means, main="Sampling Distribution of Mean", col="skyblue", border="white")
abline(v=mean(iris$Sepal.Length), col="red", lwd=2)

End of Lecture Notes ```