1. Core Concepts

Population vs. Sample

  • Population: The entire group you want to draw conclusions about (e.g., all adults in the USA).
  • Sample: A specific group (subset) that you will collect data from.
  • Parameter: A numerical summary of the population (e.g., Population Mean \(\mu\)).
  • Statistic: A numerical summary of the sample (e.g., Sample Mean \(\bar{x}\)).

Why Sample?

  1. Cost: Collecting data from everyone is expensive.
  2. Time: Sampling is faster.
  3. Feasibility: Sometimes measuring the whole population is impossible (e.g., testing the lifespan of every lightbulb produced).

2. The Dataset: “The Population”

For this lecture, we will use the diamonds dataset from the ggplot2 package. We will treat these 53,940 diamonds as our entire Population.

# Load data
population_data <- diamonds %>% select(carat, cut, color, price)

# Calculate Population Parameters
pop_mean_price <- mean(population_data$price)
pop_sd_price <- sd(population_data$price)

print(paste("Population Size:", nrow(population_data)))
## [1] "Population Size: 53940"
print(paste("Population Mean Price:", round(pop_mean_price, 2)))
## [1] "Population Mean Price: 3932.8"

3. Probability Sampling Methods

3.1 Simple Random Sampling (SRS)

Every member of the population has an equal chance of being selected. In R, we use slice_sample().

set.seed(123) # For reproducibility

# Take a sample of 500 diamonds
srs_sample <- population_data %>%
  slice_sample(n = 500)

mean(srs_sample$price)
## [1] 3922.04

3.2 Stratified Sampling

The population is divided into subgroups (strata) based on a characteristic (e.g., cut), and a random sample is taken from each. This ensures all groups are represented.

stratified_sample <- population_data %>%
  group_by(cut) %>%
  slice_sample(prop = 0.01) # Take 1% from each cut category

# Check counts per group
stratified_sample %>% count(cut)
## # A tibble: 5 × 2
## # Groups:   cut [5]
##   cut           n
##   <ord>     <int>
## 1 Fair         16
## 2 Good         49
## 3 Very Good   120
## 4 Premium     137
## 5 Ideal       215

3.3 Systematic Sampling

Selecting every \(k^{th}\) individual from a list.

# Determine interval k
n <- 500
N <- nrow(population_data)
k <- ceiling(N / n)

# Select indices
indices <- seq(from = 1, to = N, by = k)
systematic_sample <- population_data[indices, ]

head(systematic_sample)
## # A tibble: 6 × 4
##   carat cut       color price
##   <dbl> <ord>     <ord> <int>
## 1  0.23 Ideal     E       326
## 2  0.81 Ideal     F      2761
## 3  0.77 Ideal     H      2781
## 4  1    Premium   J      2801
## 5  0.55 Very Good D      2815
## 6  0.7  Ideal     H      2827

4. Sampling Bias and Error

Sampling Error is the difference between the sample statistic and the population parameter. Let’s visualize how sample size affects this error.

# Function to get mean from various sample sizes
get_sample_mean <- function(size) {
  population_data %>% 
    slice_sample(n = size) %>% 
    summarize(m = mean(price)) %>% 
    pull(m)
}

sizes <- c(10, 50, 100, 500, 1000, 5000)
results <- data.frame(
  sample_size = sizes,
  sample_mean = sapply(sizes, get_sample_mean)
) %>%
  mutate(error = sample_mean - pop_mean_price)

kable(results) %>% kable_styling(full_width = F)
sample_size sample_mean error
10 2685.700 -1247.09972
50 4457.780 524.98028
100 3838.450 -94.34972
500 3855.658 -77.14172
1000 3842.328 -90.47172
5000 3961.164 28.36468

5. The Central Limit Theorem (CLT)

The CLT states that if you take many samples, the distribution of the sample means will be approximately normal, regardless of the population’s distribution shape.

Step 1: Distribution of the Population

The price of diamonds is highly skewed.

p1 <- ggplot(population_data, aes(x = price)) +
  geom_histogram(fill = "steelblue", bins = 50) +
  labs(title = "Population Distribution (Skewed)", x = "Price", y = "Count")
p1

Step 2: Distribution of Sample Means

We will take 1,000 different samples (size n=100) and plot their means.

# Simulation: Take 1000 samples of size 100
samples_1000 <- replicate(1000, {
  population_data %>% 
    slice_sample(n = 100) %>% 
    summarize(m = mean(price)) %>% 
    pull(m)
})

sample_means_df <- data.frame(mean_price = samples_1000)

p2 <- ggplot(sample_means_df, aes(x = mean_price)) +
  geom_histogram(fill = "darkorange", color = "white", bins = 30) +
  geom_vline(xintercept = pop_mean_price, color = "red", linetype = "dashed", size = 1) +
  labs(title = "Distribution of 1,000 Sample Means (Normal)",
       subtitle = "The red line is the true Population Mean",
       x = "Mean Price", y = "Frequency")

p2


6. Visualization: Sample vs. Population

Does a sample actually “look” like the population? Let’s compare a 5% sample to the full population using Carat vs. Price.

small_sample <- population_data %>% slice_sample(prop = 0.05)

ggplot() +
  geom_point(data = population_data, aes(x = carat, y = price), alpha = 0.1, color = "grey") +
  geom_point(data = small_sample, aes(x = carat, y = price), color = "red", alpha = 0.5) +
  theme_minimal() +
  labs(title = "Population (Grey) vs. 5% Sample (Red)",
       caption = "Visualizing representation in sampling")


7. Summary Table

Method When to use R Function
Simple Random When the population is homogeneous. slice_sample(n=x)
Stratified When you want to ensure subgroups are represented. group_by() %>% slice_sample()
Systematic When you have a sorted list or physical queue. seq() indexing
Cluster When the population is naturally grouped geographically. filter(group %in% selected_clusters)

8. Lab Exercise

  1. Using the diamonds dataset, create a Stratified Sample based on the color column.
  2. Calculate the mean carat for that sample.
  3. Compare it to the population mean carat.
  4. Bonus: Create a loop that takes 500 samples of size 50 and plot the distribution of the standard deviations.
# Hint for Exercise 1 & 2
diamonds %>%
  group_by(color) %>%
  slice_sample(prop = 0.1) %>%
  ungroup() %>%
  summarize(mean_carat = mean(carat))

```

Key Takeaways in this Note:

  1. Practical Application: It uses the diamonds dataset, which has over 50,000 rows, making the concept of “sampling from a population” feel real.
  2. Tidyverse Focused: It uses slice_sample(), which is the modern way to sample in R (replacing the older sample_n).
  3. Visual Proof: The Central Limit Theorem section visually demonstrates why we can trust samples even when population data is messy or skewed.
  4. Reproducibility: It includes set.seed(), which is a crucial lesson for students to ensure they get the same “random” results every time they run their code.