For this lecture, we will use the diamonds dataset from
the ggplot2 package. We will treat these 53,940
diamonds as our entire Population.
# Load data
population_data <- diamonds %>% select(carat, cut, color, price)
# Calculate Population Parameters
pop_mean_price <- mean(population_data$price)
pop_sd_price <- sd(population_data$price)
print(paste("Population Size:", nrow(population_data)))
## [1] "Population Size: 53940"
print(paste("Population Mean Price:", round(pop_mean_price, 2)))
## [1] "Population Mean Price: 3932.8"
Every member of the population has an equal chance of being selected.
In R, we use slice_sample().
set.seed(123) # For reproducibility
# Take a sample of 500 diamonds
srs_sample <- population_data %>%
slice_sample(n = 500)
mean(srs_sample$price)
## [1] 3922.04
The population is divided into subgroups (strata)
based on a characteristic (e.g., cut), and a random sample
is taken from each. This ensures all groups are represented.
stratified_sample <- population_data %>%
group_by(cut) %>%
slice_sample(prop = 0.01) # Take 1% from each cut category
# Check counts per group
stratified_sample %>% count(cut)
## # A tibble: 5 × 2
## # Groups: cut [5]
## cut n
## <ord> <int>
## 1 Fair 16
## 2 Good 49
## 3 Very Good 120
## 4 Premium 137
## 5 Ideal 215
Selecting every \(k^{th}\) individual from a list.
# Determine interval k
n <- 500
N <- nrow(population_data)
k <- ceiling(N / n)
# Select indices
indices <- seq(from = 1, to = N, by = k)
systematic_sample <- population_data[indices, ]
head(systematic_sample)
## # A tibble: 6 × 4
## carat cut color price
## <dbl> <ord> <ord> <int>
## 1 0.23 Ideal E 326
## 2 0.81 Ideal F 2761
## 3 0.77 Ideal H 2781
## 4 1 Premium J 2801
## 5 0.55 Very Good D 2815
## 6 0.7 Ideal H 2827
Sampling Error is the difference between the sample statistic and the population parameter. Let’s visualize how sample size affects this error.
# Function to get mean from various sample sizes
get_sample_mean <- function(size) {
population_data %>%
slice_sample(n = size) %>%
summarize(m = mean(price)) %>%
pull(m)
}
sizes <- c(10, 50, 100, 500, 1000, 5000)
results <- data.frame(
sample_size = sizes,
sample_mean = sapply(sizes, get_sample_mean)
) %>%
mutate(error = sample_mean - pop_mean_price)
kable(results) %>% kable_styling(full_width = F)
| sample_size | sample_mean | error |
|---|---|---|
| 10 | 2685.700 | -1247.09972 |
| 50 | 4457.780 | 524.98028 |
| 100 | 3838.450 | -94.34972 |
| 500 | 3855.658 | -77.14172 |
| 1000 | 3842.328 | -90.47172 |
| 5000 | 3961.164 | 28.36468 |
The CLT states that if you take many samples, the distribution of the sample means will be approximately normal, regardless of the population’s distribution shape.
The price of diamonds is highly skewed.
p1 <- ggplot(population_data, aes(x = price)) +
geom_histogram(fill = "steelblue", bins = 50) +
labs(title = "Population Distribution (Skewed)", x = "Price", y = "Count")
p1
We will take 1,000 different samples (size n=100) and plot their means.
# Simulation: Take 1000 samples of size 100
samples_1000 <- replicate(1000, {
population_data %>%
slice_sample(n = 100) %>%
summarize(m = mean(price)) %>%
pull(m)
})
sample_means_df <- data.frame(mean_price = samples_1000)
p2 <- ggplot(sample_means_df, aes(x = mean_price)) +
geom_histogram(fill = "darkorange", color = "white", bins = 30) +
geom_vline(xintercept = pop_mean_price, color = "red", linetype = "dashed", size = 1) +
labs(title = "Distribution of 1,000 Sample Means (Normal)",
subtitle = "The red line is the true Population Mean",
x = "Mean Price", y = "Frequency")
p2
Does a sample actually “look” like the population? Let’s compare a 5% sample to the full population using Carat vs. Price.
small_sample <- population_data %>% slice_sample(prop = 0.05)
ggplot() +
geom_point(data = population_data, aes(x = carat, y = price), alpha = 0.1, color = "grey") +
geom_point(data = small_sample, aes(x = carat, y = price), color = "red", alpha = 0.5) +
theme_minimal() +
labs(title = "Population (Grey) vs. 5% Sample (Red)",
caption = "Visualizing representation in sampling")
| Method | When to use | R Function |
|---|---|---|
| Simple Random | When the population is homogeneous. | slice_sample(n=x) |
| Stratified | When you want to ensure subgroups are represented. | group_by() %>% slice_sample() |
| Systematic | When you have a sorted list or physical queue. | seq() indexing |
| Cluster | When the population is naturally grouped geographically. | filter(group %in% selected_clusters) |
diamonds dataset, create a Stratified
Sample based on the color column.carat for that sample.carat.# Hint for Exercise 1 & 2
diamonds %>%
group_by(color) %>%
slice_sample(prop = 0.1) %>%
ungroup() %>%
summarize(mean_carat = mean(carat))
```
diamonds dataset, which has over 50,000 rows, making the
concept of “sampling from a population” feel real.slice_sample(), which is the modern way to sample in R
(replacing the older sample_n).set.seed(), which is a crucial lesson for students to
ensure they get the same “random” results every time they run their
code.