Simple random sampling
The simplest method of sampling a population, is known as simple
random sampling (sometimes abbreviated to “SRS”), and involves
picking rows at random, one at a time, where each row has the same
chance of being picked as any other.
To make it easier to see which rows end up in the sample, it’s
helpful to include a row ID column in the dataset before you take the
sample.
We’ll look at sampling methods using a synthetic (fictional) employee
attrition dataset from IBM, where “attrition” means
leaving the company.
# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(fst)
library(tibble)
Load dataset attrition_pop
path_attr <- ruta_fst <- "C:/Users/JuanFer Mosquera/Documents/datasets/attrition.fst"
attrition_pop <- read_fst(path_attr)
colnames(attrition_pop)
[1] "Age" "Attrition" "BusinessTravel" "DailyRate" "Department"
[6] "DistanceFromHome" "Education" "EducationField" "EnvironmentSatisfaction" "Gender"
[11] "HourlyRate" "JobInvolvement" "JobLevel" "JobRole" "JobSatisfaction"
[16] "MaritalStatus" "MonthlyIncome" "MonthlyRate" "NumCompaniesWorked" "OverTime"
[21] "PercentSalaryHike" "PerformanceRating" "RelationshipSatisfaction" "StockOptionLevel" "TotalWorkingYears"
[26] "TrainingTimesLastYear" "WorkLifeBalance" "YearsAtCompany" "YearsInCurrentRole" "YearsSinceLastPromotion"
[31] "YearsWithCurrManager"
attrition_pop
set.seed(590497)
attrition_samp <- attrition_pop %>%
# Add a row id column
rowid_to_column() %>%
# Get 200 rows using simple random sampling
slice_sample(n = 200)
# View the attrition_samp dataset
attrition_samp
Systematic sample
One sampling method that avoids randomness is called systematic
sampling. Here, you pick rows from the population at regular
intervals.
For example, if the population dataset had \(1000\) rows and you wanted a sample size of
five, you’d pick rows 200
, 400
,
600
, 800
, and 1000
.
# Set the sample size
sample_size_atr <- 200
# Get the population size from attrition_pop
pop_size_atr <- nrow(attrition_pop)
# Calculate the interval
interval_atr <- pop_size_atr %/% sample_size_atr
# Get row indexes for the sample
row_indexes_atr <- seq_len(sample_size_atr) * interval_atr
attrition_sys_samp <- attrition_pop %>%
# Add a row id column
rowid_to_column() %>%
# Get 200 rows using systematic sampling
slice(row_indexes_atr)
# View the results
attrition_sys_samp
NA
Is systematic sampling ok?
Systematic sampling has a problem: if the data has been sorted, or
there is some sort of pattern or meaning behind the row order, then the
resulting sample may not be representative of the whole population. The
problem can be solved by shuffling the rows, but then systematic
sampling is equivalent to simple random sampling.
Here you’ll look at how to determine whether or not there is a
problem.
# Add a row ID column to attrition_pop
attrition_pop_id <- attrition_pop %>%
rowid_to_column()
# Using attrition_pop_id, plot YearsAtCompany vs. rowid
ggplot(attrition_pop_id, aes(x = rowid, y = YearsAtCompany)) +
# Make it a scatter plot
geom_point() +
# Add a smooth trend line
geom_smooth()

# Shuffle the rows of attrition_pop then add row IDs
attrition_shuffled <- attrition_pop %>%
slice_sample(prop = 1) %>%
rowid_to_column()
# Using attrition_shuffled, plot YearsAtCompany vs. rowid
attrition_shuffled %>%
ggplot(aes(x = rowid, y = YearsAtCompany)) +
geom_point() +
geom_smooth()

A systematic sample does not always produce a sample similar
to a simple random sample.
Proportional stratified sampling
If you are interested in subgroups within the population, then you
may need to carefully control the counts of each subgroup within the
population. Proportional stratified sampling results in
subgroup sizes within the sample that are representative of the subgroup
sizes within the population. It is equivalent to performing a simple
random sample on each subgroup.
education_counts_pop <- attrition_pop %>%
# Count the employees by education level
count(Education, sort = TRUE) %>%
# Add a percent column
mutate(percent = 100 * n / sum(n))
education_counts_pop
# Use proportional stratified sampling to get 40% of each Education group
attrition_strat <- attrition_pop %>%
group_by(Education) %>%
slice_sample(prop = 0.4)
attrition_strat
# Get the counts and percents from attrition strat
education_counts_strat <- attrition_strat %>%
count(Education, sort = TRUE) %>%
mutate(percent = 100 * n / sum(n))
education_counts_strat
Equal counts stratified sampling
f one subgroup is larger than another subgroup in the population, but
you don’t want to reflect that difference in your analysis, then you can
use equal counts stratified sampling to generate samples where
each subgroup has the same amount of data. For example, if you are
analyzing blood types, O is the most common blood type worldwide, but
you may wish to have equal amounts of O, A, B, and AB in your
sample.
# Use equal counts stratified sampling to get 30 employees from each education group
attrition_eq <- attrition_pop %>%
group_by(Education) %>%
slice_sample(n = 30) %>%
ungroup()
# See the results
attrition_eq
# Get the counts and percent of attrition_eq
education_counts_eq <- attrition_eq %>%
count(Education) %>%
mutate(percent = 100 * n / sum(n))
# See the results
education_counts_eq
Weighted sampling
Stratified sampling provides rules about the probability of picking
rows from your dataset at the subgroup level. A generalization of this
is weighted sampling, which lets you specify rules
about the probability of picking rows at the row level. The probability
of picking any given row is proportional to the weight value for that
row.
# Using attrition_pop plot YearsAtCompany as a histogram with binwidth = 1
ggplot(attrition_pop, aes(YearsAtCompany)) +
geom_histogram(binwidth = 1)

# Sample 400 employees weighted by YearsAtCompany
attrition_weight <- attrition_pop %>%
slice_sample(n = 400, weight_by = YearsAtCompany)
# See the result
attrition_weight
# Using attrition weight, plot YearsAtCompany as a histogram with binwidth = 1
ggplot(attrition_weight, aes(YearsAtCompany)) +
geom_histogram(binwidth = 1)

3 kinds of samping
Let’s compare the performance of point estimates using simple,
stratified, and cluster sampling. Before we do that, you’ll have to set
up the samples. We’ll use the RelationshipSatisfaction
column of the attrition dataset, which categorizes the employee’s
relationship with the company. It’s an ordered factor with four levels:
Low
, Medium
, High
, and
Very_High
.
attrition_srs <- attrition_pop %>%
slice_sample(prop = 0.25)
# Perform stratified sampling to get 0.25 of each relationship group
attrition_strt <- attrition_pop %>%
group_by(RelationshipSatisfaction) %>%
slice_sample(prop = 0.25)
# Get unique values of RelationshipSatisfaction
satisfaction_unique <- unique(attrition_pop$RelationshipSatisfaction)
# Ramdomly sample for 2 of the unique satisfaction values
satisfaction_samp <- sample(satisfaction_unique)
# Perform cluster sampling on the selected group getting 0.25 of the population
attrition_clust <- attrition_pop %>%
filter(RelationshipSatisfaction %in% satisfaction_samp) %>%
group_by(RelationshipSatisfaction) %>%
slice_sample(n = round(nrow(attrition_pop) / 4)) %>%
ungroup()
Summary statistics on different kinds of
sample
Now you have three types of sample (simple, stratified, cluster), you
can compare point estimates from each sample to the population
parameter. That is, you can calculate the same summary statistic on each
sample and see how it compares to the summary statistic for the
population.
Here, we’ll look at how satisfaction with the company affects whether
or not the employee leaves the company. That is, you’ll calculate the
proportion of employees who left the company (they have an
Attrition
value of "Yes"
), for each value of
RelationshipSatisfaction
.
Whole population dataset
# Use the whole population dataset
mean_attrition_pop <- attrition_pop %>%
# Group by relationship satisfaction level
group_by(RelationshipSatisfaction) %>%
# Calculate the proportion of employee attrition
summarise(mean_attrition = mean(Attrition == "Yes"))
# See the result
mean_attrition_pop
attrition_srs sataset
mean_attrition_srs <- attrition_srs %>%
group_by(RelationshipSatisfaction) %>%
summarise(mean_attrition = mean(Attrition == "Yes"))
#See the result
mean_attrition_srs
attrition_strat dataset
mean_attrition_strt <- attrition_strt %>%
group_by(RelationshipSatisfaction) %>%
summarise(mean_attrition = mean(Attrition == "Yes"))
# See the result
mean_attrition_strt
attrition_clust dataset
mean_attrition_clust <- attrition_clust %>%
group_by(RelationshipSatisfaction) %>%
summarise(mean_attrition = mean(Attrition == "Yes"))
# See the results
mean_attrition_clust
Calculate relative errors
The size of the sample you take affects how accurately the point
estimates reflect the corresponding population parameter. For example,
when you calculate a sample mean, you want it to be close to the
population mean. However, if your sample is too small, this might not be
the case.
The most common metric for assessing accuracy is relative
error. This is the absolute difference between the population
parameter and the point estimate, all divided by the population
parameter. It is sometimes expressed as a percentage.
# Generate a simple random sample of 10 rows
attrition_srs10 <- attrition_pop %>%
slice_sample(n = 10)
# Calculate the proportion of employee attrition in the sample
mean_attrition_srs10 <- attrition_srs10 %>%
summarise(mean_attrition = mean(Attrition == "Yes")) %>%
pull(mean_attrition)
# Calculate the relative error percentage
rel_error_pct10 <- 100 * abs(mean_attrition_pop - mean_attrition_srs10) / mean_attrition_pop
Warning: '-' is not meaningful for ordered factorsWarning: '/' is not meaningful for ordered factors
# See the results
rel_error_pct10
Calculate the relative error percentage again. This time, use a
simple random sample of one hundred rows of
attrition_pop
.
attrition_srs100 <- attrition_pop %>%
slice_sample(n = 100)
mean_attrition_srs100 <- attrition_srs100 %>%
summarise(mean_attrition = mean(Attrition == "Yes")) %>%
pull(mean_attrition)
rel_error_pct100 <- 100 * abs(mean_attrition_pop - mean_attrition_srs10) / mean_attrition_pop
Warning: '-' is not meaningful for ordered factorsWarning: '/' is not meaningful for ordered factors
rel_error_pct100
Replicating samples
When you calculate a point estimate such as a sample mean, the value
you calculate depends on the rows that were included in the sample. That
means that there is some randomness in the answer. In order to quantify
the variation caused by this randomness, you can create many samples and
calculate the sample mean (or other statistic) for each sample.
# Replicate this code 500 times
mean_attritions <- replicate(n = 500,
expr = attrition_pop %>%
slice_sample(n = 20) %>%
summarise(mean_attrition = mean(Attrition == "Yes")) %>%
pull(mean_attrition)
)
# see the result
head(mean_attritions)
[1] 0.25 0.35 0.05 0.20 0.20 0.00
# Store mean attritions in a tibble in a column named sample_mean
sample_means_att <- tibble(sample_mean = mean_attritions)
# Plot a histogram of the "sample_mean" column, binwidth 0.05
ggplot(sample_means_att, aes(sample_mean)) +
geom_histogram(binwidth = 0.05)

Population and sampling distributions means
One of the useful features of sampling distributions is that we can
quantify them. In particular, we can calculate summary statistics on
them. Here, we’ll look at the relationship between the mean of the
sampling distribution and the population parameter that the sampling is
supposed to estimate.
Three sampling distributions are provided. In each case, the employee
attrition dataset was sampled using simple random sampling, then the
mean attrition was calculated. This was done 1000 times to get a
sampling distribution of mean attritions. One sampling distribution used
a sample size of 5 for each replicate, one used 50, and one used
500.
Let’s create the datasets of sample distributions:
num_replc <- 1000
# Replicate this code 1000 times
samp_distribution_5 <- replicate(n = num_replc,
expr = attrition_pop %>%
slice_sample(n = 5) %>%
summarise(mean_attrition = mean(Attrition == "Yes")) %>%
pull(mean_attrition), simplify = FALSE
)
# Store mean attritions in a tibble in a column named sample_mean
samp_distribution_5_tibl <- tibble(replicate = 1:num_replc, mean_attrition = unlist(samp_distribution_5))
samp_distribution_5_tibl
# For 50 samples
samp_distribution_50 <- replicate(n = num_replc,
expr = attrition_pop %>%
slice_sample(n = 50) %>%
summarise(mean_attrition = mean(Attrition == "Yes")) %>%
pull(mean_attrition), simplify = FALSE
)
samp_distribution_50_tibl <- tibble(replicate = 1:num_replc, mean_attrition = unlist(samp_distribution_50))
samp_distribution_50_tibl
# For 500 samples
samp_distribution_500 <- replicate(n = num_replc,
expr = attrition_pop %>%
slice_sample(n = 50) %>%
summarise(mean_attrition = mean(Attrition == "Yes")) %>%
pull(mean_attrition), simplify = FALSE
)
samp_distribution_500_tibl <- tibble(replicate = 1:num_replc, mean_attrition = unlist(samp_distribution_500))
samp_distribution_500_tibl
# Calculate the mean across replicates of the mean attritions in samp_distribution_5_tibl, samp_distribution_50_tibl, samp_distribution_500_tibl
mean_of_means5 <- samp_distribution_5_tibl %>%
summarise(mean_mean_attrition = mean(mean_attrition))
mean_of_means50 <- samp_distribution_50_tibl %>%
summarise(mean_mean_attrition = mean(mean_attrition))
mean_of_means500 <- samp_distribution_500_tibl %>%
summarise(mean_mean_attrition = mean(mean_attrition))
mean_of_means5
mean_of_means50
mean_of_means500
Population and sampling distribution variation
You just calculated the mean of the sampling distribution and saw how
it is an estimate of the corresponding population parameter. Similarly,
as a result of the central limit theorem, the standard deviation of the
sampling distribution has an interesting relationship with the
population parameter’s standard deviation and the sample size.
sd_of_means5 <- samp_distribution_5_tibl %>%
summarise(sd_mean_attrition = sd(mean_attrition))
sd_of_means50 <- samp_distribution_50_tibl %>%
summarise(sd_mean_attrition = sd(mean_attrition))
sd_of_means500 <- samp_distribution_500_tibl %>%
summarise(sd_mean_attrition = sd(mean_attrition))
sd_of_means5
sd_of_means50
sd_of_means500
coffee_focus <- coffee_ratings %>%
select(variety, country_of_origin, flavor) %>%
rowid_to_column()
Error in select(., variety, country_of_origin, flavor) :
object 'coffee_ratings' not found
Resample with slice_sample()
To sample with replacement, you call slice_sample()
as
usual, but set the replace
argument to TRUE
.
Setting prop to 1 gives a sample with the same size as the original
dataset.
Repeated coffees
Counting the rowid
shows how many times each coffee
ended up in the resampled dataset. Some coffees are present five times
in the new dataset.
Missing coffees
That means that some coffees didn’t end up in the resampled dataset.
By taking the number of distinct row IDs in the resampled dataset using
dplyr’s n_distinct, you can see that 834 different coffees were
included, and 505 coffees weren’t included.
---
title: "Sampling in R"
output: html_notebook
author: Juan Fernando Mosquera Araujo
date: 2024-03-06
---

### **Simple random sampling**

The simplest method of sampling a population, is known as *simple random sampling* (sometimes abbreviated to "SRS"), and involves picking rows at random, one at a time, where each row has the same chance of being picked as any other.

To make it easier to see which rows end up in the sample, it's helpful to include a row ID column in the dataset before you take the sample.

We'll look at sampling methods using a synthetic (fictional) employee attrition dataset from **IBM,** where "attrition" means leaving the company.

```{r}
# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(fst)
library(tibble)
```

### **Load dataset attrition_pop**

```{r}
path_attr <- ruta_fst <- "C:/Users/JuanFer Mosquera/Documents/datasets/attrition.fst"
attrition_pop <- read_fst(path_attr)
colnames(attrition_pop)
attrition_pop
```

```{r}
set.seed(590497)
attrition_samp <- attrition_pop %>%
  # Add a row id column
  rowid_to_column() %>%
  # Get 200 rows using simple random sampling
  slice_sample(n = 200)

# View the attrition_samp dataset
attrition_samp
```

### **Systematic sample**

One sampling method that avoids randomness is called *systematic sampling*. Here, you pick rows from the population at regular intervals.

For example, if the population dataset had $1000$ rows and you wanted a sample size of five, you'd pick rows `200`, `400`, `600`, `800`, and `1000`.

```{r}
# Set the sample size
sample_size_atr <- 200

# Get the population size from attrition_pop
pop_size_atr <- nrow(attrition_pop)

# Calculate the interval
interval_atr <- pop_size_atr %/% sample_size_atr
```

```{r}
# Get row indexes for the sample 
row_indexes_atr <- seq_len(sample_size_atr) * interval_atr

attrition_sys_samp <- attrition_pop %>%
  # Add a row id column
  rowid_to_column() %>%
  # Get 200 rows using systematic sampling
  slice(row_indexes_atr)

# View the results
attrition_sys_samp

```

### **Is systematic sampling ok?**

Systematic sampling has a problem: if the data has been sorted, or there is some sort of pattern or meaning behind the row order, then the resulting sample may not be representative of the whole population. The problem can be solved by shuffling the rows, but then systematic sampling is equivalent to simple random sampling.

Here you'll look at how to determine whether or not there is a problem.

```{r}
# Add a row ID column to attrition_pop
attrition_pop_id <- attrition_pop %>%
  rowid_to_column()

# Using attrition_pop_id, plot YearsAtCompany vs. rowid
ggplot(attrition_pop_id, aes(x = rowid, y = YearsAtCompany)) +
  # Make it a scatter plot
  geom_point() + 
  # Add a smooth trend line
  geom_smooth()
```

```{r}
# Shuffle the rows of attrition_pop then add row IDs
attrition_shuffled <- attrition_pop %>%
  slice_sample(prop = 1) %>%
  rowid_to_column()

# Using attrition_shuffled, plot YearsAtCompany vs. rowid
attrition_shuffled %>%
  ggplot(aes(x = rowid, y = YearsAtCompany)) +
  geom_point() + 
  geom_smooth()
```

**A systematic sample does not always produce a sample similar to a simple random sample.**

### **Proportional stratified sampling**

If you are interested in subgroups within the population, then you may need to carefully control the counts of each subgroup within the population. *Proportional stratified sampling* results in subgroup sizes within the sample that are representative of the subgroup sizes within the population. It is equivalent to performing a simple random sample on each subgroup.

```{r}
education_counts_pop <- attrition_pop %>%
  # Count the employees by education level
  count(Education, sort = TRUE) %>%
  # Add a percent column
  mutate(percent = 100 * n / sum(n))

education_counts_pop
```

```{r}
# Use proportional stratified sampling to get 40% of each Education group 
attrition_strat <- attrition_pop %>%
  group_by(Education) %>%
  slice_sample(prop = 0.4)

attrition_strat
```

```{r}
# Get the counts and percents from attrition strat
education_counts_strat <- attrition_strat %>%
  count(Education, sort = TRUE) %>%
  mutate(percent = 100 * n / sum(n))

education_counts_strat
```

### **Equal counts stratified sampling**

f one subgroup is larger than another subgroup in the population, but you don't want to reflect that difference in your analysis, then you can use *equal counts stratified sampling* to generate samples where each subgroup has the same amount of data. For example, if you are analyzing blood types, O is the most common blood type worldwide, but you may wish to have equal amounts of O, A, B, and AB in your sample.

```{r}
# Use equal counts stratified sampling to get 30 employees from each education group
attrition_eq <- attrition_pop %>%
  group_by(Education) %>%
  slice_sample(n = 30) %>%
  ungroup() 

# See the results
attrition_eq
```

```{r}
# Get the counts and percent of attrition_eq
education_counts_eq <- attrition_eq %>%
  count(Education) %>%
  mutate(percent = 100 * n / sum(n))

# See the results 
education_counts_eq
```

### **Weighted sampling**

Stratified sampling provides rules about the probability of picking rows from your dataset at the subgroup level. A generalization of this is **weighted sampling**, which lets you specify rules about the probability of picking rows at the row level. The probability of picking any given row is proportional to the weight value for that row.

```{r}
# Using attrition_pop plot YearsAtCompany as a histogram with binwidth = 1
ggplot(attrition_pop, aes(YearsAtCompany)) + 
  geom_histogram(binwidth = 1)
```

```{r}
# Sample 400 employees weighted by YearsAtCompany
attrition_weight <- attrition_pop %>%
  slice_sample(n = 400, weight_by = YearsAtCompany)

# See the result 
attrition_weight
```

```{r}
# Using attrition weight, plot YearsAtCompany as a histogram with binwidth = 1
ggplot(attrition_weight, aes(YearsAtCompany)) +
  geom_histogram(binwidth = 1)
```

### **Performing cluster sampling**

Now that you know when to use cluster sampling, it's time to put it into action. In this exercise you'll explore the `JobRole` column of the attrition dataset. You can think of each job role as a subgroup of the whole population of employees.

```{r}
# Get unique JobRole values
job_roles_pop <- unique(attrition_pop$JobRole)

# Randomly sample 4 JobRole values
job_roles_samp <- sample(job_roles_pop, size = 4)

# see the result
job_roles_samp
```

```{r}
# Filter for rows where JobRole is in job_roles_samp
attrition_filtered <- attrition_pop %>%
  filter(JobRole %in% job_roles_samp) %>%
  group_by(JobRole)

# Randomly sample 10 employees from each sampled job role
 attrition_clus <-  attrition_filtered %>%
   slice_sample(n = 10)
  
 attrition_clus
```

### **3 kinds of samping**

Let's compare the performance of point estimates using simple, stratified, and cluster sampling. Before we do that, you'll have to set up the samples. We'll use the `RelationshipSatisfaction` column of the attrition dataset, which categorizes the employee's relationship with the company. It's an ordered factor with four levels: `Low`, `Medium`, `High`, and `Very_High`.

```{r}
attrition_srs <- attrition_pop %>%
  slice_sample(prop = 0.25)

# Perform stratified sampling to get 0.25 of each relationship group
attrition_strt <- attrition_pop %>%
  group_by(RelationshipSatisfaction) %>%
  slice_sample(prop = 0.25)

# Get unique values of RelationshipSatisfaction 
satisfaction_unique <- unique(attrition_pop$RelationshipSatisfaction)

# Ramdomly sample for 2 of the unique satisfaction values 
satisfaction_samp <- sample(satisfaction_unique)

# Perform cluster sampling on the selected group getting 0.25 of the population
attrition_clust <- attrition_pop %>%
  filter(RelationshipSatisfaction %in% satisfaction_samp) %>%
  group_by(RelationshipSatisfaction) %>%
  slice_sample(n = round(nrow(attrition_pop) / 4)) %>%
  ungroup()
```

### **Summary statistics on different kinds of sample**

Now you have three types of sample (simple, stratified, cluster), you can compare point estimates from each sample to the population parameter. That is, you can calculate the same summary statistic on each sample and see how it compares to the summary statistic for the population.

Here, we'll look at how satisfaction with the company affects whether or not the employee leaves the company. That is, you'll calculate the proportion of employees who left the company (they have an `Attrition` value of `"Yes"`), for each value of `RelationshipSatisfaction`.

### **Whole population dataset**

```{r}
# Use the whole population dataset
mean_attrition_pop <- attrition_pop %>%
  # Group by relationship satisfaction level
  group_by(RelationshipSatisfaction) %>%
  # Calculate the proportion of employee attrition
  summarise(mean_attrition = mean(Attrition == "Yes"))

# See the result
mean_attrition_pop
```

### **attrition_srs sataset**

```{r}
mean_attrition_srs <- attrition_srs %>%
  group_by(RelationshipSatisfaction) %>%
  summarise(mean_attrition = mean(Attrition == "Yes"))

#See the result
mean_attrition_srs
```

### **attrition_strat dataset**

```{r}
mean_attrition_strt <- attrition_strt %>%
  group_by(RelationshipSatisfaction) %>%
  summarise(mean_attrition = mean(Attrition == "Yes"))

# See the result
mean_attrition_strt
```

### **attrition_clust dataset**

```{r}
mean_attrition_clust <- attrition_clust %>%
  group_by(RelationshipSatisfaction) %>%
  summarise(mean_attrition = mean(Attrition == "Yes"))

# See the results
mean_attrition_clust
```

### **Calculate relative errors**

The size of the sample you take affects how accurately the point estimates reflect the corresponding population parameter. For example, when you calculate a sample mean, you want it to be close to the population mean. However, if your sample is too small, this might not be the case.

The most common metric for assessing accuracy is *relative error*. This is the absolute difference between the population parameter and the point estimate, all divided by the population parameter. It is sometimes expressed as a percentage.

```{r}
# Generate a simple random sample of 10 rows 
attrition_srs10 <- attrition_pop %>%
  slice_sample(n = 10)

# Calculate the proportion of employee attrition in the sample
  mean_attrition_srs10 <- attrition_srs10 %>%
    summarise(mean_attrition = mean(Attrition == "Yes")) %>%
    pull(mean_attrition)
  
# Calculate the relative error percentage
rel_error_pct10 <- 100 * abs(mean_attrition_pop - mean_attrition_srs10) / mean_attrition_pop

# See the results
rel_error_pct10
```

Calculate the relative error percentage again. This time, use a simple random sample of one hundred rows of `attrition_pop`.

```{r}
attrition_srs100 <- attrition_pop %>%
  slice_sample(n = 100)

  mean_attrition_srs100 <- attrition_srs100 %>%
    summarise(mean_attrition = mean(Attrition == "Yes")) %>%
    pull(mean_attrition)
  
rel_error_pct100 <- 100 * abs(mean_attrition_pop - mean_attrition_srs10) / mean_attrition_pop

rel_error_pct100
```

### **Replicating samples**

When you calculate a point estimate such as a sample mean, the value you calculate depends on the rows that were included in the sample. That means that there is some randomness in the answer. In order to quantify the variation caused by this randomness, you can create many samples and calculate the sample mean (or other statistic) for each sample.

```{r}
# Replicate this code 500 times
mean_attritions <- replicate(n = 500, 
  expr = attrition_pop %>%
    slice_sample(n = 20) %>%
    summarise(mean_attrition = mean(Attrition == "Yes")) %>%
    pull(mean_attrition)
)

# see the result
head(mean_attritions)
```

```{r}
# Store mean attritions in a tibble in a column named sample_mean
sample_means_att <- tibble(sample_mean = mean_attritions)

# Plot a histogram of the "sample_mean" column, binwidth 0.05
ggplot(sample_means_att, aes(sample_mean)) + 
  geom_histogram(binwidth = 0.05)
```

### **Population and sampling distributions means**

One of the useful features of sampling distributions is that we can quantify them. In particular, we can calculate summary statistics on them. Here, we'll look at the relationship between the mean of the sampling distribution and the population parameter that the sampling is supposed to estimate.

Three sampling distributions are provided. In each case, the employee attrition dataset was sampled using simple random sampling, then the mean attrition was calculated. This was done 1000 times to get a sampling distribution of mean attritions. One sampling distribution used a sample size of 5 for each replicate, one used 50, and one used 500.

Let's create the datasets of sample distributions:

```{r}
num_replc <- 1000

# Replicate this code 1000 times
samp_distribution_5 <- replicate(n = num_replc, 
  expr = attrition_pop %>%
    slice_sample(n = 5) %>%
    summarise(mean_attrition = mean(Attrition == "Yes")) %>%
    pull(mean_attrition), simplify = FALSE
)

# Store mean attritions in a tibble in a column named sample_mean
samp_distribution_5_tibl <- tibble(replicate = 1:num_replc, mean_attrition = unlist(samp_distribution_5))
samp_distribution_5_tibl

# For 50 samples
samp_distribution_50 <- replicate(n = num_replc, 
  expr = attrition_pop %>%
    slice_sample(n = 50) %>%
    summarise(mean_attrition = mean(Attrition == "Yes")) %>%
    pull(mean_attrition), simplify = FALSE
)
samp_distribution_50_tibl <- tibble(replicate = 1:num_replc, mean_attrition = unlist(samp_distribution_50))
samp_distribution_50_tibl

# For 500 samples
samp_distribution_500 <- replicate(n = num_replc, 
  expr = attrition_pop %>%
    slice_sample(n = 50) %>%
    summarise(mean_attrition = mean(Attrition == "Yes")) %>%
    pull(mean_attrition), simplify = FALSE
)
samp_distribution_500_tibl <- tibble(replicate = 1:num_replc, mean_attrition = unlist(samp_distribution_500))
samp_distribution_500_tibl
```

```{r}
# Calculate the mean across replicates of the mean attritions in samp_distribution_5_tibl, samp_distribution_50_tibl, samp_distribution_500_tibl
mean_of_means5 <- samp_distribution_5_tibl %>%
  summarise(mean_mean_attrition = mean(mean_attrition))

mean_of_means50 <- samp_distribution_50_tibl %>%
  summarise(mean_mean_attrition = mean(mean_attrition))

mean_of_means500 <- samp_distribution_500_tibl %>%
  summarise(mean_mean_attrition = mean(mean_attrition))

mean_of_means5
mean_of_means50
mean_of_means500
```

### **Population and sampling distribution variation**

You just calculated the mean of the sampling distribution and saw how it is an estimate of the corresponding population parameter. Similarly, as a result of the central limit theorem, the standard deviation of the sampling distribution has an interesting relationship with the population parameter's standard deviation and the sample size.

```{r}
sd_of_means5 <- samp_distribution_5_tibl %>%
  summarise(sd_mean_attrition = sd(mean_attrition))

sd_of_means50 <- samp_distribution_50_tibl %>%
  summarise(sd_mean_attrition = sd(mean_attrition))

sd_of_means500 <- samp_distribution_500_tibl %>%
  summarise(sd_mean_attrition = sd(mean_attrition))

sd_of_means5
sd_of_means50
sd_of_means500
```

```{r}
coffee_focus <- coffee_ratings %>%
  select(variety, country_of_origin, flavor) %>%
  rowid_to_column()

glimpse(coffee_focus)
```

### **Resample with `slice_sample()`**

To sample with replacement, you call `slice_sample()` as usual, but set the `replace` argument to `TRUE`. Setting prop to 1 gives a sample with the same size as the original dataset.

```{r}
coffee_resamp <- coffee_focus %>%
  slice_sample(prop = 1, replace = TRUE)
```

### **Repeated coffees**

Counting the `rowid` shows how many times each coffee ended up in the resampled dataset. Some coffees are present five times in the new dataset.

```{r}
coffee_resamp %>%
  count(rowid, sort = TRUE)
```

### **Missing coffees**

That means that some coffees didn't end up in the resampled dataset. By taking the number of distinct row IDs in the resampled dataset using dplyr's n_distinct, you can see that 834 different coffees were included, and 505 coffees weren't included.

```{r}

```
