This weeks data dive focuses on understanding sampling and it’s relationship to the central limit theorem. Specifically, we will use the bank marketing dataset, break it out into various samples, and then analyze / compare summary statistics between each sample to understand how they differ and relate more broadly to the “population” – where population is defined as entire dataset.
# Declare libraries
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.2.1
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")
# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl (7): age, balance, day, duration, campaign, pdays, previous
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To begin, we will go ahead and create a dataframe that stores each of the samples where each sample represents 15% of all values randomly chosen from our population. We will go ahead select 5 samples for our sampling distribution.
# Creating a Sample Dataframe
sample_frac = 0.15 # Declare that each sample is 15% of the population set
n_samples = 5 # Specify 5 samples
df_samples = tibble() # Declare an empty sample dataframe to bind to
for (sample_i in 1:n_samples) {
df_i <- bank_marketing |>
sample_n(size = sample_frac * nrow(bank_marketing), replace = TRUE) |>
mutate(sample_num = sample_i)
df_samples = bind_rows(df_samples, df_i)
}
To understand how these samples relate to one another and what these principles may tell us about the data, we will begin by creating some basic group by data frames on categorical columns for comparison on education, job, and marital status.
# Group by Sample for Education Variable
df_education <- df_samples |>
group_by(sample_num, education) |>
summarize(n = n(), .groups = "drop") |>
pivot_wider(
names_from = education,
values_from = n
)
df_education
## # A tibble: 5 × 5
## sample_num primary secondary tertiary unknown
## <int> <int> <int> <int> <int>
## 1 1 1025 3408 2064 284
## 2 2 1008 3481 1997 295
## 3 3 1022 3515 1978 266
## 4 4 988 3451 2063 279
## 5 5 1094 3490 1916 281
Here we can see that the education categorical variable maintains similar distributions across all samples–although slight variances exist. For instance, sample #3 has the lowest number of bank clients who have completed up to a primary education. Despite this, these minor differences have almost no meaningful impact on how the samples are distributed proportionally as demonstrated below, especially when rounded to the nearest digit.
# Converting Education values counts to a Percentage
df_education <- df_samples |>
count(sample_num, education) |>
group_by(sample_num) |>
mutate(percentage = round((n / sum(n)) * 100, 0)) |>
select(sample_num, education, percentage) |>
pivot_wider(
names_from = education,
values_from = percentage
)
df_education
## # A tibble: 5 × 5
## # Groups: sample_num [5]
## sample_num primary secondary tertiary unknown
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 15 50 30 4
## 2 2 15 51 29 4
## 3 3 15 52 29 4
## 4 4 15 51 30 4
## 5 5 16 51 28 4
# Group by Job Variable
df_job <- df_samples |>
group_by(sample_num,job) |>
summarise(n = n(), .groups = "drop") |>
pivot_wider(
names_from = job,
values_from = n
)
df_job
## # A tibble: 5 × 13
## sample_num admin. `blue-collar` entrepreneur housemaid management retired
## <int> <int> <int> <int> <int> <int> <int>
## 1 1 741 1456 229 176 1441 324
## 2 2 787 1458 226 187 1445 341
## 3 3 766 1498 230 196 1410 326
## 4 4 755 1486 208 171 1433 312
## 5 5 766 1504 217 194 1360 373
## # ℹ 6 more variables: `self-employed` <int>, services <int>, student <int>,
## # technician <int>, unemployed <int>, unknown <int>
Again, when looking across the raw value counts, there doesn’t seem to be any significant differences among samples aside from slight variations. If we were to convert these to percentages we would likely get the same story as well – where each sample is between 1 or 2 percentage points within each other.
# Group by Martial Status Variable
df_marital <- df_samples |>
group_by(sample_num, marital) |>
summarise(n = n(), .groups = "drop") |>
pivot_wider(
names_from = marital,
values_from = n
)
df_marital
## # A tibble: 5 × 4
## sample_num divorced married single
## <int> <int> <int> <int>
## 1 1 775 4014 1992
## 2 2 731 4096 1954
## 3 3 777 4071 1933
## 4 4 789 4091 1901
## 5 5 764 4027 1990
The same story is told across the marital status variable as well – with no specific outstanding differences. Although, one interesting observation is that the larger the population share, the smaller the internal variance among samples within a given marital status. For instance, the range for the count of bank clients that are divorced in the samples is between 741 to 815 –representing a difference of around 80 clients. Whereas for married clients, this ranges from 4,192 to 4,081, which is only difference of 110. Proportionally, this variance is much smaller the larger the categorical subset of the sample is. This can also be observed in all the other categorical data frames above and I believe exhibits the law of large numbers.
Next we will focus on understanding how data are distributed differently for continuous variables. Since we are choosing a sample size of 15% of the population – where each sample has about 6,700 observations – we are not going to expect large variations in how the data are distributed across each sample. Let’s specifically focus on understand the interquartile ranges for age and yearly average balance.
# Age Interquartile Range Box Plot
df_samples |>
ggplot() +
geom_boxplot(aes(x = factor(sample_num), y = age)) +
labs(
x = "Sample Number",
y = "Age",
title = "Age Interquartile Distribution Across Samples 1- 5"
) +
theme_bw()
Here we can see that the distribution of age has almost no significant differences on mean, median, or interquartile ranges between each sample. We do see differences in the number of outliers represented by the dots outside the whisker, especially on Sample Number 1, but again the summary statistics show little to no difference. Now, if the goal was to analyze only the outlier population to gain a potential understanding of those over the age of 75 – we may see a difference in how this specific subset of the sample is distributed from one to the next. So let’s fine out.
# Subset of Samples where Age is over 75
df_samples |>
filter(age >= 75) |>
ggplot(aes(x = factor(sample_num), y = age)) +
geom_boxplot() +
labs(
x = "Sample Number",
y = "Age",
title = "75+ Interquartile Distribution Across Samples 1–5"
) +
theme_bw()
Now were in business and can see some clear variation in how the data are distributed for those 75+ within these sub samples. I think this gets to the heart of the central limit theorem – that as our sample size grows within the level of unit analysis we are engaging with, then the differences / variations between each sample should be smaller. Now let’s go ahead and run this same 75+ analysis after creating samples composed of 35 percent of the population instead of 15 percent to see if the variation here changes.
# Creating a Sample Dataframe
sample_frac = 0.35 # Declare that each sample is 15% of the population set
n_samples = 5 # Specify 5 samples
df_75_plus = tibble() # Declare an empty sample dataframe to bind to
for (sample_i in 1:n_samples) {
df_i <- bank_marketing |>
sample_n(size = sample_frac * nrow(bank_marketing), replace = TRUE) |>
mutate(sample_num = sample_i)
df_75_plus = bind_rows(df_75_plus, df_i)
}
df_75_plus |>
filter(age >= 75) |>
ggplot() +
geom_boxplot(aes(x = factor(sample_num), y = age)) +
labs(
x = "Sample Number",
y = "Age",
title = "75+ Interquartile Distribution Across Samples 1-5 for 35% Sample Size"
) +
theme_bw()
Here we can see that the median lines and interquartile ranges are now closer together than they were previously. If we compare the count of the number of observations in each sample, from each table, we will see a clear difference in the number of records included. For instance, sample number one jumps from 47 records to 111 – and more broadly the proportional variance in the range also decreases substantially. In the data frame defined by a sample of size 15%, the range is 18 observations – whereas in the dataframe defined by a sample size of 35%, that range is only 16. That is an astronomical difference proportionally.
# Create a combined dataset and display count of rows for those 75+
df_combined <- bind_rows(
df_15 = df_samples,
df_35 = df_75_plus,
.id = "table_name")
df_combined |>
filter(age >= 75) |>
group_by(table_name, sample_num) |>
count(table_name) |>
pivot_wider(
names_from = sample_num,
values_from = n
)
## # A tibble: 2 × 6
## # Groups: table_name [2]
## table_name `1` `2` `3` `4` `5`
## <chr> <int> <int> <int> <int> <int>
## 1 df_15 37 42 35 35 53
## 2 df_35 104 103 109 99 123
I would have done balance but I’m getting tired and started this a tad bit late. I think I demonstrated the principles of the central limit thoerem and how it relates to the sampling distribution and size of each sample within the distribution pretty well though!
From this exercise we have gleamed:
The importance of the law of large numbers and the central limit theorem in determining the reliability of the sample. The larger the sample size (e.g., number of observations) the closer the sample is to representing the true population.
Level of analysis matters. If we are looking at a specific outlier population like those over the age of 75 in our bank marketing dataset, then we need a larger sample size to compensate for potential variations in sampling.
Data are always a subset of the truth–and we can improve clarity / our potential to resolve the truth by increasing the number of data points we have access to. In this context, we are looking at two specific subsets of a broader population for bank marketing: (1) the larger broader population as a whole that may not be a client of this bank and (2) the subset of clients at this bank that were involved in the marketing campaign.