00 - Introduction to Week #4 Data Dive

This weeks data dive focuses on understanding sampling and it’s relationship to the central limit theorem. Specifically, we will use the bank marketing dataset, break it out into various samples, and then analyze / compare summary statistics between each sample to understand how they differ and relate more broadly to the “population” – where population is defined as entire dataset.

# Declare libraries
library(readr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.1
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")

## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

To begin, we will go ahead and create a dataframe that stores each of the samples where each sample represents 15% of all values randomly chosen from our population. We will go ahead select 5 samples for our sampling distribution.

# Creating a Sample Dataframe

sample_frac = 0.15 # Declare that each sample is 15% of the population set

n_samples = 5 # Specify 5 samples

df_samples = tibble() # Declare an empty sample dataframe to bind to

for (sample_i in 1:n_samples) {
  df_i <- bank_marketing |>
    sample_n(size = sample_frac * nrow(bank_marketing), replace = TRUE) |>
    mutate(sample_num = sample_i)
  
  df_samples = bind_rows(df_samples, df_i)
}

01 - Exploring Sample Summary Statistics

To understand how these samples relate to one another and what these principles may tell us about the data, we will begin by creating some basic group by data frames on categorical columns for comparison on education, job, and marital status.

Education

# Group by Sample for Education Variable
df_education <- df_samples |>
  group_by(sample_num, education) |>
  summarize(n = n(), .groups = "drop") |>
  pivot_wider(
    names_from = education,
    values_from = n
  )

df_education

## # A tibble: 5 × 5
##   sample_num primary secondary tertiary unknown
##        <int>   <int>     <int>    <int>   <int>
## 1          1    1025      3408     2064     284
## 2          2    1008      3481     1997     295
## 3          3    1022      3515     1978     266
## 4          4     988      3451     2063     279
## 5          5    1094      3490     1916     281

Here we can see that the education categorical variable maintains similar distributions across all samples–although slight variances exist. For instance, sample #3 has the lowest number of bank clients who have completed up to a primary education. Despite this, these minor differences have almost no meaningful impact on how the samples are distributed proportionally as demonstrated below, especially when rounded to the nearest digit.

# Converting Education values counts to a Percentage
df_education <- df_samples |>
  count(sample_num, education) |>
  group_by(sample_num) |>
  mutate(percentage = round((n / sum(n)) * 100, 0)) |>
  select(sample_num, education, percentage) |>
  pivot_wider(
    names_from = education,
    values_from = percentage
  )

df_education

## # A tibble: 5 × 5
## # Groups:   sample_num [5]
##   sample_num primary secondary tertiary unknown
##        <int>   <dbl>     <dbl>    <dbl>   <dbl>
## 1          1      15        50       30       4
## 2          2      15        51       29       4
## 3          3      15        52       29       4
## 4          4      15        51       30       4
## 5          5      16        51       28       4

Job

# Group by Job Variable

df_job <- df_samples |>
  group_by(sample_num,job) |>
  summarise(n = n(), .groups = "drop") |>
  pivot_wider(
    names_from = job,
    values_from = n
  )

df_job

## # A tibble: 5 × 13
##   sample_num admin. `blue-collar` entrepreneur housemaid management retired
##        <int>  <int>         <int>        <int>     <int>      <int>   <int>
## 1          1    741          1456          229       176       1441     324
## 2          2    787          1458          226       187       1445     341
## 3          3    766          1498          230       196       1410     326
## 4          4    755          1486          208       171       1433     312
## 5          5    766          1504          217       194       1360     373
## # ℹ 6 more variables: `self-employed` <int>, services <int>, student <int>,
## #   technician <int>, unemployed <int>, unknown <int>

Again, when looking across the raw value counts, there doesn’t seem to be any significant differences among samples aside from slight variations. If we were to convert these to percentages we would likely get the same story as well – where each sample is between 1 or 2 percentage points within each other.

Marital Status

# Group by Martial Status Variable
df_marital <- df_samples |>
  group_by(sample_num, marital) |>
  summarise(n = n(), .groups = "drop") |>
  pivot_wider(
    names_from = marital,
    values_from = n
  )

df_marital

## # A tibble: 5 × 4
##   sample_num divorced married single
##        <int>    <int>   <int>  <int>
## 1          1      775    4014   1992
## 2          2      731    4096   1954
## 3          3      777    4071   1933
## 4          4      789    4091   1901
## 5          5      764    4027   1990

The same story is told across the marital status variable as well – with no specific outstanding differences. Although, one interesting observation is that the larger the population share, the smaller the internal variance among samples within a given marital status. For instance, the range for the count of bank clients that are divorced in the samples is between 741 to 815 –representing a difference of around 80 clients. Whereas for married clients, this ranges from 4,192 to 4,081, which is only difference of 110. Proportionally, this variance is much smaller the larger the categorical subset of the sample is. This can also be observed in all the other categorical data frames above and I believe exhibits the law of large numbers.

Next we will focus on understanding how data are distributed differently for continuous variables. Since we are choosing a sample size of 15% of the population – where each sample has about 6,700 observations – we are not going to expect large variations in how the data are distributed across each sample. Let’s specifically focus on understand the interquartile ranges for age and yearly average balance.

Age

# Age Interquartile Range Box Plot
df_samples |>
  ggplot() +
  geom_boxplot(aes(x = factor(sample_num), y = age)) +
  labs(
    x = "Sample Number",
    y = "Age",
    title = "Age Interquartile Distribution Across Samples 1- 5"
  ) +
  theme_bw()

Here we can see that the distribution of age has almost no significant differences on mean, median, or interquartile ranges between each sample. We do see differences in the number of outliers represented by the dots outside the whisker, especially on Sample Number 1, but again the summary statistics show little to no difference. Now, if the goal was to analyze only the outlier population to gain a potential understanding of those over the age of 75 – we may see a difference in how this specific subset of the sample is distributed from one to the next. So let’s fine out.

# Subset of Samples where Age is over 75
df_samples |>
  filter(age >= 75) |>
  ggplot(aes(x = factor(sample_num), y = age)) +
  geom_boxplot() +
  labs(
    x = "Sample Number",
    y = "Age",
    title = "75+ Interquartile Distribution Across Samples 1–5"
  ) +
  theme_bw()

Now were in business and can see some clear variation in how the data are distributed for those 75+ within these sub samples. I think this gets to the heart of the central limit theorem – that as our sample size grows within the level of unit analysis we are engaging with, then the differences / variations between each sample should be smaller. Now let’s go ahead and run this same 75+ analysis after creating samples composed of 35 percent of the population instead of 15 percent to see if the variation here changes.

# Creating a Sample Dataframe

sample_frac = 0.35 # Declare that each sample is 15% of the population set

n_samples = 5 # Specify 5 samples

df_75_plus = tibble() # Declare an empty sample dataframe to bind to

for (sample_i in 1:n_samples) {
  df_i <- bank_marketing |>
    sample_n(size = sample_frac * nrow(bank_marketing), replace = TRUE) |>
    mutate(sample_num = sample_i)
  
  df_75_plus = bind_rows(df_75_plus, df_i)
}

df_75_plus |>
  filter(age >= 75) |>
  ggplot() +
  geom_boxplot(aes(x = factor(sample_num), y = age)) +
  labs(
    x = "Sample Number",
    y = "Age",
    title = "75+ Interquartile Distribution Across Samples 1-5 for 35% Sample Size"
  ) +
  theme_bw()

Here we can see that the median lines and interquartile ranges are now closer together than they were previously. If we compare the count of the number of observations in each sample, from each table, we will see a clear difference in the number of records included. For instance, sample number one jumps from 47 records to 111 – and more broadly the proportional variance in the range also decreases substantially. In the data frame defined by a sample of size 15%, the range is 18 observations – whereas in the dataframe defined by a sample size of 35%, that range is only 16. That is an astronomical difference proportionally.

# Create a combined dataset and display count of rows for those 75+
df_combined <- bind_rows(
  df_15 = df_samples, 
  df_35 = df_75_plus,
  .id = "table_name")

df_combined |>
  filter(age >= 75) |>
  group_by(table_name, sample_num) |>
  count(table_name) |>
  pivot_wider(
    names_from = sample_num,
    values_from = n
  )

## # A tibble: 2 × 6
## # Groups:   table_name [2]
##   table_name   `1`   `2`   `3`   `4`   `5`
##   <chr>      <int> <int> <int> <int> <int>
## 1 df_15         37    42    35    35    53
## 2 df_35        104   103   109    99   123

Balance

I would have done balance but I’m getting tired and started this a tad bit late. I think I demonstrated the principles of the central limit thoerem and how it relates to the sampling distribution and size of each sample within the distribution pretty well though!

02 - Key Takeaways

From this exercise we have gleamed:

The importance of the law of large numbers and the central limit theorem in determining the reliability of the sample. The larger the sample size (e.g., number of observations) the closer the sample is to representing the true population.
Level of analysis matters. If we are looking at a specific outlier population like those over the age of 75 in our bank marketing dataset, then we need a larger sample size to compensate for potential variations in sampling.
Data are always a subset of the truth–and we can improve clarity / our potential to resolve the truth by increasing the number of data points we have access to. In this context, we are looking at two specific subsets of a broader population for bank marketing: (1) the larger broader population as a whole that may not be a client of this bank and (2) the subset of clients at this bank that were involved in the marketing campaign.

week-4-data-sives

2026-02-03