Stratified Sampling in Survey Research with Python

PyLadiesCon 2025

Cary K. Jim, Ph.D.

Introduction

Learning Outcomes

Understand when to use stratified sampling over random sampling
Learn how to define subgroups (strata) and implement sampling in Python
Evaluate sample representativeness by checking distribution patterns
Apply weights to adjust for unequal proportions in score estimation

Today’s presentation

Demonstrate how stratified random sampling to identify subgroups - Post-stratification weights are applied to correct unequal representation

Stratified Sampling: Why and When?

Challenges in surveying large populations: time, budget, and resource constraints
Importance of representative samples in research to reduce bias
Align research outcomes with actionable goals: informing policy, guiding interventions, or enhancing program effectiveness
Allow for subgroup analysis to understand diverse perspectives within the population

Figure 1. Types of Sampling Methods with Illustration (Generated with Copilot)

Stratified Sampling

Overview

Stratified sampling is a research design method that involves dividing a population into smaller subgroups known as strata.

Strata are formed based on shared attributes or characteristics (e.g., age group, gender, geographic location).
A sampling technique, such as random, systematic, or convenience sampling, is then applied to select participants in each stratum.

📘 Examples

Systematic sampling within strata: You divided a college students population based on their major and then select every kth element based on the size needed from each stratum.
Stratified random sampling within strata: You divided a college students population based on their major and then randomly select a proportionate number of students from each major.

Figure 2. Stratified Random Sampling Workflow Diagram (Generated with Copilot)

Python Implementation of Stratified Random Sampling

Population Dataset

In this example, the population is defined as 25,000 college students.

Variables in the population dataset:

Code

# View the first three rows of the population dataset
population_data.head(3)

	StudentID	Gender	Age	Race	Ethnicity	StudentType	OverallGPA
0	SID000001	Male	18	White	Hispanic or Latino	Freshman	4.00
1	SID000002	Female	40	Two or more races	Hispanic or Latino	Continuing Education	0.00
2	SID000003	Female	19	White	Not Hispanic or Latino	Junior	3.23

Cochran’s Sample Size Formula

The ideal sample size for an infinite population is given by: \[ n_0 = \frac{Z^2 \, p \, (1 - p)}{e^2} \]

where:

( n₀ ) = initial sample size from Cochran’s formula for an infinite population
( Z ) = Z-value (e.g., 1.96 for 95% confidence, z-table lookup)
( p ) = estimated proportion of the attribute present in the population (assuming 0.5 for largest variability)
( e ) = margin of error (e.g., 0.05 for 5%, the desired level of precision)

The sample size adjusted for a finite population is given by:

\[ n = \frac{n_0}{1 + \left( \frac{n_0 - 1}{N} \right)} \]

where:

( n ) = adjusted sample size for a finite population
( n₀ ) = theoretical sample value obtained from Cochran’s formula for an infinite population
( N ) = population size

Code example: Cochran’s Sample Size CalculationCode# Cochran's Sample Size Calculation
import scipy.stats as stats
def cochran_sample_size(Z, p, e, N):
    n0 = (Z**2 * p * (1 - p)) / (e**2)
    n = n0 / (1 + ((n0 - 1) / N))
    return int(n)

# Parameters
Z = stats.norm.ppf(0.975)   # Calculate the critical z-value for a 95% confidence level in a standard normal distribution
p = 0.5                     # Estimated proportion (use 0.5 for maximum variability, sort of like worst-case scenario)
e = 0.05                    # Margin of error (precision level, more precision requires larger sample size)
N = 25000                   # Finite population size (in this example, student population size)

# Calculate sample size

# Calculate n0 and adjusted sample size for finite population
n0 = (Z**2 * p * (1 - p)) / (e**2)
sample_size = n0 / (1 + ((n0 - 1) / N))
print(f"Ideal sample size (n0): {round(n0)}")
print(f"Adjusted sample size for finite population: {int(sample_size)}")

Ideal sample size (n0): 384
Adjusted sample size for finite population: 378

But Wait…There’s Re-coding to Do!

Example 1: By increments of 2 or 10

Code

# Binding age groups by increments of 2 years for younger ages and 10 years for older ages
population_data['Age_Group_1'] = pd.cut(population_data['Age'], 
                                       bins=[17, 19, 21, 23, 25, 27, 29, 31, 41, 50],
                                       labels=['17-18', '19-20', '21-22', '23-24', '25-26', '27-28', '29-30', '31-40', '41-50'], right=False)
# View the distribution after recoding for both counts and proportions
# Make a list of two outputs: counts and proportions
# To view counts
age_1_group_counts = population_data['Age_Group_1'].value_counts().sort_index()
# To view proportions
age_1_group_proportions = population_data['Age_Group_1'].value_counts(normalize=True).sort_index().round(3)
age_1_group_counts, age_1_group_proportions

(Age_Group_1
 17-18    10123
 19-20     8896
 21-22     4222
 23-24      506
 25-26      329
 27-28      277
 29-30      235
 31-40      400
 41-50       12
 Name: count, dtype: int64,
 Age_Group_1
 17-18    0.405
 19-20    0.356
 21-22    0.169
 23-24    0.020
 25-26    0.013
 27-28    0.011
 29-30    0.009
 31-40    0.016
 41-50    0.000
 Name: proportion, dtype: float64)

Example 2: By meaningful groups

Code

# We will keep the first three groups like the previous example, but then group all older ages into one group, as 31 and above.

population_data['Age_Group_2'] = pd.cut(population_data['Age'], 
                                       bins=[17, 19, 21, 23, 25, 31, 44],
                                       labels=['17-18', '19-20', '21-22', '23-25', '26-30', '31+'],
                                   right=False)
# View the distribution after recoding for both counts and proportions
# To view counts
age_2_group_counts = population_data['Age_Group_2'].value_counts().sort_index()
# To view proportions
age_2_group_proportions = population_data['Age_Group_2'].value_counts(normalize=True).sort_index().round(3)
age_2_group_counts, age_2_group_proportions

(Age_Group_2
 17-18    10123
 19-20     8896
 21-22     4222
 23-25      506
 26-30      841
 31+        411
 Name: count, dtype: int64,
 Age_Group_2
 17-18    0.405
 19-20    0.356
 21-22    0.169
 23-25    0.020
 26-30    0.034
 31+      0.016
 Name: proportion, dtype: float64)

Reminder: The proportion of each subgroup should be calculated after re-coding. If you adjust any levels within a variable, it can affect the distribution of values in the dataset. In practice, you may repeat this step until you are able to draw a proper set of subgroups before creating the strata.

Calculate Proportions

The strata (subgroups) are determined by the following categories:

Gender (Female or Male)
Race (Asian, Black or African American, White, Two or more races, Other)
Age Group (16-18, 19-20, 21-22, 23-30, 31+)

Code example: Proportion by each stratum

Code

# Create a distribution chart for each of the three variables after recoding
import matplotlib.pyplot as plt

# Proportion of Gender in the Sample Selection Data and add the proportion on top of each bar
gender_proportions = sample_selection_data['Gender_m'].value_counts(normalize=True).round(2)
gender_proportions.plot(kind='bar', title='Proportion of Gender')
plt.xticks(rotation=0)
# Round to 2 decimal places and convert to percentage
for i, v in enumerate(gender_proportions):
    plt.text(i, v + 0.01, str(round(v * 100, 2)) + '%', ha='center')
plt.show()

Code

# Proportion of Race in the Sample Selection Data
race_proportions = sample_selection_data['Race_m'].value_counts(normalize=True).round(2)
race_proportions.plot(kind='bar', title='Proportion of Race')
plt.xticks(rotation=0)
# Round to 2 decimal places and convert to percentage
for i, v in enumerate(race_proportions):
    plt.text(i, v + 0.01, str(round(v * 100, 2)) + '%', ha='center')
plt.show()

Code

# Proportion of Age Group in the Sample Selection Data
age_proportions = sample_selection_data['Age_Group_2'].value_counts(normalize=True).round(2)
age_proportions.plot(kind='bar', title='Proportion of Age Group')
plt.xticks(rotation=0)
for i, v in enumerate(age_proportions):
    plt.text(i, v + 0.01, str(round(v * 100, 2)) + '%', ha='center')
plt.show()

Re-sampling

Code

# This time we are using a different set of parameters. The formula remains the same from Step 2.
# Parameters
Z = stats.norm.ppf(0.975)   # Calculate the critical z-value for
# a 95% confidence level in a standard normal distribution
p = 0.5                     # Estimated proportion (use 0.5 for maximum variability, sort of like worst-case scenario)
e_3 = 0.03                    # Margin of error (precision level, more precision requires larger sample size)
N = 25000                   # Finite population size (in this example, student population size)

# Calculate sample size
# Calculate n0 and adjusted sample size for finite population
n0 = (Z**2 * p * (1 - p)) / (e_3**2)
sample_size_larger_e = n0 / (1 + ((n0 - 1) / N))
print(f"Ideal sample size (n0) for 95% confidence level with 3% margin of error: {round(n0)}")
print(f"Adjusted sample size of finite population for 95% confidence level with 3% margin of error: {int(sample_size_larger_e)}")

Ideal sample size (n0) for 95% confidence level with 3% margin of error: 1067
Adjusted sample size of finite population for 95% confidence level with 3% margin of error: 1023

Then, we apply the proportions to get the sample of this updated sample size.

Code

# Determine the combined proportions of the strata in the sample selection data with the modified variables
stratum_set_1 = sample_selection_data.groupby(['Gender_m', 'Race_m', 'Age_Group_2'], observed=True).size().reset_index(name='Count')
stratum_set_1['Proportion'] = stratum_set_1['Count'] / stratum_set_1['Count'].sum()

# use the 3% sample size to draw the stratified random sample
stratum_sizes_three_moe = (stratum_set_1['Proportion'] * int(sample_size_larger_e)).round().astype(int)
# Add the grouping variables back to the stratum sizes DataFrame for clarity
stratum_sizes_three_moe = pd.concat([stratum_set_1[['Gender_m', 'Race_m', 'Age_Group_2']], stratum_sizes_three_moe.rename('Sample_Size')], axis=1)
print("Stratum sizes for each subgroup based on 3% margin of error sample size (first 3 rows):")
stratum_sizes_three_moe.head(3)

Stratum sizes for each subgroup based on 3% margin of error sample size (first 3 rows):

	Gender_m	Race_m	Age_Group_2	Sample_Size
0	Female	Asian	17-18	25
1	Female	Asian	19-20	21
2	Female	Asian	21-22	10

Draw the stratified random sample based on the calculated stratum sizes.

Code

# Initialize an empty list to hold sampled dataframes
sampled_groups = []
for i, row in stratum_sizes_three_moe.iterrows():
    stratum_filter = (
        (sample_selection_data['Gender_m'] == row['Gender_m']) &
        (sample_selection_data['Race_m'] == row['Race_m']) &
        (sample_selection_data['Age_Group_2'] == row['Age_Group_2'])
    )
    stratum_population = sample_selection_data[stratum_filter]
    # Use the sample size from the current row
    stratum_sample_size = int(row['Sample_Size'])

    # Ensure we don't sample from empty strata or more than available
    n = min(stratum_sample_size, len(stratum_population))
    if n > 0:
        stratum_sample = stratum_population.sample(n=n, replace=False)
        sampled_groups.append(stratum_sample)

# Combine all sampled dataframes into a single dataframe (guard against empty list)
if sampled_groups:
    sampled_dataset = pd.concat(sampled_groups).reset_index(drop=True)
else:
    sampled_dataset = pd.DataFrame(columns=sample_selection_data.columns)

# Create an alias expected by downstream cells and align column naming
sampled_data = sampled_dataset.copy()
if 'Age_Group' not in sampled_data.columns and 'Age_Group_2' in sampled_data.columns:
    sampled_data['Age_Group'] = sampled_data['Age_Group_2']
# View the first three rows of the sampled dataset
print("Sampled dataset (first 3 rows):")
sampled_data[['StudentID', 'Gender_m', 'Race_m', 'Age_Group']].head(3)

Sampled dataset (first 3 rows):

	StudentID	Gender_m	Race_m	Age_Group
0	SID024775	Female	Asian	17-18
1	SID019476	Female	Asian	17-18
2	SID018187	Female	Asian	17-18

Compare Sample Proportions to Population Proportions

Stratum are indexed based on the order in the DataFrame.

Sampled Dataset (n = 1024)

Code

print("Sampled dataset size:", len(sampled_dataset))

# Calculate the proportions in the sampled data
sample_proportions = sampled_dataset.groupby(['Gender_m', 'Race_m', 'Age_Group_2'], observed=True).size().reset_index(name='Count')
sample_proportions['Proportion'] = sample_proportions['Count'] / sample_proportions['Count'].sum()
# View the distribution of the sampled data in a chart
import matplotlib.pyplot as plt
# Chart for Sampled Data Proportions
plt.figure(figsize=(5, 3))
plt.bar(range(len(sample_proportions)), sample_proportions['Proportion'])
plt.title('Sampled Data Proportions by Stratum')
plt.xlabel('Stratum Index')
plt.ylabel('Proportion')
plt.xticks(range(len(sample_proportions)), sample_proportions.index, rotation=90)
plt.tight_layout()
plt.show()
# Display the sampled proportions DataFrame
sample_proportions[['Proportion']].round(3).head(5)

Sampled dataset size: 1024

	Proportion
0	0.024
1	0.021
2	0.010
3	0.001
4	0.002

Population Dataset (N = 25,000)

Code

print("population:", len(sample_selection_data))

# Calculate the proportions in the population data
population_proportions = sample_selection_data.groupby(['Gender_m', 'Race_m', 'Age_Group_2'], observed=True).size().reset_index(name='Count')
population_proportions['Proportion'] = population_proportions['Count'] / population_proportions['Count'].sum()
# View the distribution of the population data in a chart
plt.figure(figsize=(5, 3))
plt.bar(range(len(population_proportions)), population_proportions['Proportion'])
plt.title('Population Data Proportions by Stratum')
plt.xlabel('Stratum Index')
plt.ylabel('Proportion')
plt.xticks(range(len(population_proportions)), population_proportions.index, rotation=90)
plt.tight_layout()
plt.show()
# Display the population proportions DataFrame
population_proportions[['Proportion']].head(5).round(3)

population: 25000

	Proportion
0	0.024
1	0.021
2	0.010
3	0.001
4	0.002

Once verified, the sampled dataset can be used for survey distribution.

Weight Calculation for Survey Analysis

Post-Stratification Weights

Essential when sample proportions differ from population proportions.
Corrects over- or under-representation of certain groups and improves accuracy and validity of survey results.

Post-Stratification weights can be calculated in different ways:

Each stratum individually
Combined categories (joint distribution considered)
Note. weights can be a fraction, but are always positive and non-zero.

Today’s demonstration is similar to individual strata weighting but accounts for joint distribution.

Weights Calculation Formula

Each weight is determined by dividing the sample proportion by the respondent proportion of each variable. Then, multiply the variable weights together to get the overall weight for each respondent.

In general, the formula for calculating weight for each variable is:

\[ \text{Weight} = \dfrac{\text{Sample Proportion}}{\text{Respondent Proportion}} \]

Because we used a combination of variables in the strata, each participant should be assigned an overall weight:

\[ W_{\text{overall}} = W_{\text{race}} \times W_{\text{gender}} \times W_{\text{age}} \]

The overall weight assigned to each respondent is then used in survey item analysis.

Overall Weight Calculation

This table shows an example of weight calculation for a respondent (rounded to 3 decimal places for illustration purposes):

Participant	Variables	Sample Proportion	Respondent Proportion	Variable weight = sample_p/respondent_p
Gender	Female	.574	.631	.910
Age Group	17-18	.404	.261	1.548
Race	White	.568	.556	1.022

The final weight for this respondent is \[ 0.910*1.548*1.022 = 1.439 \]

Calculate Weights in Respondents Data

Respondents Dataset

Synthetic respondents data based on a survey distributed to the sampled dataset.

Code

import pandas as pd
data_folder = './Data Files/'
# Import the respondents data file (synthetic respondents data based on a survey)
respondents_data = pd.read_excel(data_folder + 'respondents_file.xlsx')
# Make the output view smaller for presentation
respondents_data[:3]

	Q1_Gender	Q2_Race	Q3_AgeGroup	Q4_Traffic_Accidents	Q5_Academic_Performance
0	Male	White	17-18	4.0	4.0
1	Female	White	17-18	3.0	3.0
2	Female	White	17-18	5.0	5.0

Calculate Weights

Code

# Recall the propotions from the sampled data
sampled_data_proportions = {
    "gender_sampled": sampled_dataset['Gender_m'].value_counts(normalize=True).round(3).to_dict(),
    "race_sampled": sampled_dataset['Race_m'].value_counts(normalize=True).round(3).to_dict(),
    "age_group_sampled": sampled_dataset['Age_Group_2'].value_counts(normalize=True).round(3).to_dict()
}

## Calculate the respondents demographic proportions of the three variables: Gender, Race, and Age Group and create a dictionary by the propotion and round to three decimal places
respondent_proportions = {
    "Gender_r": respondents_data['Q1_Gender'].value_counts(normalize=True).round(3).to_dict(),
    "Race_r": respondents_data['Q2_Race'].value_counts(normalize=True).round(3).to_dict(),
    "Age_Group_r": respondents_data['Q3_AgeGroup'].value_counts(normalize=True).round(3).to_dict()
}

# Define the two dictionaries for population and respondents proportions
popl_dict = sampled_data_proportions
resp_dict = respondent_proportions

# Using each of the labels in the variables, then the population or respondent proportions can be retrieved from the dictionaries.
# For example, if Q1_Gender is 'Female', then the population proportion of 0.574 will be retrieved.
# Then, the respondent proportion, 'Gender_r' of 0.631 will be retrieved for the denominator.

# Now we can calculate the weights for each respondent based on the three variables
def calculate_weight(row, popl_dict, resp_dict):
    gender_weight = popl_dict['gender_sampled'].get(row['Q1_Gender'], 0) / resp_dict['Gender_r'].get(row['Q1_Gender'], 1)
    race_weight = popl_dict['race_sampled'].get(row['Q2_Race'], 0) / resp_dict['Race_r'].get(row['Q2_Race'], 1)
    age_group_weight = popl_dict['age_group_sampled'].get(row['Q3_AgeGroup'], 0) / resp_dict['Age_Group_r'].get(row['Q3_AgeGroup'], 1)
    total_weight = gender_weight * race_weight * age_group_weight
    return total_weight

Preview the weights calculated for the first three respondents.

Code

# Apply the weight calculation to each row in the respondents_data DataFrame
respondents_data['O_Weight'] = respondents_data.apply(calculate_weight, axis=1, args=(popl_dict, resp_dict))
# View the first three rows with the calculated weights
print("Respondents data with calculated weights (first 3 rows, rounded to 3 decimal places):")  
respondents_data[['Q1_Gender', 'Q2_Race', 'Q3_AgeGroup', 'O_Weight']].head(3).round(3)

Respondents data with calculated weights (first 3 rows, rounded to 3 decimal places):

	Q1_Gender	Q2_Race	Q3_AgeGroup	O_Weight
0	Male	White	17-18	1.830
1	Female	White	17-18	1.442
2	Female	White	17-18	1.442

Note. Due to decimal rounding, some weights will be slightly different from manual calculations in the previous slide.

Apply Weights to Survey Item

Question: On a scale from 0-5, how often does drug abuse lead to traffic accidents among college students?

Scale: 0 - Never to 5 - Very Often

Code

# Apply the weight calculation to each row in the respondents_data DataFrame
respondents_data['Q4_with_weights'] = respondents_data['Q4_Traffic_Accidents'] * respondents_data['O_Weight']
# View the first three rows with the calculated weighted scores
print("Respondents data with weighted Q4_Traffic_Accidents scores (first 3 rows, rounded to 3 decimal places):")  
respondents_data[['Q4_Traffic_Accidents', 'O_Weight', 'Q4_with_weights']].head(3).round(3)

Respondents data with weighted Q4_Traffic_Accidents scores (first 3 rows, rounded to 3 decimal places):

	Q4_Traffic_Accidents	O_Weight	Q4_with_weights
0	4.0	1.830	7.320
1	3.0	1.442	4.326
2	5.0	1.442	7.210

Next, we can compare the non-weighted mean to the weighted mean of the survey item.

Code

traffic_accident_response = respondents_data[['Q4_Traffic_Accidents', 'Q4_with_weights']].copy()
non_weighted_mean = traffic_accident_response['Q4_Traffic_Accidents'].mean()
weighted_mean = traffic_accident_response['Q4_with_weights'].mean()
print(f"Non-weighted mean of Q4_Traffic_Accidents: {non_weighted_mean:.3f}")
print(f"Weighted mean of Q4_Traffic_Accidents: {weighted_mean:.3f}")

Non-weighted mean of Q4_Traffic_Accidents: 4.606
Weighted mean of Q4_Traffic_Accidents: 4.940

Distribution of Non-Weighted and Weighted Scores

Visual comparison of the non-weighted and weighted scores with histogram:

Code

# Visual comparison of non-weighted vs weighted means with histogram
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.clf()  # Clear any previous plots

# Histogram of non-weighted responses
n, bins, patches = plt.hist(
    respondents_data['Q4_Traffic_Accidents'],
    bins=6,
    alpha=0.5,
    label='Non-weighted',
    color='blue',
    edgecolor='black'
)

# Histogram of weighted responses
n_w, bins_w, patches_w = plt.hist(
    respondents_data['Q4_with_weights'],
    bins=6,
    alpha=0.5,
    label='Weighted',
    color='orange',
    edgecolor='black'
)

plt.xlabel('Q4_Traffic_Accidents (0=Never, 5=Very Often)')
plt.ylabel('Frequency')
plt.title('Distribution of Q4_Traffic_Accidents Responses')
plt.legend()
plt.tight_layout()
plt.show()

The distribution of responses is more balanced after applying weights, indicating that the weighted mean better reflects the population characteristics.

Code

respondents_data[['Q4_Traffic_Accidents', 'Q4_with_weights']].describe().round(3)

	Q4_Traffic_Accidents	Q4_with_weights
count	203.000	203.000
mean	4.606	4.940
std	0.753	2.345
min	1.000	0.174
25%	4.000	3.635
50%	5.000	4.723
75%	5.000	6.022
max	5.000	10.561

Thank you!

GitHub Repository: https://github.com/caryjim/pyladies2025-presentation
Contact Me: caryjim.data@gmail.com
LinkedIn: https://www.linkedin.com/in/cary-k-jim-40046152
ResearchGate: https://www.researchgate.net/profile/Cary-Jim
Instagram (food, travel, and cat photos): @ladybirdcary