Stratified Sampling in Survey Research with Python

PyLadiesCon 2025

Cary K. Jim, Ph.D.

Introduction

Learning Outcomes

  1. Understand when to use stratified sampling over random sampling

  2. Learn how to define subgroups (strata) and implement sampling in Python

  3. Evaluate sample representativeness by checking distribution patterns

  4. Apply weights to adjust for unequal proportions in score estimation

Today’s presentation

Demonstrate how stratified random sampling to identify subgroups - Post-stratification weights are applied to correct unequal representation

Stratified Sampling: Why and When?

  • Challenges in surveying large populations: time, budget, and resource constraints

  • Importance of representative samples in research to reduce bias

  • Align research outcomes with actionable goals: informing policy, guiding interventions, or enhancing program effectiveness

  • Allow for subgroup analysis to understand diverse perspectives within the population

Figure 1. Types of Sampling Methods with Illustration (Generated with Copilot)

Stratified Sampling

Overview

Stratified sampling is a research design method that involves dividing a population into smaller subgroups known as strata.

  • Strata are formed based on shared attributes or characteristics (e.g., age group, gender, geographic location).

  • A sampling technique, such as random, systematic, or convenience sampling, is then applied to select participants in each stratum.

📘 Examples

  1. Systematic sampling within strata: You divided a college students population based on their major and then select every kth element based on the size needed from each stratum.

  2. Stratified random sampling within strata: You divided a college students population based on their major and then randomly select a proportionate number of students from each major.

Figure 2. Stratified Random Sampling Workflow Diagram (Generated with Copilot)

Python Implementation of Stratified Random Sampling

Population Dataset

In this example, the population is defined as 25,000 college students.

Variables in the population dataset:

Code
# View the first three rows of the population dataset
population_data.head(3)
StudentID Gender Age Race Ethnicity StudentType OverallGPA
0 SID000001 Male 18 White Hispanic or Latino Freshman 4.00
1 SID000002 Female 40 Two or more races Hispanic or Latino Continuing Education 0.00
2 SID000003 Female 19 White Not Hispanic or Latino Junior 3.23

Cochran’s Sample Size Formula

The ideal sample size for an infinite population is given by: \[ n_0 = \frac{Z^2 \, p \, (1 - p)}{e^2} \]

where:

  • ( n0 ) = initial sample size from Cochran’s formula for an infinite population
  • ( Z ) = Z-value (e.g., 1.96 for 95% confidence, z-table lookup)
  • ( p ) = estimated proportion of the attribute present in the population (assuming 0.5 for largest variability)
  • ( e ) = margin of error (e.g., 0.05 for 5%, the desired level of precision)

The sample size adjusted for a finite population is given by:

\[ n = \frac{n_0}{1 + \left( \frac{n_0 - 1}{N} \right)} \]

where:

  • ( n ) = adjusted sample size for a finite population
  • ( n0 ) = theoretical sample value obtained from Cochran’s formula for an infinite population
  • ( N ) = population size

Code example: Cochran’s Sample Size Calculation

Code
# Cochran's Sample Size Calculation
import scipy.stats as stats
def cochran_sample_size(Z, p, e, N):
    n0 = (Z**2 * p * (1 - p)) / (e**2)
    n = n0 / (1 + ((n0 - 1) / N))
    return int(n)

# Parameters
Z = stats.norm.ppf(0.975)   # Calculate the critical z-value for a 95% confidence level in a standard normal distribution
p = 0.5                     # Estimated proportion (use 0.5 for maximum variability, sort of like worst-case scenario)
e = 0.05                    # Margin of error (precision level, more precision requires larger sample size)
N = 25000                   # Finite population size (in this example, student population size)

# Calculate sample size

# Calculate n0 and adjusted sample size for finite population
n0 = (Z**2 * p * (1 - p)) / (e**2)
sample_size = n0 / (1 + ((n0 - 1) / N))
print(f"Ideal sample size (n0): {round(n0)}")
print(f"Adjusted sample size for finite population: {int(sample_size)}")
Ideal sample size (n0): 384
Adjusted sample size for finite population: 378

But Wait…There’s Re-coding to Do!

Example 1: By increments of 2 or 10

Code
# Binding age groups by increments of 2 years for younger ages and 10 years for older ages
population_data['Age_Group_1'] = pd.cut(population_data['Age'], 
                                       bins=[17, 19, 21, 23, 25, 27, 29, 31, 41, 50],
                                       labels=['17-18', '19-20', '21-22', '23-24', '25-26', '27-28', '29-30', '31-40', '41-50'], right=False)
# View the distribution after recoding for both counts and proportions
# Make a list of two outputs: counts and proportions
# To view counts
age_1_group_counts = population_data['Age_Group_1'].value_counts().sort_index()
# To view proportions
age_1_group_proportions = population_data['Age_Group_1'].value_counts(normalize=True).sort_index().round(3)
age_1_group_counts, age_1_group_proportions
(Age_Group_1
 17-18    10123
 19-20     8896
 21-22     4222
 23-24      506
 25-26      329
 27-28      277
 29-30      235
 31-40      400
 41-50       12
 Name: count, dtype: int64,
 Age_Group_1
 17-18    0.405
 19-20    0.356
 21-22    0.169
 23-24    0.020
 25-26    0.013
 27-28    0.011
 29-30    0.009
 31-40    0.016
 41-50    0.000
 Name: proportion, dtype: float64)

Example 2: By meaningful groups

Code
# We will keep the first three groups like the previous example, but then group all older ages into one group, as 31 and above.

population_data['Age_Group_2'] = pd.cut(population_data['Age'], 
                                       bins=[17, 19, 21, 23, 25, 31, 44],
                                       labels=['17-18', '19-20', '21-22', '23-25', '26-30', '31+'],
                                   right=False)
# View the distribution after recoding for both counts and proportions
# To view counts
age_2_group_counts = population_data['Age_Group_2'].value_counts().sort_index()
# To view proportions
age_2_group_proportions = population_data['Age_Group_2'].value_counts(normalize=True).sort_index().round(3)
age_2_group_counts, age_2_group_proportions
(Age_Group_2
 17-18    10123
 19-20     8896
 21-22     4222
 23-25      506
 26-30      841
 31+        411
 Name: count, dtype: int64,
 Age_Group_2
 17-18    0.405
 19-20    0.356
 21-22    0.169
 23-25    0.020
 26-30    0.034
 31+      0.016
 Name: proportion, dtype: float64)

Reminder: The proportion of each subgroup should be calculated after re-coding. If you adjust any levels within a variable, it can affect the distribution of values in the dataset. In practice, you may repeat this step until you are able to draw a proper set of subgroups before creating the strata.

Calculate Proportions

The strata (subgroups) are determined by the following categories:

  • Gender (Female or Male)

  • Race (Asian, Black or African American, White, Two or more races, Other)

  • Age Group (16-18, 19-20, 21-22, 23-30, 31+)

Code example: Proportion by each stratum

Code
# Create a distribution chart for each of the three variables after recoding
import matplotlib.pyplot as plt

# Proportion of Gender in the Sample Selection Data and add the proportion on top of each bar
gender_proportions = sample_selection_data['Gender_m'].value_counts(normalize=True).round(2)
gender_proportions.plot(kind='bar', title='Proportion of Gender')
plt.xticks(rotation=0)
# Round to 2 decimal places and convert to percentage
for i, v in enumerate(gender_proportions):
    plt.text(i, v + 0.01, str(round(v * 100, 2)) + '%', ha='center')
plt.show()

Code
# Proportion of Race in the Sample Selection Data
race_proportions = sample_selection_data['Race_m'].value_counts(normalize=True).round(2)
race_proportions.plot(kind='bar', title='Proportion of Race')
plt.xticks(rotation=0)
# Round to 2 decimal places and convert to percentage
for i, v in enumerate(race_proportions):
    plt.text(i, v + 0.01, str(round(v * 100, 2)) + '%', ha='center')
plt.show()

Code
# Proportion of Age Group in the Sample Selection Data
age_proportions = sample_selection_data['Age_Group_2'].value_counts(normalize=True).round(2)
age_proportions.plot(kind='bar', title='Proportion of Age Group')
plt.xticks(rotation=0)
for i, v in enumerate(age_proportions):
    plt.text(i, v + 0.01, str(round(v * 100, 2)) + '%', ha='center')
plt.show()

Re-sampling

Code
# This time we are using a different set of parameters. The formula remains the same from Step 2.
# Parameters
Z = stats.norm.ppf(0.975)   # Calculate the critical z-value for
# a 95% confidence level in a standard normal distribution
p = 0.5                     # Estimated proportion (use 0.5 for maximum variability, sort of like worst-case scenario)
e_3 = 0.03                    # Margin of error (precision level, more precision requires larger sample size)
N = 25000                   # Finite population size (in this example, student population size)

# Calculate sample size
# Calculate n0 and adjusted sample size for finite population
n0 = (Z**2 * p * (1 - p)) / (e_3**2)
sample_size_larger_e = n0 / (1 + ((n0 - 1) / N))
print(f"Ideal sample size (n0) for 95% confidence level with 3% margin of error: {round(n0)}")
print(f"Adjusted sample size of finite population for 95% confidence level with 3% margin of error: {int(sample_size_larger_e)}")
Ideal sample size (n0) for 95% confidence level with 3% margin of error: 1067
Adjusted sample size of finite population for 95% confidence level with 3% margin of error: 1023

Then, we apply the proportions to get the sample of this updated sample size.

Code
# Determine the combined proportions of the strata in the sample selection data with the modified variables
stratum_set_1 = sample_selection_data.groupby(['Gender_m', 'Race_m', 'Age_Group_2'], observed=True).size().reset_index(name='Count')
stratum_set_1['Proportion'] = stratum_set_1['Count'] / stratum_set_1['Count'].sum()

# use the 3% sample size to draw the stratified random sample
stratum_sizes_three_moe = (stratum_set_1['Proportion'] * int(sample_size_larger_e)).round().astype(int)
# Add the grouping variables back to the stratum sizes DataFrame for clarity
stratum_sizes_three_moe = pd.concat([stratum_set_1[['Gender_m', 'Race_m', 'Age_Group_2']], stratum_sizes_three_moe.rename('Sample_Size')], axis=1)
print("Stratum sizes for each subgroup based on 3% margin of error sample size (first 3 rows):")
stratum_sizes_three_moe.head(3)
Stratum sizes for each subgroup based on 3% margin of error sample size (first 3 rows):
Gender_m Race_m Age_Group_2 Sample_Size
0 Female Asian 17-18 25
1 Female Asian 19-20 21
2 Female Asian 21-22 10

Draw the stratified random sample based on the calculated stratum sizes.

Code
# Initialize an empty list to hold sampled dataframes
sampled_groups = []
for i, row in stratum_sizes_three_moe.iterrows():
    stratum_filter = (
        (sample_selection_data['Gender_m'] == row['Gender_m']) &
        (sample_selection_data['Race_m'] == row['Race_m']) &
        (sample_selection_data['Age_Group_2'] == row['Age_Group_2'])
    )
    stratum_population = sample_selection_data[stratum_filter]
    # Use the sample size from the current row
    stratum_sample_size = int(row['Sample_Size'])

    # Ensure we don't sample from empty strata or more than available
    n = min(stratum_sample_size, len(stratum_population))
    if n > 0:
        stratum_sample = stratum_population.sample(n=n, replace=False)
        sampled_groups.append(stratum_sample)

# Combine all sampled dataframes into a single dataframe (guard against empty list)
if sampled_groups:
    sampled_dataset = pd.concat(sampled_groups).reset_index(drop=True)
else:
    sampled_dataset = pd.DataFrame(columns=sample_selection_data.columns)

# Create an alias expected by downstream cells and align column naming
sampled_data = sampled_dataset.copy()
if 'Age_Group' not in sampled_data.columns and 'Age_Group_2' in sampled_data.columns:
    sampled_data['Age_Group'] = sampled_data['Age_Group_2']
# View the first three rows of the sampled dataset
print("Sampled dataset (first 3 rows):")
sampled_data[['StudentID', 'Gender_m', 'Race_m', 'Age_Group']].head(3)
Sampled dataset (first 3 rows):
StudentID Gender_m Race_m Age_Group
0 SID024775 Female Asian 17-18
1 SID019476 Female Asian 17-18
2 SID018187 Female Asian 17-18

Compare Sample Proportions to Population Proportions

Stratum are indexed based on the order in the DataFrame.

Sampled Dataset (n = 1024)

Code
print("Sampled dataset size:", len(sampled_dataset))

# Calculate the proportions in the sampled data
sample_proportions = sampled_dataset.groupby(['Gender_m', 'Race_m', 'Age_Group_2'], observed=True).size().reset_index(name='Count')
sample_proportions['Proportion'] = sample_proportions['Count'] / sample_proportions['Count'].sum()
# View the distribution of the sampled data in a chart
import matplotlib.pyplot as plt
# Chart for Sampled Data Proportions
plt.figure(figsize=(5, 3))
plt.bar(range(len(sample_proportions)), sample_proportions['Proportion'])
plt.title('Sampled Data Proportions by Stratum')
plt.xlabel('Stratum Index')
plt.ylabel('Proportion')
plt.xticks(range(len(sample_proportions)), sample_proportions.index, rotation=90)
plt.tight_layout()
plt.show()
# Display the sampled proportions DataFrame
sample_proportions[['Proportion']].round(3).head(5)
Sampled dataset size: 1024

Proportion
0 0.024
1 0.021
2 0.010
3 0.001
4 0.002

Population Dataset (N = 25,000)

Code
print("population:", len(sample_selection_data))

# Calculate the proportions in the population data
population_proportions = sample_selection_data.groupby(['Gender_m', 'Race_m', 'Age_Group_2'], observed=True).size().reset_index(name='Count')
population_proportions['Proportion'] = population_proportions['Count'] / population_proportions['Count'].sum()
# View the distribution of the population data in a chart
plt.figure(figsize=(5, 3))
plt.bar(range(len(population_proportions)), population_proportions['Proportion'])
plt.title('Population Data Proportions by Stratum')
plt.xlabel('Stratum Index')
plt.ylabel('Proportion')
plt.xticks(range(len(population_proportions)), population_proportions.index, rotation=90)
plt.tight_layout()
plt.show()
# Display the population proportions DataFrame
population_proportions[['Proportion']].head(5).round(3)
population: 25000

Proportion
0 0.024
1 0.021
2 0.010
3 0.001
4 0.002

Once verified, the sampled dataset can be used for survey distribution.

Weight Calculation for Survey Analysis

Post-Stratification Weights

  • Essential when sample proportions differ from population proportions.

  • Corrects over- or under-representation of certain groups and improves accuracy and validity of survey results.

Post-Stratification weights can be calculated in different ways:

  • Each stratum individually

  • Combined categories (joint distribution considered)

  • Note. weights can be a fraction, but are always positive and non-zero.

Today’s demonstration is similar to individual strata weighting but accounts for joint distribution.

Weights Calculation Formula

Each weight is determined by dividing the sample proportion by the respondent proportion of each variable. Then, multiply the variable weights together to get the overall weight for each respondent.

In general, the formula for calculating weight for each variable is:

\[ \text{Weight} = \dfrac{\text{Sample Proportion}}{\text{Respondent Proportion}} \]

Because we used a combination of variables in the strata, each participant should be assigned an overall weight:

\[ W_{\text{overall}} = W_{\text{race}} \times W_{\text{gender}} \times W_{\text{age}} \]

The overall weight assigned to each respondent is then used in survey item analysis.

Overall Weight Calculation

This table shows an example of weight calculation for a respondent (rounded to 3 decimal places for illustration purposes):

Participant Variables Sample Proportion Respondent Proportion Variable weight = sample_p/respondent_p
Gender Female .574 .631 .910
Age Group 17-18 .404 .261 1.548
Race White .568 .556 1.022

The final weight for this respondent is \[ 0.910*1.548*1.022 = 1.439 \]

Calculate Weights in Respondents Data

Respondents Dataset

Synthetic respondents data based on a survey distributed to the sampled dataset.

Code
import pandas as pd
data_folder = './Data Files/'
# Import the respondents data file (synthetic respondents data based on a survey)
respondents_data = pd.read_excel(data_folder + 'respondents_file.xlsx')
# Make the output view smaller for presentation
respondents_data[:3]
Q1_Gender Q2_Race Q3_AgeGroup Q4_Traffic_Accidents Q5_Academic_Performance
0 Male White 17-18 4.0 4.0
1 Female White 17-18 3.0 3.0
2 Female White 17-18 5.0 5.0

Calculate Weights

Code
# Recall the propotions from the sampled data
sampled_data_proportions = {
    "gender_sampled": sampled_dataset['Gender_m'].value_counts(normalize=True).round(3).to_dict(),
    "race_sampled": sampled_dataset['Race_m'].value_counts(normalize=True).round(3).to_dict(),
    "age_group_sampled": sampled_dataset['Age_Group_2'].value_counts(normalize=True).round(3).to_dict()
}

## Calculate the respondents demographic proportions of the three variables: Gender, Race, and Age Group and create a dictionary by the propotion and round to three decimal places
respondent_proportions = {
    "Gender_r": respondents_data['Q1_Gender'].value_counts(normalize=True).round(3).to_dict(),
    "Race_r": respondents_data['Q2_Race'].value_counts(normalize=True).round(3).to_dict(),
    "Age_Group_r": respondents_data['Q3_AgeGroup'].value_counts(normalize=True).round(3).to_dict()
}

# Define the two dictionaries for population and respondents proportions
popl_dict = sampled_data_proportions
resp_dict = respondent_proportions

# Using each of the labels in the variables, then the population or respondent proportions can be retrieved from the dictionaries.
# For example, if Q1_Gender is 'Female', then the population proportion of 0.574 will be retrieved.
# Then, the respondent proportion, 'Gender_r' of 0.631 will be retrieved for the denominator.

# Now we can calculate the weights for each respondent based on the three variables
def calculate_weight(row, popl_dict, resp_dict):
    gender_weight = popl_dict['gender_sampled'].get(row['Q1_Gender'], 0) / resp_dict['Gender_r'].get(row['Q1_Gender'], 1)
    race_weight = popl_dict['race_sampled'].get(row['Q2_Race'], 0) / resp_dict['Race_r'].get(row['Q2_Race'], 1)
    age_group_weight = popl_dict['age_group_sampled'].get(row['Q3_AgeGroup'], 0) / resp_dict['Age_Group_r'].get(row['Q3_AgeGroup'], 1)
    total_weight = gender_weight * race_weight * age_group_weight
    return total_weight

Preview the weights calculated for the first three respondents.

Code
# Apply the weight calculation to each row in the respondents_data DataFrame
respondents_data['O_Weight'] = respondents_data.apply(calculate_weight, axis=1, args=(popl_dict, resp_dict))
# View the first three rows with the calculated weights
print("Respondents data with calculated weights (first 3 rows, rounded to 3 decimal places):")  
respondents_data[['Q1_Gender', 'Q2_Race', 'Q3_AgeGroup', 'O_Weight']].head(3).round(3)
Respondents data with calculated weights (first 3 rows, rounded to 3 decimal places):
Q1_Gender Q2_Race Q3_AgeGroup O_Weight
0 Male White 17-18 1.830
1 Female White 17-18 1.442
2 Female White 17-18 1.442

Note. Due to decimal rounding, some weights will be slightly different from manual calculations in the previous slide.

Apply Weights to Survey Item

Question: On a scale from 0-5, how often does drug abuse lead to traffic accidents among college students?

Scale: 0 - Never to 5 - Very Often

Code
# Apply the weight calculation to each row in the respondents_data DataFrame
respondents_data['Q4_with_weights'] = respondents_data['Q4_Traffic_Accidents'] * respondents_data['O_Weight']
# View the first three rows with the calculated weighted scores
print("Respondents data with weighted Q4_Traffic_Accidents scores (first 3 rows, rounded to 3 decimal places):")  
respondents_data[['Q4_Traffic_Accidents', 'O_Weight', 'Q4_with_weights']].head(3).round(3)
Respondents data with weighted Q4_Traffic_Accidents scores (first 3 rows, rounded to 3 decimal places):
Q4_Traffic_Accidents O_Weight Q4_with_weights
0 4.0 1.830 7.320
1 3.0 1.442 4.326
2 5.0 1.442 7.210

Next, we can compare the non-weighted mean to the weighted mean of the survey item.

Code
traffic_accident_response = respondents_data[['Q4_Traffic_Accidents', 'Q4_with_weights']].copy()
non_weighted_mean = traffic_accident_response['Q4_Traffic_Accidents'].mean()
weighted_mean = traffic_accident_response['Q4_with_weights'].mean()
print(f"Non-weighted mean of Q4_Traffic_Accidents: {non_weighted_mean:.3f}")
print(f"Weighted mean of Q4_Traffic_Accidents: {weighted_mean:.3f}")
Non-weighted mean of Q4_Traffic_Accidents: 4.606
Weighted mean of Q4_Traffic_Accidents: 4.940

Distribution of Non-Weighted and Weighted Scores

Visual comparison of the non-weighted and weighted scores with histogram:

Code
# Visual comparison of non-weighted vs weighted means with histogram
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.clf()  # Clear any previous plots

# Histogram of non-weighted responses
n, bins, patches = plt.hist(
    respondents_data['Q4_Traffic_Accidents'],
    bins=6,
    alpha=0.5,
    label='Non-weighted',
    color='blue',
    edgecolor='black'
)

# Histogram of weighted responses
n_w, bins_w, patches_w = plt.hist(
    respondents_data['Q4_with_weights'],
    bins=6,
    alpha=0.5,
    label='Weighted',
    color='orange',
    edgecolor='black'
)

plt.xlabel('Q4_Traffic_Accidents (0=Never, 5=Very Often)')
plt.ylabel('Frequency')
plt.title('Distribution of Q4_Traffic_Accidents Responses')
plt.legend()
plt.tight_layout()
plt.show()

The distribution of responses is more balanced after applying weights, indicating that the weighted mean better reflects the population characteristics.

Code
respondents_data[['Q4_Traffic_Accidents', 'Q4_with_weights']].describe().round(3)
Q4_Traffic_Accidents Q4_with_weights
count 203.000 203.000
mean 4.606 4.940
std 0.753 2.345
min 1.000 0.174
25% 4.000 3.635
50% 5.000 4.723
75% 5.000 6.022
max 5.000 10.561

Thank you!