Exposure Aggregation in Epidemiological Research: Time-Weighted Methods and Applications

Introduction

In epidemiological research, especially in environmental health studies, properly aggregating exposure data over time is crucial for accurate exposure assessment (Brink and Haagsma 2024). Environmental exposures such as noise, typically reported using standardized metrics like day-evening-night level ($L_{den}$) (European Union 2002; Day–evening–night noise level 2025), exhibit significant temporal variations that must be carefully processed to avoid exposure misclassification. This document provides a comprehensive guide to using the create_period_exposures() function, which offers flexible methods for aggregating environmental exposure measurements across different time periods.

Key Features of the Function

The create_period_exposures() function provides:

Aggregation of exposures into standardized time periods (month, quarter, half-year, year)
Support for both point-in-time data and period-based exposure data
Special handling for logarithmic values (e.g., decibel-scale noise measurements)
Time-weighted averaging to account for varying exposure durations
Data quality metrics to assess completeness of exposure information
Flexible parameter options to customize the aggregation process

Function Documentation

Function Signature and Parameters

Code

create_period_exposures <- function(
    data, 
    id_var = "id", 
    date_var = NULL,
    in_date_var = "period_start",
    out_date_var = "period_end",
    exposure_var = "lden", 
    period = "quarter",
    energy = TRUE,
    average_function = "mean",
    weight_by_days = TRUE,
    # output variables
    result_variable = "avg_exposure",
    period_id_var = "period_id",
    period_start_var = "period_start",
    period_end_var = "period_end",
    keep_original = FALSE,
    # debugging
    debug = FALSE
)

Parameter Descriptions

Parameter	Description
`data`	Data frame containing exposure data
`id_var`	Name of ID variable (default: “id”)
`date_var`	Name of date variable for point data (default: NULL)
`in_date_var`	Name of start date variable for period data (default: “period_start”)
`out_date_var`	Name of end date variable for period data (default: “period_end”)
`exposure_var`	Name of exposure variable (default: “lden”)
`period`	Aggregation period type: “month”, “quarter”, “halfyear”, “year” (default: “quarter”)
`energy`	Whether to use energy-based averaging for logarithmic values (default: TRUE)
`average_function`	Function to use for averaging: “mean” or “median” (default: “mean”)
`weight_by_days`	Whether to weight by days in period (default: TRUE)
`result_variable`	Name for the result column (default: “avg_exposure”)
`period_id_var`	Name for period identifier column (default: “period_id”)
`period_start_var`	Name for period start column (default: “period_start”)
`period_end_var`	Name for period end column (default: “period_end”)
`keep_original`	Whether to keep original values (default: FALSE)
`debug`	Whether to show debugging information (default: FALSE)

Return Value

The function returns a data frame with one row per ID per time period with:

ID variable
Period identifier (e.g., “2020-01”, “2020Q1”)
Period start and end dates
Aggregated exposure value
Minimum and maximum exposure values in period
Data quality metrics:
- Days in period
- Days with data
- Missing days
- Missing percentage
Method description (e.g., “Energy-based mean (day-weighted)”)

Use Cases and Examples

Example 1: Point-in-Time Data (Daily Measurements)

This example demonstrates how to aggregate daily PM2.5 measurements into monthly periods.

Generate Sample Daily Data

Code

# Create a date sequence for 2020
all_dates <- seq(as.Date("2020-01-01"), as.Date("2020-12-31"), by = "day")

# Create sample daily data for 3 individuals with extensive missing date patterns
# to better illustrate missingness threshold effects
daily_data <- data.frame(
  id = rep(1:3, each = length(all_dates)),
  date = rep(all_dates, 3)
) %>%
  # Add diverse missing patterns:
  # ID 1: Various missing patterns throughout the year
  filter(!(id == 1 & month(date) == 1 & day(date) %in% c(5:9, 15:19, 25:29))) %>% # Jan: 15 missing days (3 sets of 5 days)
  filter(!(id == 1 & month(date) == 3 & day(date) %in% c(10:25))) %>% # Mar: 16 consecutive days missing
  filter(!(id == 1 & month(date) %in% c(5, 6) & day(date) %% 3 == 0)) %>% # May-Jun: Every 3rd day missing
  filter(!(id == 1 & month(date) == 8 & day(date) %in% c(1:7, 15:21))) %>% # Aug: Two weeks missing
  filter(!(id == 1 & month(date) == 11 & day(date) %in% c(1:5, 11:15, 21:25))) %>% # Nov: 15 days missing in sets
  
  # ID 2: Some months with low to moderate missingness, some with high
  filter(!(id == 2 & month(date) == 2 & day(date) %% 5 == 0)) %>% # Feb: Every 5th day missing (low)
  filter(!(id == 2 & month(date) == 4 & day(date) %in% c(1:20))) %>% # Apr: 20 days missing (high)
  filter(!(id == 2 & month(date) == 6 & day(date) %% 3 == 0)) %>% # Jun: Every 3rd day missing (moderate)
  filter(!(id == 2 & month(date) == 7 & day(date) %in% c(5:10, 20:31))) %>% # Jul: 18 days missing (high)
  filter(!(id == 2 & month(date) == 9 & day(date) %% 4 == 0)) %>% # Sep: Every 4th day missing (low)
  filter(!(id == 2 & month(date) == 12 & day(date) %in% c(10:30))) %>% # Dec: 21 days missing (high)
  
  # ID 3: Gradual increase in missingness through the year
  filter(!(id == 3 & month(date) == 1 & day(date) %in% c(15:17))) %>% # Jan: 3 days missing (~10%)
  filter(!(id == 3 & month(date) == 2 & day(date) %in% c(5:9))) %>% # Feb: 5 days missing (~17%) 
  filter(!(id == 3 & month(date) == 4 & day(date) %in% c(10:16))) %>% # Apr: 7 days missing (~23%)
  filter(!(id == 3 & month(date) == 5 & day(date) %in% c(5:14))) %>% # May: 10 days missing (~33%)
  filter(!(id == 3 & month(date) == 7 & day(date) %in% c(1:15))) %>% # Jul: 15 days missing (~48%)
  filter(!(id == 3 & month(date) == 9 & day(date) %in% c(1:20))) %>% # Sep: 20 days missing (~67%)
  filter(!(id == 3 & month(date) == 11 & day(date) %in% c(1:25))) %>% # Nov: 25 days missing (~83%)
  
  # Add PM2.5 exposure values
  mutate(
    # Base level
    base_pm25 = case_when(
      id == 1 ~ 12.0, # urban area
      id == 2 ~ 15.5, # near highway
      id == 3 ~ 8.5,  # suburban
      TRUE ~ 10.0
    ),
    # Seasonal pattern (higher in winter)
    month_effect = 5 * cos((month(date) - 1) * pi/6),
    # Random daily variation
    daily_variation = rnorm(n(), 0, 2),
    # Final PM2.5 value
    pm25 = pmax(0, base_pm25 + month_effect + daily_variation) # ensure no negative values
  )

# Calculate missing days per month for documentation
monthly_summary <- daily_data %>%
  group_by(id, year = year(date), month = month(date)) %>%
  summarize(
    days_with_data = n(),
    days_in_month = days_in_month(first(date)),
    missing_days = days_in_month - days_with_data,
    missing_pct = 100 * missing_days / days_in_month,
    .groups = "drop"
  )

# Show sample of the data
head(daily_data) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

id	date	base_pm25	month_effect	daily_variation	pm25
1	2020-01-01	12	5	-1.1209513	15.87905
1	2020-01-02	12	5	-0.4603550	16.53965
1	2020-01-03	12	5	3.1174166	20.11742
1	2020-01-04	12	5	0.1410168	17.14102
1	2020-01-10	12	5	0.2585755	17.25858
1	2020-01-11	12	5	3.4301300	20.43013

Plot Daily Data to Show Patterns

Code

# Plot daily values for each ID
ggplot(daily_data, aes(x = date, y = pm25, color = factor(id))) +
  geom_point(alpha = 0.3) +
  geom_smooth(se = FALSE) +
  facet_wrap(~ id, ncol = 1) +
  labs(
    title = "Daily PM2.5 Exposure by Individual",
    subtitle = "Note the missing data patterns and seasonal trends",
    x = "Date",
    y = "PM2.5 (μg/m³)")",
    color = "ID"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Aggregate to Monthly Periods

Code

# Aggregate daily data to monthly periods
monthly_data <- create_period_exposures(
  data = daily_data,
  id_var = "id",
  date_var = "date",
  exposure_var = "pm25",
  period = "month",
  energy = FALSE,         # Arithmetic averaging for PM2.5 values
  average_function = "mean",
  result_variable = "monthly_pm25",
  debug = FALSE
)

# Show the monthly aggregated data
monthly_data %>%
  select(id, period_id, period_start, period_end, monthly_pm25, 
         missing_percentage, method_description) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

id	period_id	period_start	period_end	monthly_pm25	missing_percentage	method_description
1	2020-01	2020-01-01	2020-01-31	17.509085	48.38710	Arithmetic mean (day-weighted)
1	2020-02	2020-02-01	2020-02-29	16.257268	0.00000	Arithmetic mean (day-weighted)
1	2020-03	2020-03-01	2020-03-31	14.622774	51.61290	Arithmetic mean (day-weighted)
1	2020-04	2020-04-01	2020-04-30	12.048841	0.00000	Arithmetic mean (day-weighted)
1	2020-05	2020-05-01	2020-05-31	9.484487	32.25806	Arithmetic mean (day-weighted)
1	2020-06	2020-06-01	2020-06-30	7.438511	33.33333	Arithmetic mean (day-weighted)
1	2020-07	2020-07-01	2020-07-31	6.705139	0.00000	Arithmetic mean (day-weighted)
1	2020-08	2020-08-01	2020-08-31	8.116065	45.16129	Arithmetic mean (day-weighted)
1	2020-09	2020-09-01	2020-09-30	9.502457	0.00000	Arithmetic mean (day-weighted)
1	2020-10	2020-10-01	2020-10-31	11.822999	0.00000	Arithmetic mean (day-weighted)

Visualize Monthly Aggregates

Code

# Plot monthly values
ggplot(monthly_data, aes(x = period_start, y = monthly_pm25, color = factor(id))) +
  geom_point() +
  geom_line() +
  facet_wrap(~ id, ncol = 1) +
  labs(
    title = "Monthly Aggregated PM2.5 Exposure",
    x = "Month",
    y = "PM2.5 (μg/m³)")",
    color = "ID"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Visualize Data Completeness

Code

# Plot missing data percentage by month
ggplot(monthly_data, aes(x = period_start, y = missing_percentage, fill = factor(id))) +
  geom_col(position = "dodge") +
  labs(
    title = "Missing Data Percentage by Month",
    x = "Month",
    y = "Missing Data (%)",
    fill = "ID"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Example 2: Period Data (Address History with Noise Exposures)

This example demonstrates how to handle exposure data that already covers periods (e.g., address history with noise exposure estimates).

Generate Sample Address History Data

Code

# Create sample address history with noise exposure data
address_data <- data.frame(
  id = rep(1:3, each = 4),
  period_start = as.Date(c(
    # ID 1 address history
    "2020-01-01", "2020-04-15", "2020-07-01", "2020-10-01",
    # ID 2 address history
    "2020-01-01", "2020-03-01", "2020-08-15", "2020-11-01",
    # ID 3 address history
    "2020-01-01", "2020-05-01", "2020-06-15", "2020-09-01"
  )),
  period_end = as.Date(c(
    # ID 1 address history
    "2020-04-14", "2020-06-30", "2020-09-30", "2020-12-31",
    # ID 2 address history
    "2020-02-29", "2020-08-14", "2020-10-31", "2020-12-31",
    # ID 3 address history
    "2020-04-30", "2020-06-14", "2020-08-31", "2020-12-31"
  )),
  # Different Lden (noise) exposure at each address
  lden = c(
    62.3, 55.7, 58.2, 64.1,  # ID 1 exposures
    60.5, 54.2, 66.8, 61.3,  # ID 2 exposures
    53.7, 56.9, 59.5, 65.2   # ID 3 exposures
  )
)

# Show the address history data
address_data %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

id	period_start	period_end	lden
1	2020-01-01	2020-04-14	62.3
1	2020-04-15	2020-06-30	55.7
1	2020-07-01	2020-09-30	58.2
1	2020-10-01	2020-12-31	64.1
2	2020-01-01	2020-02-29	60.5
2	2020-03-01	2020-08-14	54.2
2	2020-08-15	2020-10-31	66.8
2	2020-11-01	2020-12-31	61.3
3	2020-01-01	2020-04-30	53.7
3	2020-05-01	2020-06-14	56.9
3	2020-06-15	2020-08-31	59.5
3	2020-09-01	2020-12-31	65.2

Visualize Address History Timeline

Code

# Create a timeline plot of address history
address_data %>%
  mutate(
    duration = as.numeric(period_end - period_start) + 1,
    address_number = rep(1:4, 3)  # Add address number for each person
  ) %>%
  ggplot(aes(x = period_start, xend = period_end, y = factor(id), yend = factor(id), 
             color = lden, size = lden)) +
  geom_segment() +
  geom_point(aes(x = period_start), size = 3) +
  geom_point(aes(x = period_end), size = 3) +
  geom_text(aes(label = sprintf("%.1f", lden), x = period_start + (period_end - period_start)/2), 
            vjust = -1, size = 3) +
  scale_color_viridis_c(option = "plasma") +
  labs(
    title = "Address History Timeline with Noise Exposure",
    x = "Date",
    y = "Individual ID",
    color = "Noise level (dB)",
    size = "Noise level (dB)"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Aggregate to Quarterly Periods

Code

# Aggregate to quarterly periods
quarterly_data <- create_period_exposures(
  data = address_data,
  id_var = "id",
  in_date_var = "period_start",  # For period data, use in_date_var and out_date_var
  out_date_var = "period_end",
  exposure_var = "lden",
  period = "quarter",
  energy = TRUE,         # Energy-based averaging for noise values
  weight_by_days = TRUE,  # Weight by duration at each address
  result_variable = "quarterly_lden",
  debug = FALSE
)

# Show the quarterly aggregated data
quarterly_data %>%
  select(id, period_id, period_start, period_end, quarterly_lden, 
         days_with_data, days_in_period, method_description) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

id	period_id	period_start	period_end	quarterly_lden	days_with_data	days_in_period	method_description
1	2020Q1	2020-01-01	2020-03-31	62.30000	91	91	Energy-based mean (day-weighted)
1	2020Q2	2020-04-01	2020-06-30	55.70000	77	91	Energy-based mean (day-weighted)
1	2020Q3	2020-07-01	2020-09-30	58.20000	92	92	Energy-based mean (day-weighted)
1	2020Q4	2020-10-01	2020-12-31	64.10000	92	92	Energy-based mean (day-weighted)
2	2020Q1	2020-01-01	2020-03-31	59.18761	91	91	Energy-based mean (day-weighted)
2	2020Q3	2020-07-01	2020-09-30	66.80000	47	92	Energy-based mean (day-weighted)
2	2020Q4	2020-10-01	2020-12-31	61.30000	61	92	Energy-based mean (day-weighted)
3	2020Q1	2020-01-01	2020-03-31	53.70000	91	91	Energy-based mean (day-weighted)
3	2020Q2	2020-04-01	2020-06-30	57.74578	61	91	Energy-based mean (day-weighted)
3	2020Q3	2020-07-01	2020-09-30	65.20000	30	92	Energy-based mean (day-weighted)

Visualize Quarterly Aggregates

Code

# Plot quarterly noise exposures
ggplot(quarterly_data, aes(x = period_start, y = quarterly_lden, color = factor(id))) +
  geom_point(size = 3) +
  geom_line() +
  labs(
    title = "Quarterly Aggregated Noise Exposure",
    subtitle = "Day-weighted energy-based averages from address history",
    x = "Quarter",
    y = "Noise Level (dB)",
    color = "ID"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Advanced Features

Handling Logarithmic Values (Energy-Based Averaging)

For exposures measured on logarithmic scales (like decibel-based noise measurements), arithmetic averaging is incorrect. The function provides energy-based averaging options.

Comparing Arithmetic vs. Energy-Based Averaging for Noise

Code

# Generate sample monthly noise data with large variations across the year
noise_data <- data.frame(
  id = rep(1, 12),
  month = 1:12,
  period_start = seq(as.Date("2020-01-01"), by = "month", length.out = 12),
  # Create a realistic yearly pattern with one very loud month (summer festival in July)
  lden = c(60, 59, 61, 62, 63, 64, 75, 65, 63, 62, 60, 61)
)

# Calculate arithmetic and energy-based means
arithmetic_mean <- mean(noise_data$lden)
energy_mean <- 10 * log10(mean(10^(noise_data$lden/10)))

# Show the results
results_df <- data.frame(
  Method = c("Arithmetic Mean", "Energy-Based Mean"),
  Value = c(arithmetic_mean, energy_mean)
)

results_df %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Method	Value
Arithmetic Mean	62.91667
Energy-Based Mean	66.17781

Code

# Plot to show the difference
noise_data %>%
  mutate(month_name = factor(month.name[month], levels = month.name)) %>%
  ggplot(aes(x = month_name, y = lden, group = 1)) +
  geom_point(size = 3) +
  geom_line() +
  geom_hline(yintercept = arithmetic_mean, color = "red", linetype = "dashed") +
  geom_hline(yintercept = energy_mean, color = "blue", linetype = "dashed") +
  annotate("text", x = 11.5, y = arithmetic_mean + 0.5, 
           label = "Arithmetic Mean", hjust = 1, color = "red") +
  annotate("text", x = 11.5, y = energy_mean - 0.5, 
           label = "Energy-Based Mean", hjust = 1, color = "blue") +
  labs(
    title = "Effect of Energy-Based Averaging for Annual Noise Levels",
    subtitle = "Note how energy-based averaging gives more weight to the loud July values",
    x = "Month",
    y = "Noise Level (dB)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code

# Add explanation text
cat("The example above shows monthly noise levels throughout a year, with a particularly loud month in July (75 dB).\n\n")

The example above shows monthly noise levels throughout a year, with a particularly loud month in July (75 dB).

Code

cat("The arithmetic mean (", round(arithmetic_mean, 1), " dB) underestimates the actual perceived average noise level.\n")

The arithmetic mean ( 62.9  dB) underestimates the actual perceived average noise level.

Code

cat("The energy-based mean (", round(energy_mean, 1), " dB) properly accounts for the logarithmic nature of the decibel scale,\n")

The energy-based mean ( 66.2  dB) properly accounts for the logarithmic nature of the decibel scale,

Code

cat("giving more weight to louder periods as they contribute disproportionately to the overall sound energy.\n\n")

giving more weight to louder periods as they contribute disproportionately to the overall sound energy.

Code

cat("This difference is crucial in epidemiological studies of noise exposure, where using arithmetic\n")

This difference is crucial in epidemiological studies of noise exposure, where using arithmetic

Code

cat("means would systematically underestimate the true exposure and potentially lead to exposure misclassification.\n")

means would systematically underestimate the true exposure and potentially lead to exposure misclassification.

Time-Weighted Averaging with Different Day Counts

When aggregating period data, weighting by the number of days ensures proper representation of exposures.

Code

# Create sample data with uneven period lengths
uneven_data <- data.frame(
  id = rep(1, 3),
  period_start = as.Date(c("2020-01-01", "2020-01-15", "2020-02-01")),
  period_end = as.Date(c("2020-01-14", "2020-01-31", "2020-02-28")),
  no2 = c(25, 15, 20)  # NO2 values
)

# Show the data
uneven_data %>%
  mutate(days = as.numeric(period_end - period_start) + 1) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

id	period_start	period_end	no2	days
1	2020-01-01	2020-01-14	25	14
1	2020-01-15	2020-01-31	15	17
1	2020-02-01	2020-02-28	20	28

Code

# Calculate weighted and unweighted monthly averages
monthly_weighted <- create_period_exposures(
  data = uneven_data,
  in_date_var = "period_start",
  out_date_var = "period_end",
  exposure_var = "no2",
  period = "month",
  energy = FALSE,
  weight_by_days = TRUE,
  result_variable = "monthly_no2_weighted"
)

monthly_unweighted <- create_period_exposures(
  data = uneven_data,
  in_date_var = "period_start",
  out_date_var = "period_end",
  exposure_var = "no2",
  period = "month",
  energy = FALSE,
  weight_by_days = FALSE,
  result_variable = "monthly_no2_unweighted"
)

# Compare the results
comparison <- left_join(
  select(monthly_weighted, id, period_id, monthly_no2_weighted),
  select(monthly_unweighted, id, period_id, monthly_no2_unweighted),
  by = c("id", "period_id")
) %>%
  mutate(difference = monthly_no2_weighted - monthly_no2_unweighted)

comparison %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

id	period_id	monthly_no2_weighted	monthly_no2_unweighted	difference
1	2020-01	19.51613	20	-0.483871
1	2020-02	20.00000	20	0.000000

Considerations for Data Quality

The function provides metrics to assess data quality, which can be used to filter results based on completeness.

Code

# Filter monthly data based on 5% missing threshold
complete_months_5pct <- monthly_data %>%
  filter(missing_percentage <= 5) %>%
  select(id, period_id, monthly_pm25, missing_percentage)

incomplete_months_5pct <- monthly_data %>%
  filter(missing_percentage > 5) %>%
  select(id, period_id, monthly_pm25, missing_percentage)

# Compare number of complete vs incomplete months
# print("Complete months (≤5% missing):", nrow(complete_months_5pct), "\n")")
# print("Incomplete months (>5% missing):", nrow(incomplete_months_5pct), "\n")

The choice of missingness threshold depends on your study requirements and exposure type. As shown in the tabs above, different thresholds result in different sets of included/excluded periods. Some considerations:

Stricter thresholds (e.g., 5%): Ensure high data quality but may exclude many periods
Moderate thresholds (e.g., 10-20%): Balance between data quality and sample size
Permissive thresholds (e.g., 30%): Include more periods but with potentially less reliable estimates

For critical research applications, it’s often advisable to perform sensitivity analyses using different thresholds to assess the impact on your findings.

Performance Considerations

The function can process large datasets efficiently, but performance may vary based on the number of IDs and time periods.

Code

# Generate a larger dataset
large_daily_data <- data.frame(
  id = rep(1:1000, each = 365),
  date = rep(seq(as.Date("2020-01-01"), by = "day", length.out = 365), 1000),
  exposure = rnorm(1000 * 365, 50, 10)
)

# Time the execution
system.time({
  large_monthly_data <- create_period_exposures(
    data = large_daily_data,
    date_var = "date",
    exposure_var = "exposure",
    period = "month"
  )
})

For very large datasets, consider using parallel processing:

Code

library(parallel)

# Split data by ID
ids <- unique(large_daily_data$id)
id_chunks <- split(ids, ceiling(seq_along(ids) / 100))  # Process 100 IDs at a time

# Function to process a chunk of IDs
process_chunk <- function(ids) {
  chunk_data <- large_daily_data[large_daily_data$id %in% ids, ]
  create_period_exposures(
    data = chunk_data,
    date_var = "date",
    exposure_var = "exposure",
    period = "month"
  )
}

# Process chunks in parallel
results_list <- mclapply(id_chunks, process_chunk, mc.cores = detectCores() - 1)

# Combine results
combined_results <- do.call(rbind, results_list)

Best Practices and Recommendations

General Guidelines

Choose appropriate time periods based on your research question and the exposure dynamics
Use energy-based averaging for logarithmic values (noise, decibels)
Weight by days when using period data with varying durations
Filter by data quality metrics for more reliable analyses
Validate aggregated results against raw data when possible

Common Issues and Solutions

Issue	Solution
Missing data	Use missing_percentage to filter periods with insufficient data
Mixed data types	Convert point data to period data by duplicating the date
Incorrect averaging	Ensure energy=TRUE for decibel values
Memory limitations	Process the data in batches by ID
Performance concerns	Consider parallel processing for very large datasets

Conclusion

The create_period_exposures() function provides a flexible and powerful tool for aggregating exposure data in epidemiological research. By properly accounting for time periods, data quality, and appropriate averaging methods, researchers can generate more reliable exposure metrics for health outcome analysis.

When publishing research using this function, we recommend documenting:

The chosen time periods and their relevance to health outcomes
Averaging methods (energy-based vs. arithmetic)
Data completeness thresholds applied
Any pre-processing steps applied to the raw exposure data

Session Information

Code

sessionInfo()

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2019 x64 (build 17763)

Matrix products: default


locale:
[1] LC_COLLATE=Swedish_Sweden.1252  LC_CTYPE=Swedish_Sweden.1252   
[3] LC_MONETARY=Swedish_Sweden.1252 LC_NUMERIC=C                   
[5] LC_TIME=Swedish_Sweden.1252    

time zone: Europe/Stockholm
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tictoc_1.2.1     kableExtra_1.4.0 knitr_1.48       patchwork_1.3.0 
[5] lubridate_1.9.3  ggplot2_3.5.1    dplyr_1.1.4     

loaded via a namespace (and not attached):
 [1] Matrix_1.6-5      gtable_0.3.5      jsonlite_1.8.8    highr_0.11       
 [5] compiler_4.3.1    tidyselect_1.2.1  xml2_1.3.6        stringr_1.5.1    
 [9] splines_4.3.1     systemfonts_1.2.1 scales_1.3.0      yaml_2.3.10      
[13] fastmap_1.2.0     lattice_0.21-8    R6_2.6.1          labeling_0.4.3   
[17] generics_0.1.3    htmlwidgets_1.6.4 tibble_3.2.1      munsell_0.5.1    
[21] svglite_2.1.3     pillar_1.10.1     rlang_1.1.4       stringi_1.8.4    
[25] xfun_0.49         viridisLite_0.4.2 timechange_0.3.0  cli_3.6.3        
[29] mgcv_1.9-1        withr_3.0.2       magrittr_2.0.3    digest_0.6.37    
[33] grid_4.3.1        rstudioapi_0.17.1 nlme_3.1-162      lifecycle_1.0.4  
[37] vctrs_0.6.5       evaluate_1.0.1    glue_1.8.0        farver_2.1.1     
[41] colorspace_2.1-0  rmarkdown_2.29    tools_4.3.1       pkgconfig_2.0.3  
[45] htmltools_0.5.8.1

References

Brink, Mark, and Juanita Haagsma. 2024. “Determining the Population Health Impact of Environmental Noise.” In A Sound Approach to Noise and Health, 75–96. Springer Nature Singapore Singapore. https://library.oapen.org/bitstream/handle/20.500.12657/94584/978-981-97-6121-0.pdf?sequence=1#page=82.

Day–evening–night noise level. 2025. “Day–Evening–Night Noise Level – Wikipedia.” https://en.wikipedia.org/wiki/Day%E2%80%93evening%E2%80%93night_noise_level#Definition.

European Union. 2002. “Directive 2002/49/EC of the European Parliament and of the Council of 25 June 2002 Relating to the Assessment and Management of Environmental Noise.” http://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32002L0049&from=EN.

--- title: "Exposure Aggregation in Epidemiological Research: Time-Weighted Methods and Applications" format: html: toc: true toc-depth: 4 toc-location: left code-fold: show code-tools: true code-link: true embed-resources: true theme: cosmo self-contained: true pdf: toc: true toc-depth: 4 number-sections: true colorlinks: true code-fold: show keep-tex: false execute: echo: true warning: false message: false bibliography: references.bib --- ```{r} #| label: setup #| include: false #| warning: false #| echo: false library(dplyr) library(ggplot2) library(lubridate) library(patchwork) library(knitr) library(kableExtra) library(tictoc) # Set a seed for reproducibility set.seed(123) # Source the function file source("create_period_exposures.R") ``` # Introduction In epidemiological research, especially in environmental health studies, properly aggregating exposure data over time is crucial for accurate exposure assessment [@brink2024determining]. Environmental exposures such as noise, typically reported using standardized metrics like day-evening-night level ($L_{den}$) [@eu2002directive; @wikipediaLden], exhibit significant temporal variations that must be carefully processed to avoid exposure misclassification. This document provides a comprehensive guide to using the `create_period_exposures()` function, which offers flexible methods for aggregating environmental exposure measurements across different time periods. ## Key Features of the Function The `create_period_exposures()` function provides: 1. Aggregation of exposures into standardized time periods (month, quarter, half-year, year) 2. Support for both point-in-time data and period-based exposure data 3. Special handling for logarithmic values (e.g., decibel-scale noise measurements) 4. Time-weighted averaging to account for varying exposure durations 5. Data quality metrics to assess completeness of exposure information 6. Flexible parameter options to customize the aggregation process # Function Documentation ## Function Signature and Parameters ```{r} #| label: function-signature #| eval: false create_period_exposures <- function( data, id_var = "id", date_var = NULL, in_date_var = "period_start", out_date_var = "period_end", exposure_var = "lden", period = "quarter", energy = TRUE, average_function = "mean", weight_by_days = TRUE, # output variables result_variable = "avg_exposure", period_id_var = "period_id", period_start_var = "period_start", period_end_var = "period_end", keep_original = FALSE, # debugging debug = FALSE ) ``` ## Parameter Descriptions | Parameter | Description | |-----------|-------------| | `data` | Data frame containing exposure data | | `id_var` | Name of ID variable (default: "id") | | `date_var` | Name of date variable for point data (default: NULL) | | `in_date_var` | Name of start date variable for period data (default: "period_start") | | `out_date_var` | Name of end date variable for period data (default: "period_end") | | `exposure_var` | Name of exposure variable (default: "lden") | | `period` | Aggregation period type: "month", "quarter", "halfyear", "year" (default: "quarter") | | `energy` | Whether to use energy-based averaging for logarithmic values (default: TRUE) | | `average_function` | Function to use for averaging: "mean" or "median" (default: "mean") | | `weight_by_days` | Whether to weight by days in period (default: TRUE) | | `result_variable` | Name for the result column (default: "avg_exposure") | | `period_id_var` | Name for period identifier column (default: "period_id") | | `period_start_var` | Name for period start column (default: "period_start") | | `period_end_var` | Name for period end column (default: "period_end") | | `keep_original` | Whether to keep original values (default: FALSE) | | `debug` | Whether to show debugging information (default: FALSE) | ## Return Value The function returns a data frame with one row per ID per time period with: - ID variable - Period identifier (e.g., "2020-01", "2020Q1") - Period start and end dates - Aggregated exposure value - Minimum and maximum exposure values in period - Data quality metrics: - Days in period - Days with data - Missing days - Missing percentage - Method description (e.g., "Energy-based mean (day-weighted)") # Use Cases and Examples ## Example 1: Point-in-Time Data (Daily Measurements) This example demonstrates how to aggregate daily PM2.5 measurements into monthly periods. ### Generate Sample Daily Data ```{r} #| label: daily-data # Create a date sequence for 2020 all_dates <- seq(as.Date("2020-01-01"), as.Date("2020-12-31"), by = "day") # Create sample daily data for 3 individuals with extensive missing date patterns # to better illustrate missingness threshold effects daily_data <- data.frame( id = rep(1:3, each = length(all_dates)), date = rep(all_dates, 3) ) %>% # Add diverse missing patterns: # ID 1: Various missing patterns throughout the year filter(!(id == 1 & month(date) == 1 & day(date) %in% c(5:9, 15:19, 25:29))) %>% # Jan: 15 missing days (3 sets of 5 days) filter(!(id == 1 & month(date) == 3 & day(date) %in% c(10:25))) %>% # Mar: 16 consecutive days missing filter(!(id == 1 & month(date) %in% c(5, 6) & day(date) %% 3 == 0)) %>% # May-Jun: Every 3rd day missing filter(!(id == 1 & month(date) == 8 & day(date) %in% c(1:7, 15:21))) %>% # Aug: Two weeks missing filter(!(id == 1 & month(date) == 11 & day(date) %in% c(1:5, 11:15, 21:25))) %>% # Nov: 15 days missing in sets # ID 2: Some months with low to moderate missingness, some with high filter(!(id == 2 & month(date) == 2 & day(date) %% 5 == 0)) %>% # Feb: Every 5th day missing (low) filter(!(id == 2 & month(date) == 4 & day(date) %in% c(1:20))) %>% # Apr: 20 days missing (high) filter(!(id == 2 & month(date) == 6 & day(date) %% 3 == 0)) %>% # Jun: Every 3rd day missing (moderate) filter(!(id == 2 & month(date) == 7 & day(date) %in% c(5:10, 20:31))) %>% # Jul: 18 days missing (high) filter(!(id == 2 & month(date) == 9 & day(date) %% 4 == 0)) %>% # Sep: Every 4th day missing (low) filter(!(id == 2 & month(date) == 12 & day(date) %in% c(10:30))) %>% # Dec: 21 days missing (high) # ID 3: Gradual increase in missingness through the year filter(!(id == 3 & month(date) == 1 & day(date) %in% c(15:17))) %>% # Jan: 3 days missing (~10%) filter(!(id == 3 & month(date) == 2 & day(date) %in% c(5:9))) %>% # Feb: 5 days missing (~17%) filter(!(id == 3 & month(date) == 4 & day(date) %in% c(10:16))) %>% # Apr: 7 days missing (~23%) filter(!(id == 3 & month(date) == 5 & day(date) %in% c(5:14))) %>% # May: 10 days missing (~33%) filter(!(id == 3 & month(date) == 7 & day(date) %in% c(1:15))) %>% # Jul: 15 days missing (~48%) filter(!(id == 3 & month(date) == 9 & day(date) %in% c(1:20))) %>% # Sep: 20 days missing (~67%) filter(!(id == 3 & month(date) == 11 & day(date) %in% c(1:25))) %>% # Nov: 25 days missing (~83%) # Add PM2.5 exposure values mutate( # Base level base_pm25 = case_when( id == 1 ~ 12.0, # urban area id == 2 ~ 15.5, # near highway id == 3 ~ 8.5, # suburban TRUE ~ 10.0 ), # Seasonal pattern (higher in winter) month_effect = 5 * cos((month(date) - 1) * pi/6), # Random daily variation daily_variation = rnorm(n(), 0, 2), # Final PM2.5 value pm25 = pmax(0, base_pm25 + month_effect + daily_variation) # ensure no negative values ) # Calculate missing days per month for documentation monthly_summary <- daily_data %>% group_by(id, year = year(date), month = month(date)) %>% summarize( days_with_data = n(), days_in_month = days_in_month(first(date)), missing_days = days_in_month - days_with_data, missing_pct = 100 * missing_days / days_in_month, .groups = "drop" ) # Show sample of the data head(daily_data) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed")) ``` ### Plot Daily Data to Show Patterns ```{r} #| label: daily-plot #| fig-width: 10 #| fig-height: 8 # Plot daily values for each ID ggplot(daily_data, aes(x = date, y = pm25, color = factor(id))) + geom_point(alpha = 0.3) + geom_smooth(se = FALSE) + facet_wrap(~ id, ncol = 1) + labs( title = "Daily PM2.5 Exposure by Individual", subtitle = "Note the missing data patterns and seasonal trends", x = "Date", y = "PM2.5 (μg/m³)", color = "ID" ) + theme_minimal() + theme(legend.position = "bottom") ``` ### Aggregate to Monthly Periods ```{r} #| label: monthly-aggregation # Aggregate daily data to monthly periods monthly_data <- create_period_exposures( data = daily_data, id_var = "id", date_var = "date", exposure_var = "pm25", period = "month", energy = FALSE, # Arithmetic averaging for PM2.5 values average_function = "mean", result_variable = "monthly_pm25", debug = FALSE ) # Show the monthly aggregated data monthly_data %>% select(id, period_id, period_start, period_end, monthly_pm25, missing_percentage, method_description) %>% head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed")) ``` ### Visualize Monthly Aggregates ```{r} #| label: monthly-plot #| fig-width: 10 #| fig-height: 8 # Plot monthly values ggplot(monthly_data, aes(x = period_start, y = monthly_pm25, color = factor(id))) + geom_point() + geom_line() + facet_wrap(~ id, ncol = 1) + labs( title = "Monthly Aggregated PM2.5 Exposure", x = "Month", y = "PM2.5 (μg/m³)", color = "ID" ) + theme_minimal() + theme(legend.position = "bottom") ``` ### Visualize Data Completeness ```{r} #| label: missing-data-plot #| fig-width: 10 #| fig-height: 6 # Plot missing data percentage by month ggplot(monthly_data, aes(x = period_start, y = missing_percentage, fill = factor(id))) + geom_col(position = "dodge") + labs( title = "Missing Data Percentage by Month", x = "Month", y = "Missing Data (%)", fill = "ID" ) + theme_minimal() + theme(legend.position = "bottom") ``` ## Example 2: Period Data (Address History with Noise Exposures) This example demonstrates how to handle exposure data that already covers periods (e.g., address history with noise exposure estimates). ### Generate Sample Address History Data ```{r} #| label: address-data # Create sample address history with noise exposure data address_data <- data.frame( id = rep(1:3, each = 4), period_start = as.Date(c( # ID 1 address history "2020-01-01", "2020-04-15", "2020-07-01", "2020-10-01", # ID 2 address history "2020-01-01", "2020-03-01", "2020-08-15", "2020-11-01", # ID 3 address history "2020-01-01", "2020-05-01", "2020-06-15", "2020-09-01" )), period_end = as.Date(c( # ID 1 address history "2020-04-14", "2020-06-30", "2020-09-30", "2020-12-31", # ID 2 address history "2020-02-29", "2020-08-14", "2020-10-31", "2020-12-31", # ID 3 address history "2020-04-30", "2020-06-14", "2020-08-31", "2020-12-31" )), # Different Lden (noise) exposure at each address lden = c( 62.3, 55.7, 58.2, 64.1, # ID 1 exposures 60.5, 54.2, 66.8, 61.3, # ID 2 exposures 53.7, 56.9, 59.5, 65.2 # ID 3 exposures ) ) # Show the address history data address_data %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed")) ``` ### Visualize Address History Timeline ```{r} #| label: address-timeline #| fig-width: 10 #| fig-height: 6 # Create a timeline plot of address history address_data %>% mutate( duration = as.numeric(period_end - period_start) + 1, address_number = rep(1:4, 3) # Add address number for each person ) %>% ggplot(aes(x = period_start, xend = period_end, y = factor(id), yend = factor(id), color = lden, size = lden)) + geom_segment() + geom_point(aes(x = period_start), size = 3) + geom_point(aes(x = period_end), size = 3) + geom_text(aes(label = sprintf("%.1f", lden), x = period_start + (period_end - period_start)/2), vjust = -1, size = 3) + scale_color_viridis_c(option = "plasma") + labs( title = "Address History Timeline with Noise Exposure", x = "Date", y = "Individual ID", color = "Noise level (dB)", size = "Noise level (dB)" ) + theme_minimal() + theme(legend.position = "bottom") ``` ### Aggregate to Quarterly Periods ```{r} #| label: quarterly-aggregation # Aggregate to quarterly periods quarterly_data <- create_period_exposures( data = address_data, id_var = "id", in_date_var = "period_start", # For period data, use in_date_var and out_date_var out_date_var = "period_end", exposure_var = "lden", period = "quarter", energy = TRUE, # Energy-based averaging for noise values weight_by_days = TRUE, # Weight by duration at each address result_variable = "quarterly_lden", debug = FALSE ) # Show the quarterly aggregated data quarterly_data %>% select(id, period_id, period_start, period_end, quarterly_lden, days_with_data, days_in_period, method_description) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed")) ``` ### Visualize Quarterly Aggregates ```{r} #| label: quarterly-plot #| fig-width: 10 #| fig-height: 6 # Plot quarterly noise exposures ggplot(quarterly_data, aes(x = period_start, y = quarterly_lden, color = factor(id))) + geom_point(size = 3) + geom_line() + labs( title = "Quarterly Aggregated Noise Exposure", subtitle = "Day-weighted energy-based averages from address history", x = "Quarter", y = "Noise Level (dB)", color = "ID" ) + theme_minimal() + theme(legend.position = "bottom") ``` # Advanced Features ## Handling Logarithmic Values (Energy-Based Averaging) For exposures measured on logarithmic scales (like decibel-based noise measurements), arithmetic averaging is incorrect. The function provides energy-based averaging options. ### Comparing Arithmetic vs. Energy-Based Averaging for Noise ```{r} #| label: energy-averaging #| fig-width: 10 #| fig-height: 6 # Generate sample monthly noise data with large variations across the year noise_data <- data.frame( id = rep(1, 12), month = 1:12, period_start = seq(as.Date("2020-01-01"), by = "month", length.out = 12), # Create a realistic yearly pattern with one very loud month (summer festival in July) lden = c(60, 59, 61, 62, 63, 64, 75, 65, 63, 62, 60, 61) ) # Calculate arithmetic and energy-based means arithmetic_mean <- mean(noise_data$lden) energy_mean <- 10 * log10(mean(10^(noise_data$lden/10))) # Show the results results_df <- data.frame( Method = c("Arithmetic Mean", "Energy-Based Mean"), Value = c(arithmetic_mean, energy_mean) ) results_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed")) # Plot to show the difference noise_data %>% mutate(month_name = factor(month.name[month], levels = month.name)) %>% ggplot(aes(x = month_name, y = lden, group = 1)) + geom_point(size = 3) + geom_line() + geom_hline(yintercept = arithmetic_mean, color = "red", linetype = "dashed") + geom_hline(yintercept = energy_mean, color = "blue", linetype = "dashed") + annotate("text", x = 11.5, y = arithmetic_mean + 0.5, label = "Arithmetic Mean", hjust = 1, color = "red") + annotate("text", x = 11.5, y = energy_mean - 0.5, label = "Energy-Based Mean", hjust = 1, color = "blue") + labs( title = "Effect of Energy-Based Averaging for Annual Noise Levels", subtitle = "Note how energy-based averaging gives more weight to the loud July values", x = "Month", y = "Noise Level (dB)" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Add explanation text cat("The example above shows monthly noise levels throughout a year, with a particularly loud month in July (75 dB).\n\n") cat("The arithmetic mean (", round(arithmetic_mean, 1), " dB) underestimates the actual perceived average noise level.\n") cat("The energy-based mean (", round(energy_mean, 1), " dB) properly accounts for the logarithmic nature of the decibel scale,\n") cat("giving more weight to louder periods as they contribute disproportionately to the overall sound energy.\n\n") cat("This difference is crucial in epidemiological studies of noise exposure, where using arithmetic\n") cat("means would systematically underestimate the true exposure and potentially lead to exposure misclassification.\n") ``` ## Time-Weighted Averaging with Different Day Counts When aggregating period data, weighting by the number of days ensures proper representation of exposures. ```{r} #| label: weighted-vs-unweighted # Create sample data with uneven period lengths uneven_data <- data.frame( id = rep(1, 3), period_start = as.Date(c("2020-01-01", "2020-01-15", "2020-02-01")), period_end = as.Date(c("2020-01-14", "2020-01-31", "2020-02-28")), no2 = c(25, 15, 20) # NO2 values ) # Show the data uneven_data %>% mutate(days = as.numeric(period_end - period_start) + 1) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed")) # Calculate weighted and unweighted monthly averages monthly_weighted <- create_period_exposures( data = uneven_data, in_date_var = "period_start", out_date_var = "period_end", exposure_var = "no2", period = "month", energy = FALSE, weight_by_days = TRUE, result_variable = "monthly_no2_weighted" ) monthly_unweighted <- create_period_exposures( data = uneven_data, in_date_var = "period_start", out_date_var = "period_end", exposure_var = "no2", period = "month", energy = FALSE, weight_by_days = FALSE, result_variable = "monthly_no2_unweighted" ) # Compare the results comparison <- left_join( select(monthly_weighted, id, period_id, monthly_no2_weighted), select(monthly_unweighted, id, period_id, monthly_no2_unweighted), by = c("id", "period_id") ) %>% mutate(difference = monthly_no2_weighted - monthly_no2_unweighted) comparison %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed")) ``` ## Considerations for Data Quality The function provides metrics to assess data quality, which can be used to filter results based on completeness. ```{r} #| label: data-quality-prepere #| fig-width: 10 #| fig-height: 8 # Filter monthly data based on 5% missing threshold complete_months_5pct <- monthly_data %>% filter(missing_percentage <= 5) %>% select(id, period_id, monthly_pm25, missing_percentage) incomplete_months_5pct <- monthly_data %>% filter(missing_percentage > 5) %>% select(id, period_id, monthly_pm25, missing_percentage) # Compare number of complete vs incomplete months # print("Complete months (≤5% missing):", nrow(complete_months_5pct), "\n") # print("Incomplete months (>5% missing):", nrow(incomplete_months_5pct), "\n") ``` ::: {.panel-tabset} ### 5% Threshold ```{r} #| label: data-quality-5pct #| fig-width: 10 #| fig-height: 8 #| echo: false # Plot to compare complete vs incomplete with 5% threshold monthly_data %>% mutate(quality = ifelse(missing_percentage <= 5, "Complete (≤5% missing)", "Incomplete (>5% missing)")) %>% ggplot(aes(x = period_start, y = monthly_pm25, color = quality, shape = quality)) + geom_point(size = 3) + facet_wrap(~ id, ncol = 1) + labs( title = "Monthly PM2.5 Exposure with Data Quality Indication", subtitle = "Using 5% missingness threshold", x = "Month", y = "PM2.5 (μg/m³)", color = "Data Quality", shape = "Data Quality" ) + theme_minimal() + theme(legend.position = "bottom") # Compare number of complete vs incomplete months # print("Complete months (≤5% missing):", nrow(complete_months_5pct), "\n") # print("Incomplete months (>5% missing):", nrow(incomplete_months_5pct), "\n") ``` ### 10% Threshold ```{r} #| label: data-quality-10pct #| fig-width: 10 #| fig-height: 8 #| echo: false # Filter monthly data based on 10% missing threshold complete_months_10pct <- monthly_data %>% filter(missing_percentage <= 10) %>% select(id, period_id, monthly_pm25, missing_percentage) incomplete_months_10pct <- monthly_data %>% filter(missing_percentage > 10) %>% select(id, period_id, monthly_pm25, missing_percentage) # Compare number of complete vs incomplete months # print("Complete months (≤10% missing):", nrow(complete_months_10pct), "\n") # print("Incomplete months (>10% missing):", nrow(incomplete_months_10pct), "\n") # Plot to compare complete vs incomplete with 10% threshold monthly_data %>% mutate(quality = ifelse(missing_percentage <= 10, "Complete (≤10% missing)", "Incomplete (>10% missing)")) %>% ggplot(aes(x = period_start, y = monthly_pm25, color = quality, shape = quality)) + geom_point(size = 3) + facet_wrap(~ id, ncol = 1) + labs( title = "Monthly PM2.5 Exposure with Data Quality Indication", subtitle = "Using 10% missingness threshold", x = "Month", y = "PM2.5 (μg/m³)", color = "Data Quality", shape = "Data Quality" ) + theme_minimal() + theme(legend.position = "bottom") ``` ### 20% Threshold ```{r} #| label: data-quality-20pct #| fig-width: 10 #| fig-height: 8 #| echo: false # Filter monthly data based on 20% missing threshold complete_months_20pct <- monthly_data %>% filter(missing_percentage <= 20) %>% select(id, period_id, monthly_pm25, missing_percentage) incomplete_months_20pct <- monthly_data %>% filter(missing_percentage > 20) %>% select(id, period_id, monthly_pm25, missing_percentage) # Compare number of complete vs incomplete months # cat("Complete months (≤20% missing):", nrow(complete_months_20pct), "\n") # cat("Incomplete months (>20% missing):", nrow(incomplete_months_20pct), "\n") # Plot to compare complete vs incomplete with 20% threshold monthly_data %>% mutate(quality = ifelse(missing_percentage <= 20, "Complete (≤20% missing)", "Incomplete (>20% missing)")) %>% ggplot(aes(x = period_start, y = monthly_pm25, color = quality, shape = quality)) + geom_point(size = 3) + facet_wrap(~ id, ncol = 1) + labs( title = "Monthly PM2.5 Exposure with Data Quality Indication", subtitle = "Using 20% missingness threshold", x = "Month", y = "PM2.5 (μg/m³)", color = "Data Quality", shape = "Data Quality" ) + theme_minimal() + theme(legend.position = "bottom") ``` ### 30% Threshold ```{r} #| label: data-quality-30pct #| fig-width: 10 #| fig-height: 8 #| echo: false # Filter monthly data based on 30% missing threshold complete_months_30pct <- monthly_data %>% filter(missing_percentage <= 30) %>% select(id, period_id, monthly_pm25, missing_percentage) incomplete_months_30pct <- monthly_data %>% filter(missing_percentage > 30) %>% select(id, period_id, monthly_pm25, missing_percentage) # Compare number of complete vs incomplete months # cat("Complete months (≤30% missing):", nrow(complete_months_30pct), "\n") # cat("Incomplete months (>30% missing):", nrow(incomplete_months_30pct), "\n") # Plot to compare complete vs incomplete with 30% threshold monthly_data %>% mutate(quality = ifelse(missing_percentage <= 30, "Complete (≤30% missing)", "Incomplete (>30% missing)")) %>% ggplot(aes(x = period_start, y = monthly_pm25, color = quality, shape = quality)) + geom_point(size = 3) + facet_wrap(~ id, ncol = 1) + labs( title = "Monthly PM2.5 Exposure with Data Quality Indication", subtitle = "Using 30% missingness threshold", x = "Month", y = "PM2.5 (μg/m³)", color = "Data Quality", shape = "Data Quality" ) + theme_minimal() + theme(legend.position = "bottom") ``` ::: The choice of missingness threshold depends on your study requirements and exposure type. As shown in the tabs above, different thresholds result in different sets of included/excluded periods. Some considerations: * **Stricter thresholds (e.g., 5%)**: Ensure high data quality but may exclude many periods * **Moderate thresholds (e.g., 10-20%)**: Balance between data quality and sample size * **Permissive thresholds (e.g., 30%)**: Include more periods but with potentially less reliable estimates For critical research applications, it's often advisable to perform sensitivity analyses using different thresholds to assess the impact on your findings. # Performance Considerations The function can process large datasets efficiently, but performance may vary based on the number of IDs and time periods. ```{r} #| label: performance #| eval: false # Generate a larger dataset large_daily_data <- data.frame( id = rep(1:1000, each = 365), date = rep(seq(as.Date("2020-01-01"), by = "day", length.out = 365), 1000), exposure = rnorm(1000 * 365, 50, 10) ) # Time the execution system.time({ large_monthly_data <- create_period_exposures( data = large_daily_data, date_var = "date", exposure_var = "exposure", period = "month" ) }) ``` For very large datasets, consider using parallel processing: ```{r} #| label: parallel-processing #| eval: false library(parallel) # Split data by ID ids <- unique(large_daily_data$id) id_chunks <- split(ids, ceiling(seq_along(ids) / 100)) # Process 100 IDs at a time # Function to process a chunk of IDs process_chunk <- function(ids) { chunk_data <- large_daily_data[large_daily_data$id %in% ids, ] create_period_exposures( data = chunk_data, date_var = "date", exposure_var = "exposure", period = "month" ) } # Process chunks in parallel results_list <- mclapply(id_chunks, process_chunk, mc.cores = detectCores() - 1) # Combine results combined_results <- do.call(rbind, results_list) ``` # Best Practices and Recommendations ## General Guidelines 1. **Choose appropriate time periods** based on your research question and the exposure dynamics 2. **Use energy-based averaging** for logarithmic values (noise, decibels) 3. **Weight by days** when using period data with varying durations 4. **Filter by data quality** metrics for more reliable analyses 5. **Validate aggregated results** against raw data when possible ## Common Issues and Solutions | Issue | Solution | |-------|----------| | Missing data | Use missing_percentage to filter periods with insufficient data | | Mixed data types | Convert point data to period data by duplicating the date | | Incorrect averaging | Ensure energy=TRUE for decibel values | | Memory limitations | Process the data in batches by ID | | Performance concerns | Consider parallel processing for very large datasets | # Conclusion The `create_period_exposures()` function provides a flexible and powerful tool for aggregating exposure data in epidemiological research. By properly accounting for time periods, data quality, and appropriate averaging methods, researchers can generate more reliable exposure metrics for health outcome analysis. When publishing research using this function, we recommend documenting: 1. The chosen time periods and their relevance to health outcomes 2. Averaging methods (energy-based vs. arithmetic) 3. Data completeness thresholds applied 4. Any pre-processing steps applied to the raw exposure data # Session Information ```{r} #| label: session-info sessionInfo() ```