In epidemiological research, especially in environmental health studies, properly aggregating exposure data over time is crucial for accurate exposure assessment (Brink and Haagsma 2024). Environmental exposures such as noise, typically reported using standardized metrics like day-evening-night level (\(L_{den}\)) (European Union 2002; Day–evening–night noise level 2025), exhibit significant temporal variations that must be carefully processed to avoid exposure misclassification. This document provides a comprehensive guide to using the create_period_exposures() function, which offers flexible methods for aggregating environmental exposure measurements across different time periods.
Key Features of the Function
The create_period_exposures() function provides:
Aggregation of exposures into standardized time periods (month, quarter, half-year, year)
Support for both point-in-time data and period-based exposure data
Special handling for logarithmic values (e.g., decibel-scale noise measurements)
Time-weighted averaging to account for varying exposure durations
Data quality metrics to assess completeness of exposure information
Flexible parameter options to customize the aggregation process
Name of date variable for point data (default: NULL)
in_date_var
Name of start date variable for period data (default: “period_start”)
out_date_var
Name of end date variable for period data (default: “period_end”)
exposure_var
Name of exposure variable (default: “lden”)
period
Aggregation period type: “month”, “quarter”, “halfyear”, “year” (default: “quarter”)
energy
Whether to use energy-based averaging for logarithmic values (default: TRUE)
average_function
Function to use for averaging: “mean” or “median” (default: “mean”)
weight_by_days
Whether to weight by days in period (default: TRUE)
result_variable
Name for the result column (default: “avg_exposure”)
period_id_var
Name for period identifier column (default: “period_id”)
period_start_var
Name for period start column (default: “period_start”)
period_end_var
Name for period end column (default: “period_end”)
keep_original
Whether to keep original values (default: FALSE)
debug
Whether to show debugging information (default: FALSE)
Return Value
The function returns a data frame with one row per ID per time period with:
ID variable
Period identifier (e.g., “2020-01”, “2020Q1”)
Period start and end dates
Aggregated exposure value
Minimum and maximum exposure values in period
Data quality metrics:
Days in period
Days with data
Missing days
Missing percentage
Method description (e.g., “Energy-based mean (day-weighted)”)
Use Cases and Examples
Example 1: Point-in-Time Data (Daily Measurements)
This example demonstrates how to aggregate daily PM2.5 measurements into monthly periods.
Generate Sample Daily Data
Code
# Create a date sequence for 2020all_dates<-seq(as.Date("2020-01-01"), as.Date("2020-12-31"), by ="day")# Create sample daily data for 3 individuals with extensive missing date patterns# to better illustrate missingness threshold effectsdaily_data<-data.frame( id =rep(1:3, each =length(all_dates)), date =rep(all_dates, 3))%>%# Add diverse missing patterns:# ID 1: Various missing patterns throughout the yearfilter(!(id==1&month(date)==1&day(date)%in%c(5:9, 15:19, 25:29)))%>%# Jan: 15 missing days (3 sets of 5 days)filter(!(id==1&month(date)==3&day(date)%in%c(10:25)))%>%# Mar: 16 consecutive days missingfilter(!(id==1&month(date)%in%c(5, 6)&day(date)%%3==0))%>%# May-Jun: Every 3rd day missingfilter(!(id==1&month(date)==8&day(date)%in%c(1:7, 15:21)))%>%# Aug: Two weeks missingfilter(!(id==1&month(date)==11&day(date)%in%c(1:5, 11:15, 21:25)))%>%# Nov: 15 days missing in sets# ID 2: Some months with low to moderate missingness, some with highfilter(!(id==2&month(date)==2&day(date)%%5==0))%>%# Feb: Every 5th day missing (low)filter(!(id==2&month(date)==4&day(date)%in%c(1:20)))%>%# Apr: 20 days missing (high)filter(!(id==2&month(date)==6&day(date)%%3==0))%>%# Jun: Every 3rd day missing (moderate)filter(!(id==2&month(date)==7&day(date)%in%c(5:10, 20:31)))%>%# Jul: 18 days missing (high)filter(!(id==2&month(date)==9&day(date)%%4==0))%>%# Sep: Every 4th day missing (low)filter(!(id==2&month(date)==12&day(date)%in%c(10:30)))%>%# Dec: 21 days missing (high)# ID 3: Gradual increase in missingness through the yearfilter(!(id==3&month(date)==1&day(date)%in%c(15:17)))%>%# Jan: 3 days missing (~10%)filter(!(id==3&month(date)==2&day(date)%in%c(5:9)))%>%# Feb: 5 days missing (~17%) filter(!(id==3&month(date)==4&day(date)%in%c(10:16)))%>%# Apr: 7 days missing (~23%)filter(!(id==3&month(date)==5&day(date)%in%c(5:14)))%>%# May: 10 days missing (~33%)filter(!(id==3&month(date)==7&day(date)%in%c(1:15)))%>%# Jul: 15 days missing (~48%)filter(!(id==3&month(date)==9&day(date)%in%c(1:20)))%>%# Sep: 20 days missing (~67%)filter(!(id==3&month(date)==11&day(date)%in%c(1:25)))%>%# Nov: 25 days missing (~83%)# Add PM2.5 exposure valuesmutate(# Base level base_pm25 =case_when(id==1~12.0, # urban areaid==2~15.5, # near highwayid==3~8.5, # suburbanTRUE~10.0),# Seasonal pattern (higher in winter) month_effect =5*cos((month(date)-1)*pi/6),# Random daily variation daily_variation =rnorm(n(), 0, 2),# Final PM2.5 value pm25 =pmax(0, base_pm25+month_effect+daily_variation)# ensure no negative values)# Calculate missing days per month for documentationmonthly_summary<-daily_data%>%group_by(id, year =year(date), month =month(date))%>%summarize( days_with_data =n(), days_in_month =days_in_month(first(date)), missing_days =days_in_month-days_with_data, missing_pct =100*missing_days/days_in_month, .groups ="drop")# Show sample of the datahead(daily_data)%>%kable()%>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))
id
date
base_pm25
month_effect
daily_variation
pm25
1
2020-01-01
12
5
-1.1209513
15.87905
1
2020-01-02
12
5
-0.4603550
16.53965
1
2020-01-03
12
5
3.1174166
20.11742
1
2020-01-04
12
5
0.1410168
17.14102
1
2020-01-10
12
5
0.2585755
17.25858
1
2020-01-11
12
5
3.4301300
20.43013
Plot Daily Data to Show Patterns
Code
# Plot daily values for each IDggplot(daily_data, aes(x =date, y =pm25, color =factor(id)))+geom_point(alpha =0.3)+geom_smooth(se =FALSE)+facet_wrap(~id, ncol =1)+labs( title ="Daily PM2.5 Exposure by Individual", subtitle ="Note the missing data patterns and seasonal trends", x ="Date", y ="PM2.5 (μg/m³)")", color ="ID")+theme_minimal()+theme(legend.position ="bottom")
Aggregate to Monthly Periods
Code
# Aggregate daily data to monthly periodsmonthly_data<-create_period_exposures( data =daily_data, id_var ="id", date_var ="date", exposure_var ="pm25", period ="month", energy =FALSE, # Arithmetic averaging for PM2.5 values average_function ="mean", result_variable ="monthly_pm25", debug =FALSE)# Show the monthly aggregated datamonthly_data%>%select(id, period_id, period_start, period_end, monthly_pm25, missing_percentage, method_description)%>%head(10)%>%kable()%>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))
id
period_id
period_start
period_end
monthly_pm25
missing_percentage
method_description
1
2020-01
2020-01-01
2020-01-31
17.509085
48.38710
Arithmetic mean (day-weighted)
1
2020-02
2020-02-01
2020-02-29
16.257268
0.00000
Arithmetic mean (day-weighted)
1
2020-03
2020-03-01
2020-03-31
14.622774
51.61290
Arithmetic mean (day-weighted)
1
2020-04
2020-04-01
2020-04-30
12.048841
0.00000
Arithmetic mean (day-weighted)
1
2020-05
2020-05-01
2020-05-31
9.484487
32.25806
Arithmetic mean (day-weighted)
1
2020-06
2020-06-01
2020-06-30
7.438511
33.33333
Arithmetic mean (day-weighted)
1
2020-07
2020-07-01
2020-07-31
6.705139
0.00000
Arithmetic mean (day-weighted)
1
2020-08
2020-08-01
2020-08-31
8.116065
45.16129
Arithmetic mean (day-weighted)
1
2020-09
2020-09-01
2020-09-30
9.502457
0.00000
Arithmetic mean (day-weighted)
1
2020-10
2020-10-01
2020-10-31
11.822999
0.00000
Arithmetic mean (day-weighted)
Visualize Monthly Aggregates
Code
# Plot monthly valuesggplot(monthly_data, aes(x =period_start, y =monthly_pm25, color =factor(id)))+geom_point()+geom_line()+facet_wrap(~id, ncol =1)+labs( title ="Monthly Aggregated PM2.5 Exposure", x ="Month", y ="PM2.5 (μg/m³)")", color ="ID")+theme_minimal()+theme(legend.position ="bottom")
Visualize Data Completeness
Code
# Plot missing data percentage by monthggplot(monthly_data, aes(x =period_start, y =missing_percentage, fill =factor(id)))+geom_col(position ="dodge")+labs( title ="Missing Data Percentage by Month", x ="Month", y ="Missing Data (%)", fill ="ID")+theme_minimal()+theme(legend.position ="bottom")
Example 2: Period Data (Address History with Noise Exposures)
This example demonstrates how to handle exposure data that already covers periods (e.g., address history with noise exposure estimates).
Generate Sample Address History Data
Code
# Create sample address history with noise exposure dataaddress_data<-data.frame( id =rep(1:3, each =4), period_start =as.Date(c(# ID 1 address history"2020-01-01", "2020-04-15", "2020-07-01", "2020-10-01",# ID 2 address history"2020-01-01", "2020-03-01", "2020-08-15", "2020-11-01",# ID 3 address history"2020-01-01", "2020-05-01", "2020-06-15", "2020-09-01")), period_end =as.Date(c(# ID 1 address history"2020-04-14", "2020-06-30", "2020-09-30", "2020-12-31",# ID 2 address history"2020-02-29", "2020-08-14", "2020-10-31", "2020-12-31",# ID 3 address history"2020-04-30", "2020-06-14", "2020-08-31", "2020-12-31")),# Different Lden (noise) exposure at each address lden =c(62.3, 55.7, 58.2, 64.1, # ID 1 exposures60.5, 54.2, 66.8, 61.3, # ID 2 exposures53.7, 56.9, 59.5, 65.2# ID 3 exposures))# Show the address history dataaddress_data%>%kable()%>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))
id
period_start
period_end
lden
1
2020-01-01
2020-04-14
62.3
1
2020-04-15
2020-06-30
55.7
1
2020-07-01
2020-09-30
58.2
1
2020-10-01
2020-12-31
64.1
2
2020-01-01
2020-02-29
60.5
2
2020-03-01
2020-08-14
54.2
2
2020-08-15
2020-10-31
66.8
2
2020-11-01
2020-12-31
61.3
3
2020-01-01
2020-04-30
53.7
3
2020-05-01
2020-06-14
56.9
3
2020-06-15
2020-08-31
59.5
3
2020-09-01
2020-12-31
65.2
Visualize Address History Timeline
Code
# Create a timeline plot of address historyaddress_data%>%mutate( duration =as.numeric(period_end-period_start)+1, address_number =rep(1:4, 3)# Add address number for each person)%>%ggplot(aes(x =period_start, xend =period_end, y =factor(id), yend =factor(id), color =lden, size =lden))+geom_segment()+geom_point(aes(x =period_start), size =3)+geom_point(aes(x =period_end), size =3)+geom_text(aes(label =sprintf("%.1f", lden), x =period_start+(period_end-period_start)/2), vjust =-1, size =3)+scale_color_viridis_c(option ="plasma")+labs( title ="Address History Timeline with Noise Exposure", x ="Date", y ="Individual ID", color ="Noise level (dB)", size ="Noise level (dB)")+theme_minimal()+theme(legend.position ="bottom")
Aggregate to Quarterly Periods
Code
# Aggregate to quarterly periodsquarterly_data<-create_period_exposures( data =address_data, id_var ="id", in_date_var ="period_start", # For period data, use in_date_var and out_date_var out_date_var ="period_end", exposure_var ="lden", period ="quarter", energy =TRUE, # Energy-based averaging for noise values weight_by_days =TRUE, # Weight by duration at each address result_variable ="quarterly_lden", debug =FALSE)# Show the quarterly aggregated dataquarterly_data%>%select(id, period_id, period_start, period_end, quarterly_lden, days_with_data, days_in_period, method_description)%>%kable()%>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))
id
period_id
period_start
period_end
quarterly_lden
days_with_data
days_in_period
method_description
1
2020Q1
2020-01-01
2020-03-31
62.30000
91
91
Energy-based mean (day-weighted)
1
2020Q2
2020-04-01
2020-06-30
55.70000
77
91
Energy-based mean (day-weighted)
1
2020Q3
2020-07-01
2020-09-30
58.20000
92
92
Energy-based mean (day-weighted)
1
2020Q4
2020-10-01
2020-12-31
64.10000
92
92
Energy-based mean (day-weighted)
2
2020Q1
2020-01-01
2020-03-31
59.18761
91
91
Energy-based mean (day-weighted)
2
2020Q3
2020-07-01
2020-09-30
66.80000
47
92
Energy-based mean (day-weighted)
2
2020Q4
2020-10-01
2020-12-31
61.30000
61
92
Energy-based mean (day-weighted)
3
2020Q1
2020-01-01
2020-03-31
53.70000
91
91
Energy-based mean (day-weighted)
3
2020Q2
2020-04-01
2020-06-30
57.74578
61
91
Energy-based mean (day-weighted)
3
2020Q3
2020-07-01
2020-09-30
65.20000
30
92
Energy-based mean (day-weighted)
Visualize Quarterly Aggregates
Code
# Plot quarterly noise exposuresggplot(quarterly_data, aes(x =period_start, y =quarterly_lden, color =factor(id)))+geom_point(size =3)+geom_line()+labs( title ="Quarterly Aggregated Noise Exposure", subtitle ="Day-weighted energy-based averages from address history", x ="Quarter", y ="Noise Level (dB)", color ="ID")+theme_minimal()+theme(legend.position ="bottom")
For exposures measured on logarithmic scales (like decibel-based noise measurements), arithmetic averaging is incorrect. The function provides energy-based averaging options.
Comparing Arithmetic vs. Energy-Based Averaging for Noise
Code
# Generate sample monthly noise data with large variations across the yearnoise_data<-data.frame( id =rep(1, 12), month =1:12, period_start =seq(as.Date("2020-01-01"), by ="month", length.out =12),# Create a realistic yearly pattern with one very loud month (summer festival in July) lden =c(60, 59, 61, 62, 63, 64, 75, 65, 63, 62, 60, 61))# Calculate arithmetic and energy-based meansarithmetic_mean<-mean(noise_data$lden)energy_mean<-10*log10(mean(10^(noise_data$lden/10)))# Show the resultsresults_df<-data.frame( Method =c("Arithmetic Mean", "Energy-Based Mean"), Value =c(arithmetic_mean, energy_mean))results_df%>%kable()%>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))
Method
Value
Arithmetic Mean
62.91667
Energy-Based Mean
66.17781
Code
# Plot to show the differencenoise_data%>%mutate(month_name =factor(month.name[month], levels =month.name))%>%ggplot(aes(x =month_name, y =lden, group =1))+geom_point(size =3)+geom_line()+geom_hline(yintercept =arithmetic_mean, color ="red", linetype ="dashed")+geom_hline(yintercept =energy_mean, color ="blue", linetype ="dashed")+annotate("text", x =11.5, y =arithmetic_mean+0.5, label ="Arithmetic Mean", hjust =1, color ="red")+annotate("text", x =11.5, y =energy_mean-0.5, label ="Energy-Based Mean", hjust =1, color ="blue")+labs( title ="Effect of Energy-Based Averaging for Annual Noise Levels", subtitle ="Note how energy-based averaging gives more weight to the loud July values", x ="Month", y ="Noise Level (dB)")+theme_minimal()+theme(axis.text.x =element_text(angle =45, hjust =1))
Code
# Add explanation textcat("The example above shows monthly noise levels throughout a year, with a particularly loud month in July (75 dB).\n\n")
The example above shows monthly noise levels throughout a year, with a particularly loud month in July (75 dB).
Code
cat("The arithmetic mean (", round(arithmetic_mean, 1), " dB) underestimates the actual perceived average noise level.\n")
The arithmetic mean ( 62.9 dB) underestimates the actual perceived average noise level.
Code
cat("The energy-based mean (", round(energy_mean, 1), " dB) properly accounts for the logarithmic nature of the decibel scale,\n")
The energy-based mean ( 66.2 dB) properly accounts for the logarithmic nature of the decibel scale,
Code
cat("giving more weight to louder periods as they contribute disproportionately to the overall sound energy.\n\n")
giving more weight to louder periods as they contribute disproportionately to the overall sound energy.
Code
cat("This difference is crucial in epidemiological studies of noise exposure, where using arithmetic\n")
This difference is crucial in epidemiological studies of noise exposure, where using arithmetic
Code
cat("means would systematically underestimate the true exposure and potentially lead to exposure misclassification.\n")
means would systematically underestimate the true exposure and potentially lead to exposure misclassification.
Time-Weighted Averaging with Different Day Counts
When aggregating period data, weighting by the number of days ensures proper representation of exposures.
Code
# Create sample data with uneven period lengthsuneven_data<-data.frame( id =rep(1, 3), period_start =as.Date(c("2020-01-01", "2020-01-15", "2020-02-01")), period_end =as.Date(c("2020-01-14", "2020-01-31", "2020-02-28")), no2 =c(25, 15, 20)# NO2 values)# Show the datauneven_data%>%mutate(days =as.numeric(period_end-period_start)+1)%>%kable()%>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))
id
period_start
period_end
no2
days
1
2020-01-01
2020-01-14
25
14
1
2020-01-15
2020-01-31
15
17
1
2020-02-01
2020-02-28
20
28
Code
# Calculate weighted and unweighted monthly averagesmonthly_weighted<-create_period_exposures( data =uneven_data, in_date_var ="period_start", out_date_var ="period_end", exposure_var ="no2", period ="month", energy =FALSE, weight_by_days =TRUE, result_variable ="monthly_no2_weighted")monthly_unweighted<-create_period_exposures( data =uneven_data, in_date_var ="period_start", out_date_var ="period_end", exposure_var ="no2", period ="month", energy =FALSE, weight_by_days =FALSE, result_variable ="monthly_no2_unweighted")# Compare the resultscomparison<-left_join(select(monthly_weighted, id, period_id, monthly_no2_weighted),select(monthly_unweighted, id, period_id, monthly_no2_unweighted), by =c("id", "period_id"))%>%mutate(difference =monthly_no2_weighted-monthly_no2_unweighted)comparison%>%kable()%>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))
id
period_id
monthly_no2_weighted
monthly_no2_unweighted
difference
1
2020-01
19.51613
20
-0.483871
1
2020-02
20.00000
20
0.000000
Considerations for Data Quality
The function provides metrics to assess data quality, which can be used to filter results based on completeness.
Code
# Filter monthly data based on 5% missing thresholdcomplete_months_5pct<-monthly_data%>%filter(missing_percentage<=5)%>%select(id, period_id, monthly_pm25, missing_percentage)incomplete_months_5pct<-monthly_data%>%filter(missing_percentage>5)%>%select(id, period_id, monthly_pm25, missing_percentage)# Compare number of complete vs incomplete months# print("Complete months (≤5% missing):", nrow(complete_months_5pct), "\n")")# print("Incomplete months (>5% missing):", nrow(incomplete_months_5pct), "\n")
The choice of missingness threshold depends on your study requirements and exposure type. As shown in the tabs above, different thresholds result in different sets of included/excluded periods. Some considerations:
Stricter thresholds (e.g., 5%): Ensure high data quality but may exclude many periods
Moderate thresholds (e.g., 10-20%): Balance between data quality and sample size
Permissive thresholds (e.g., 30%): Include more periods but with potentially less reliable estimates
For critical research applications, it’s often advisable to perform sensitivity analyses using different thresholds to assess the impact on your findings.
Performance Considerations
The function can process large datasets efficiently, but performance may vary based on the number of IDs and time periods.
Code
# Generate a larger datasetlarge_daily_data<-data.frame( id =rep(1:1000, each =365), date =rep(seq(as.Date("2020-01-01"), by ="day", length.out =365), 1000), exposure =rnorm(1000*365, 50, 10))# Time the executionsystem.time({large_monthly_data<-create_period_exposures( data =large_daily_data, date_var ="date", exposure_var ="exposure", period ="month")})
For very large datasets, consider using parallel processing:
Code
library(parallel)# Split data by IDids<-unique(large_daily_data$id)id_chunks<-split(ids, ceiling(seq_along(ids)/100))# Process 100 IDs at a time# Function to process a chunk of IDsprocess_chunk<-function(ids){chunk_data<-large_daily_data[large_daily_data$id%in%ids, ]create_period_exposures( data =chunk_data, date_var ="date", exposure_var ="exposure", period ="month")}# Process chunks in parallelresults_list<-mclapply(id_chunks, process_chunk, mc.cores =detectCores()-1)# Combine resultscombined_results<-do.call(rbind, results_list)
Best Practices and Recommendations
General Guidelines
Choose appropriate time periods based on your research question and the exposure dynamics
Use energy-based averaging for logarithmic values (noise, decibels)
Weight by days when using period data with varying durations
Filter by data quality metrics for more reliable analyses
Validate aggregated results against raw data when possible
Common Issues and Solutions
Issue
Solution
Missing data
Use missing_percentage to filter periods with insufficient data
Mixed data types
Convert point data to period data by duplicating the date
Incorrect averaging
Ensure energy=TRUE for decibel values
Memory limitations
Process the data in batches by ID
Performance concerns
Consider parallel processing for very large datasets
Conclusion
The create_period_exposures() function provides a flexible and powerful tool for aggregating exposure data in epidemiological research. By properly accounting for time periods, data quality, and appropriate averaging methods, researchers can generate more reliable exposure metrics for health outcome analysis.
When publishing research using this function, we recommend documenting:
The chosen time periods and their relevance to health outcomes
Averaging methods (energy-based vs. arithmetic)
Data completeness thresholds applied
Any pre-processing steps applied to the raw exposure data
---title: "Exposure Aggregation in Epidemiological Research: Time-Weighted Methods and Applications"format: html: toc: true toc-depth: 4 toc-location: left code-fold: show code-tools: true code-link: true embed-resources: true theme: cosmo self-contained: true pdf: toc: true toc-depth: 4 number-sections: true colorlinks: true code-fold: show keep-tex: falseexecute: echo: true warning: false message: falsebibliography: references.bib---```{r}#| label: setup#| include: false#| warning: false#| echo: falselibrary(dplyr)library(ggplot2)library(lubridate)library(patchwork)library(knitr)library(kableExtra)library(tictoc)# Set a seed for reproducibilityset.seed(123)# Source the function filesource("create_period_exposures.R")```# IntroductionIn epidemiological research, especially in environmental health studies, properly aggregating exposure data over time is crucial for accurate exposure assessment [@brink2024determining]. Environmental exposures such as noise, typically reported using standardized metrics like day-evening-night level ($L_{den}$) [@eu2002directive; @wikipediaLden], exhibit significant temporal variations that must be carefully processed to avoid exposure misclassification. This document provides a comprehensive guide to using the `create_period_exposures()` function, which offers flexible methods for aggregating environmental exposure measurements across different time periods.## Key Features of the FunctionThe `create_period_exposures()` function provides:1. Aggregation of exposures into standardized time periods (month, quarter, half-year, year)2. Support for both point-in-time data and period-based exposure data3. Special handling for logarithmic values (e.g., decibel-scale noise measurements)4. Time-weighted averaging to account for varying exposure durations5. Data quality metrics to assess completeness of exposure information6. Flexible parameter options to customize the aggregation process# Function Documentation## Function Signature and Parameters```{r}#| label: function-signature#| eval: falsecreate_period_exposures <-function( data, id_var ="id", date_var =NULL,in_date_var ="period_start",out_date_var ="period_end",exposure_var ="lden", period ="quarter",energy =TRUE,average_function ="mean",weight_by_days =TRUE,# output variablesresult_variable ="avg_exposure",period_id_var ="period_id",period_start_var ="period_start",period_end_var ="period_end",keep_original =FALSE,# debuggingdebug =FALSE)```## Parameter Descriptions| Parameter | Description ||-----------|-------------|| `data` | Data frame containing exposure data || `id_var` | Name of ID variable (default: "id") || `date_var` | Name of date variable for point data (default: NULL) || `in_date_var` | Name of start date variable for period data (default: "period_start") || `out_date_var` | Name of end date variable for period data (default: "period_end") || `exposure_var` | Name of exposure variable (default: "lden") || `period` | Aggregation period type: "month", "quarter", "halfyear", "year" (default: "quarter") || `energy` | Whether to use energy-based averaging for logarithmic values (default: TRUE) || `average_function` | Function to use for averaging: "mean" or "median" (default: "mean") || `weight_by_days` | Whether to weight by days in period (default: TRUE) || `result_variable` | Name for the result column (default: "avg_exposure") || `period_id_var` | Name for period identifier column (default: "period_id") || `period_start_var` | Name for period start column (default: "period_start") || `period_end_var` | Name for period end column (default: "period_end") || `keep_original` | Whether to keep original values (default: FALSE) || `debug` | Whether to show debugging information (default: FALSE) |## Return ValueThe function returns a data frame with one row per ID per time period with:- ID variable- Period identifier (e.g., "2020-01", "2020Q1")- Period start and end dates- Aggregated exposure value- Minimum and maximum exposure values in period- Data quality metrics: - Days in period - Days with data - Missing days - Missing percentage- Method description (e.g., "Energy-based mean (day-weighted)")# Use Cases and Examples## Example 1: Point-in-Time Data (Daily Measurements)This example demonstrates how to aggregate daily PM2.5 measurements into monthly periods.### Generate Sample Daily Data```{r}#| label: daily-data# Create a date sequence for 2020all_dates <-seq(as.Date("2020-01-01"), as.Date("2020-12-31"), by ="day")# Create sample daily data for 3 individuals with extensive missing date patterns# to better illustrate missingness threshold effectsdaily_data <-data.frame(id =rep(1:3, each =length(all_dates)),date =rep(all_dates, 3)) %>%# Add diverse missing patterns:# ID 1: Various missing patterns throughout the yearfilter(!(id ==1&month(date) ==1&day(date) %in%c(5:9, 15:19, 25:29))) %>%# Jan: 15 missing days (3 sets of 5 days)filter(!(id ==1&month(date) ==3&day(date) %in%c(10:25))) %>%# Mar: 16 consecutive days missingfilter(!(id ==1&month(date) %in%c(5, 6) &day(date) %%3==0)) %>%# May-Jun: Every 3rd day missingfilter(!(id ==1&month(date) ==8&day(date) %in%c(1:7, 15:21))) %>%# Aug: Two weeks missingfilter(!(id ==1&month(date) ==11&day(date) %in%c(1:5, 11:15, 21:25))) %>%# Nov: 15 days missing in sets# ID 2: Some months with low to moderate missingness, some with highfilter(!(id ==2&month(date) ==2&day(date) %%5==0)) %>%# Feb: Every 5th day missing (low)filter(!(id ==2&month(date) ==4&day(date) %in%c(1:20))) %>%# Apr: 20 days missing (high)filter(!(id ==2&month(date) ==6&day(date) %%3==0)) %>%# Jun: Every 3rd day missing (moderate)filter(!(id ==2&month(date) ==7&day(date) %in%c(5:10, 20:31))) %>%# Jul: 18 days missing (high)filter(!(id ==2&month(date) ==9&day(date) %%4==0)) %>%# Sep: Every 4th day missing (low)filter(!(id ==2&month(date) ==12&day(date) %in%c(10:30))) %>%# Dec: 21 days missing (high)# ID 3: Gradual increase in missingness through the yearfilter(!(id ==3&month(date) ==1&day(date) %in%c(15:17))) %>%# Jan: 3 days missing (~10%)filter(!(id ==3&month(date) ==2&day(date) %in%c(5:9))) %>%# Feb: 5 days missing (~17%) filter(!(id ==3&month(date) ==4&day(date) %in%c(10:16))) %>%# Apr: 7 days missing (~23%)filter(!(id ==3&month(date) ==5&day(date) %in%c(5:14))) %>%# May: 10 days missing (~33%)filter(!(id ==3&month(date) ==7&day(date) %in%c(1:15))) %>%# Jul: 15 days missing (~48%)filter(!(id ==3&month(date) ==9&day(date) %in%c(1:20))) %>%# Sep: 20 days missing (~67%)filter(!(id ==3&month(date) ==11&day(date) %in%c(1:25))) %>%# Nov: 25 days missing (~83%)# Add PM2.5 exposure valuesmutate(# Base levelbase_pm25 =case_when( id ==1~12.0, # urban area id ==2~15.5, # near highway id ==3~8.5, # suburbanTRUE~10.0 ),# Seasonal pattern (higher in winter)month_effect =5*cos((month(date) -1) * pi/6),# Random daily variationdaily_variation =rnorm(n(), 0, 2),# Final PM2.5 valuepm25 =pmax(0, base_pm25 + month_effect + daily_variation) # ensure no negative values )# Calculate missing days per month for documentationmonthly_summary <- daily_data %>%group_by(id, year =year(date), month =month(date)) %>%summarize(days_with_data =n(),days_in_month =days_in_month(first(date)),missing_days = days_in_month - days_with_data,missing_pct =100* missing_days / days_in_month,.groups ="drop" )# Show sample of the datahead(daily_data) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))```### Plot Daily Data to Show Patterns```{r}#| label: daily-plot#| fig-width: 10#| fig-height: 8# Plot daily values for each IDggplot(daily_data, aes(x = date, y = pm25, color =factor(id))) +geom_point(alpha =0.3) +geom_smooth(se =FALSE) +facet_wrap(~ id, ncol =1) +labs(title ="Daily PM2.5 Exposure by Individual",subtitle ="Note the missing data patterns and seasonal trends",x ="Date",y ="PM2.5 (μg/m³)",color ="ID" ) +theme_minimal() +theme(legend.position ="bottom")```### Aggregate to Monthly Periods```{r}#| label: monthly-aggregation# Aggregate daily data to monthly periodsmonthly_data <-create_period_exposures(data = daily_data,id_var ="id",date_var ="date",exposure_var ="pm25",period ="month",energy =FALSE, # Arithmetic averaging for PM2.5 valuesaverage_function ="mean",result_variable ="monthly_pm25",debug =FALSE)# Show the monthly aggregated datamonthly_data %>%select(id, period_id, period_start, period_end, monthly_pm25, missing_percentage, method_description) %>%head(10) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))```### Visualize Monthly Aggregates```{r}#| label: monthly-plot#| fig-width: 10#| fig-height: 8# Plot monthly valuesggplot(monthly_data, aes(x = period_start, y = monthly_pm25, color =factor(id))) +geom_point() +geom_line() +facet_wrap(~ id, ncol =1) +labs(title ="Monthly Aggregated PM2.5 Exposure",x ="Month",y ="PM2.5 (μg/m³)",color ="ID" ) +theme_minimal() +theme(legend.position ="bottom")```### Visualize Data Completeness```{r}#| label: missing-data-plot#| fig-width: 10#| fig-height: 6# Plot missing data percentage by monthggplot(monthly_data, aes(x = period_start, y = missing_percentage, fill =factor(id))) +geom_col(position ="dodge") +labs(title ="Missing Data Percentage by Month",x ="Month",y ="Missing Data (%)",fill ="ID" ) +theme_minimal() +theme(legend.position ="bottom")```## Example 2: Period Data (Address History with Noise Exposures)This example demonstrates how to handle exposure data that already covers periods (e.g., address history with noise exposure estimates).### Generate Sample Address History Data```{r}#| label: address-data# Create sample address history with noise exposure dataaddress_data <-data.frame(id =rep(1:3, each =4),period_start =as.Date(c(# ID 1 address history"2020-01-01", "2020-04-15", "2020-07-01", "2020-10-01",# ID 2 address history"2020-01-01", "2020-03-01", "2020-08-15", "2020-11-01",# ID 3 address history"2020-01-01", "2020-05-01", "2020-06-15", "2020-09-01" )),period_end =as.Date(c(# ID 1 address history"2020-04-14", "2020-06-30", "2020-09-30", "2020-12-31",# ID 2 address history"2020-02-29", "2020-08-14", "2020-10-31", "2020-12-31",# ID 3 address history"2020-04-30", "2020-06-14", "2020-08-31", "2020-12-31" )),# Different Lden (noise) exposure at each addresslden =c(62.3, 55.7, 58.2, 64.1, # ID 1 exposures60.5, 54.2, 66.8, 61.3, # ID 2 exposures53.7, 56.9, 59.5, 65.2# ID 3 exposures ))# Show the address history dataaddress_data %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))```### Visualize Address History Timeline```{r}#| label: address-timeline#| fig-width: 10#| fig-height: 6# Create a timeline plot of address historyaddress_data %>%mutate(duration =as.numeric(period_end - period_start) +1,address_number =rep(1:4, 3) # Add address number for each person ) %>%ggplot(aes(x = period_start, xend = period_end, y =factor(id), yend =factor(id), color = lden, size = lden)) +geom_segment() +geom_point(aes(x = period_start), size =3) +geom_point(aes(x = period_end), size =3) +geom_text(aes(label =sprintf("%.1f", lden), x = period_start + (period_end - period_start)/2), vjust =-1, size =3) +scale_color_viridis_c(option ="plasma") +labs(title ="Address History Timeline with Noise Exposure",x ="Date",y ="Individual ID",color ="Noise level (dB)",size ="Noise level (dB)" ) +theme_minimal() +theme(legend.position ="bottom")```### Aggregate to Quarterly Periods```{r}#| label: quarterly-aggregation# Aggregate to quarterly periodsquarterly_data <-create_period_exposures(data = address_data,id_var ="id",in_date_var ="period_start", # For period data, use in_date_var and out_date_varout_date_var ="period_end",exposure_var ="lden",period ="quarter",energy =TRUE, # Energy-based averaging for noise valuesweight_by_days =TRUE, # Weight by duration at each addressresult_variable ="quarterly_lden",debug =FALSE)# Show the quarterly aggregated dataquarterly_data %>%select(id, period_id, period_start, period_end, quarterly_lden, days_with_data, days_in_period, method_description) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))```### Visualize Quarterly Aggregates```{r}#| label: quarterly-plot#| fig-width: 10#| fig-height: 6# Plot quarterly noise exposuresggplot(quarterly_data, aes(x = period_start, y = quarterly_lden, color =factor(id))) +geom_point(size =3) +geom_line() +labs(title ="Quarterly Aggregated Noise Exposure",subtitle ="Day-weighted energy-based averages from address history",x ="Quarter",y ="Noise Level (dB)",color ="ID" ) +theme_minimal() +theme(legend.position ="bottom")```# Advanced Features## Handling Logarithmic Values (Energy-Based Averaging)For exposures measured on logarithmic scales (like decibel-based noise measurements), arithmetic averaging is incorrect. The function provides energy-based averaging options.### Comparing Arithmetic vs. Energy-Based Averaging for Noise```{r}#| label: energy-averaging#| fig-width: 10#| fig-height: 6# Generate sample monthly noise data with large variations across the yearnoise_data <-data.frame(id =rep(1, 12),month =1:12,period_start =seq(as.Date("2020-01-01"), by ="month", length.out =12),# Create a realistic yearly pattern with one very loud month (summer festival in July)lden =c(60, 59, 61, 62, 63, 64, 75, 65, 63, 62, 60, 61))# Calculate arithmetic and energy-based meansarithmetic_mean <-mean(noise_data$lden)energy_mean <-10*log10(mean(10^(noise_data$lden/10)))# Show the resultsresults_df <-data.frame(Method =c("Arithmetic Mean", "Energy-Based Mean"),Value =c(arithmetic_mean, energy_mean))results_df %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))# Plot to show the differencenoise_data %>%mutate(month_name =factor(month.name[month], levels = month.name)) %>%ggplot(aes(x = month_name, y = lden, group =1)) +geom_point(size =3) +geom_line() +geom_hline(yintercept = arithmetic_mean, color ="red", linetype ="dashed") +geom_hline(yintercept = energy_mean, color ="blue", linetype ="dashed") +annotate("text", x =11.5, y = arithmetic_mean +0.5, label ="Arithmetic Mean", hjust =1, color ="red") +annotate("text", x =11.5, y = energy_mean -0.5, label ="Energy-Based Mean", hjust =1, color ="blue") +labs(title ="Effect of Energy-Based Averaging for Annual Noise Levels",subtitle ="Note how energy-based averaging gives more weight to the loud July values",x ="Month",y ="Noise Level (dB)" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))# Add explanation textcat("The example above shows monthly noise levels throughout a year, with a particularly loud month in July (75 dB).\n\n")cat("The arithmetic mean (", round(arithmetic_mean, 1), " dB) underestimates the actual perceived average noise level.\n")cat("The energy-based mean (", round(energy_mean, 1), " dB) properly accounts for the logarithmic nature of the decibel scale,\n")cat("giving more weight to louder periods as they contribute disproportionately to the overall sound energy.\n\n")cat("This difference is crucial in epidemiological studies of noise exposure, where using arithmetic\n")cat("means would systematically underestimate the true exposure and potentially lead to exposure misclassification.\n")```## Time-Weighted Averaging with Different Day CountsWhen aggregating period data, weighting by the number of days ensures proper representation of exposures.```{r}#| label: weighted-vs-unweighted# Create sample data with uneven period lengthsuneven_data <-data.frame(id =rep(1, 3),period_start =as.Date(c("2020-01-01", "2020-01-15", "2020-02-01")),period_end =as.Date(c("2020-01-14", "2020-01-31", "2020-02-28")),no2 =c(25, 15, 20) # NO2 values)# Show the datauneven_data %>%mutate(days =as.numeric(period_end - period_start) +1) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))# Calculate weighted and unweighted monthly averagesmonthly_weighted <-create_period_exposures(data = uneven_data,in_date_var ="period_start",out_date_var ="period_end",exposure_var ="no2",period ="month",energy =FALSE,weight_by_days =TRUE,result_variable ="monthly_no2_weighted")monthly_unweighted <-create_period_exposures(data = uneven_data,in_date_var ="period_start",out_date_var ="period_end",exposure_var ="no2",period ="month",energy =FALSE,weight_by_days =FALSE,result_variable ="monthly_no2_unweighted")# Compare the resultscomparison <-left_join(select(monthly_weighted, id, period_id, monthly_no2_weighted),select(monthly_unweighted, id, period_id, monthly_no2_unweighted),by =c("id", "period_id")) %>%mutate(difference = monthly_no2_weighted - monthly_no2_unweighted)comparison %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))```## Considerations for Data QualityThe function provides metrics to assess data quality, which can be used to filter results based on completeness.```{r}#| label: data-quality-prepere#| fig-width: 10#| fig-height: 8# Filter monthly data based on 5% missing thresholdcomplete_months_5pct <- monthly_data %>%filter(missing_percentage <=5) %>%select(id, period_id, monthly_pm25, missing_percentage)incomplete_months_5pct <- monthly_data %>%filter(missing_percentage >5) %>%select(id, period_id, monthly_pm25, missing_percentage)# Compare number of complete vs incomplete months# print("Complete months (≤5% missing):", nrow(complete_months_5pct), "\n")# print("Incomplete months (>5% missing):", nrow(incomplete_months_5pct), "\n")```::: {.panel-tabset}### 5% Threshold```{r}#| label: data-quality-5pct#| fig-width: 10#| fig-height: 8#| echo: false# Plot to compare complete vs incomplete with 5% thresholdmonthly_data %>%mutate(quality =ifelse(missing_percentage <=5, "Complete (≤5% missing)", "Incomplete (>5% missing)")) %>%ggplot(aes(x = period_start, y = monthly_pm25, color = quality, shape = quality)) +geom_point(size =3) +facet_wrap(~ id, ncol =1) +labs(title ="Monthly PM2.5 Exposure with Data Quality Indication",subtitle ="Using 5% missingness threshold",x ="Month",y ="PM2.5 (μg/m³)",color ="Data Quality",shape ="Data Quality" ) +theme_minimal() +theme(legend.position ="bottom")# Compare number of complete vs incomplete months# print("Complete months (≤5% missing):", nrow(complete_months_5pct), "\n")# print("Incomplete months (>5% missing):", nrow(incomplete_months_5pct), "\n")```### 10% Threshold```{r}#| label: data-quality-10pct#| fig-width: 10#| fig-height: 8#| echo: false# Filter monthly data based on 10% missing thresholdcomplete_months_10pct <- monthly_data %>%filter(missing_percentage <=10) %>%select(id, period_id, monthly_pm25, missing_percentage)incomplete_months_10pct <- monthly_data %>%filter(missing_percentage >10) %>%select(id, period_id, monthly_pm25, missing_percentage)# Compare number of complete vs incomplete months# print("Complete months (≤10% missing):", nrow(complete_months_10pct), "\n")# print("Incomplete months (>10% missing):", nrow(incomplete_months_10pct), "\n")# Plot to compare complete vs incomplete with 10% thresholdmonthly_data %>%mutate(quality =ifelse(missing_percentage <=10, "Complete (≤10% missing)", "Incomplete (>10% missing)")) %>%ggplot(aes(x = period_start, y = monthly_pm25, color = quality, shape = quality)) +geom_point(size =3) +facet_wrap(~ id, ncol =1) +labs(title ="Monthly PM2.5 Exposure with Data Quality Indication",subtitle ="Using 10% missingness threshold",x ="Month",y ="PM2.5 (μg/m³)",color ="Data Quality",shape ="Data Quality" ) +theme_minimal() +theme(legend.position ="bottom")```### 20% Threshold```{r}#| label: data-quality-20pct#| fig-width: 10#| fig-height: 8#| echo: false# Filter monthly data based on 20% missing thresholdcomplete_months_20pct <- monthly_data %>%filter(missing_percentage <=20) %>%select(id, period_id, monthly_pm25, missing_percentage)incomplete_months_20pct <- monthly_data %>%filter(missing_percentage >20) %>%select(id, period_id, monthly_pm25, missing_percentage)# Compare number of complete vs incomplete months# cat("Complete months (≤20% missing):", nrow(complete_months_20pct), "\n")# cat("Incomplete months (>20% missing):", nrow(incomplete_months_20pct), "\n")# Plot to compare complete vs incomplete with 20% thresholdmonthly_data %>%mutate(quality =ifelse(missing_percentage <=20, "Complete (≤20% missing)", "Incomplete (>20% missing)")) %>%ggplot(aes(x = period_start, y = monthly_pm25, color = quality, shape = quality)) +geom_point(size =3) +facet_wrap(~ id, ncol =1) +labs(title ="Monthly PM2.5 Exposure with Data Quality Indication",subtitle ="Using 20% missingness threshold",x ="Month",y ="PM2.5 (μg/m³)",color ="Data Quality",shape ="Data Quality" ) +theme_minimal() +theme(legend.position ="bottom")```### 30% Threshold```{r}#| label: data-quality-30pct#| fig-width: 10#| fig-height: 8#| echo: false# Filter monthly data based on 30% missing thresholdcomplete_months_30pct <- monthly_data %>%filter(missing_percentage <=30) %>%select(id, period_id, monthly_pm25, missing_percentage)incomplete_months_30pct <- monthly_data %>%filter(missing_percentage >30) %>%select(id, period_id, monthly_pm25, missing_percentage)# Compare number of complete vs incomplete months# cat("Complete months (≤30% missing):", nrow(complete_months_30pct), "\n")# cat("Incomplete months (>30% missing):", nrow(incomplete_months_30pct), "\n")# Plot to compare complete vs incomplete with 30% thresholdmonthly_data %>%mutate(quality =ifelse(missing_percentage <=30, "Complete (≤30% missing)", "Incomplete (>30% missing)")) %>%ggplot(aes(x = period_start, y = monthly_pm25, color = quality, shape = quality)) +geom_point(size =3) +facet_wrap(~ id, ncol =1) +labs(title ="Monthly PM2.5 Exposure with Data Quality Indication",subtitle ="Using 30% missingness threshold",x ="Month",y ="PM2.5 (μg/m³)",color ="Data Quality",shape ="Data Quality" ) +theme_minimal() +theme(legend.position ="bottom")```:::The choice of missingness threshold depends on your study requirements and exposure type. As shown in the tabs above, different thresholds result in different sets of included/excluded periods. Some considerations:* **Stricter thresholds (e.g., 5%)**: Ensure high data quality but may exclude many periods* **Moderate thresholds (e.g., 10-20%)**: Balance between data quality and sample size* **Permissive thresholds (e.g., 30%)**: Include more periods but with potentially less reliable estimatesFor critical research applications, it's often advisable to perform sensitivity analyses using different thresholds to assess the impact on your findings.# Performance ConsiderationsThe function can process large datasets efficiently, but performance may vary based on the number of IDs and time periods.```{r}#| label: performance#| eval: false# Generate a larger datasetlarge_daily_data <-data.frame(id =rep(1:1000, each =365),date =rep(seq(as.Date("2020-01-01"), by ="day", length.out =365), 1000),exposure =rnorm(1000*365, 50, 10))# Time the executionsystem.time({ large_monthly_data <-create_period_exposures(data = large_daily_data,date_var ="date",exposure_var ="exposure",period ="month" )})```For very large datasets, consider using parallel processing:```{r}#| label: parallel-processing#| eval: falselibrary(parallel)# Split data by IDids <-unique(large_daily_data$id)id_chunks <-split(ids, ceiling(seq_along(ids) /100)) # Process 100 IDs at a time# Function to process a chunk of IDsprocess_chunk <-function(ids) { chunk_data <- large_daily_data[large_daily_data$id %in% ids, ]create_period_exposures(data = chunk_data,date_var ="date",exposure_var ="exposure",period ="month" )}# Process chunks in parallelresults_list <-mclapply(id_chunks, process_chunk, mc.cores =detectCores() -1)# Combine resultscombined_results <-do.call(rbind, results_list)```# Best Practices and Recommendations## General Guidelines1. **Choose appropriate time periods** based on your research question and the exposure dynamics2. **Use energy-based averaging** for logarithmic values (noise, decibels)3. **Weight by days** when using period data with varying durations4. **Filter by data quality** metrics for more reliable analyses5. **Validate aggregated results** against raw data when possible## Common Issues and Solutions| Issue | Solution ||-------|----------|| Missing data | Use missing_percentage to filter periods with insufficient data || Mixed data types | Convert point data to period data by duplicating the date || Incorrect averaging | Ensure energy=TRUE for decibel values || Memory limitations | Process the data in batches by ID || Performance concerns | Consider parallel processing for very large datasets |# ConclusionThe `create_period_exposures()` function provides a flexible and powerful tool for aggregating exposure data in epidemiological research. By properly accounting for time periods, data quality, and appropriate averaging methods, researchers can generate more reliable exposure metrics for health outcome analysis.When publishing research using this function, we recommend documenting:1. The chosen time periods and their relevance to health outcomes2. Averaging methods (energy-based vs. arithmetic)3. Data completeness thresholds applied4. Any pre-processing steps applied to the raw exposure data# Session Information```{r}#| label: session-infosessionInfo()```