A raw time-series dataset containing daily data on mortality levels, temperature, and air pollution was provided to answer the following questions in order to assess the short-term impact of PM2.5 on daily mortality.
# Data loading
data <- import(here("PhD_exercise.rda"))
str(data)
## Classes 'data.table' and 'data.frame': 3652 obs. of 6 variables:
## $ city : chr "Roma" "Roma" "Roma" "Roma" ...
## $ date : Date, format: "2006-01-01" "2006-01-02" ...
## $ doy : int 1 2 3 4 5 6 7 8 9 10 ...
## $ nat : num 64 65 73 70 59 59 78 72 81 66 ...
## $ pm25 : num 21.9 22 23.1 24.7 34.3 ...
## $ tmean: num 10.4 6.65 6 5.95 5.75 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(data)
## city date doy nat
## Length:3652 Min. :2006-01-01 Min. : 1.0 Min. : 35.00
## Class :character 1st Qu.:2008-07-01 1st Qu.: 92.0 1st Qu.: 64.00
## Mode :character Median :2010-12-31 Median :183.0 Median : 71.00
## Mean :2010-12-31 Mean :183.1 Mean : 73.01
## 3rd Qu.:2013-07-01 3rd Qu.:274.0 3rd Qu.: 79.00
## Max. :2015-12-31 Max. :366.0 Max. :999.00
## pm25 tmean
## Min. : 4.553 Min. :-1.00
## 1st Qu.: 13.541 1st Qu.:10.40
## Median : 17.283 Median :15.60
## Mean : 25.420 Mean :16.02
## 3rd Qu.: 23.716 3rd Qu.:21.90
## Max. :999.000 Max. :33.00
# is.na(data) # data doesnot have missing value
This time-series dataset comprises 3652 observations spanning from January 1, 2006, to December 31, 2015, encompassing the daily counts of mortality, as well as the corresponding daily measurements of temperature and air pollution concentrations within the city of Rome. The dataset includes 6 variables:
city: Rome
date: date from 01/01/2006 to 31/12/2015
doy - Day of the year: Days in the year are numbered from 1 to 366.
nat - the daily number of deaths: The daily number of deaths ranged from 35 to 999. However, 999 might be considered an outlier since the mean, median, and 3rd quartile values of this variable are all less than 100. We will further investigate this in the next section.
PM2.5 - the concentration of PM2.5 (particulate matter with a diameter of 2.5 micrometers: The daily concentration of PM2.5 ranged from 4.553 to 999. However, 999 might be considered an outlier since the mean, median, and 3rd quartile values of this variable are all less than 25. We will further investigate this in the next section.
tmean - the daily mean temperature: The daily mean ambient temperature ranged from -1 ℃ to 33 ℃.
Descriptive analyses on daily levels of air pollution and natural deaths. Are there any data point (outliers) that differs significantly from other observations? Are they reasonable? If not, what will you do with those observation? Please justify and apply it to your data.
First, we can gain a better understanding of any patterns or correlations that may emerge by visualizing daily levels of air pollution and natural deaths over time as the figure below:
# SET THE PLOTTING PARAMETERS FOR THE PLOT (SEE ?par)
oldpar <- par(no.readonly=TRUE)
par(mex=0.8,mfrow=c(2,1)) #ve 2 bieu do tren 2 hang
# SUB-PLOT FOR DAILY DEATHS, WITH VERTICAL LINES DEFINING YEARS
plot(data$date,data$nat,pch=".",main="Daily deaths over time",
ylab="Daily number of deaths",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)
# THE SAME FOR OZONE LEVELS
plot(data$date,data$pm25,pch=".",main="PM2.5 levels over time",
ylab="Daily mean PM2.5 level (ug/m3)",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)
par(oldpar)
layout(1)
The figure above illustrates the daily levels of air pollution and natural deaths. However, the presence of some 999 values as outliers may hinder a clear representation of the data’s trends and patterns. We need to examine the frequency of occurrences of the value 999 in this dataset.
nat_999 <- table(data$nat == 999)
pm25_999 <- table(data$pm25 == 999)
# Print the frequency
print(nat_999) #4
##
## FALSE TRUE
## 3648 4
print(pm25_999) #19
##
## FALSE TRUE
## 3633 19
There is a possibility that 999 values may represent missing data. The variables nat and pm25 contain 4 and 19 instances of the value 999, respectively, which is considerably less than 10% of the total number of observations.
In this case, single imputation can be employed by replacing the 999 value with nearby variables. However, a previous study has shown that covariate-stratified effect estimates from the complete-case analysis are unbiased when missingness is very small, independent of the outcome conditional on the exposure and covariates (Rachael K Ross et al.). nHence, we can safely remove these 999 values from the dataset.
data <- data %>% filter(nat != 999) %>% filter(pm25 != 999.000000) # N=3629
It is also important to assess the presence of missing data or occurrences where values are recorded as 999 in other variables as well. Fortunately, the analysis reveals that the remaining variables do not exhibit any missing values or instances of 999 as recorded values.
# is.na(data) # data doesnot have missing value
doy_999 <- table(data$doy == 999)
tmean_999 <- table(data$tmean == 999)
# Print the frequency
print(doy_999) #0
##
## FALSE
## 3629
print(tmean_999) #0
##
## FALSE
## 3629
After removing outliers, we describe again the daily levels of air pollution and natural deaths.
# SET THE PLOTTING PARAMETERS FOR THE PLOT (SEE ?par)
oldpar <- par(no.readonly=TRUE)
par(mex=0.8,mfrow=c(2,1)) #ve 2 bieu do tren 2 hang
# SUB-PLOT FOR DAILY DEATHS, WITH VERTICAL LINES DEFINING YEARS
plot(data$date,data$nat,pch=".",main="Daily deaths over time",
ylab="Daily number of deaths",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)
# THE SAME FOR OZONE LEVELS
plot(data$date,data$pm25,pch=".",main="PM2.5 levels over time",
ylab="Daily mean PM2.5 level (ug/m3)",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)
par(oldpar)
layout(1)
We can enhance the visual representation of data patterns by incorporating the smooth function to generate moving average plots. These plots serve as valuable complements to raw scatter plots as they effectively reduce the noise in the data by calculating the average of neighboring data points within a fixed window size. By smoothing out the raw data, these plots highlight underlying trends and patterns that may be obscured in the original scatter plots.
# smth function
data$nat_smt <- as.factor(data$nat)
data$nat_smt <- SMA(data$nat_smt,n=18)
data$pm25_smt <- as.factor(data$pm25)
data$pm25_smt <- SMA(data$pm25_smt,n=35)
oldpar <- par(no.readonly=TRUE)
par(mex=0.8,mfrow=c(2,1)) #ve 2 bieu do tren 2 hang
# SUB-PLOT FOR DAILY DEATHS, WITH VERTICAL LINES DEFINING YEARS
plot(data$date,data$nat_smt,pch=".",main="Daily deaths over time",
ylab="Daily number of deaths",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)
# THE SAME FOR OZONE LEVELS
plot(data$date,data$pm25_smt,pch=".",main="PM2.5 levels over time",
ylab="Daily mean PM2.5 level (ug/m3)",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)
par(oldpar)
layout(1)
The presented plots reveal a clear association between PM2.5 levels and death counts, both exhibiting prominent annual seasonal patterns. Winter months exhibit higher PM2.5 levels as well as increased mortality, while summer months demonstrate lower levels of PM2.5 and reduced mortality. It is important to note, however, that the observed correlation does not necessarily imply a causal relationship between high PM2.5 levels and increased mortality during winter. It is common for time series data to exhibit systematic patterns over time, leading to correlations that are often not indicative of causal connections.
What are the mean daily deaths?, What is the mean concentration of PM2.5 ?
The mean daily number of death is 72.0 (SD = 11.6). The mean daily concentration of PM2.5 is 20.3 (SD = 10.3).
# Create a table1 object with the 'nat', 'pm25', and 'temp' variables
table1_data <- table1(~nat + pm25 + tmean, data=data)
table1_data
| Overall (N=3629) |
|
|---|---|
| nat | |
| Mean (SD) | 72.0 (11.6) |
| Median [Min, Max] | 71.0 [35.0, 122] |
| pm25 | |
| Mean (SD) | 20.3 (10.3) |
| Median [Min, Max] | 17.3 [4.55, 68.3] |
| tmean | |
| Mean (SD) | 16.0 (6.95) |
| Median [Min, Max] | 15.6 [-1.00, 33.0] |
Are there any seasonal differences of daily number of deaths? How can this be explained?
The decomposition figure below illustrates the presence of seasonal variations in the daily number of deaths. We observed that in Rome, from 2006 to 2015, the winter months consistently exhibited higher levels of mortality in comparison to the summer months. Previous research has indicated that a significant portion of this increase can be attributed to seasonal variations in behavior and the human body, along with heightened exposure to respiratory diseases. For example, winter is typically flu season (Climate Change Indicatorsy). Furthermore, cold temperatures can exacerbate pre-existing medical conditions, including cardiovascular and respiratory diseases. The incidence of heart attacks tends to rise as temperatures drop, potentially due to the impact of cold weather on blood circulation, blood vessels, and other related factors (Gasparrini et al.).
During the summer months, we frequently witness a concerning rise in mortality rates, and one significant contributing factor is the occurrence of heat waves. These prolonged periods of extreme heat not only create oppressive conditions but also pose a substantial risk to human well-being, leading to an increase in mortality rates (Claire Demoury et al.).
Moreover, the decomposition figure presented below depicts the presence of temporal trends in the daily number of deaths. It reveals a positive temporal trend, indicating a consistent pattern of increasing daily deaths over the specified period. This trend can be attributed to the concurrent growth in Rome’s population size, which increased from 2,547,677 in 2006 to 2,872,021 in 2015 (Data Commons).
# Convert the 'nat' column to a time series object
nat_ts <- ts(data$nat_smt, start = 1, frequency = 120)
# Decompose the time series
nat_decomp <- decompose(nat_ts)
plot(nat_decomp)
Are there temporal trends in the levels of PM2.5? How can this be explained?
The decomposition figure below illustrates the presence of temporal trends in the levels of PM2.5. A negative temporal trend was found indicating an decreasing pattern from 2006 to 2015. This might be explained by (environmental regulations). The European Union’s Air Quality Directive, adopted in 1996 and updated in 2008, sets stricter standards and includes more pollutants to improve air quality in EU. The impact of the directive on air quality in the European Union has been observed over the years following its implementation. Progress in improving air quality varies across member states and regions, with some areas experiencing significant improvements including Rome.
Furthermore, the decomposition figure below demonstrates the existence of seasonal variations in PM2.5 levels. Winter months exhibit higher PM2.5 levels than summer months. Winter months are commonly characterized by stable atmospheric conditions, including temperature inversions and diminished wind speeds. These conditions tend to trap pollutants in proximity to the ground, resulting in elevated concentrations of PM2.5 particles. Conversely, summer months typically experience more favorable atmospheric conditions, facilitating the dispersion of pollutants and leading to lower levels of PM2.5 (Robert Cichowicz et al.).
pm25_ts <- ts(data$pm25_smt, start = 1, frequency = 120)
# Decompose the time series
pm25_decomp <- decompose(pm25_ts)
plot(pm25_decomp)
The research question aims to explore the relationship between short-term changes in mortality and daily PM2.5 levels. However, the raw outcome and exposure data is likely influenced by seasonal patterns and long-term trends, necessitating the control of these patterns in the regression model. The reasons for controlling these patterns are as follows:
Confounding: Seasonal patterns and long-term trends can independently affect both the exposure (PM2.5 levels) and the outcome (mortality). By controlling for these patterns, we ensure that the observed associations between exposure and outcome are not confounded by these extraneous factors. This allows for a more accurate interpretation of the relationship, focusing specifically on day-to-day associations.
Precision and Power: Controlling for seasonal patterns and long-term trends improves the precision and statistical power of the regression model. It reduces variability and noise associated with these patterns, enabling better estimation of the short-term associations of interest. Additionally, it enhances comparability across different time periods, ensuring valid comparisons of the short-term associations between exposure and outcome.
In this study, it is important to consider short-term weekly variations as a potential confounder. The table below demonstrates the significant relationship between short-term weekly variations and PM2.5 levels, and the daily number of deaths. It means that short-term weekly variations might be a confounder. To accurately assess the association between PM2.5 and daily mortality, it is crucial to account for these short-term weekly variations in the model. Adjusting for these confounding factors will provide a more comprehensive understanding of the relationship between PM2.5 and daily mortality.
data$dow <- as.factor(weekdays(data$date))
# reorder dow
data$dow <- factor(data$dow, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
# selecting variables
database <- data[, .(nat, pm25, dow)]
database %>% # keep only the columns of interest
tbl_summary(by = dow, percent = "row",digits = list(everything() ~ c(2))) %>% # produce summary table and specify grouping variable
add_p()
| Characteristic | Monday, N = 5151 | Tuesday, N = 5211 | Wednesday, N = 5201 | Thursday, N = 5171 | Friday, N = 5191 | Saturday, N = 5171 | Sunday, N = 5201 | p-value2 |
|---|---|---|---|---|---|---|---|---|
| nat | 72.00 (64.50, 81.00) | 72.00 (64.00, 81.00) | 72.00 (65.00, 81.00) | 72.00 (64.00, 80.00) | 71.00 (64.00, 79.00) | 71.00 (64.00, 78.00) | 69.00 (62.00, 76.00) | <0.001 |
| pm25 | 16.86 (13.25, 22.88) | 17.12 (13.58, 23.55) | 17.52 (13.31, 23.75) | 17.86 (13.83, 24.79) | 17.86 (14.00, 24.00) | 17.41 (13.73, 23.43) | 16.31 (12.93, 22.54) | 0.047 |
| 1 Median (IQR) | ||||||||
| 2 Kruskal-Wallis rank sum test | ||||||||
Significant negative correlations between daily mean temperatures and PM2.5 and daily number of death were also found (Figure below). Therefore, it is important to adjust for daily mean temperatures in the regression model because it is associated with both PM2.5 levels and day-to-day changes in mortality. By including daily mean temperatures as a confounder, we aim to eliminate or reduce its influence on the estimated association between exposure and outcome. This adjustment helps isolate the direct effect of PM2.5 levels on mortality and allows for a more accurate assessment of its true impact.
# correlation analysis
# library("PerformanceAnalytics")
my_data <- data[, c(4,5,6)]
chart.Correlation(my_data, histogram=TRUE, pch=19)
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
It is worth noting that while common confounders in epidemiology include factors such as age, sex, body mass index, smoking status, and drinking, these “standard confounders” may not be applicable to the specific data under consideration. Since their distribution is unlikely to vary on a day-to-day basis and they are not associated with short-term fluctuations in environmental exposures like pollution levels, they may not confound the observed associations in this context.
A quasi-Poisson generalized additive model was applied. This model should be adjusted for
A penalized spline smooth function of calendar day with seven degrees of freedom (df) per year to account for underlying long-term and seasonal time-trends,
Day-of-week to account for weekly variations
Natural spline function with 4 df for daily mean air temperature.
Seasonal time-trends: Cyclic splines
Although Dr. Jeroen specifically requested a penalized spline smooth function of the calendar day with seven degrees of freedom (df) per year to capture long-term and seasonal time-trends, Dr. Gavin Simpson suggests considering the adjustment of seasonal time-trends separately in time series analysis. In light of this recommendation, I propose to incorporate a seasonal time-trends variable into the model and assess its impact on model performance.
Therefore, the model of Mortality - PM2.5 association can be written as below:
\[ \begin{split} Model=GAM (death∼ β0 + β1∗PM2.5 + s(time, bs = "ps",df=7∗10) + ns(tmean,df=4) +s (doy, bs ="cc")+ β2∗DOW) \end{split} \]
Therefore, we we need to mutate these variables.
Long-term patterns
Seasonality
Weekly variations
The variable representing long-term patterns was transformed from a date object into a numeric format (integer) and subsequently divided by 1000. This conversion allows for the indication of the relative position of observations in time and contributes to achieving stability in model fitting. The seasonal time variable, on the other hand, was represented by the day of the year (Gavin Simpson).
data$dow <- as.factor(weekdays(data$date))
# reorder dow
data$dow <- factor(data$dow, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
#time
data$time <- as.numeric(data$date)/1000
# to obtain the RR per 10g/m3 of PM2.5, pm25_10 was created:
data$pm25_10 <- data$pm25/10 # create pm2.5*10
To assess the long-term patterns, we initially employed a GAM model excluding PM2.5. By employing this approach, we aimed to determine if these specifications adequately accounted for the long-term and seasonal time-trends in our analysis.
# PLOT RESPONSE RESIDUALS OVER TIME
# Model without pm25
#####################################
model0 <- gam(nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) + s(doy, bs = "cc"),
data = data,
family = quasipoisson,method = "REML")
# GENERATE RESIDUALS
res3 <- residuals(model0,type="response")
############
# FIGURE residual variation
############
plot(data$date,res3,ylim=c(-50,150),pch=19,cex=0.4,col=grey(0.6),
main="Residuals over time",ylab="Residuals (observed-fitted)",xlab="Date")
abline(h=1,lty=2,lwd=2)
The figure above illustrates the residual variation in daily deaths after accounting for the seasonal and long-term trends through modeling. It indicates our successful control over the seasonal and long-term effects, as the residual variation predominantly clusters around zero. By incorporating PM2.5 into this model, we can now address our primary objective, which is to examine whether the remaining short-term variation, in relation to the long-term pattern, is partially attributable to PM2.5.
data$pm25_10 <- data$pm25/10 # create pm2.5*10
model1 <- gam(nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) + s(doy, bs = "cc") + pm25_10,
data = data,
family = quasipoisson,method = "REML")
summary(model1)
##
## Family: quasipoisson
## Link function: log
##
## Formula:
## nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) +
## s(doy, bs = "cc") + pm25_10
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.218715 0.022765 185.319 < 2e-16 ***
## factor(dow)Tuesday -0.003039 0.007825 -0.388 0.69771
## factor(dow)Wednesday -0.006707 0.007835 -0.856 0.39203
## factor(dow)Thursday -0.011441 0.007860 -1.455 0.14562
## factor(dow)Friday -0.014655 0.007857 -1.865 0.06224 .
## factor(dow)Saturday -0.023236 0.007881 -2.949 0.00321 **
## factor(dow)Sunday -0.049753 0.007921 -6.281 3.76e-10 ***
## ns(tmean, 4)1 0.038372 0.020885 1.837 0.06625 .
## ns(tmean, 4)2 0.035816 0.025436 1.408 0.15919
## ns(tmean, 4)3 0.234245 0.049453 4.737 2.26e-06 ***
## ns(tmean, 4)4 0.331764 0.029549 11.227 < 2e-16 ***
## pm25_10 0.007841 0.002609 3.005 0.00267 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 16.241 19.71 22.84 <2e-16 ***
## s(doy) 7.495 8.00 55.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.377 Deviance explained = 37.9%
## -REML = 2173.8 Scale est. = 1.1565 n = 3629
gam.check(model1)
##
## Method: REML Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-1.095943e-05,4.860971e-06]
## (score 2173.803 & scale 1.156452).
## Hessian positive definite, eigenvalue range [2.209941,1808.04].
## Model rank = 89 / 89
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(time) 69.00 16.24 0.90 <2e-16 ***
## s(doy) 8.00 7.49 0.99 0.26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Running gam.check() on a model provides several outputs, in both the console and as plots. We’ll start with the console output.
It reports full convergence after 5 iterations. R has found a best solution.
Below, we see a table of basis checking results. This shows a statistical test for patterns in model residuals, which should be random. Each line reports the test results for one smooth. It shows the k value or number of basis functions, the effective degrees of freedom, a test statistic, and p-value. Here, small p-values indicate that residuals are not randomly distributed. This often means there are not enough basis functions. However, according to Gasparrini et al., although the optimal number of knots is a topic of ongoing debate and there is no consensus, a commonly supported approach suggests that using approximately 7 knots per year strikes a balance. This choice aims to provide effective control for seasonality and mitigate other potential confounding factors caused by temporal trends, while still retaining enough information to estimate exposure effects accurately. Therefore, we will keep 70 basis functions in this model for the time variable.
Some basic model checking plots for model final fitted to the data. The upper left normal QQ-plot is very close to a straight line, suggesting that the distributional assumption is reasonable. The upper right plot suggests that variance is approximately constant as the mean increases. The histogram of residuals at lower left appears approximately consistent with normality. The lower right plot, of response against fitted values, shows a positive linear relationship with a good deal of scatter: nothing problematic.
eff <- ci.lin(model1,subset="pm25_10",Exp=T)
tabeff <- rbind(eff)[,5:7]
tabeff
## exp(Est.) 2.5% 97.5%
## 1.007872 1.002731 1.013039
After adding adjustment for season and long-term trend to the model, each 10 µg/m3 increase in PM2.5 levels is associated with a mortality risk ratio of 1.008 (95% CI 1.002 to 1.013, p < 0.001), suggesting that in the short term, higher PM2.5 levels is associated with higher mortality risk. While small effect sizes are common in environmental epidemiology, it is important to recognize their significance when entire populations are exposed.
The study’s findings underscore the importance of implementing solutions to minimize PM2.5 exposure and effectively mitigate pollution, thus safeguarding the well-being of the public in Rome. Given the significant contribution of traffic emissions to PM2.5 pollution, potential solutions such as promoting electric vehicles, improving public transportation, implementing integrated urban planning strategies as well as other possible solutions targeting different sources of PM2.5 should be considered in Rome city.
We can also utilize a smoothing function for PM2.5 to assess whether the relationship between mortality and PM2.5 exhibits non-linearity. By applying smoothing techniques, we can gain insights into the potential nonlinear nature of the association between these variables.
model2 <- gam(nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) + s(doy, bs = "cc") + s(pm25_10),
data = data,
family = quasipoisson)
summary(model2)
##
## Family: quasipoisson
## Link function: log
##
## Formula:
## nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) +
## s(doy, bs = "cc") + s(pm25_10)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.218821 0.022897 184.255 < 2e-16 ***
## factor(dow)Tuesday -0.003031 0.007714 -0.393 0.69438
## factor(dow)Wednesday -0.006764 0.007725 -0.876 0.38133
## factor(dow)Thursday -0.011715 0.007752 -1.511 0.13081
## factor(dow)Friday -0.014864 0.007750 -1.918 0.05521 .
## factor(dow)Saturday -0.023258 0.007771 -2.993 0.00278 **
## factor(dow)Sunday -0.049490 0.007809 -6.337 2.63e-10 ***
## ns(tmean, 4)1 0.060989 0.021342 2.858 0.00429 **
## ns(tmean, 4)2 0.059062 0.026202 2.254 0.02425 *
## ns(tmean, 4)3 0.252140 0.050793 4.964 7.23e-07 ***
## ns(tmean, 4)4 0.338637 0.031374 10.794 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 57.921 62.377 9.662 < 2e-16 ***
## s(doy) 5.733 8.000 4.385 < 2e-16 ***
## s(pm25_10) 1.737 2.201 4.677 0.00788 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.397 Deviance explained = 40.4%
## GCV = 1.1538 Scale est. = 1.1238 n = 3629
plot(model2, select =3)
In the analysis of smooth terms, the first column displays the effective degrees of freedom (edf), which serves as an indicator of smooth complexity. The effective degrees of freedom (edf) for PM2.5 range from 1 to 2, implying that the relationship between mortality and PM2.5 exhibits nearly linearity. This result reaffirms our previous conclusion regarding the linear association between these variables and provides further support for the insights gained from the analysis of the linear relationship.
Our current modeling approach has focused on examining the relationship between mortality and PM2.5 levels on the same day. However, it is important to consider the possibility of a delayed or ‘lagged’ association between exposure and outcome. It is conceivable that the PM2.5 level from the previous day might be a more influential predictor of today’s mortality risk than the PM2.5 level observed on the same day.
A distributed lag model with 0-7 lags was applied to explore whether there is any delayed association.
# PRODUCE THE CROSS-BASIS FOR PM2.5 (SCALING NOT NEEDED)
# A SIMPLE UNSTRANSFORMED LINEAR TERM AND THE UNCONSTRAINED LAG STRUCTURE
cbpm25unc <- crossbasis(data$pm25,lag=c(0,7),argvar=list(fun="lin"),
arglag=list(fun="integer"))
summary(cbpm25unc)
## CROSSBASIS FUNCTIONS
## observations: 3629
## range: 4.552505 to 68.26521
## lag period: 0 7
## total df: 8
##
## BASIS FOR VAR:
## fun: lin
## intercept: FALSE
##
## BASIS FOR LAG:
## fun: integer
## values: 0 1 2 3 4 5 6 ...
## intercept: TRUE
# RUN THE MODEL AND OBTAIN PREDICTIONS FOR OZONE LEVEL 10ug/m3
model1 <- gam(nat ~ s(time, bs = "ps", k = 70) + factor(dow) + s(doy, bs = "cc") + ns(tmean, 4) + cbpm25unc,
data = data,
family = quasipoisson)
summary(model1)
##
## Family: quasipoisson
## Link function: log
##
## Formula:
## nat ~ s(time, bs = "ps", k = 70) + factor(dow) + s(doy, bs = "cc") +
## ns(tmean, 4) + cbpm25unc
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.212e+00 2.398e-02 175.637 < 2e-16 ***
## factor(dow)Tuesday -2.097e-03 7.729e-03 -0.271 0.78619
## factor(dow)Wednesday -6.211e-03 7.742e-03 -0.802 0.42251
## factor(dow)Thursday -1.047e-02 7.772e-03 -1.348 0.17782
## factor(dow)Friday -1.423e-02 7.769e-03 -1.831 0.06715 .
## factor(dow)Saturday -2.359e-02 7.791e-03 -3.028 0.00248 **
## factor(dow)Sunday -5.010e-02 7.824e-03 -6.404 1.71e-10 ***
## ns(tmean, 4)1 4.221e-02 2.222e-02 1.900 0.05753 .
## ns(tmean, 4)2 4.787e-02 2.634e-02 1.818 0.06918 .
## ns(tmean, 4)3 2.237e-01 5.196e-02 4.305 1.71e-05 ***
## ns(tmean, 4)4 3.325e-01 3.075e-02 10.814 < 2e-16 ***
## cbpm25uncv1.l1 -3.262e-05 3.860e-04 -0.085 0.93264
## cbpm25uncv1.l2 8.820e-04 5.025e-04 1.755 0.07935 .
## cbpm25uncv1.l3 4.154e-04 5.074e-04 0.819 0.41302
## cbpm25uncv1.l4 -7.250e-05 5.087e-04 -0.143 0.88667
## cbpm25uncv1.l5 4.653e-04 5.086e-04 0.915 0.36036
## cbpm25uncv1.l6 -7.539e-04 5.073e-04 -1.486 0.13733
## cbpm25uncv1.l7 6.257e-04 4.954e-04 1.263 0.20667
## cbpm25uncv1.l8 -4.153e-04 3.757e-04 -1.105 0.26914
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 56.868 61.69 9.360 <2e-16 ***
## s(doy) 5.772 8.00 4.583 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.398 Deviance explained = 40.6%
## GCV = 1.1549 Scale est. = 1.123 n = 3622
pred1 <- crosspred(cbpm25unc,model1,at=10)
# ESTIMATED EFFECTS AT EACH LAG
tablag2 <- with(pred1,t(rbind(matRRfit,matRRlow,matRRhigh)))
colnames(tablag2) <- c("RR","ci.low","ci.hi")
tablag2
## RR ci.low ci.hi
## lag0 0.9996738 0.9921402 1.007265
## lag1 1.0088585 0.9989704 1.018844
## lag2 1.0041627 0.9942258 1.014199
## lag3 0.9992753 0.9893623 1.009288
## lag4 1.0046636 0.9946983 1.014729
## lag5 0.9924895 0.9826708 1.002406
## lag6 1.0062768 0.9965532 1.016095
## lag7 0.9958558 0.9885488 1.003217
#############
# FIGURE LAG effect
#############
plot(pred1,var=10,type="p",ci="bars",col=1,pch=19,ylim=c(0.99,1.03),
main="All lag terms modelled together (unconstrained)",xlab="Lag (days)",
ylab="RR and 95%CI per 10ug/m3 PM2.5 increase")
We found that the effect of each 10 µg/m3 increase in PM2.5 levels on risk of morality was higher at day 1, 2, 4 and 6 before the event days, but all of them were not significant.
The cumulative effect of an exposure over several lag days can be calculated from a distributed lag model as the sum of the coefficients. Based on the results below, the cumulative 7 days RR for mortality was significantly larger than 1, (RR= 1.01, 95%CI: 1.002 - 1.02).
# OVERALL CUMULATIVE (NET) EFFECT
pred1$allRRfit ; pred1$allRRlow ; pred1$allRRhigh
## 10
## 1.011203
## 10
## 1.002689
## 10
## 1.019789
In our current model, we have included daily mean ambient temperature as a confounder. However, it is important to recognize that daily mean ambient temperature could potentially act as an effect modifier in the relationship between mortality and PM2.5. Consequently, in this section, we will conduct an effect modification analysis to investigate this possibility and examine how the relationship between mortality and PM2.5 may vary across different levels of daily mean ambient temperature.
Firstly, I divided the data into two separate datasets based on the median temperature. This division allows us to create two distinct groups: one comprising data points with temperatures below the median, and the other with data points having temperatures above or equal to the median. By doing so, we can explore potential variations in the relationship between mortality and PM2.5 across these two temperature groups and evaluate the effect modification of daily mean ambient temperature on this association.
hist(data$tmean) #non-normal distribution --> Using cut-off Median
summary(data$tmean) #Median: 15.60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00 10.40 15.60 16.02 21.90 33.00
data <- data %>%
mutate(
# Create age level variable
temp_level = case_when(
pm25 <=15.06 ~ 1,
pm25 > 15.06 ~ 2),
temp_level = as.factor(temp_level)
)
table(data$temp_level) # 1 (1292) 2 (2337)
##
## 1 2
## 1292 2337
temp_high <- data %>%
filter(temp_level == "2")
temp_low <- data %>%
filter(temp_level == "1")
#Model for high level of temp
model3 <- gam(nat ~ s(time, bs = "ps", k = 70) + s(doy, bs = "cc")
+ factor(dow)
+ pm25_10,
data = temp_high,
family = quasipoisson)
#Model for low level of temp
model4 <- gam(nat ~ s(time, bs = "ps", k = 70) + s(doy, bs = "cc")
+ factor(dow)
+ pm25_10,
data = temp_low,
family = quasipoisson)
#RR of PM2.5
tabeff_low <- rbind(ci.lin(model4, subset = "pm25_10", Exp = TRUE))[, 5:7]
tabeff_high <- rbind(ci.lin(model3, subset = "pm25_10", Exp = TRUE))[, 5:7]
combined_table <- rbind(tabeff_low, tabeff_high)
combined_table
## exp(Est.) 2.5% 97.5%
## tabeff_low 1.073324 1.037064 1.110852
## tabeff_high 1.008747 1.002263 1.015272
Secondly, I employed the current model to analyze each dataset individually in order to determine the relative risks (RRs) of PM2.5 in each model. According to the information below, when considering lower temperature levels, the estimated effect size of PM2.5 on mortality exhibits a larger magnitude, resulting in an average 7.3% increase in the risk of mortality for every 10 µg/m3 rise in PM2.5 levels. In contrast, at higher temperature levels, the estimated effect size is smaller, indicating a 0.9% increase in mortality per 10 µg/m3 increment in PM2.5 levels. These findings suggest that the impact of PM2.5 on mortality may be more pronounced in lower temperature conditions compared to higher temperature conditions.
The findings emphasize the public health concerns associated with both air pollution and climate change Rome. Each 10 µg/m3 rise in PM2.5 levels is linked to a 0.8% [0.3–1.0] increase in mortality risk. In addition, when temperatures are below 15.06 ℃, the impact of PM2.5 on mortality is more pronounced, resulting in an average 7.3% increase in risk for every 10 µg/m3 increase in PM2.5 levels. Conversely, at temperatures above 15.06 ℃, the effect size is smaller, indicating a 0.9% increase in mortality per 10 µg/m3 rise in PM2.5 levels.
Thank you for taking the time to read through the information. If you have any further inquiries, please feel free to reach out to me via email at khanhhuyenhnhuph@gmail.com or my linkedin. I’m more than happy to assist you with any questions you may have.