Practical exercise for the PhD position

A raw time-series dataset containing daily data on mortality levels, temperature, and air pollution was provided to answer the following questions in order to assess the short-term impact of PM_2.5 on daily mortality.

Loading data

# Data loading 
data <- import(here("PhD_exercise.rda"))
str(data)

## Classes 'data.table' and 'data.frame':   3652 obs. of  6 variables:
##  $ city : chr  "Roma" "Roma" "Roma" "Roma" ...
##  $ date : Date, format: "2006-01-01" "2006-01-02" ...
##  $ doy  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ nat  : num  64 65 73 70 59 59 78 72 81 66 ...
##  $ pm25 : num  21.9 22 23.1 24.7 34.3 ...
##  $ tmean: num  10.4 6.65 6 5.95 5.75 ...
##  - attr(*, ".internal.selfref")=<externalptr>

summary(data)

##      city                date                 doy             nat        
##  Length:3652        Min.   :2006-01-01   Min.   :  1.0   Min.   : 35.00  
##  Class :character   1st Qu.:2008-07-01   1st Qu.: 92.0   1st Qu.: 64.00  
##  Mode  :character   Median :2010-12-31   Median :183.0   Median : 71.00  
##                     Mean   :2010-12-31   Mean   :183.1   Mean   : 73.01  
##                     3rd Qu.:2013-07-01   3rd Qu.:274.0   3rd Qu.: 79.00  
##                     Max.   :2015-12-31   Max.   :366.0   Max.   :999.00  
##       pm25             tmean      
##  Min.   :  4.553   Min.   :-1.00  
##  1st Qu.: 13.541   1st Qu.:10.40  
##  Median : 17.283   Median :15.60  
##  Mean   : 25.420   Mean   :16.02  
##  3rd Qu.: 23.716   3rd Qu.:21.90  
##  Max.   :999.000   Max.   :33.00

# is.na(data) # data doesnot have missing value

This time-series dataset comprises 3652 observations spanning from January 1, 2006, to December 31, 2015, encompassing the daily counts of mortality, as well as the corresponding daily measurements of temperature and air pollution concentrations within the city of Rome. The dataset includes 6 variables:

city: Rome
date: date from 01/01/2006 to 31/12/2015
doy - Day of the year: Days in the year are numbered from 1 to 366.
nat - the daily number of deaths: The daily number of deaths ranged from 35 to 999. However, 999 might be considered an outlier since the mean, median, and 3^rd quartile values of this variable are all less than 100. We will further investigate this in the next section.
PM_2.5 - the concentration of PM_2.5 (particulate matter with a diameter of 2.5 micrometers: The daily concentration of PM_2.5 ranged from 4.553 to 999. However, 999 might be considered an outlier since the mean, median, and 3^rd quartile values of this variable are all less than 25. We will further investigate this in the next section.
tmean - the daily mean temperature: The daily mean ambient temperature ranged from -1 ℃ to 33 ℃.

Descriptive analysis

Question 1

Descriptive analyses on daily levels of air pollution and natural deaths. Are there any data point (outliers) that differs significantly from other observations? Are they reasonable? If not, what will you do with those observation? Please justify and apply it to your data.

First, we can gain a better understanding of any patterns or correlations that may emerge by visualizing daily levels of air pollution and natural deaths over time as the figure below:

# SET THE PLOTTING PARAMETERS FOR THE PLOT (SEE ?par)
oldpar <- par(no.readonly=TRUE)
par(mex=0.8,mfrow=c(2,1)) #ve 2 bieu do tren 2 hang

# SUB-PLOT FOR DAILY DEATHS, WITH VERTICAL LINES DEFINING YEARS
plot(data$date,data$nat,pch=".",main="Daily deaths over time",
     ylab="Daily number of deaths",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)

# THE SAME FOR OZONE LEVELS
plot(data$date,data$pm25,pch=".",main="PM2.5 levels over time",
     ylab="Daily mean PM2.5 level (ug/m3)",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)

par(oldpar)
layout(1)

The figure above illustrates the daily levels of air pollution and natural deaths. However, the presence of some 999 values as outliers may hinder a clear representation of the data’s trends and patterns. We need to examine the frequency of occurrences of the value 999 in this dataset.

nat_999 <- table(data$nat == 999)
pm25_999 <- table(data$pm25 == 999)
# Print the frequency
print(nat_999)  #4

## 
## FALSE  TRUE 
##  3648     4

print(pm25_999) #19

## 
## FALSE  TRUE 
##  3633    19

There is a possibility that 999 values may represent missing data. The variables nat and pm25 contain 4 and 19 instances of the value 999, respectively, which is considerably less than 10% of the total number of observations.

In this case, single imputation can be employed by replacing the 999 value with nearby variables. However, a previous study has shown that covariate-stratified effect estimates from the complete-case analysis are unbiased when missingness is very small, independent of the outcome conditional on the exposure and covariates (Rachael K Ross et al.). nHence, we can safely remove these 999 values from the dataset.

data <- data %>% filter(nat != 999) %>%  filter(pm25 != 999.000000)  # N=3629

It is also important to assess the presence of missing data or occurrences where values are recorded as 999 in other variables as well. Fortunately, the analysis reveals that the remaining variables do not exhibit any missing values or instances of 999 as recorded values.

# is.na(data) # data doesnot have missing value
doy_999 <- table(data$doy == 999)
tmean_999 <- table(data$tmean == 999)
# Print the frequency
print(doy_999)  #0

## 
## FALSE 
##  3629

print(tmean_999) #0

## 
## FALSE 
##  3629

After removing outliers, we describe again the daily levels of air pollution and natural deaths.

# SET THE PLOTTING PARAMETERS FOR THE PLOT (SEE ?par)
oldpar <- par(no.readonly=TRUE)
par(mex=0.8,mfrow=c(2,1)) #ve 2 bieu do tren 2 hang

# SUB-PLOT FOR DAILY DEATHS, WITH VERTICAL LINES DEFINING YEARS
plot(data$date,data$nat,pch=".",main="Daily deaths over time",
     ylab="Daily number of deaths",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)

# THE SAME FOR OZONE LEVELS
plot(data$date,data$pm25,pch=".",main="PM2.5 levels over time",
     ylab="Daily mean PM2.5 level (ug/m3)",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)

par(oldpar)
layout(1)

We can enhance the visual representation of data patterns by incorporating the smooth function to generate moving average plots. These plots serve as valuable complements to raw scatter plots as they effectively reduce the noise in the data by calculating the average of neighboring data points within a fixed window size. By smoothing out the raw data, these plots highlight underlying trends and patterns that may be obscured in the original scatter plots.

# smth function 
data$nat_smt <- as.factor(data$nat)
data$nat_smt <- SMA(data$nat_smt,n=18)
data$pm25_smt <- as.factor(data$pm25)
data$pm25_smt <- SMA(data$pm25_smt,n=35)

oldpar <- par(no.readonly=TRUE)
par(mex=0.8,mfrow=c(2,1)) #ve 2 bieu do tren 2 hang

# SUB-PLOT FOR DAILY DEATHS, WITH VERTICAL LINES DEFINING YEARS
plot(data$date,data$nat_smt,pch=".",main="Daily deaths over time",
     ylab="Daily number of deaths",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)

# THE SAME FOR OZONE LEVELS
plot(data$date,data$pm25_smt,pch=".",main="PM2.5 levels over time",
     ylab="Daily mean PM2.5 level (ug/m3)",xlab="Date")
abline(v=data$date[grep("-01-01",data$date)],col=grey(0.6),lty=2)

par(oldpar)
layout(1)

The presented plots reveal a clear association between PM_2.5 levels and death counts, both exhibiting prominent annual seasonal patterns. Winter months exhibit higher PM_2.5 levels as well as increased mortality, while summer months demonstrate lower levels of PM_2.5 and reduced mortality. It is important to note, however, that the observed correlation does not necessarily imply a causal relationship between high PM_2.5 levels and increased mortality during winter. It is common for time series data to exhibit systematic patterns over time, leading to correlations that are often not indicative of causal connections.

Question 2

What are the mean daily deaths?, What is the mean concentration of PM_2.5 ?

The mean daily number of death is 72.0 (SD = 11.6). The mean daily concentration of PM_2.5 is 20.3 (SD = 10.3).

# Create a table1 object with the 'nat', 'pm25', and 'temp' variables
table1_data <- table1(~nat + pm25 + tmean, data=data)
table1_data

	Overall (N=3629)
nat
Mean (SD)	72.0 (11.6)
Median [Min, Max]	71.0 [35.0, 122]
pm25
Mean (SD)	20.3 (10.3)
Median [Min, Max]	17.3 [4.55, 68.3]
tmean
Mean (SD)	16.0 (6.95)
Median [Min, Max]	15.6 [-1.00, 33.0]

Are there any seasonal differences of daily number of deaths? How can this be explained?

The decomposition figure below illustrates the presence of seasonal variations in the daily number of deaths. We observed that in Rome, from 2006 to 2015, the winter months consistently exhibited higher levels of mortality in comparison to the summer months. Previous research has indicated that a significant portion of this increase can be attributed to seasonal variations in behavior and the human body, along with heightened exposure to respiratory diseases. For example, winter is typically flu season (Climate Change Indicatorsy). Furthermore, cold temperatures can exacerbate pre-existing medical conditions, including cardiovascular and respiratory diseases. The incidence of heart attacks tends to rise as temperatures drop, potentially due to the impact of cold weather on blood circulation, blood vessels, and other related factors (Gasparrini et al.).

During the summer months, we frequently witness a concerning rise in mortality rates, and one significant contributing factor is the occurrence of heat waves. These prolonged periods of extreme heat not only create oppressive conditions but also pose a substantial risk to human well-being, leading to an increase in mortality rates (Claire Demoury et al.).

Moreover, the decomposition figure presented below depicts the presence of temporal trends in the daily number of deaths. It reveals a positive temporal trend, indicating a consistent pattern of increasing daily deaths over the specified period. This trend can be attributed to the concurrent growth in Rome’s population size, which increased from 2,547,677 in 2006 to 2,872,021 in 2015 (Data Commons).

# Convert the 'nat' column to a time series object
nat_ts <- ts(data$nat_smt, start = 1, frequency = 120)
# Decompose the time series
nat_decomp <- decompose(nat_ts)
plot(nat_decomp)

Question 3

Are there temporal trends in the levels of PM_2.5? How can this be explained?

The decomposition figure below illustrates the presence of temporal trends in the levels of PM_2.5. A negative temporal trend was found indicating an decreasing pattern from 2006 to 2015. This might be explained by (environmental regulations). The European Union’s Air Quality Directive, adopted in 1996 and updated in 2008, sets stricter standards and includes more pollutants to improve air quality in EU. The impact of the directive on air quality in the European Union has been observed over the years following its implementation. Progress in improving air quality varies across member states and regions, with some areas experiencing significant improvements including Rome.

Furthermore, the decomposition figure below demonstrates the existence of seasonal variations in PM_2.5 levels. Winter months exhibit higher PM_2.5 levels than summer months. Winter months are commonly characterized by stable atmospheric conditions, including temperature inversions and diminished wind speeds. These conditions tend to trap pollutants in proximity to the ground, resulting in elevated concentrations of PM_2.5 particles. Conversely, summer months typically experience more favorable atmospheric conditions, facilitating the dispersion of pollutants and leading to lower levels of PM_2.5 (Robert Cichowicz et al.).

pm25_ts <- ts(data$pm25_smt, start = 1, frequency = 120)
# Decompose the time series
pm25_decomp <- decompose(pm25_ts)
plot(pm25_decomp)

Adjustment factors: Control for seasonality and long-term trends

Underlying time-trends of mortality

The research question aims to explore the relationship between short-term changes in mortality and daily PM_2.5 levels. However, the raw outcome and exposure data is likely influenced by seasonal patterns and long-term trends, necessitating the control of these patterns in the regression model. The reasons for controlling these patterns are as follows:

Confounding: Seasonal patterns and long-term trends can independently affect both the exposure (PM_2.5 levels) and the outcome (mortality). By controlling for these patterns, we ensure that the observed associations between exposure and outcome are not confounded by these extraneous factors. This allows for a more accurate interpretation of the relationship, focusing specifically on day-to-day associations.
Precision and Power: Controlling for seasonal patterns and long-term trends improves the precision and statistical power of the regression model. It reduces variability and noise associated with these patterns, enabling better estimation of the short-term associations of interest. Additionally, it enhances comparability across different time periods, ensuring valid comparisons of the short-term associations between exposure and outcome.

Short-term weekly variations

In this study, it is important to consider short-term weekly variations as a potential confounder. The table below demonstrates the significant relationship between short-term weekly variations and PM_2.5 levels, and the daily number of deaths. It means that short-term weekly variations might be a confounder. To accurately assess the association between PM_2.5 and daily mortality, it is crucial to account for these short-term weekly variations in the model. Adjusting for these confounding factors will provide a more comprehensive understanding of the relationship between PM_2.5 and daily mortality.

data$dow <- as.factor(weekdays(data$date))
# reorder dow
data$dow <- factor(data$dow, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
# selecting variables
database <- data[, .(nat, pm25, dow)]
database %>%   # keep only the columns of interest
  tbl_summary(by = dow, percent = "row",digits = list(everything() ~ c(2))) %>%  # produce summary table and specify grouping variable
  add_p()

Characteristic	Monday, N = 515¹	Tuesday, N = 521¹	Wednesday, N = 520¹	Thursday, N = 517¹	Friday, N = 519¹	Saturday, N = 517¹	Sunday, N = 520¹	p-value²
nat	72.00 (64.50, 81.00)	72.00 (64.00, 81.00)	72.00 (65.00, 81.00)	72.00 (64.00, 80.00)	71.00 (64.00, 79.00)	71.00 (64.00, 78.00)	69.00 (62.00, 76.00)	<0.001
pm25	16.86 (13.25, 22.88)	17.12 (13.58, 23.55)	17.52 (13.31, 23.75)	17.86 (13.83, 24.79)	17.86 (14.00, 24.00)	17.41 (13.73, 23.43)	16.31 (12.93, 22.54)	0.047
¹ Median (IQR)
² Kruskal-Wallis rank sum test

Confounding: Temperature

Significant negative correlations between daily mean temperatures and PM_2.5 and daily number of death were also found (Figure below). Therefore, it is important to adjust for daily mean temperatures in the regression model because it is associated with both PM_2.5 levels and day-to-day changes in mortality. By including daily mean temperatures as a confounder, we aim to eliminate or reduce its influence on the estimated association between exposure and outcome. This adjustment helps isolate the direct effect of PM_2.5 levels on mortality and allows for a more accurate assessment of its true impact.

# correlation analysis
# library("PerformanceAnalytics")
my_data <- data[, c(4,5,6)]
chart.Correlation(my_data, histogram=TRUE, pch=19)

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

Other “usual” confounders

It is worth noting that while common confounders in epidemiology include factors such as age, sex, body mass index, smoking status, and drinking, these “standard confounders” may not be applicable to the specific data under consideration. Since their distribution is unlikely to vary on a day-to-day basis and they are not associated with short-term fluctuations in environmental exposures like pollution levels, they may not confound the observed associations in this context.

Analyses: Modelling the exposure-outcome association

A quasi-Poisson generalized additive model was applied. This model should be adjusted for

A penalized spline smooth function of calendar day with seven degrees of freedom (df) per year to account for underlying long-term and seasonal time-trends,
Day-of-week to account for weekly variations
Natural spline function with 4 df for daily mean air temperature.
Seasonal time-trends: Cyclic splines

Although Dr. Jeroen specifically requested a penalized spline smooth function of the calendar day with seven degrees of freedom (df) per year to capture long-term and seasonal time-trends, Dr. Gavin Simpson suggests considering the adjustment of seasonal time-trends separately in time series analysis. In light of this recommendation, I propose to incorporate a seasonal time-trends variable into the model and assess its impact on model performance.

Therefore, the model of Mortality - PM_2.5 association can be written as below:

\[ \begin{split} Model=GAM (death∼ β0 + β1∗PM2.5 + s(time, bs = "ps",df=7∗10) + ns(tmean,df=4) +s (doy, bs ="cc")+ β2∗DOW) \end{split} \]

Step 1: Mutate variables

Therefore, we we need to mutate these variables.

Long-term patterns
Seasonality
Weekly variations

The variable representing long-term patterns was transformed from a date object into a numeric format (integer) and subsequently divided by 1000. This conversion allows for the indication of the relative position of observations in time and contributes to achieving stability in model fitting. The seasonal time variable, on the other hand, was represented by the day of the year (Gavin Simpson).

data$dow <- as.factor(weekdays(data$date))
# reorder dow
data$dow <- factor(data$dow, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
#time
data$time <- as.numeric(data$date)/1000
# to obtain the RR per 10g/m3 of PM2.5, pm25_10 was created:
data$pm25_10 <- data$pm25/10  # create pm2.5*10

Step 2: Check the long-term patterns and seasonality

To assess the long-term patterns, we initially employed a GAM model excluding PM_2.5. By employing this approach, we aimed to determine if these specifications adequately accounted for the long-term and seasonal time-trends in our analysis.

# PLOT RESPONSE RESIDUALS OVER TIME
# Model without pm25
#####################################
model0 <- gam(nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) + s(doy, bs = "cc"),
             data = data,
             family = quasipoisson,method = "REML")
# GENERATE RESIDUALS
res3 <- residuals(model0,type="response")
############
# FIGURE residual variation
############
plot(data$date,res3,ylim=c(-50,150),pch=19,cex=0.4,col=grey(0.6),
  main="Residuals over time",ylab="Residuals (observed-fitted)",xlab="Date")
abline(h=1,lty=2,lwd=2)

The figure above illustrates the residual variation in daily deaths after accounting for the seasonal and long-term trends through modeling. It indicates our successful control over the seasonal and long-term effects, as the residual variation predominantly clusters around zero. By incorporating PM_2.5 into this model, we can now address our primary objective, which is to examine whether the remaining short-term variation, in relation to the long-term pattern, is partially attributable to PM_2.5.

Step 3: Apply model for PM_2.5.

data$pm25_10 <- data$pm25/10  # create pm2.5*10
model1 <- gam(nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) + s(doy, bs = "cc") + pm25_10,
             data = data,
             family = quasipoisson,method = "REML")
summary(model1)

## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) + 
##     s(doy, bs = "cc") + pm25_10
## 
## Parametric coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.218715   0.022765 185.319  < 2e-16 ***
## factor(dow)Tuesday   -0.003039   0.007825  -0.388  0.69771    
## factor(dow)Wednesday -0.006707   0.007835  -0.856  0.39203    
## factor(dow)Thursday  -0.011441   0.007860  -1.455  0.14562    
## factor(dow)Friday    -0.014655   0.007857  -1.865  0.06224 .  
## factor(dow)Saturday  -0.023236   0.007881  -2.949  0.00321 ** 
## factor(dow)Sunday    -0.049753   0.007921  -6.281 3.76e-10 ***
## ns(tmean, 4)1         0.038372   0.020885   1.837  0.06625 .  
## ns(tmean, 4)2         0.035816   0.025436   1.408  0.15919    
## ns(tmean, 4)3         0.234245   0.049453   4.737 2.26e-06 ***
## ns(tmean, 4)4         0.331764   0.029549  11.227  < 2e-16 ***
## pm25_10               0.007841   0.002609   3.005  0.00267 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(time) 16.241  19.71 22.84  <2e-16 ***
## s(doy)   7.495   8.00 55.84  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.377   Deviance explained = 37.9%
## -REML = 2173.8  Scale est. = 1.1565    n = 3629

Step 4: Model checking

gam.check(model1)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-1.095943e-05,4.860971e-06]
## (score 2173.803 & scale 1.156452).
## Hessian positive definite, eigenvalue range [2.209941,1808.04].
## Model rank =  89 / 89 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'   edf k-index p-value    
## s(time) 69.00 16.24    0.90  <2e-16 ***
## s(doy)   8.00  7.49    0.99    0.26    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Running gam.check() on a model provides several outputs, in both the console and as plots. We’ll start with the console output.

It reports full convergence after 5 iterations. R has found a best solution.

Below, we see a table of basis checking results. This shows a statistical test for patterns in model residuals, which should be random. Each line reports the test results for one smooth. It shows the k value or number of basis functions, the effective degrees of freedom, a test statistic, and p-value. Here, small p-values indicate that residuals are not randomly distributed. This often means there are not enough basis functions. However, according to Gasparrini et al., although the optimal number of knots is a topic of ongoing debate and there is no consensus, a commonly supported approach suggests that using approximately 7 knots per year strikes a balance. This choice aims to provide effective control for seasonality and mitigate other potential confounding factors caused by temporal trends, while still retaining enough information to estimate exposure effects accurately. Therefore, we will keep 70 basis functions in this model for the time variable.

Some basic model checking plots for model final fitted to the data. The upper left normal QQ-plot is very close to a straight line, suggesting that the distributional assumption is reasonable. The upper right plot suggests that variance is approximately constant as the mean increases. The histogram of residuals at lower left appears approximately consistent with normality. The lower right plot, of response against fitted values, shows a positive linear relationship with a good deal of scatter: nothing problematic.

Step 5: Interpretation for PM_2.5

eff <- ci.lin(model1,subset="pm25_10",Exp=T)
tabeff <- rbind(eff)[,5:7]
tabeff

## exp(Est.)      2.5%     97.5% 
##  1.007872  1.002731  1.013039

After adding adjustment for season and long-term trend to the model, each 10 µg/m³ increase in PM_2.5 levels is associated with a mortality risk ratio of 1.008 (95% CI 1.002 to 1.013, p < 0.001), suggesting that in the short term, higher PM_2.5 levels is associated with higher mortality risk. While small effect sizes are common in environmental epidemiology, it is important to recognize their significance when entire populations are exposed.

The study’s findings underscore the importance of implementing solutions to minimize PM_2.5 exposure and effectively mitigate pollution, thus safeguarding the well-being of the public in Rome. Given the significant contribution of traffic emissions to PM_2.5 pollution, potential solutions such as promoting electric vehicles, improving public transportation, implementing integrated urban planning strategies as well as other possible solutions targeting different sources of PM_2.5 should be considered in Rome city.

Extra analysis: Improve the current model

We can also utilize a smoothing function for PM_2.5 to assess whether the relationship between mortality and PM_2.5 exhibits non-linearity. By applying smoothing techniques, we can gain insights into the potential nonlinear nature of the association between these variables.

model2 <- gam(nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) + s(doy, bs = "cc") + s(pm25_10),
             data = data,
             family = quasipoisson)
summary(model2)

## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## nat ~ s(time, bs = "ps", k = 70) + factor(dow) + ns(tmean, 4) + 
##     s(doy, bs = "cc") + s(pm25_10)
## 
## Parametric coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.218821   0.022897 184.255  < 2e-16 ***
## factor(dow)Tuesday   -0.003031   0.007714  -0.393  0.69438    
## factor(dow)Wednesday -0.006764   0.007725  -0.876  0.38133    
## factor(dow)Thursday  -0.011715   0.007752  -1.511  0.13081    
## factor(dow)Friday    -0.014864   0.007750  -1.918  0.05521 .  
## factor(dow)Saturday  -0.023258   0.007771  -2.993  0.00278 ** 
## factor(dow)Sunday    -0.049490   0.007809  -6.337 2.63e-10 ***
## ns(tmean, 4)1         0.060989   0.021342   2.858  0.00429 ** 
## ns(tmean, 4)2         0.059062   0.026202   2.254  0.02425 *  
## ns(tmean, 4)3         0.252140   0.050793   4.964 7.23e-07 ***
## ns(tmean, 4)4         0.338637   0.031374  10.794  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df     F p-value    
## s(time)    57.921 62.377 9.662 < 2e-16 ***
## s(doy)      5.733  8.000 4.385 < 2e-16 ***
## s(pm25_10)  1.737  2.201 4.677 0.00788 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.397   Deviance explained = 40.4%
## GCV = 1.1538  Scale est. = 1.1238    n = 3629

plot(model2, select =3)

In the analysis of smooth terms, the first column displays the effective degrees of freedom (edf), which serves as an indicator of smooth complexity. The effective degrees of freedom (edf) for PM_2.5 range from 1 to 2, implying that the relationship between mortality and PM_2.5 exhibits nearly linearity. This result reaffirms our previous conclusion regarding the linear association between these variables and provides further support for the insights gained from the analysis of the linear relationship.

Extra analysis: Check for delayed effects

Our current modeling approach has focused on examining the relationship between mortality and PM_2.5 levels on the same day. However, it is important to consider the possibility of a delayed or ‘lagged’ association between exposure and outcome. It is conceivable that the PM_2.5 level from the previous day might be a more influential predictor of today’s mortality risk than the PM_2.5 level observed on the same day.

A distributed lag model with 0-7 lags was applied to explore whether there is any delayed association.

# PRODUCE THE CROSS-BASIS FOR PM2.5 (SCALING NOT NEEDED)
# A SIMPLE UNSTRANSFORMED LINEAR TERM AND THE UNCONSTRAINED LAG STRUCTURE
cbpm25unc <- crossbasis(data$pm25,lag=c(0,7),argvar=list(fun="lin"),
  arglag=list(fun="integer"))
summary(cbpm25unc)

## CROSSBASIS FUNCTIONS
## observations: 3629 
## range: 4.552505 to 68.26521 
## lag period: 0 7 
## total df:  8 
## 
## BASIS FOR VAR:
## fun: lin 
## intercept: FALSE 
## 
## BASIS FOR LAG:
## fun: integer 
## values: 0 1 2 3 4 5 6 ... 
## intercept: TRUE

# RUN THE MODEL AND OBTAIN PREDICTIONS FOR OZONE LEVEL 10ug/m3

model1 <- gam(nat ~ s(time, bs = "ps", k = 70) + factor(dow) + s(doy, bs = "cc")  + ns(tmean, 4) + cbpm25unc,
             data = data,
             family = quasipoisson)
summary(model1)

## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## nat ~ s(time, bs = "ps", k = 70) + factor(dow) + s(doy, bs = "cc") + 
##     ns(tmean, 4) + cbpm25unc
## 
## Parametric coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.212e+00  2.398e-02 175.637  < 2e-16 ***
## factor(dow)Tuesday   -2.097e-03  7.729e-03  -0.271  0.78619    
## factor(dow)Wednesday -6.211e-03  7.742e-03  -0.802  0.42251    
## factor(dow)Thursday  -1.047e-02  7.772e-03  -1.348  0.17782    
## factor(dow)Friday    -1.423e-02  7.769e-03  -1.831  0.06715 .  
## factor(dow)Saturday  -2.359e-02  7.791e-03  -3.028  0.00248 ** 
## factor(dow)Sunday    -5.010e-02  7.824e-03  -6.404 1.71e-10 ***
## ns(tmean, 4)1         4.221e-02  2.222e-02   1.900  0.05753 .  
## ns(tmean, 4)2         4.787e-02  2.634e-02   1.818  0.06918 .  
## ns(tmean, 4)3         2.237e-01  5.196e-02   4.305 1.71e-05 ***
## ns(tmean, 4)4         3.325e-01  3.075e-02  10.814  < 2e-16 ***
## cbpm25uncv1.l1       -3.262e-05  3.860e-04  -0.085  0.93264    
## cbpm25uncv1.l2        8.820e-04  5.025e-04   1.755  0.07935 .  
## cbpm25uncv1.l3        4.154e-04  5.074e-04   0.819  0.41302    
## cbpm25uncv1.l4       -7.250e-05  5.087e-04  -0.143  0.88667    
## cbpm25uncv1.l5        4.653e-04  5.086e-04   0.915  0.36036    
## cbpm25uncv1.l6       -7.539e-04  5.073e-04  -1.486  0.13733    
## cbpm25uncv1.l7        6.257e-04  4.954e-04   1.263  0.20667    
## cbpm25uncv1.l8       -4.153e-04  3.757e-04  -1.105  0.26914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(time) 56.868  61.69 9.360  <2e-16 ***
## s(doy)   5.772   8.00 4.583  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.398   Deviance explained = 40.6%
## GCV = 1.1549  Scale est. = 1.123     n = 3622

pred1 <- crosspred(cbpm25unc,model1,at=10)

# ESTIMATED EFFECTS AT EACH LAG
tablag2 <- with(pred1,t(rbind(matRRfit,matRRlow,matRRhigh)))
colnames(tablag2) <- c("RR","ci.low","ci.hi")
tablag2

##             RR    ci.low    ci.hi
## lag0 0.9996738 0.9921402 1.007265
## lag1 1.0088585 0.9989704 1.018844
## lag2 1.0041627 0.9942258 1.014199
## lag3 0.9992753 0.9893623 1.009288
## lag4 1.0046636 0.9946983 1.014729
## lag5 0.9924895 0.9826708 1.002406
## lag6 1.0062768 0.9965532 1.016095
## lag7 0.9958558 0.9885488 1.003217

#############
# FIGURE LAG effect
#############

plot(pred1,var=10,type="p",ci="bars",col=1,pch=19,ylim=c(0.99,1.03),
  main="All lag terms modelled together (unconstrained)",xlab="Lag (days)",
  ylab="RR and 95%CI per 10ug/m3 PM2.5 increase")

We found that the effect of each 10 µg/m³ increase in PM_2.5 levels on risk of morality was higher at day 1, 2, 4 and 6 before the event days, but all of them were not significant.

The cumulative effect of an exposure over several lag days can be calculated from a distributed lag model as the sum of the coefficients. Based on the results below, the cumulative 7 days RR for mortality was significantly larger than 1, (RR= 1.01, 95%CI: 1.002 - 1.02).

# OVERALL CUMULATIVE (NET) EFFECT
pred1$allRRfit ; pred1$allRRlow ; pred1$allRRhigh

##       10 
## 1.011203

##       10 
## 1.002689

##       10 
## 1.019789

Extra analysis: Effect modification by Temperatures

In our current model, we have included daily mean ambient temperature as a confounder. However, it is important to recognize that daily mean ambient temperature could potentially act as an effect modifier in the relationship between mortality and PM_2.5. Consequently, in this section, we will conduct an effect modification analysis to investigate this possibility and examine how the relationship between mortality and PM_2.5 may vary across different levels of daily mean ambient temperature.

Firstly, I divided the data into two separate datasets based on the median temperature. This division allows us to create two distinct groups: one comprising data points with temperatures below the median, and the other with data points having temperatures above or equal to the median. By doing so, we can explore potential variations in the relationship between mortality and PM_2.5 across these two temperature groups and evaluate the effect modification of daily mean ambient temperature on this association.

hist(data$tmean) #non-normal distribution --> Using cut-off Median

summary(data$tmean) #Median: 15.60

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.00   10.40   15.60   16.02   21.90   33.00

data <- data %>%
  mutate(
    # Create age level variable
    temp_level = case_when(
      pm25 <=15.06            ~ 1,
      pm25 > 15.06            ~ 2),
    temp_level = as.factor(temp_level)
  )
table(data$temp_level) # 1 (1292)   2 (2337)

## 
##    1    2 
## 1292 2337

temp_high <- data %>% 
  filter(temp_level == "2")

temp_low <- data %>% 
  filter(temp_level == "1")

#Model for high level of temp
model3 <- gam(nat ~ s(time, bs = "ps", k = 70) + s(doy, bs = "cc")
               + factor(dow) 
               + pm25_10,
               data = temp_high,
               family = quasipoisson)
#Model for low level of temp
model4 <- gam(nat ~ s(time, bs = "ps", k = 70) + s(doy, bs = "cc")
               + factor(dow) 
               + pm25_10,
               data = temp_low,
               family = quasipoisson)
#RR of PM2.5
tabeff_low <- rbind(ci.lin(model4, subset = "pm25_10", Exp = TRUE))[, 5:7]
tabeff_high <- rbind(ci.lin(model3, subset = "pm25_10", Exp = TRUE))[, 5:7]
combined_table <- rbind(tabeff_low, tabeff_high)
combined_table

##             exp(Est.)     2.5%    97.5%
## tabeff_low   1.073324 1.037064 1.110852
## tabeff_high  1.008747 1.002263 1.015272

Secondly, I employed the current model to analyze each dataset individually in order to determine the relative risks (RRs) of PM_2.5 in each model. According to the information below, when considering lower temperature levels, the estimated effect size of PM_2.5 on mortality exhibits a larger magnitude, resulting in an average 7.3% increase in the risk of mortality for every 10 µg/m³ rise in PM_2.5 levels. In contrast, at higher temperature levels, the estimated effect size is smaller, indicating a 0.9% increase in mortality per 10 µg/m³ increment in PM_2.5 levels. These findings suggest that the impact of PM_2.5 on mortality may be more pronounced in lower temperature conditions compared to higher temperature conditions.

Conclusion

The findings emphasize the public health concerns associated with both air pollution and climate change Rome. Each 10 µg/m³ rise in PM_2.5 levels is linked to a 0.8% [0.3–1.0] increase in mortality risk. In addition, when temperatures are below 15.06 ℃, the impact of PM_2.5 on mortality is more pronounced, resulting in an average 7.3% increase in risk for every 10 µg/m³ increase in PM_2.5 levels. Conversely, at temperatures above 15.06 ℃, the effect size is smaller, indicating a 0.9% increase in mortality per 10 µg/m³ rise in PM_2.5 levels.

Thank you for taking the time to read through the information. If you have any further inquiries, please feel free to reach out to me via email at khanhhuyenhnhuph@gmail.com or my linkedin. I’m more than happy to assist you with any questions you may have.

Practical exercise for the PhD position

Nguyen Thi Khanh Huyen

2023-06-17

Loading data

Descriptive analysis

Question 1

Question 2

Question 3

Adjustment factors: Control for seasonality and long-term trends

Underlying time-trends of mortality

Short-term weekly variations

Confounding: Temperature

Other “usual” confounders

Analyses: Modelling the exposure-outcome association

Step 1: Mutate variables

Step 2: Check the long-term patterns and seasonality

Step 3: Apply model for PM_2.5.

Step 4: Model checking

Step 5: Interpretation for PM_2.5

Extra analysis: Improve the current model

Extra analysis: Check for delayed effects

Extra analysis: Effect modification by Temperatures

Conclusion

Practical exercise for the PhD position

Nguyen Thi Khanh Huyen

2023-06-17

Loading data

Descriptive analysis

Question 1

Question 2

Question 3

Adjustment factors: Control for seasonality and long-term trends

Underlying time-trends of mortality

Short-term weekly variations

Confounding: Temperature

Other “usual” confounders

Analyses: Modelling the exposure-outcome association

Step 1: Mutate variables

Step 2: Check the long-term patterns and seasonality

Step 3: Apply model for PM2.5.

Step 4: Model checking

Step 5: Interpretation for PM2.5

Extra analysis: Improve the current model

Extra analysis: Check for delayed effects

Extra analysis: Effect modification by Temperatures

Conclusion

Step 3: Apply model for PM_2.5.

Step 5: Interpretation for PM_2.5