DEFRA question

Loading data

Background: https://uk-air.defra.gov.uk/pm25targets/progress

calculations: https://uk-air.defra.gov.uk/pm25targets/calculation

library(openair)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(maps)


Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map

library(reshape2)


Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths

Meta data

meta_aurn = importMeta(source = "aurn")

Locations of all sites (not sure what’s going on with Ireland)

ggplot(data = meta_aurn, aes(x=longitude, y=latitude)) +
  borders(fill = "white") +
  geom_point(aes(colour = site_type)) + 
  xlim(-10,2) + ylim(50,61)

Restrict to ‘Urban Background’ and ‘Suburban Background’ which is what the methodology page says are used

aurn_background = meta_aurn |> filter(str_detect(site_type, "rban Background"))

ggplot(data = aurn_background, aes(x=longitude, y=latitude)) +
  borders(fill = "white") +
  geom_point(aes(colour = site_type)) + 
  xlim(-10,2) + ylim(50,61)

Methodology

Annual increment from year \(n\) to year \(n+1\) are calculated by finding the annual mean for every site for years \(n-2\) to \(n+1\).

This includes only sites that are:

‘Urban background’ or ‘suburban background’
Have been operational for all four years
Have a data capture rate of at least 85%

For each of the four years, calculate the annual mean concentration for each site.

Then, find the mean of the averages for each year, to produce an England-wide (or UK-wide?) mean.

Find the average over the first three years (\(n-2\) to \(n\)) to find the reference indicator \(RI_\text{year}\).

Find the average over the final three years (\(n-1\) to \(n+1\)) to find the population exposure indicator \(PEI_\text{year}\).

Then, the yearly increment is

\[\Delta PEI_\text{year} = PEI_\text{year} - RI_\text{year}. \]

These yearly increments are summed to find the change over multiple years.

Question about the methodology

Are these weighted averages? If not, then surely we have the UK/England wide average for each year (found from the eligible sites, which are only included if they are eligible for the entire 4 year period), say

\[M_{n}\]

for the UK-wide average for year \(n\).

Then we have

\[ \begin{align*} RI_n &= \frac{1}{3}\left(M_{n-2} + M_{n-1} + M_{n}\right)\\ PEI_n & = \frac{1}{3}\left(M_{n-1} + M_{n} + M_{n+1}\right) \end{align*} \]

and so

\[\Delta PEI_n = \frac{1}{3}\left(M_{n+1} - M_{n-2}\right). \] That is, it doesn’t depend on the averages for years \(n\) or \(n-1\). It is a yearly average calculated from a 3 year difference. This would make calculations about standard error easier, but it does seem a bit strange.

Uncertainty

How is uncertainty being handled? Eg sample standard error?

We can consider the uncertainty from the sequence of calculations, since these are just a sequence of statistics calculated from data that is (in theory) a random sample from the population:

Standard error in individual year-site means
Standard error in region-wide year means (could be estimated using SE above?)
Standard error of \(\Delta PEI\)
Standard error in sum of \(\Delta PEI\) to find difference over multiple years

I can’t find any mention of a threshold on the uncertainty of the estimate, which I think means it impossible to determine how many sensors are needed. But perhaps there is one and I’ve missed it / it isn’t on the webpage.

What do the data look like?

Download data for some sites and plot it:

Sites nearby to one another and of similar types
Sites of different types?
Sites far from one anohter and of different types?

How noisy are the data? Are there broad trends? Are they correlated? Plot at different scales.

Restrict to a small number near Welsh border

aurn_wb = meta_aurn |> 
  filter(
    between(latitude, 52, 53) & 
      between(longitude, -3, -2) &
      str_detect(site_type, "rban Background"))

Find codes

aurn_wb_codes = aurn_wb$code

Make a list of the data (probably a bad way to do it), restrict to one year. Remove problematic codes manually (ones with no data for 2022). This gets us from 9 to 3.

aurn_wb_codes = c("LEOM", "TDHD", "WAL4")
aurn_wb2022 = list()
for (code in aurn_wb_codes){
  print(code)
  aurn_wb2022[[code]] = importAURN(
   site = code,
   year = 2022)
}

[1] "LEOM"
[1] "TDHD"
[1] "WAL4"

Check what columns they each have

for (i in 1:length(aurn_wb_codes)){
  code_i = aurn_wb_codes[i]
  print(sprintf("%s: %s", code_i , paste(names(aurn_wb2022[[code_i]]), collapse = ", ")))
}

[1] "LEOM: source, site, code, date, nox, no2, no, o3, ws, wd, air_temp"
[1] "TDHD: source, site, code, date, nox, no2, no, pm10, pm2.5, ws, wd, air_temp"
[1] "WAL4: source, site, code, date, nox, no2, no, o3, ws, wd, air_temp"

Only TDHD has pm2.5, might as well re-import

aurn_TDHD = importAURN(site = "TDHD", year = 2018:2024)
ggplot(data = aurn_TDHD, aes(x=date, y=pm2.5)) + geom_point()

Warning: Removed 39593 rows containing missing values or values outside the scale range
(`geom_point()`).

Restrict \(x\) axis

ggplot(
  data = aurn_TDHD |> filter(!is.na(pm2.5)), 
  aes(x=date, y=pm2.5)) + 
  geom_point(size=0.1, alpha=0.6) +
  scale_x_datetime(date_labels ="%b-%Y")

So:

Looks quite noisy
From these 9 codes in the area I selected, none would be able to create a \(\Delta PEI\) involving the year 2022 (ie. for 2021 to 2024).

Try for wider area, with a view to calculating the 2023-24 increment - so I need data for 2021 to 2024.

all_bg_codes = aurn_background$code

Go through full list of codes, selecting what seem to be the eligible ones:

aurn_23_full = list()
for (code in all_bg_codes){
  
  ## See if the data will even import
  ## I think this fails if there's no data in that time range
  try_code = try(importAURN(
    site = code, 
    year = 2021:2024
    ))
  print(sprintf("Done: %s", code))
  if(any(class(try_code) == "try-error")){
    print(sprintf("Error: %s", code))
  } else {
    ## Want to only keep those sites with enough data for this time period
    # First check there is a pm2.5 column
    # then check it's at least 85% complete
    # If yes to both, add it to the list
    if(any(names(try_code) == "pm2.5")){
      na_count = sum(is.na(try_code$pm2.5))
      data_count = sum(!is.na(try_code$pm2.5))
      if((na_count / data_count) < 0.15){
        try_keep = try_code |> select(code, date, pm2.5)
        aurn_23_full[[code]] = try_keep
      } else {
        print(sprintf("Insufficient data: %s", code))
      }
    } else {
      print(sprintf("No column for pm2.5: %s", code))
    }
      
  }
  
}
save(aurn_23_full, file = "aurn_23_full.Rdata")

Load dataset. There are 50 eligible sites.

load("aurn_23_full.Rdata")
length(aurn_23_full)

[1] 50

Find the metadata for these

codes_23_eligible = names(aurn_23_full)
meta_23_eligible = meta_aurn |> filter(code %in% codes_23_eligible)

Plot locations (just out of curiosity)

ggplot(data = meta_23_eligible, aes(x=longitude, y=latitude)) +
  borders(fill = "white") +
  geom_point(aes(colour = site_type)) + 
  xlim(-10,2) + ylim(50,61) + coord_fixed()

Now unlist them to one big dataframe

aurn_23_unlist = bind_rows(aurn_23_full)

How much of a disaster is it if I plot all the data?! Have to get it to a very small time-scale in order to see any detail. Also clearer at log scale since because of the extreme peaks.

Some arbitrary two-week periods

ggplot(aurn_23_unlist |> 
         filter(between(date, as.Date("2024-06-15"), as.Date("2024-07-01"))), 
       aes(x=date, y=log(pm2.5))) +
  geom_line(aes(group = code), size=0.1, alpha=0.5) +
  scale_x_datetime(date_labels = "%d-%b-%y")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Warning in log(pm2.5): NaNs produced

Warning: Removed 502 rows containing missing values or values outside the scale range
(`geom_line()`).

Same 2 weeks, 2022

ggplot(aurn_23_unlist |> 
         filter(between(date, as.Date("2022-06-15"), as.Date("2022-07-01"))), 
       aes(x=date, y=log(pm2.5))) +
  geom_line(aes(group = code), size=0.1, alpha=0.5) +
  scale_x_datetime(date_labels = "%d-%b-%y")

Warning in log(pm2.5): NaNs produced

Christmas period 2023 - appears more periodic (daily)

ggplot(aurn_23_unlist |> 
         filter(between(date, as.Date("2023-12-18"), as.Date("2024-01-03"))), 
       aes(x=date, y=log(pm2.5))) +
  geom_line(aes(group = code), size=0.1, alpha=0.5) +
  scale_x_datetime(date_labels = "%d-%b-%y")

Warning in log(pm2.5): NaNs produced

Warning: Removed 386 rows containing missing values or values outside the scale range
(`geom_line()`).

FInd correlation between different stations

aurn_23_wide = aurn_23_unlist |> pivot_wider(
  names_from = code,
  values_from = pm2.5
  )
cor_23_full = cor(aurn_23_wide[,-1], use = "pairwise.complete.obs")
melted_cormat = melt(cor_23_full)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() + scale_color_viridis_b()

Yearly averages for each site

aurn_23_unlist$year = lubridate::year(aurn_23_unlist$date)
aurn_23_summary = aurn_23_unlist |> 
  group_by(code, year) |> 
  summarise(
    mean = mean(pm2.5, na.rm=T), 
    sd = sd(pm2.5, na.rm=T),
    n = length(na.omit(pm2.5)))

`summarise()` has grouped output by 'code'. You can override using the
`.groups` argument.

Add in some meta columns, and a standard error col

aurn_23_summary = left_join(
  aurn_23_summary,
  meta_aurn,
  by = "code") |> mutate(se = sd / sqrt(n))

Plot means

ggplot(data = aurn_23_summary, aes(as.factor(year), mean)) +
  geom_boxplot()

What are the sample standard deviations like?

ggplot(data = aurn_23_summary) + geom_histogram(aes(x=se)) +
  xlim(0, NA)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).

Which is the one with the really high SD?

aurn_23_summary$code[aurn_23_summary$sd == max(aurn_23_summary$sd)]

[1] "NOTT"

ggplot(aurn_23_unlist |> 
         filter(code == "NOTT"), 
       aes(x=date, y=log(pm2.5))) +
  geom_line(aes(group = code), size=0.1, alpha=0.5) +
  scale_x_datetime(date_labels = "%d-%b-%y")

Warning: Removed 25 rows containing missing values or values outside the scale range
(`geom_line()`).

Now see if there’s any kind of pattern of mean pm2.5 by area

ggplot(aurn_23_summary, aes(x=longitude, y=latitude, col=mean)) +
  borders(fill = "white") +
  xlim(-7.5, 2) +
  ylim(50, 57.5)+
  geom_point() +
  facet_wrap(~year)+
  coord_fixed() +
  scale_color_continuous(type = "viridis")

Can I do this with a linear model?

lm_all = lm(
  mean ~ longitude + latitude + longitude:latitude + year,
  data = aurn_23_summary)
summary(lm_all)


Call:
lm(formula = mean ~ longitude + latitude + longitude:latitude + 
    year, data = aurn_23_summary)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1623 -0.7132 -0.1399  0.5607  2.9592 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        485.50804  136.00855   3.570 0.000466 ***
longitude           -0.90498    1.65369  -0.547 0.584933    
latitude            -0.33961    0.08533  -3.980 0.000102 ***
year                -0.22740    0.06717  -3.385 0.000885 ***
longitude:latitude   0.01685    0.03095   0.545 0.586781    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.001 on 168 degrees of freedom
Multiple R-squared:  0.2844,    Adjusted R-squared:  0.2674 
F-statistic: 16.69 on 4 and 168 DF,  p-value: 1.542e-11

Outliers on QQ plot are 109, 135, 108 Highest leverage are 41, 43

aurn_23_summary[c(41, 43, 108, 109, 135),]

# A tibble: 5 × 11
# Groups:   code [3]
  code   year  mean    sd     n source site  latitude longitude site_type     se
  <chr> <dbl> <dbl> <dbl> <int> <chr>  <chr>    <dbl>     <dbl> <chr>      <dbl>
1 DERR   2022  8.53  9.40  8349 aurn   Derr…     55.0     -7.33 Urban Ba… 0.103 
2 DERR   2024  8.15  7.11  8066 aurn   Derr…     55.0     -7.33 Urban Ba… 0.0792
3 NOTT   2023 10.5  12.2   8745 aurn   Nott…     53.0     -1.15 Urban Ba… 0.130 
4 NOTT   2024 10.1   7.55   703 aurn   Nott…     53.0     -1.15 Urban Ba… 0.285 
5 ROED   2024 10.1   9.88  2521 aurn   Roth…     53.4     -1.32 Urban Ba… 0.197