DEFRA question

Loading data

Background: https://uk-air.defra.gov.uk/pm25targets/progress

calculations: https://uk-air.defra.gov.uk/pm25targets/calculation

library(openair)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(maps)

Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map
library(reshape2)

Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths

Meta data

meta_aurn = importMeta(source = "aurn")

Locations of all sites (not sure what’s going on with Ireland)

ggplot(data = meta_aurn, aes(x=longitude, y=latitude)) +
  borders(fill = "white") +
  geom_point(aes(colour = site_type)) + 
  xlim(-10,2) + ylim(50,61)

Restrict to ‘Urban Background’ and ‘Suburban Background’ which is what the methodology page says are used

aurn_background = meta_aurn |> filter(str_detect(site_type, "rban Background"))
ggplot(data = aurn_background, aes(x=longitude, y=latitude)) +
  borders(fill = "white") +
  geom_point(aes(colour = site_type)) + 
  xlim(-10,2) + ylim(50,61)

Methodology

Annual increment from year \(n\) to year \(n+1\) are calculated by finding the annual mean for every site for years \(n-2\) to \(n+1\).

This includes only sites that are:

  • ‘Urban background’ or ‘suburban background’
  • Have been operational for all four years
  • Have a data capture rate of at least 85%

For each of the four years, calculate the annual mean concentration for each site.

Then, find the mean of the averages for each year, to produce an England-wide (or UK-wide?) mean.

Find the average over the first three years (\(n-2\) to \(n\)) to find the reference indicator \(RI_\text{year}\).

Find the average over the final three years (\(n-1\) to \(n+1\)) to find the population exposure indicator \(PEI_\text{year}\).

Then, the yearly increment is

\[\Delta PEI_\text{year} = PEI_\text{year} - RI_\text{year}. \]

These yearly increments are summed to find the change over multiple years.

Question about the methodology

Are these weighted averages? If not, then surely we have the UK/England wide average for each year (found from the eligible sites, which are only included if they are eligible for the entire 4 year period), say

\[M_{n}\]

for the UK-wide average for year \(n\).

Then we have

\[ \begin{align*} RI_n &= \frac{1}{3}\left(M_{n-2} + M_{n-1} + M_{n}\right)\\ PEI_n & = \frac{1}{3}\left(M_{n-1} + M_{n} + M_{n+1}\right) \end{align*} \]

and so

\[\Delta PEI_n = \frac{1}{3}\left(M_{n+1} - M_{n-2}\right). \] That is, it doesn’t depend on the averages for years \(n\) or \(n-1\). It is a yearly average calculated from a 3 year difference. This would make calculations about standard error easier, but it does seem a bit strange.

Uncertainty

How is uncertainty being handled? Eg sample standard error?

We can consider the uncertainty from the sequence of calculations, since these are just a sequence of statistics calculated from data that is (in theory) a random sample from the population:

  • Standard error in individual year-site means
  • Standard error in region-wide year means (could be estimated using SE above?)
  • Standard error of \(\Delta PEI\)
  • Standard error in sum of \(\Delta PEI\) to find difference over multiple years

I can’t find any mention of a threshold on the uncertainty of the estimate, which I think means it impossible to determine how many sensors are needed. But perhaps there is one and I’ve missed it / it isn’t on the webpage.

What do the data look like?

Download data for some sites and plot it:

  • Sites nearby to one another and of similar types
  • Sites of different types?
  • Sites far from one anohter and of different types?

How noisy are the data? Are there broad trends? Are they correlated? Plot at different scales.

Restrict to a small number near Welsh border

aurn_wb = meta_aurn |> 
  filter(
    between(latitude, 52, 53) & 
      between(longitude, -3, -2) &
      str_detect(site_type, "rban Background"))

Find codes

aurn_wb_codes = aurn_wb$code

Make a list of the data (probably a bad way to do it), restrict to one year. Remove problematic codes manually (ones with no data for 2022). This gets us from 9 to 3.

aurn_wb_codes = c("LEOM", "TDHD", "WAL4")
aurn_wb2022 = list()
for (code in aurn_wb_codes){
  print(code)
  aurn_wb2022[[code]] = importAURN(
   site = code,
   year = 2022)
}
[1] "LEOM"
[1] "TDHD"
[1] "WAL4"

Check what columns they each have

for (i in 1:length(aurn_wb_codes)){
  code_i = aurn_wb_codes[i]
  print(sprintf("%s: %s", code_i , paste(names(aurn_wb2022[[code_i]]), collapse = ", ")))
}
[1] "LEOM: source, site, code, date, nox, no2, no, o3, ws, wd, air_temp"
[1] "TDHD: source, site, code, date, nox, no2, no, pm10, pm2.5, ws, wd, air_temp"
[1] "WAL4: source, site, code, date, nox, no2, no, o3, ws, wd, air_temp"

Only TDHD has pm2.5, might as well re-import

aurn_TDHD = importAURN(site = "TDHD", year = 2018:2024)
ggplot(data = aurn_TDHD, aes(x=date, y=pm2.5)) + geom_point()
Warning: Removed 39593 rows containing missing values or values outside the scale range
(`geom_point()`).

Restrict \(x\) axis

ggplot(
  data = aurn_TDHD |> filter(!is.na(pm2.5)), 
  aes(x=date, y=pm2.5)) + 
  geom_point(size=0.1, alpha=0.6) +
  scale_x_datetime(date_labels ="%b-%Y") 

So:

  • Looks quite noisy
  • From these 9 codes in the area I selected, none would be able to create a \(\Delta PEI\) involving the year 2022 (ie. for 2021 to 2024).

Try for wider area, with a view to calculating the 2023-24 increment - so I need data for 2021 to 2024.

all_bg_codes = aurn_background$code

Go through full list of codes, selecting what seem to be the eligible ones:

aurn_23_full = list()
for (code in all_bg_codes){
  
  ## See if the data will even import
  ## I think this fails if there's no data in that time range
  try_code = try(importAURN(
    site = code, 
    year = 2021:2024
    ))
  print(sprintf("Done: %s", code))
  if(any(class(try_code) == "try-error")){
    print(sprintf("Error: %s", code))
  } else {
    ## Want to only keep those sites with enough data for this time period
    # First check there is a pm2.5 column
    # then check it's at least 85% complete
    # If yes to both, add it to the list
    if(any(names(try_code) == "pm2.5")){
      na_count = sum(is.na(try_code$pm2.5))
      data_count = sum(!is.na(try_code$pm2.5))
      if((na_count / data_count) < 0.15){
        try_keep = try_code |> select(code, date, pm2.5)
        aurn_23_full[[code]] = try_keep
      } else {
        print(sprintf("Insufficient data: %s", code))
      }
    } else {
      print(sprintf("No column for pm2.5: %s", code))
    }
      
  }
  
}
save(aurn_23_full, file = "aurn_23_full.Rdata")

Load dataset. There are 50 eligible sites.

load("aurn_23_full.Rdata")
length(aurn_23_full)
[1] 50

Find the metadata for these

codes_23_eligible = names(aurn_23_full)
meta_23_eligible = meta_aurn |> filter(code %in% codes_23_eligible)

Plot locations (just out of curiosity)

ggplot(data = meta_23_eligible, aes(x=longitude, y=latitude)) +
  borders(fill = "white") +
  geom_point(aes(colour = site_type)) + 
  xlim(-10,2) + ylim(50,61) + coord_fixed()

Now unlist them to one big dataframe

aurn_23_unlist = bind_rows(aurn_23_full)

How much of a disaster is it if I plot all the data?! Have to get it to a very small time-scale in order to see any detail. Also clearer at log scale since because of the extreme peaks.

Some arbitrary two-week periods

ggplot(aurn_23_unlist |> 
         filter(between(date, as.Date("2024-06-15"), as.Date("2024-07-01"))), 
       aes(x=date, y=log(pm2.5))) +
  geom_line(aes(group = code), size=0.1, alpha=0.5) +
  scale_x_datetime(date_labels = "%d-%b-%y")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in log(pm2.5): NaNs produced
Warning: Removed 502 rows containing missing values or values outside the scale range
(`geom_line()`).

Same 2 weeks, 2022

ggplot(aurn_23_unlist |> 
         filter(between(date, as.Date("2022-06-15"), as.Date("2022-07-01"))), 
       aes(x=date, y=log(pm2.5))) +
  geom_line(aes(group = code), size=0.1, alpha=0.5) +
  scale_x_datetime(date_labels = "%d-%b-%y")
Warning in log(pm2.5): NaNs produced

Christmas period 2023 - appears more periodic (daily)

ggplot(aurn_23_unlist |> 
         filter(between(date, as.Date("2023-12-18"), as.Date("2024-01-03"))), 
       aes(x=date, y=log(pm2.5))) +
  geom_line(aes(group = code), size=0.1, alpha=0.5) +
  scale_x_datetime(date_labels = "%d-%b-%y")
Warning in log(pm2.5): NaNs produced
Warning: Removed 386 rows containing missing values or values outside the scale range
(`geom_line()`).

FInd correlation between different stations

aurn_23_wide = aurn_23_unlist |> pivot_wider(
  names_from = code,
  values_from = pm2.5
  )
cor_23_full = cor(aurn_23_wide[,-1], use = "pairwise.complete.obs")
melted_cormat = melt(cor_23_full)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() + scale_color_viridis_b()

Yearly averages for each site

aurn_23_unlist$year = lubridate::year(aurn_23_unlist$date)
aurn_23_summary = aurn_23_unlist |> 
  group_by(code, year) |> 
  summarise(
    mean = mean(pm2.5, na.rm=T), 
    sd = sd(pm2.5, na.rm=T),
    n = length(na.omit(pm2.5)))
`summarise()` has grouped output by 'code'. You can override using the
`.groups` argument.

Add in some meta columns, and a standard error col

aurn_23_summary = left_join(
  aurn_23_summary,
  meta_aurn,
  by = "code") |> mutate(se = sd / sqrt(n))

Plot means

ggplot(data = aurn_23_summary, aes(as.factor(year), mean)) +
  geom_boxplot()

What are the sample standard deviations like?

ggplot(data = aurn_23_summary) + geom_histogram(aes(x=se)) +
  xlim(0, NA)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).

Which is the one with the really high SD?

aurn_23_summary$code[aurn_23_summary$sd == max(aurn_23_summary$sd)]
[1] "NOTT"
ggplot(aurn_23_unlist |> 
         filter(code == "NOTT"), 
       aes(x=date, y=log(pm2.5))) +
  geom_line(aes(group = code), size=0.1, alpha=0.5) +
  scale_x_datetime(date_labels = "%d-%b-%y")
Warning: Removed 25 rows containing missing values or values outside the scale range
(`geom_line()`).

Now see if there’s any kind of pattern of mean pm2.5 by area

ggplot(aurn_23_summary, aes(x=longitude, y=latitude, col=mean)) +
  borders(fill = "white") +
  xlim(-7.5, 2) +
  ylim(50, 57.5)+
  geom_point() +
  facet_wrap(~year)+
  coord_fixed() +
  scale_color_continuous(type = "viridis")

Can I do this with a linear model?

lm_all = lm(
  mean ~ longitude + latitude + longitude:latitude + year,
  data = aurn_23_summary)
summary(lm_all)

Call:
lm(formula = mean ~ longitude + latitude + longitude:latitude + 
    year, data = aurn_23_summary)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1623 -0.7132 -0.1399  0.5607  2.9592 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        485.50804  136.00855   3.570 0.000466 ***
longitude           -0.90498    1.65369  -0.547 0.584933    
latitude            -0.33961    0.08533  -3.980 0.000102 ***
year                -0.22740    0.06717  -3.385 0.000885 ***
longitude:latitude   0.01685    0.03095   0.545 0.586781    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.001 on 168 degrees of freedom
Multiple R-squared:  0.2844,    Adjusted R-squared:  0.2674 
F-statistic: 16.69 on 4 and 168 DF,  p-value: 1.542e-11

Outliers on QQ plot are 109, 135, 108 Highest leverage are 41, 43

aurn_23_summary[c(41, 43, 108, 109, 135),]
# A tibble: 5 × 11
# Groups:   code [3]
  code   year  mean    sd     n source site  latitude longitude site_type     se
  <chr> <dbl> <dbl> <dbl> <int> <chr>  <chr>    <dbl>     <dbl> <chr>      <dbl>
1 DERR   2022  8.53  9.40  8349 aurn   Derr…     55.0     -7.33 Urban Ba… 0.103 
2 DERR   2024  8.15  7.11  8066 aurn   Derr…     55.0     -7.33 Urban Ba… 0.0792
3 NOTT   2023 10.5  12.2   8745 aurn   Nott…     53.0     -1.15 Urban Ba… 0.130 
4 NOTT   2024 10.1   7.55   703 aurn   Nott…     53.0     -1.15 Urban Ba… 0.285 
5 ROED   2024 10.1   9.88  2521 aurn   Roth…     53.4     -1.32 Urban Ba… 0.197