These yearly increments are summed to find the change over multiple years.
Question about the methodology
Are these weighted averages? If not, then surely we have the UK/England wide average for each year (found from the eligible sites, which are only included if they are eligible for the entire 4 year period), say
\[\Delta PEI_n = \frac{1}{3}\left(M_{n+1} - M_{n-2}\right). \] That is, it doesn’t depend on the averages for years \(n\) or \(n-1\). It is a yearly average calculated from a 3 year difference. This would make calculations about standard error easier, but it does seem a bit strange.
Uncertainty
How is uncertainty being handled? Eg sample standard error?
We can consider the uncertainty from the sequence of calculations, since these are just a sequence of statistics calculated from data that is (in theory) a random sample from the population:
Standard error in individual year-site means
Standard error in region-wide year means (could be estimated using SE above?)
Standard error of \(\Delta PEI\)
Standard error in sum of \(\Delta PEI\) to find difference over multiple years
I can’t find any mention of a threshold on the uncertainty of the estimate, which I think means it impossible to determine how many sensors are needed. But perhaps there is one and I’ve missed it / it isn’t on the webpage.
What do the data look like?
Download data for some sites and plot it:
Sites nearby to one another and of similar types
Sites of different types?
Sites far from one anohter and of different types?
How noisy are the data? Are there broad trends? Are they correlated? Plot at different scales.
Make a list of the data (probably a bad way to do it), restrict to one year. Remove problematic codes manually (ones with no data for 2022). This gets us from 9 to 3.
From these 9 codes in the area I selected, none would be able to create a \(\Delta PEI\) involving the year 2022 (ie. for 2021 to 2024).
Try for wider area, with a view to calculating the 2023-24 increment - so I need data for 2021 to 2024.
all_bg_codes = aurn_background$code
Go through full list of codes, selecting what seem to be the eligible ones:
aurn_23_full =list()for (code in all_bg_codes){## See if the data will even import## I think this fails if there's no data in that time range try_code =try(importAURN(site = code, year =2021:2024 ))print(sprintf("Done: %s", code))if(any(class(try_code) =="try-error")){print(sprintf("Error: %s", code)) } else {## Want to only keep those sites with enough data for this time period# First check there is a pm2.5 column# then check it's at least 85% complete# If yes to both, add it to the listif(any(names(try_code) =="pm2.5")){ na_count =sum(is.na(try_code$pm2.5)) data_count =sum(!is.na(try_code$pm2.5))if((na_count / data_count) <0.15){ try_keep = try_code |>select(code, date, pm2.5) aurn_23_full[[code]] = try_keep } else {print(sprintf("Insufficient data: %s", code)) } } else {print(sprintf("No column for pm2.5: %s", code)) } }}save(aurn_23_full, file ="aurn_23_full.Rdata")
How much of a disaster is it if I plot all the data?! Have to get it to a very small time-scale in order to see any detail. Also clearer at log scale since because of the extreme peaks.