We recently received a data source called InfrDensity from the European Commission. This data was generated by the Joint Research Centre in a Freshwater Ecosystem Assessment. The full report is available here. This metric was generated using a combination of Corine Land Cover and the density of aquatic infrastructure based on the OpenStreetMap datasource. Publicly, these data are available at the NUTS Classification, however that is not useful for us. As such, we received this datasource at a finer scale.
This document details the datasource we received and potential ways we can use it for a European-wide tool in the NAVIDIV project. As an outline I’ll show the datasource, and walk through how I think we should use it.
.
# Packages
library(tidyverse)
library(sf)
library(mapview)
infra <- st_read("../InfrDensity/infrDensity.gdb/a00000009.gdbtable")
## Reading layer `InfrDensity' from data source
## `C:\Users\Aaron Sexton\Desktop\InfrDensity\infrDensity.gdb\a00000009.gdbtable'
## using driver `OpenFileGDB'
## Simple feature collection with 311369 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 2647600 ymin: 1462200 xmax: 6787300 ymax: 5380300
## Projected CRS: ETRS89-extended / LAEA Europe
ggplot() + geom_sf(data = infra) + aes()
Pan-European Cover. Can’t make an interactive map with this because the file is too large So can show you still images of the map zoomed in a bit so that we can see the coverage.
Very fine scale polygons.
When we zoom in even more and add a 1km buffer around each study
site, we can see an issue. Within each 1km buffer are multiple polygons
with multiple values.
In QGIS I clipped this layer at 1km surrounding each study site in the Seine. below is a look into that csv file
infraden <- read.csv("../InfrDensity/infrdensity_seine_1k.csv")
head(infraden)
## OBJECTID HydroID CatchmentID AreaSqKm InfrDensity Shape_Length Shape_Area fid
## 1 65098 8088365 337714 0.50 1.07 4200 500000 20
## 2 65098 8088365 337714 0.50 1.07 4200 500000 21
## 3 65098 8088365 337714 0.50 1.07 4200 500000 146
## 4 65104 8088432 337867 3.20 0.97 11400 3200000 21
## 5 65137 8088687 338435 0.78 12.62 5000 780000 46
## 6 65137 8088687 338435 0.78 12.62 5000 780000 178
## study station size
## 1 Seine_Fish_AJ 3080003 392285.62
## 2 Seine_Fish_AJ 3080036 135374.35
## 3 Seine_Inverts_ET 3149176 193254.56
## 4 Seine_Fish_AJ 3080036 1743760.23
## 5 Seine_Fish_AJ 3134000 37270.52
## 6 Seine_Inverts_ET 3134000 37195.69
This gives us a table of all the polygons within the 1km buffer of
each site Since there may be multiple values per polygon there are
multiple rows of varying sizes and values for each site.
This is an issue, because we only want one value per site. So, lets do
that.
Briefly, I took the proportion each polygon makes up of the total coverage within the 1km buffer, and multiplied the InfrDensity score by the proportion. This way the score is weighted by how present it is.
# Remove unnecessary columns
infraden2 <- dplyr::select(infraden, station, InfrDensity, size)
str(infraden2)
## 'data.frame': 503 obs. of 3 variables:
## $ station : int 3080003 3080036 3149176 3080036 3134000 3134000 3154000 3135000 3173000 3137000 ...
## $ InfrDensity: num 1.07 1.07 1.07 0.97 12.62 ...
## $ size : num 392286 135374 193255 1743760 37271 ...
# Now lets create a column that just tells us the total size of riparian areas
# within the 1km buffer for each site. Because we may have meanders, or thick or
# skinny stretches.
infraden3 <- aggregate(infraden2$size, by = list(station = infraden2$station), FUN = sum)
head(infraden3)
## station x
## 1 909090 935949
## 2 3004095 1430207
## 3 3004349 4341862
## 4 3005200 2348466
## 5 3005999 2213552
## 6 3006000 3868853
infraden3 <- dplyr::left_join(infraden2, infraden3, by =c("station" = "station"))
infraden3 <- dplyr::rename(infraden3, totalsize = x)
head(infraden3)
## station InfrDensity size totalsize
## 1 3080003 1.07 392285.62 2214292
## 2 3080036 1.07 135374.35 2047613
## 3 3149176 1.07 193254.56 2094815
## 4 3080036 0.97 1743760.23 2047613
## 5 3134000 12.62 37270.52 3269968
## 6 3134000 12.62 37195.69 3269968
# Now lets calculate the proportion each little polygon makes up for the total
# size coverage of each 1km polygon
infraden3 <- dplyr::mutate(infraden3, proportion = size / totalsize)
head(infraden3)
## station InfrDensity size totalsize proportion
## 1 3080003 1.07 392285.62 2214292 0.17716072
## 2 3080036 1.07 135374.35 2047613 0.06611324
## 3 3149176 1.07 193254.56 2094815 0.09225374
## 4 3080036 0.97 1743760.23 2047613 0.85160627
## 5 3134000 12.62 37270.52 3269968 0.01139783
## 6 3134000 12.62 37195.69 3269968 0.01137494
# Now lets multiply the InfrDensity score of each polygon by the proportion cover
infraden3 <- dplyr::mutate(infraden3, weighted = InfrDensity * proportion)
# Then sum all the values for each site
infraden4 <- aggregate(infraden3$weighted, by = list(station = infraden3$station), FUN = sum)
infraden4 <- dplyr::rename(infraden4, cum_wtd_infrden = x)
head(infraden4)
## station cum_wtd_infrden
## 1 909090 3.819643
## 2 3004095 4.553032
## 3 3004349 2.590000
## 4 3005200 3.443493
## 5 3005999 1.319771
## 6 3006000 1.320000
hist(sqrt(infraden4$cum_wtd_infrden))
So this now gives us one value of infrastructure density. It seems to make sense when I check it against the map, but that’s just spot checking, its not reproducible. So how do we ensure it is the correct way to do this?
May be better to standardize the process at the 25km stretch? With Cybill’s 25km buffer? Might make more sense at the European scale.
.
Let’s plot them out to visualize that
# Upload the sites from biodiversity collection with no data in them
sites_seine <- st_read("../InfrDensity/seine_pts.gpkg")
## Reading layer `seine_pts' from data source
## `C:\Users\Aaron Sexton\Desktop\InfrDensity\seine_pts.gpkg'
## using driver `GPKG'
## Simple feature collection with 206 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 3651566 ymin: 2739610 xmax: 3963050 ymax: 2976726
## Projected CRS: ETRS89-extended / LAEA Europe
# Attach the infrdensity value we calculated
infra5 <- left_join(sites_seine, infraden4, by = c("station" = "station"))
# sqrt convert it to normality
infra5$sqrtd <- sqrt(infra5$cum_wtd_infrden)
# Plot it out
mapview(infra5, zcol = "sqrtd")
Calculating the weighted average of infrastructure density within the watershed makes sense both biologically and statistically.
We can use the HydroBASINS layer created by the HydroSHEDS group, which can be found here.
What is particularly great about this data source is that we can select the scale that makes the most sense - there are 12 levels of spatial fineness. Lets take a look at these different layers spatial size.
largebasin <- st_read("../InfrDensity/hybas_eu_lev02_v1c.shp")
## Reading layer `hybas_eu_lev02_v1c' from data source
## `C:\Users\Aaron Sexton\Desktop\InfrDensity\hybas_eu_lev02_v1c.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 9 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -24.54232 ymin: 12.59131 xmax: 69.55452 ymax: 81.85898
## Geodetic CRS: WGS 84
ggplot() + geom_sf(data = largebasin) + aes()
Obviously too large a scale for us.
finebasin <- st_read("../InfrDensity/hybas_eu_lev12_v1c.shp")
## Reading layer `hybas_eu_lev12_v1c' from data source
## `C:\Users\Aaron Sexton\Desktop\InfrDensity\hybas_eu_lev12_v1c.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 137658 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -24.54232 ymin: 12.59131 xmax: 69.55452 ymax: 81.85898
## Geodetic CRS: WGS 84
ggplot() + geom_sf(data = finebasin) + aes()+ xlim(-2, 9) + ylim(42, 51)
Probably too fine a scale for us.
Lets find the goldilocks layer.
goldbasin <- st_read("../InfrDensity/hybas_eu_lev07_v1c.shp")
## Reading layer `hybas_eu_lev07_v1c' from data source
## `C:\Users\Aaron Sexton\Desktop\InfrDensity\hybas_eu_lev07_v1c.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 7820 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -24.54232 ymin: 12.59131 xmax: 69.55452 ymax: 81.85898
## Geodetic CRS: WGS 84
ggplot() + geom_sf(data = goldbasin) + aes() + xlim(-2, 9) + ylim(42, 51)
This seems great.
Now in QGIS I combined the HydroSHEDS layer with the InfrDensity layer so that it tells us which InfrDensity polygons are within each HydroSHEDS polygons so that we can run that same math as above.
infr_inhydro <- read.csv("../InfrDensity/infraden_in_hydrosheds.csv")
head(infr_inhydro)
## OBJECTID HydroID CatchmentID AreaSqKm InfrDensity Shape_Length Shape_Area
## 1 1 8000004 18403 0.01 14.88 400 10000
## 2 2 8000017 18446 0.14 2.98 4200 140000
## 3 3 8000033 18469 1.84 1.25 9400 1840000
## 4 4 8000036 18476 0.28 1.32 3000 280000
## 5 5 8000051 18503 0.79 2.62 5000 790000
## 6 6 8000054 18512 0.10 0.00 3600 100000
## HYBAS_ID NEXT_DOWN NEXT_SINK MAIN_BAS DIST_SINK DIST_MAIN SUB_AREA
## 1 2070027050 0 2070027050 2070027050 0 0 2221.3
## 2 2070033080 0 2070033080 2070033080 0 0 2588.5
## 3 2070033080 0 2070033080 2070033080 0 0 2588.5
## 4 2070032600 0 2070032600 2070032600 0 0 3429.9
## 5 2070032600 0 2070032600 2070032600 0 0 3429.9
## 6 2070032600 0 2070032600 2070032600 0 0 3429.9
## UP_AREA PFAF_ID ENDO COAST ORDER SORT size
## 1 2221.3 2427050 0 1 0 2375 10000
## 2 2588.5 2447902 0 0 1 2886 140000
## 3 2588.5 2447902 0 0 1 2886 1840000
## 4 3429.9 2447506 0 0 1 2878 280000
## 5 3429.9 2447506 0 0 1 2878 790000
## 6 3429.9 2447506 0 0 1 2878 100000
# Lets get ride of some unnecessary columns
infr_inhydro <- dplyr::select(infr_inhydro, InfrDensity, size, HYBAS_ID)
head(infr_inhydro)
## InfrDensity size HYBAS_ID
## 1 14.88 10000 2070027050
## 2 2.98 140000 2070033080
## 3 1.25 1840000 2070033080
## 4 1.32 280000 2070032600
## 5 2.62 790000 2070032600
## 6 0.00 100000 2070032600
# Now lets sum the sizes of all the infrden polygons within each shed polygon
infrhydro2 <- aggregate(infr_inhydro$size, by = list(shed = infr_inhydro$HYBAS_ID), FUN = sum)
head(infrhydro2)
## shed x
## 1 2070001320 10289580.35
## 2 2070001360 206617398.24
## 3 2070001370 56663.07
## 4 2070001450 7408958.97
## 5 2070001460 83280775.02
## 6 2070001720 31264945.29
# Now we can add that back into the main df
infrhydro3 <- dplyr::left_join(infr_inhydro, infrhydro2, by = c("HYBAS_ID" = "shed"))
infrhydro3 <- dplyr::rename(infrhydro3, ripinshed = x)
head(infrhydro3)
## InfrDensity size HYBAS_ID ripinshed
## 1 14.88 10000 2070027050 98088710
## 2 2.98 140000 2070033080 177751269
## 3 1.25 1840000 2070033080 177751269
## 4 1.32 280000 2070032600 575067545
## 5 2.62 790000 2070032600 575067545
## 6 0.00 100000 2070032600 575067545
# Now we calculate the proportion cover each infr polygon makes up for the total
# infr cover within each hydroshed
infrhydro4 <- dplyr::mutate(infrhydro3, proportion = size / ripinshed)
head(infrhydro4)
## InfrDensity size HYBAS_ID ripinshed proportion
## 1 14.88 10000 2070027050 98088710 0.0001019485
## 2 2.98 140000 2070033080 177751269 0.0007876174
## 3 1.25 1840000 2070033080 177751269 0.0103515435
## 4 1.32 280000 2070032600 575067545 0.0004868993
## 5 2.62 790000 2070032600 575067545 0.0013737517
## 6 0.00 100000 2070032600 575067545 0.0001738926
# Now we multiply the InfrDensity score of each polygon by the proportion cover
infrhydro4 <- dplyr::mutate(infrhydro4, weighted = InfrDensity * proportion)
# Then sum all the values for each watershed
infrhydro5 <- aggregate(infrhydro4$weighted, by = list(shed = infrhydro4$HYBAS_ID), FUN = sum)
head(infrhydro5)
## shed x
## 1 2070001320 0.0007815448
## 2 2070001360 0.7172531324
## 3 2070001370 0.0032223951
## 4 2070001450 0.3489788212
## 5 2070001460 2.5629695115
## 6 2070001720 0.9638749491
infrhydro5 <- dplyr::rename(infrhydro5, cum_wtd_infrden = x)
head(infrhydro5)
## shed cum_wtd_infrden
## 1 2070001320 0.0007815448
## 2 2070001360 0.7172531324
## 3 2070001370 0.0032223951
## 4 2070001450 0.3489788212
## 5 2070001460 2.5629695115
## 6 2070001720 0.9638749491
#Put them back together
infrhydro4 <- dplyr::left_join(infrhydro4, infrhydro5, by = c("HYBAS_ID" = "shed"))
head(infrhydro4)
## InfrDensity size HYBAS_ID ripinshed proportion weighted
## 1 14.88 10000 2070027050 98088710 0.0001019485 0.0015169942
## 2 2.98 140000 2070033080 177751269 0.0007876174 0.0023471000
## 3 1.25 1840000 2070033080 177751269 0.0103515435 0.0129394294
## 4 1.32 280000 2070032600 575067545 0.0004868993 0.0006427071
## 5 2.62 790000 2070032600 575067545 0.0013737517 0.0035992294
## 6 0.00 100000 2070032600 575067545 0.0001738926 0.0000000000
## cum_wtd_infrden
## 1 0.7165765
## 2 2.0398237
## 3 2.0398237
## 4 1.0234892
## 5 1.0234892
## 6 1.0234892
# Looks nice!
Now lets plot it out to see how it looks spatially.
# We can combine the table we just made with the GIS layer we looked at earlier
# We'll keep just the hydroshed ID and the score we created to make the join clean
shedscore <- dplyr::select(infrhydro4, HYBAS_ID, cum_wtd_infrden)
shedscore$infraden <- sqrt(shedscore$cum_wtd_infrden)
shedscore <- dplyr::select(shedscore, HYBAS_ID, infraden)
str(shedscore)
## 'data.frame': 318215 obs. of 2 variables:
## $ HYBAS_ID: int 2070027050 2070033080 2070033080 2070032600 2070032600 2070032600 2070032600 2070032280 2070025100 2070032600 ...
## $ infraden: num 0.847 1.428 1.428 1.012 1.012 ...
str(goldbasin)
## Classes 'sf' and 'data.frame': 7820 obs. of 14 variables:
## $ HYBAS_ID : num 2.07e+09 2.07e+09 2.07e+09 2.07e+09 2.07e+09 ...
## $ NEXT_DOWN: num 0.00 0.00 0.00 0.00 2.07e+09 ...
## $ NEXT_SINK: num 2.07e+09 2.07e+09 2.07e+09 2.07e+09 2.07e+09 ...
## $ MAIN_BAS : num 2.07e+09 2.07e+09 2.07e+09 2.07e+09 2.07e+09 ...
## $ DIST_SINK: num 0 0 0 0 21.2 ...
## $ DIST_MAIN: num 0 0 0 0 21.2 ...
## $ SUB_AREA : num 5.1 6264.2 7763.3 308.9 3612.7 ...
## $ UP_AREA : num 5.1 6264.2 7763.3 23856.5 3613.1 ...
## $ PFAF_ID : int 2110110 2110120 2110130 2110141 2110142 2110143 2110144 2110145 2110147 2110146 ...
## $ ENDO : int 0 0 0 0 0 0 0 0 0 0 ...
## $ COAST : int 1 0 1 0 0 0 0 0 0 0 ...
## $ ORDER : int 0 1 0 1 2 1 2 1 1 2 ...
## $ SORT : num 1 2 3 4 5 6 7 8 9 10 ...
## $ geometry :sfc_MULTIPOLYGON of length 7820; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:9, 1:2] 32.3 32.3 32.3 32.3 32.3 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:13] "HYBAS_ID" "NEXT_DOWN" "NEXT_SINK" "MAIN_BAS" ...
goldbasin2 <- dplyr::left_join(goldbasin, shedscore, by = c("HYBAS_ID" = "HYBAS_ID"))
goldbasin2 <- goldbasin2 %>%
drop_na(infraden)
goldbasin2 <- dplyr::select(goldbasin2, HYBAS_ID, infraden, geometry)
# Saving it locally - st_write(goldbasin2, "../InfrDensity/HSHEDS_infraden.gpkg")
iden <- st_read("../InfrDensity/HSHEDS_infraden.gpkg")
## Reading layer `HSHEDS_infraden' from data source
## `C:\Users\Aaron Sexton\Desktop\InfrDensity\HSHEDS_infraden.gpkg'
## using driver `GPKG'
## Simple feature collection with 2416 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -10.66733 ymin: 34.63693 xmax: 41.01665 ymax: 71.18401
## Geodetic CRS: WGS 84
hist(iden$infraden)
hist(sqrt(iden$infraden))
ggplot(data = iden) + geom_sf(data = iden, aes(fill = infraden)) +
scale_fill_steps(low = "royalblue2", high = "maroon",
breaks = c(0.75, 1.25, 1.75, 2.25, 2.75))
Looks good. We likely will need to sqrt transform the data to normality, but this looks promising for a European scale assessment?