Introduction


After 10 years of cycling on the Strava platform, you might want to know in which places you have cycled. And where you have not or rarely ever been spotted. The Strava heatmap gives a nice impression. Some areas are apparently more favored than others. What quantifiable cycle triggers could underlie the image?

Map 1. A cyclist's Strava heatmap [List of maps](#maplist)

Map 1. A cyclist’s Strava heatmap List of maps



This article is about the characteristics of the most visited places, versus the empty spots on the heatmap. For that purpose we try to collect possibly relevant stats from public sources. When looking for data, we keep in mind why certain places in this country attracts cyclists.

1. First, the distance to the home address is of course very decisive: the further from home, the longer the travel time to get there. Usually you leave through your own back door. Sometimes you go on a day trip with your racing bike on the bicycle carrier, or you book a short cycling holiday.
2. Sparsely populated with good roads and cycle paths favorable for speed.
3. Hilly enough to allow some climbing and descending.
4. Environmental factors such as forest, heathland, water in the form of lakes, rivers, coastal areas.
5. Authentic places with history, special architecture, picturesque sites.

The title-term ‘city’ has different meanings around the world and in some places the settlement can be very small. Since we’re using its data, we’ll start with the question: what does Strava mean by ‘city’? In addition to ‘country’, Strava offers two country-dependent spatial units, firstly ‘state’: this can be a large part of the country or a province - and secondly ‘city’. In the Netherlands, this is now the municipality. Previously, a lower spatial level was used, the place/village/city.2 Strava does not perform backward adjustments to any spatial layout. Thus, a reference table town/village/city - municipality should guarantee a uniform classification. In short, the municipality will be the spatial entity in this writing.


Data collection

Below, collected data and their sources are listed successively, together with all R coding.


1. Municipalities of the Netherlands


First of all, we need an official national list of Dutch municipalities. With the exception of the three overseas municipalities.3 These literally don’t fit well on the map.
The number of municipalities on January 1, 2022 was 345.


version[['version.string']]
## [1] "R version 4.3.2 (2023-10-31 ucrt)"
library(readr)
library(readxl)
library(sf)          
library(tidyverse)   

options(scipen = 80)

# Municipalities of the Netherlands on January 1, 2022
# https://www.cbs.nl/nl-nl/onze-diensten/methoden/classificaties/overig/gemeentelijke-indelingen-per-jaar/indeling-per-jaar/gemeentelijke-indeling-op-1-januari-2022

CBS_municipalities_1_1_2022 <- read_excel("data/CBS gemeenten 1-1-2022.xlsx") %>% 
  rename(Province_code = ProvinciecodePV, Province = Provincienaam, Municipality = Gemeentenaam, Municipality_code = GemeentecodeGM) %>%
  select(- "Gemeentecode", - "Provinciecode")

nrow(CBS_municipalities_1_1_2022)
## [1] 345
head(CBS_municipalities_1_1_2022)
## # A tibble: 6 × 4
##   Municipality_code Municipality  Province_code Province     
##   <chr>             <chr>         <chr>         <chr>        
## 1 GM1680            Aa en Hunze   PV22          Drenthe      
## 2 GM0358            Aalsmeer      PV27          Noord-Holland
## 3 GM0197            Aalten        PV25          Gelderland   
## 4 GM0059            Achtkarspelen PV21          Fryslân      
## 5 GM0482            Alblasserdam  PV28          Zuid-Holland 
## 6 GM0613            Albrandswaard PV28          Zuid-Holland


2. Number of protected buildings/objects


In the Netherlands, a national monument, in Dutch a “Rijksmonument”, is a protected building or object, or the remainder thereof, that is of general interest because of its beauty, its significance for science or its cultural-historical value. In 2022, the Netherlands had 63,207 national monuments, of which 7,700 were located in Amsterdam.


# Number of protected buildings/objects per municipality on January 1, 2022

# https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=85538NED&_theme=429

# 2023 backward coding to 2022.

#        2022                          2023      
# Weesp           GM0457          Amsterdam       GM0363
# Brielle         GM0501          Voorne aan Zee  GM1992
# Hellevoetsluis  GM0530          Voorne aan Zee  GM1992
# Westvoorne      GM0614          Voorne aan Zee  GM1992

correction_2022 <- tribble(
  ~Municipality_code, ~n_of_Protected_objects,
  "GM0457",   221,
  "GM0501",   366,
  "GM0530",   49,
  "GM0614",   0  )

Protected_objects_per_municipality_2022 <- read.csv("~/R/Strava/data/rijksmonumenten per gemeente 1-1-2023.csv", header=TRUE, sep=";") %>% 
  select (RegioS, Rijksmonumenten_1) %>% 
  rename (n_of_Protected_objects = Rijksmonumenten_1, Municipality_code = RegioS) %>% 
  filter (Municipality_code != "GM1992") %>%   # Voorne aan Zee eruit (is nieuw per 1-1-23)
  mutate(n_of_Protected_objects = ifelse(Municipality_code == "GM0363", n_of_Protected_objects - 221, n_of_Protected_objects)) %>% 
  group_by(Municipality_code) %>% 
  summarise(n_of_Protected_objects = sum(n_of_Protected_objects)) %>% 
  rbind(., correction_2022) 

rm(correction_2022)

head(Protected_objects_per_municipality_2022)
## # A tibble: 6 × 2
##   Municipality_code n_of_Protected_objects
##   <chr>                              <dbl>
## 1 GM0014                               844
## 2 GM0034                                15
## 3 GM0037                                71
## 4 GM0047                                76
## 5 GM0050                                 1
## 6 GM0059                                65
sum(Protected_objects_per_municipality_2022$n_of_Protected_objects)
## [1] 63207


3. Population, population density, water surface


Traffic congestion in urban communities hinders cyclists. Nevertheless, there are surprisingly many Strava segments in such an area, at least in the Netherlands. Some large municipalities consciously focus on a good cycling network. We cannot escape the number of inhabitants and population density.
A water-rich environment attracts cyclists in this country. Consider, for example, the Frisian lakes and the dikes along the major rivers and the IJsselmeer.


options(width = 2000)

# Per municipality: population, population density, water surface in square km.

# https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=70072ned&_theme=235
# choose "downloads, onbewerkte dataset" en selecteer jaar, regioindeling, totale bevolking, bevolkingsdichtheid, water totaal.

CBS_population_1_1_2022 <- read_delim("data/CBS bevolkingsgegevens 1-1-2022.csv", 
                                           delim = ";", escape_double = FALSE, trim_ws = TRUE) %>%
          na.omit() %>%
          rename(Population_1_1_2022 = "TotaleBevolking_1") %>% 
          rename(Population_density_1_1_2022 = Bevolkingsdichtheid_57) %>% 
          rename(km2_Water_surface = WaterTotaal_245) %>% 
          rename(Municipality_code = RegioS) %>%
          select( - c(ID, Perioden))

head(CBS_population_1_1_2022)
## # A tibble: 6 × 4
##   Municipality_code Population_1_1_2022 Population_density_1_1_2022 km2_Water_surface
##   <chr>                           <dbl>                       <dbl>             <dbl>
## 1 GM1680                          25579                          93              2.81
## 2 GM0358                          32452                        1615             12.2 
## 3 GM0197                          27100                         281              0.52
## 4 GM0059                          27954                         274              1.77
## 5 GM0482                          20087                        2290              1.29
## 6 GM0613                          25934                        1197              2.09


4. Forest

The number of hectares of forest per municipality is an indicator for a beautiful, wooded cycling environment.


# Hectares of forest per municipality as defined by CBS (they exclude city parks).
# 2015 translated to 2022. 
# download csv file: https://opendata.cbs.nl/statline/#/CBS/nl/dataset/70262NED/table
# most recent data (2015) converted to municipality classification 1-1-2022.
# 341.266 ha bos - that is about 10% of the land area.

ha_Forest <- read_delim("data/ha_bos_2015.csv", 
                          delim = ";", escape_double = FALSE, trim_ws = TRUE) %>% 
  rename (Municipality_code = GemeentecodeGM, ha_Forest = ha_Bos_2015)


head(ha_Forest)
## # A tibble: 6 × 2
##   Municipality_code ha_Forest
##   <chr>                 <dbl>
## 1 GM0014                  460
## 2 GM0034                 2738
## 3 GM0037                  558
## 4 GM0047                   54
## 5 GM0050                 4863
## 6 GM0059                   59


5. Nature/environment


This score is primarily aimed at the quality of living in a municipality in terms of nature.


# Score on nature/environment.

# "Natuurscore" 2022 Bureau Louter / Elsevier. range 1 = high to 8 = low
# Excel download:
# https://www.ewmagazine.nl/nederland/achtergrond/2022/08/dit-zijn-de-pluspunten-van-alle-344-gemeenten-24588w/
# The list shows 344 municipalities (reference date spring 2022): Weesp became part of Amsterdam in January 2022. Weesp gets its 2021 score.

Environment_nature_score_1_1_2022 <- read_excel("data/Gemeenten natuurscore 1-1-2022.xlsx") %>% 
  rename (Municipality_code = GemeentecodeGM, Environment_nature_score = Natuurscore)

head(Environment_nature_score_1_1_2022)
## # A tibble: 6 × 2
##   Municipality_code Environment_nature_score
##   <chr>                                <dbl>
## 1 GM0050                                   1
## 2 GM0060                                   1
## 3 GM0088                                   1
## 4 GM0093                                   1
## 5 GM0096                                   1
## 6 GM0233                                   1


6. Natura 2000

Natura 2000 is a European network of protected natural areas. In these areas, certain animals, plants and their natural habitat are protected to maintain biodiversity. The Dutch contribution to the European network of protected nature reserves (Natura 2000) consists of 162 areas. These areas are located both on land and at sea with 246 municipalities involved.


# Natura2000 associated municipalities.


# List of Natura 2000 areas and associated municipalities. 
# The Netherlands has 162 Natura 2000 areas spread over 246 municipalities.
# https://www.natura2000.nl/gebieden

Municipalities_Natura2000 <- read_excel("data/Gemeenten Natura2000.xlsx") %>%
  select(- "Nr_gebied", - "Provincie") %>%
  filter (Gemeenten != '(zie 129)') %>% 
  mutate(Municipality = str_split(Gemeenten, ",")) %>%
  unnest(Municipality) %>%
  mutate(Municipality = trimws(Municipality)) %>% 
  group_by(Municipality) %>%
  summarize(n_of_Natura2000_areas = n())

head(Municipalities_Natura2000)
## # A tibble: 6 × 2
##   Municipality     n_of_Natura2000_areas
##   <chr>                            <int>
## 1 's-Gravenhage                        3
## 2 's-Hertogenbosch                     1
## 3 Aa en Hunze                          2
## 4 Alblasserdam                         1
## 5 Albrandswaard                        1
## 6 Alkmaar                              1


7. Villages, towns and hamlets


A reference table with places/villages/hamlets is necessary because Strava used this smaller-scale division for its segments until recently. Nowadays Strava uses municipal entities for the Netherlands.


# Reference table 6000 villages and towns and hamlets in The Netherlands, including GPS coordinates.

Reference_village_town_hamlet_2022 <- read_excel("data/plaatsnamen thesaurus 2022.xlsx") %>% 
  rename (Province = Provincie, City = Plaats, Municipality_code = GemeentecodeGM, Municipality = Gemeente) %>% 
  arrange(City, Municipality, Municipality_code, Province) %>% 
  mutate(Latitude = round(as.numeric(Latitude), 7)) %>% 
  mutate(Longitude = round(as.numeric(Longitude), 7))

head(Reference_village_town_hamlet_2022)
## # A tibble: 6 × 6
##   City                  Municipality   Municipality_code Province      Latitude Longitude
##   <chr>                 <chr>          <chr>             <chr>            <dbl>     <dbl>
## 1 's-Graveland          Wijdemeren     GM1696            Noord-Holland     52.2      5.12
## 2 's-Gravendeel         Hoeksche Waard GM1963            Zuid-Holland      51.8      4.62
## 3 's-Gravenhage         's-Gravenhage  GM0518            Zuid-Holland      52.1      4.30
## 4 's-Gravenmoer         Dongen         GM0766            Noord-Brabant     51.7      4.94
## 5 's-Gravenmoerse Vaart Dongen         GM0766            Noord-Brabant     NA       NA   
## 6 's-Gravenpolder       Borsele        GM0654            Zeeland           51.5      3.90


8. Hilly area


Hills in this predominantly very flat country? Maybe you prefer to call them molehills or microrelief. Anyway, there are a number of irregularities in the landscape, apart from dikes and viaducts. The southernmost part of the Netherlands, South Limburg, has been lifted by tectonics, creating a plateau landscape in which a large number of valleys have been carved out by water erosion. Furthermore, the penultimate ice age - the Saale glacial stage - has left its mark on the landscape, especially in the middle of the country. On top of the Vaalserberg you will find the highest point in the Netherlands, at 323 meters above sea level.
A very interesting site for anyone who loves climbing is Climbfinder (https://www.climbfinder.com). All climbs on Climbfinder are arranged by countries and regions. For each climb, they provide key facts such as its length in km, average gradient (%), and the total ascent in meters. In addition, every ascent has a difficulty indicator allowing climbs to be easily compared with each other.4 The indicator is associated with the Belgian Encyclopedia Cotacol.
Not unimportant: the location/city/hamlet is mentioned, useful for linking to a reference table.5


# Climbfinder "ascents" in the Netherlands.

# https://climbfinder.com/nl
# 98 municipalities with 'ascents'.
# Climbing points according to COTACOL score formula. This is based on 100 meter stretches.
# The score is calculated based on various parameters such as length, average slope, maximum slope angle 
# and road surface condition. The higher the number of points, the more difficult the climb.

Climbfinder_ascents_Netherlands <- read_delim("data/Climbfinder beklimmingen in Nederland.csv", 
                                                    delim = ";", escape_double = FALSE, locale = locale(decimal_mark = ",", grouping_mark = ".")) %>%
  rename(City = Plaats, Province = Provincie) %>% 
  left_join(Reference_village_town_hamlet_2022, by=c("City", "Province")) %>% 
  select(- c(Latitude, Longitude)) %>% 
  group_by(Municipality_code) %>%
  summarise(n_of_Climbs=n(), .groups = 'drop', 
            Climbing_points=sum(Klimpunten),
            mean_Climbing_points = round(Climbing_points / n_of_Climbs, 1))

head(Climbfinder_ascents_Netherlands)
## # A tibble: 6 × 4
##   Municipality_code n_of_Climbs Climbing_points mean_Climbing_points
##   <chr>                   <int>           <dbl>                <dbl>
## 1 GM0014                      3               9                  3  
## 2 GM0093                      3              15                  5  
## 3 GM0096                      1              27                 27  
## 4 GM0114                      2              41                 20.5
## 5 GM0148                      1               5                  5  
## 6 GM0153                      3              19                  6.3


9. Strava/VeloViewer data: cycle intensity, elevation change meters, grade

Strava segment data can be accessed via VeloViewer https://www.veloviewer.com. VeloViewer offers two export CSV files, both of which we use the latter. The first with all attempts per segment, the second with only the best raced segment based on speed.
As a measure of cycling intensity, we use the relative position score from VeloViewer. This score provides a value that represents your relative position to the other riders on each segment. Close to 100 is very good, close to 0 very bad. The formula used to work out this score is (“number of riders on segment” + 1 – “your segment position”) * 100 / (“number of riders on segment” + 1). Because VeloViewer adds 1 to both of the riders counts in the equation it is impossible to actually get a value of 0 or 100.


# Cycled Strava segments in the Netherlands. Source: Strava, VeloViewer

Cycled_Strava_segments_NL <- read_csv2(file = "data/segmenten_R_12_05_2024_rpubs_municipalities.csv", col_names = TRUE) %>% 
  select(c(Status, Country, Type, State, City, n_Riders, Tries, Pos_Score, Speed_kmh, Elv_change_m, Grade)) %>%
  filter (Status == 'active' & Country == 'Nederland' & Type == 'Ride') %>%
  rename (Province = State) %>%
  left_join(Reference_village_town_hamlet_2022, by = c( "City", "Province")) %>% 
  group_by (Province, Municipality) %>%
  summarise(n_cycled_segments = n(), .groups = 'drop',
            max_n_riders = max (n_Riders),
            max_tries = max (Tries),
            mean_PosScore = mean (Pos_Score),
            mean_Speedkmh = mean (Speed_kmh),
            mean_Elvchangem = mean(abs(Elv_change_m)),
            mean_Grade = mean(abs(Grade))) %>%
  arrange(Municipality)

head(Cycled_Strava_segments_NL)
## # A tibble: 6 × 9
##   Province      Municipality     n_cycled_segments max_n_riders max_tries mean_PosScore mean_Speedkmh mean_Elvchangem mean_Grade
##   <chr>         <chr>                        <int>        <dbl>     <dbl>         <dbl>         <dbl>           <dbl>      <dbl>
## 1 Zuid-Holland  's-Gravenhage                    9        35378         3          59.6          28.3            3.99     0.480 
## 2 Noord-Brabant 's-Hertogenbosch                 3        11693         1          67.2          30.2            5.37     0.705 
## 3 Drenthe       Aa en Hunze                    478        19172        30          91.1          38.2            5.90     0.240 
## 4 Gelderland    Aalten                          11         3489         1          86.2          35.5            3.72     0.218 
## 5 Fryslân       Achtkarspelen                  147         7002        35          93.8          41.2            3.15     0.0994
## 6 Zuid-Holland  Albrandswaard                   10        22359         1          64.6          31.2            4.91     0.146
Percentage_cycled_municipalities <- round(nrow(Cycled_Strava_segments_NL) / nrow(CBS_municipalities_1_1_2022),2) * 100
Percentage_cycled_municipalities
## [1] 77


10. Strava/VeloViewer: cycle frequency


The number of cycled segments in a specific municipality together with the maximum number of attempts on a segment defines the cycling frequency. This is calculated by a principal components analysis (PCA): the first PCA component of both variables “number of cycled segments” and “maximum number of tries” approximates the cycling frequency. The continuous component is converted to an interval ranging from 1 (low) to 9 (high cycle frequency).


# Join data frames 1-9.

Municipality_data_2022 <- CBS_municipalities_1_1_2022 %>%
  left_join(CBS_population_1_1_2022, by="Municipality_code") %>% 
  left_join(ha_Forest, by="Municipality_code") %>% 
  left_join(Protected_objects_per_municipality_2022, by="Municipality_code") %>% 
  left_join(Environment_nature_score_1_1_2022, by="Municipality_code") %>% 
  left_join(Municipalities_Natura2000, by="Municipality") %>% 
  mutate(n_of_Natura2000_areas = ifelse(is.na(n_of_Natura2000_areas), 0, n_of_Natura2000_areas)) %>% 
  left_join(Cycled_Strava_segments_NL, by = c("Province", "Municipality")) %>% 
  mutate(n_cycled_segments = ifelse(is.na(n_cycled_segments), 0, n_cycled_segments)) %>% 
  mutate(max_tries = ifelse(is.na(max_tries), 0, max_tries)) %>%
  left_join(Climbfinder_ascents_Netherlands, by = "Municipality_code") %>% 
  mutate(n_of_Climbs = ifelse(is.na(n_of_Climbs), 0, n_of_Climbs)) %>% 
  mutate(Climbing_points = ifelse(is.na(Climbing_points), 0, Climbing_points)) %>% 
  select (-mean_Climbing_points) %>% 
  arrange(Municipality_code)


# add municipality "cycle frequency" as the first PCA component of n_cycled_segments and max_tries.

cycle_frequency <- Municipality_data_2022 %>%
  select(n_cycled_segments, max_tries) %>%
  mutate(n_cycled_segments = log1p(n_cycled_segments),
         max_tries = log1p(max_tries)) 

cycle_frequency_pca_model <- prcomp(cycle_frequency, scale = TRUE)
summary(cycle_frequency_pca_model)
## Importance of components:
##                           PC1     PC2
## Standard deviation     1.3677 0.35982
## Proportion of Variance 0.9353 0.06473
## Cumulative Proportion  0.9353 1.00000
Municipality_data_2022 <- cbind(Municipality_data_2022,cycle_frequency_pca_model$x[,1:1]) %>% 
  rename(Cycle_frequency_PC1 = last_col()) %>% 
  mutate(Cycle_frequency_PC1_class = as.numeric(cut(0-Cycle_frequency_PC1, 9),  ordered_result = TRUE))

table(Municipality_data_2022$Cycle_frequency_PC1_class)  # 1 = low, 9 = high
## 
##  1  2  3  4  5  6  7  8  9 
## 88 51 77 66 34 18  6  3  2
# test "Nature" as the first PCA component of four variables.
# result: very bad. Skip.

Nature <- Municipality_data_2022 %>%
  select(Environment_nature_score, ha_Forest, n_of_Natura2000_areas, km2_Water_surface) %>%
  mutate(ha_Forest             = log1p(ha_Forest),
         n_of_Natura2000_areas = log1p(n_of_Natura2000_areas),
         km2_Water_surface     = log1p(km2_Water_surface)) 

Nature_pca_model <- prcomp(Nature, scale = TRUE)
summary(Nature_pca_model)
## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.3228 1.0996 0.7999 0.6334
## Proportion of Variance 0.4375 0.3023 0.1600 0.1003
## Cumulative Proportion  0.4375 0.7397 0.8997 1.0000
Municipality_data_2022 <- cbind(Municipality_data_2022,Nature_pca_model$x[,1:1]) %>% 
  rename(Nature_PC1 = last_col()) 


11. Geometric data: municipality polygons


The National Georegister is the location for geo-information in the Netherlands. The datasets can be viewed in a map viewer, downloaded, or linked. The “geometry” contains National Triangle coordinates of the National Triangle System (RD, “Rijksdriehoekstelsel”). The central point of the system is in Amersfoort. This point has the coordinates x = 155,000 m, y = 463,000 m. These values have been chosen in such a way that the x-coordinate is for each point in the entire (European) Netherlands on land is always between 0 and 280 km and the y-coordinate between 300 and 625 km. So all coordinates have a positive value and the y-coordinate is always greater than the x-coordinate.


# Geometric data (polygons) for map of municipalities as of 1-1-2022 

map_municipalities_2022 <- st_read("https://service.pdok.nl/cbs/gebiedsindelingen/2022/wfs/v1_0?request=GetFeature&service=WFS&version=2.0.0&typeName=cbsgebiedsindelingen:gemeente_gegeneraliseerd&outputFormat=json") %>%
  mutate(statnaam = replace(statnaam, statnaam == "Hoekse Waard", "Hoeksche Waard")) %>% 
  arrange(statcode) %>% select(-c("jrstatcode","rubriek"))
## Reading layer `gemeente_gegeneraliseerd' from data source `https://service.pdok.nl/cbs/gebiedsindelingen/2022/wfs/v1_0?request=GetFeature&service=WFS&version=2.0.0&typeName=cbsgebiedsindelingen:gemeente_gegeneraliseerd&outputFormat=json' using driver `GeoJSON'
## Simple feature collection with 345 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 13565.4 ymin: 306846.2 xmax: 278026.1 ymax: 619231.6
## Projected CRS: Amersfoort / RD New
head(map_municipalities_2022)
## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 136985.7 ymin: 473841.6 xmax: 269937.9 ymax: 592658.9
## Projected CRS: Amersfoort / RD New
##   id statcode      statnaam                       geometry
## 1  1   GM0014     Groningen MULTIPOLYGON (((245194.7 59...
## 2  2   GM0034        Almere MULTIPOLYGON (((146891.1 49...
## 3  3   GM0037   Stadskanaal MULTIPOLYGON (((263763.9 56...
## 4  4   GM0047       Veendam MULTIPOLYGON (((256231.9 57...
## 5  5   GM0050      Zeewolde MULTIPOLYGON (((170596.3 48...
## 6  6   GM0059 Achtkarspelen MULTIPOLYGON (((210504.1 58...


12. Distance from Groningen


As mentioned before, the distance to the home address is very decisive. Proxy: the straight line distance. Not the way one cycles, but “as the crow flies”.


# Distance table: the straight line distance in km from home base Groningen to another municipality in the Netherlands.

centre_polygon <- st_as_sf(st_centroid(map_municipalities_2022))

head(centre_polygon)
## Simple feature collection with 6 features and 3 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 145427.5 ymin: 484222.2 xmax: 263951 ymax: 582034.1
## Projected CRS: Amersfoort / RD New
##   id statcode      statnaam                  geometry
## 1  1   GM0014     Groningen POINT (237390.7 582034.1)
## 2  2   GM0034        Almere POINT (145427.5 486598.2)
## 3  3   GM0037   Stadskanaal     POINT (263951 558466)
## 4  4   GM0047       Veendam POINT (255012.7 567944.7)
## 5  5   GM0050      Zeewolde POINT (159344.5 484222.2)
## 6  6   GM0059 Achtkarspelen POINT (205498.4 581383.4)
distance_table <- data.frame(st_distance(centre_polygon, centre_polygon[1,])) %>% 
  rename(afstand_in_m = 1) %>% 
  mutate(km_distance_to_Groningen = round((as.numeric(afstand_in_m) / 1000),1)) %>% 
  select (- afstand_in_m)

Municipality_data_2022 <- cbind(Municipality_data_2022, distance_table, st_transform(centre_polygon, crs = 4326)) %>%  # check sequence
  mutate(longitude = sf::st_coordinates(geometry)[,1],              # convert Amersfoort RD to crs 4326
         latitude = sf::st_coordinates(geometry)[,2]) %>% select(-c("geometry", "id", "statcode", "statnaam"))

rm(distance_table)
rm(centre_polygon)


13. Elevation


In addition to Climbfinder climbing points, we also use the elevation meters, provides by numerous benchmarks. The following data has been recorded for each benchmark: Normaal Amsterdams Peil (NAP) height, the measurement date, a description of the location and the x and y coordinate of the mark in the National Triangle System (RD).
Amsterdam Ordnance Datum or Normaal Amsterdams Peil (NAP) is a vertical datum in use in large parts of Western Europe. Originally created for use in the Netherlands, its height was used by Prussia in 1879 for defining Normalnull, and in 1955 by other European countries. In the 1990s, it was used as the reference level for the United European leveling Network (UELN) which in turn led to the European Vertical Reference System (EVRS).
The most suitable variable might be the standard deviation of the NAP altitude.


# Elevation: NAP benchmarks (NAP WFS info from National Georegister)

# https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/g698b941-93e5-4a5f-a291-d070588d431b

nap_benchmarks <- st_read("https://geo.rijkswaterstaat.nl/services/ogc/gdr/nap/ows?service=WFS&request=getcapabilities&version=2.0.0")    %>% 
 mutate(nap_hoogte = as.numeric(hoogte)) %>% 
 select(puntnummer, x_rd, y_rd, type, omschrijving, shape, nap_hoogte)
## Multiple layers are present in data source https://geo.rijkswaterstaat.nl/services/ogc/gdr/nap/ows?service=WFS&request=getcapabilities&version=2.0.0, reading layer `nap:punten_actueel'.
## Use `st_layers' to list all layer names and their type in a data source.
## Set the `layer' argument in `st_read' to read a particular layer.
## Reading layer `nap:punten_actueel' from data source `https://geo.rijkswaterstaat.nl/services/ogc/gdr/nap/ows?service=WFS&request=getcapabilities&version=2.0.0' using driver `WFS'
## Simple feature collection with 33692 features and 19 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 14040 ymin: 307110 xmax: 277600 ymax: 613224
## Projected CRS: Amersfoort / RD New
head(nap_benchmarks)
## Simple feature collection with 6 features and 6 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 244888 ymin: 464746 xmax: 248746 ymax: 466980
## Projected CRS: Amersfoort / RD New
##   puntnummer   x_rd   y_rd type                  omschrijving nap_hoogte                 shape
## 1   034E0380 247493 465812   PM          SCH BDR ALBERTSWEG 2    24.0048 POINT (247493 465812)
## 2   034E0050 246140 466980   PM            HS BDR RONDEEL 32     21.4794 POINT (246140 466980)
## 3   034E0398 248746 466855   PM NO LANDHOOFD VIADUCT OVER N18    30.7727 POINT (248746 466855)
## 4   034E0420 245566 466320   PM DKR NABIJ RIOOLWATERZUIVERING    20.2545 POINT (245566 466320)
## 5   034E0406 244888 464746   PM NO LANDHOOFD VIADUCT OVER N18    27.5530 POINT (244888 464746)
## 6   034E0401 245086 466009   PM           HS GOORSESTRAAT 169    21.8148 POINT (245086 466009)
# link point to polygon with sf::st_join.

point_to_polygon <- map_municipalities_2022 %>% 
  st_join (nap_benchmarks) %>% 
  rename(Municipality_code = statcode) %>% 
  rename(Municipality = statnaam)

head(point_to_polygon[c("Municipality","puntnummer", "x_rd", "y_rd", "omschrijving", "nap_hoogte")],5)
## Simple feature collection with 5 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 226868.9 ymin: 569546.1 xmax: 247393.6 ymax: 592658.9
## Projected CRS: Amersfoort / RD New
##     Municipality puntnummer   x_rd   y_rd                              omschrijving nap_hoogte                       geometry
## 1      Groningen   007C0186 227130 581180                      BDR ZUIDWENDING 335      0.2858 MULTIPOLYGON (((245194.7 59...
## 1.1    Groningen   007B0211 238909 590520     VDC LUTJE WOLDE, IN VOETSTUK PIJLERS     -0.1219 MULTIPOLYGON (((245194.7 59...
## 1.2    Groningen   012B0204 238330 572330 BOVENLEIDING MAST 66/5, Z. SPWG-OVERGANG      4.9374 MULTIPOLYGON (((245194.7 59...
## 1.3    Groningen   007D0140 236920 575750 HS RIJKSTRAATWG 290,BIJ INGANG BEGRAAFPL      3.7421 MULTIPOLYGON (((245194.7 59...
## 1.4    Groningen   007D0220 236980 580400               PIJLER VDC A7 RD=079346-15      1.2324 MULTIPOLYGON (((245194.7 59...
# Aggregate the benchmark characteristics per municipality.

aggregate_benchmarks <- point_to_polygon  %>% 
  as.data.frame() %>%     # transform an sf object to a data frame and remove the geometry column
  select (- geometry) %>% 
  group_by(Municipality) %>%
  summarize(n_of_benchmarks = n(), 
            mean_NAP_altitude = mean(nap_hoogte),
            min_NAP_altitude = min(nap_hoogte),
            max_NAP_altitude = max(nap_hoogte),
            sd_NAP_altitude = sd(nap_hoogte)) %>% 
  mutate(range_NAP_altitude = max_NAP_altitude - min_NAP_altitude)

head(aggregate_benchmarks)
## # A tibble: 6 × 7
##   Municipality     n_of_benchmarks mean_NAP_altitude min_NAP_altitude max_NAP_altitude sd_NAP_altitude range_NAP_altitude
##   <chr>                      <int>             <dbl>            <dbl>            <dbl>           <dbl>              <dbl>
## 1 's-Gravenhage                116             2.28           -3.21              10.4             2.71              13.6 
## 2 's-Hertogenbosch             105             5.88            2.02              10.8             1.47               8.80
## 3 Aa en Hunze                  243            12.0            -0.0796            22.4             6.09              22.5 
## 4 Aalsmeer                      22            -0.545          -4.03               2.36            1.84               6.39
## 5 Aalten                        59            25.5            16.5               40.8             6.12              24.3 
## 6 Achtkarspelen                 95             1.97           -0.0265             8.36            1.26               8.39
aggregate_benchmarks <- aggregate_benchmarks %>% 
  select(Municipality, sd_NAP_altitude, max_NAP_altitude)


Municipality_data_2022 <- Municipality_data_2022 %>%
  left_join(aggregate_benchmarks, by="Municipality") 

rm(nap_benchmarks)
rm(aggregate_benchmarks)
rm(point_to_polygon)

# test "Elevation" as the first PCA component of four variables.
# result: disappointing.

Elevation <- Municipality_data_2022 %>%
  select(n_of_Climbs, Climbing_points, sd_NAP_altitude, max_NAP_altitude) %>%
  mutate(n_of_Climbs             = log1p(n_of_Climbs),
         sd_NAP_altitude = log(sd_NAP_altitude),
         max_NAP_altitude     = log(max_NAP_altitude)) 

Elevation_pca_model <- prcomp(Elevation, scale = TRUE)
summary(Elevation_pca_model)
## Importance of components:
##                          PC1    PC2     PC3     PC4
## Standard deviation     1.745 0.7495 0.50898 0.36782
## Proportion of Variance 0.761 0.1404 0.06477 0.03382
## Cumulative Proportion  0.761 0.9014 0.96618 1.00000
Municipality_data_2022 <- cbind(Municipality_data_2022,Elevation_pca_model$x[,1:1]) %>% 
  rename(Elevation_PC1 = last_col()) %>% 
  mutate(Elevation_PC1_class = as.numeric(cut(Elevation_PC1, 9),  ordered_result = TRUE))

table(Municipality_data_2022$Elevation_PC1_class)
## 
##   1   3   4   5   6   7   8   9 
##   1   5   7   8  27  49 194  54
rm(Elevation,Elevation_pca_model)

# export results.

write_csv2(Municipality_data_2022, "C:/Users/acmsc/Documents/R/Strava/werkstuk RPubs/Municipality_data_2022_RPUBS.csv")
rm(Municipality_data_2022)
Municipality_data_2022 <- read_csv2(file = "werkstuk RPubs/Municipality_data_2022_RPUBS.csv", col_names = TRUE)



rm(nap_benchmarks)
rm(aggregate_benchmarks)
rm(point_to_polygon)


Mapping


A dozen general characteristics have been mapped at the municipal level. The piece of script below shows all recoding.


# prepare mapping data.

map_data <-
  map_municipalities_2022 %>%
  left_join(Municipality_data_2022, by=c("statcode"="Municipality_code")) %>% 
  mutate(Population_class
         = cut(Population_1_1_2022, breaks = c(0, 20000, 50000, 100000, 200000, 500000, 999999),
               labels = c("< 20,000", "20-50,000", "50-100,000", "100-200,000", "200-500,000", "> 500,000"))) %>% 
  mutate(max_n_riders_class = cut(max_n_riders, breaks = c(1, 10000, 20000, 30000, 40000, 60000, 100000, 150000, 175000, 200000), 
                                  labels = c("< 10,000", "10-20,000", "20-30,000", "30-40,000", "40-60,000", "60-100,000", "100-150,000", "150-175,000", ">175,000"))) %>% 
  mutate(mean_PosScore_class = cut(mean_PosScore, breaks = c(-1,50,75,80,85,90,95,999),
                         labels = c("< 50","50-75","75-80","80-85","85-90","90-95","95-100"))) %>% 
  mutate(n_of_Protected_objects_class
         = cut(n_of_Protected_objects, breaks = c(-1, 10, 50, 200, 400, 1000, 6000, 9999),
               labels = c("< 10", "10-50", "50-200", "200-400", "400-1,000", "1,000-6,000", "> 6,000"))) %>% 
  mutate(Population_density_class
                = cut(Population_density_1_1_2022, breaks = c(0, 249, 500, 1000, 2000, 4000, 9999),
                  labels = c("< 250", "250-500", "500-1,000", "1,000-2,000", "2,000-4,000", "> 4,000"))) %>% 
  mutate(km2_Water_surface_class
         = cut(km2_Water_surface, breaks = c(-1, 1, 5, 10, 20, 50, 100, 999),
               labels = c("< 1", "1-5", "5-10", "10-20", "20-50", "50-100", "> 100"))) %>% 
  mutate(n_of_Climbs_class = cut(n_of_Climbs, breaks = c(-1,0,1,4,9,19,39,69,999),
                                      labels = c("0","1","2-4","5-9","10-19","20-39","40-69",">= 70"))) %>% 
  mutate(max_NAP_altitude_class = cut(max_NAP_altitude, breaks = c(-1,5,10,15,25,50,125,200,300,999),
                                        labels = c("< 5","5-10","10-15","15-25","25-50","50-125","125-200","200-300","> 300"))) %>% 
  mutate(sd_NAP_altitude_class = cut(sd_NAP_altitude, breaks = c(-1,2,4,7,10,20,30,40,99),
                                       labels = c("<2", "2-4", "4-7", "7-10","10-20", "20-30", "30-40", ">40"))) %>% 
  mutate(Climbing_points_class = cut(Climbing_points, breaks = c(-1,1,50,100,250,500,1000,1500,9999),
                                           labels = c("0","1-50","50-100","100-250","250-500","500-1,000","1,000-1,500","> 1,500")))


The maps:

- Population
- Population density
- Environment/nature
- Water surface
- Natura2000 areas
- Maximum NAP altitude
- Standard deviation NAP altitude
- Climbfinder ascents
- Climbfinder points
- Protected buildings/objects

We omit the R scripts which generate the maps.

Map 2. Population [List of maps](#maplist)

Map 2. Population List of maps


Map 3. Population density [List of maps](#maplist)

Map 3. Population density List of maps


Map 4. Environment/nature score [List of maps](#maplist)

Map 4. Environment/nature score List of maps


Map 5. water surface [List of maps](#maplist)

Map 5. water surface List of maps


Map 6. Natura2000 areas [List of maps](#maplist)

Map 6. Natura2000 areas List of maps


Map 7. Max. NAP altitude [List of maps](#maplist)

Map 7. Max. NAP altitude List of maps


Map 8. S.D. NAP altitude [List of maps](#maplist)

Map 8. S.D. NAP altitude List of maps


Map 9. Climbfinder climbs [List of maps](#maplist)

Map 9. Climbfinder climbs List of maps


Map 10. Climbfinder points  [List of maps](#maplist)

Map 10. Climbfinder points List of maps


Map 11. protected buildings/objects [List of maps](#maplist)

Map 11. protected buildings/objects List of maps


Next, four maps with Strava segment stats. All based on segments cycled by myself.

- Cycle frequency
- Cycle intensity
- Number of Strava riders
- High and low cycle frequency outer quartiles

Map 12. Cycle frequency [List of maps](#maplist)

Map 12. Cycle frequency List of maps


Map 13. Cycle intensity [List of maps](#maplist)

Map 13. Cycle intensity List of maps


Map 14. Strava riders [List of maps](#maplist)

Map 14. Strava riders List of maps


Map 15. High and low cycle frequency [List of maps](#maplist)

Map 15. High and low cycle frequency List of maps



This chapter’s last map is an interactive one.
As mentioned twice, the distance to the home address is very decisive. The proxy is the straight line distance. The map below shows the distance to Groningen for 50 places. Clicking on a line or end point gives a pop-up with car travel distance and time. A route planner provided this data. Searched one by one.



library(readxl)
library(igraph)
library(sp)   
library(leaflet)
library(tidyverse)   

# Reference table 6000 villages and towns and hamlets in The Netherlands, including GPS coordinates.

Reference_village_town_hamlet_2022 <- read_excel("data/plaatsnamen thesaurus 2022.xlsx") %>% 
  rename (Province = Provincie, City = Plaats, Municipality_code = GemeentecodeGM, Municipality = Gemeente) %>% 
  arrange(City, Municipality, Municipality_code, Province) %>% 
  mutate(Latitude = round(as.numeric(Latitude), 7)) %>% 
  mutate(Longitude = round(as.numeric(Longitude), 7)) %>% 
  filter(!is.na(Latitude)) %>% 
  select(City, Province, Longitude, Latitude) %>% 
  group_by (City) %>% 
  filter( !duplicated(City))

# Function to calculate distance between start and end points.

distance_between_two_points <- function(lat1, lon1, lat2, lon2) {
  
  theta <-  lon2 - lon1
  d1 <- (sin((atan(1)/45) * lat2) * sin((atan(1)/45) * lat1)) + (cos((atan(1)/45) * lat2) * cos((atan(1)/45) * lat1) * cos((atan(1)/45) * theta))
  d2 <- 2*atan(1) - asin(d1)
  d2[is.na(d2)] <- 0
  d3 = (45/atan(1))*(d2)
  distmile = d3 * 60 * 1.1515
  distkm = distmile * 1.609344
  distance_meter = distkm * 1000
  distance_km <- round(distance_meter / 1000,3)
  
  return(data.frame(distance_km))
  
}

# Calculate straight point distance from home town various places in the country.

Distance_line <- Reference_village_town_hamlet_2022 %>% 
  mutate(Home_latitude = 53.22438,
         Home_longitude = 6.566260) %>% 
  select(City, Province, Latitude, Longitude, Home_latitude, Home_longitude)

Distance_line <- cbind(Distance_line, distance_between_two_points (Distance_line$Home_latitude, Distance_line$Home_longitude,Distance_line$Latitude,Distance_line$Longitude )) %>% 
  rename(Dist_km_straight_line = distance_km) %>% 
  group_by(City, Province, Dist_km_straight_line, Longitude, Latitude, Home_longitude, Home_latitude) %>% 
  summarise()

rm(Reference_village_town_hamlet_2022)

# Add car travel distance and time.

travel <- data.frame ("from" = rep("Groningen", times=51),
                  tribble(~to, ~travel_km, ~travel_time,
                          "Groningen"       ,  0,"0:00",
                          "Overberg"        ,206,"1:50",
                          "Schiermonnikoog" , 57,"1:14",
                          "Eemshaven"       , 40,"0:39",
                          "Aalten"          ,187,"2:14",
                          "Formerum"        ,131,"3:25",
                          "Bad Nieuweschans", 60,"0:46",
                          "Aardenburg"      ,379,"4:02",
                          "Ter Apel"        , 63,"0:56",
                          "Emmen"           , 61,"0:48",
                          "Enschede"        ,151,"1:52",
                          "Bourtange"       , 71,"1:01",
                          "Zwolle"          ,105,"1:09",
                          "Termunterzijl"   , 59,"0:51",
                          "Den Dungen"      ,235,"2:28",
                          "Nijmegen"        ,195,"2:16",
                          "Amersfoort"      ,173,"2:00",
                          "Vaals"           ,352,"3:48",
                          "Eijsden"         ,342,"3:42",
                          "Eindhoven"       ,268,"2:49",
                          "Ossendrecht"     ,298,"3:13",
                          "Sas van Gent"    ,353,"3:45",
                          "Cadzand"         ,386,"4:10",
                          "Burgh-Haamstede" ,307,"3:19",
                          "Hoek van Holland",246,"2:42",
                          "Middelburg"      ,346,"3:17",
                          "Slenaken"        ,352,"3:26",
                          "Terheijden"      ,243,"2:18",
                          "Amsterdam"       ,174,"1:42",
                          "Assen"           , 31,"0:32",
                          "Delfzijl"        , 41,"0:37",
                          "Volendam"        ,182,"1:49",
                          "Eerbeek"         ,169,"1:38",
                          "Ommen"           , 88,"1:01",
                          "Leidschendam"    ,225,"2:09",
                          "Drachten"        , 36,"0:26",
                          "Lauwersoog"      , 42,"0:42",
                          "Middelstum"      , 25,"0:23",
                          "Bellingwolde"    , 60,"0:44",
                          "Dokkum"          , 65,"0:42",
                          "Sellingen"       , 75,"0:57",
                          "Holten"          ,123,"1:29",
                          "Lochem"          ,141,"1:43",
                          "Putten"          ,156,"1:37",
                          "Orvelte"         , 60,"0:49",
                          "Bergen aan Zee"  ,176,"2:09",
                          "Julianadorp"     ,148,"1:41",
                          "De Cocksdorp"    ,174,"2:39",
                          "Vlieland"        ,129,"2:06",
                          "Harlingen"       , 85,"0:57",
                          "Ballum"          , 94,"2:05"))

travel_ext <- travel %>% 
  left_join(Distance_line, by = c("to" = 'City')) %>% 
  rename (City= to) %>% 
  select(City, Province, Longitude, Latitude, Dist_km_straight_line, travel_km, travel_time)

head(travel_ext)
##              City   Province Longitude Latitude Dist_km_straight_line travel_km travel_time
## 1       Groningen  Groningen  6.566260 53.22438                 0.000         0        0:00
## 2        Overberg    Utrecht  5.498236 52.03952               150.167       206        1:50
## 3 Schiermonnikoog    Fryslân  6.156031 53.47860                39.246        57        1:14
## 4       Eemshaven  Groningen  6.801389 53.44750                29.312        40        0:39
## 5          Aalten Gelderland  6.590469 51.92604               144.371       187        2:14
## 6        Formerum    Fryslân  5.309896 53.38985                85.474       131        3:25
# Now some fussing and fiddling with R. It could probably be done easier. 

g <- graph_from_data_frame(travel, directed=FALSE, vertices=travel_ext)
class(g)
## [1] "igraph"
lo <- norm_coords(as.matrix(travel_ext[,3:4]))
class(lo)
## [1] "matrix" "array"
gg <- igraph::as_data_frame(g, "both")
class(gg)
## [1] "list"
vert <- gg$vertices
coordinates(vert) <- ~Longitude+Latitude
class(vert)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
edges <- gg$edges
edges <- lapply(1:nrow(edges), function(i) {
  as(rbind(vert[vert$name == edges[i, "from"], ], 
           vert[vert$name == edges[i, "to"], ]), "SpatialLines")})

for (i in seq_along(edges)) {
  edges[[i]] <- spChFIDs(edges[[i]], as.character(i))}

edges <- do.call(rbind, edges)
class(edges)
## [1] "SpatialLines"
## attr(,"package")
## [1] "sp"
# Pop-up info when you double-click on a point or line. 

content <- paste("<b>", "City: ", travel_ext$City, "</b>", "<br/>", 
                 "Province: ", travel_ext$Province, "<br/>",
                 "Straight line distance: ", round(travel_ext$Dist_km_straight_line,0), " km.", "<br/>",
                 "Travel distance (by car): ", travel_ext$travel_km, " km.", "<br/>",
                 "Travel time (by car): ", travel_ext$travel_time)

# Home town red icon.

getColor <- function(x) {
  sapply(x$travel_km, function(travel_km) {
    if(travel_km == 0) {
      "red"
    }  else {
      "green"
    } })
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'white',
  library = 'ion',
  markerColor = getColor(travel_ext)
)

Map 16. Distance to Groningen List of maps



Statistical methods


Linear regression is a commonly used method of predictive analysis. It uses linear relationships between a dependent variable to be explained and one or more independent variables (predictors) to predict. It shows whether changes observed in the dependent variable are associated with changes in one or more of the explanatory or independent variables.
In R “GLM” is used to fit generalized linear models, such as (logistic) regression.
“spaMM” is an R package originally designed for fitting Spatial generalized linear Mixed Models, particularly so-called geostatistical models allowing prediction in spatial context.
Logistic regression or ‘logit’ analysis has been used particularly to investigate the relationship between a binary response and explanatory variables.



1. GLM model


The cycle frequency proxy is the dependent variable, the remainder on the list below is the independent part.

input variables for stepwise regressions
Population_1_1_2022
Population_density_1_1_2022
n_of_Natura2000_areas
Environment_nature_score
km2_Water_surface
ha_Forest
km_distance_to_Groningen
n_of_Climbs
Climbing_points
max_NAP_altitude
sd_NAP_altitude
n_of_Protected_objects
mean_Elvchangem
mean_Grade
Elevation_PC1_class
Cycle_frequency_PC1_class


We have to check a few assumptions before using linear regression. These assumptions are important conditions that should be met before we draw inferences regarding the model estimates. First, variables should be distributed normally. To check, we plot histograms and transform right-skewed distributions - those having a long tail on their right side - into logarithmic scales.




The second assumption: no multicollinearity. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated with each other. To fix multicollinearity, one can remove one of the highly correlated variables or use a technique such as principal component analysis (PCA) to reduce the number of variables while retaining most of the information. We calculated ‘elevation’ as the first PCA component of four related variables:

- log1p(n_of_Climbs)
- log1p(Climbing_points)
- log1p(sd_NAP_altitude)
- log1p(max_NAP_altitude)

The third assumption we make when fitting a regression model is that the residuals of the model (i.e. what isn’t explained by the covariates) are independent. Often, however, with spatial data this assumption isn’t met. Neighboring values may correlate with each other, a phenomenon known as residual spatial autocorrelation (RSA). RSA may bias inferential results. Therefore, something must be done if strong RSA is detected. We can turn to arithmetic add-ons to regression models that basically “absorb” RSA. An attempt to remove it by modifying the model is another option, e.g. through including additional predictor variables.

When thinking about spatial autocorrelation in the present data set we note the following. If the spatial autocorrelation of a response variable is caused by autocorrelated predictor variables, such as elevation data, human population densities or Natura 2000 areas, we are not alarmed. This is not the type of autocorrelation statisticians are concerned with. We do not wish to remove the effect of such predictors after RSA model correction.

The distance to the home base will be included in the model as an important covariate, but this variable implies undeniably spatial autocorrelation. There are also spatial patterns present in the remainder of independent variables (such as forest, Natura 2000, elevation - all spread over neighboring municipalities). This type of spatial autocorrelation arises from the following situation: the difference between the (larger) scale of variation of a phenomenon and the (smaller) scale of the spatial framework used to capture that variation. Spatial autocorrelation here is endogenous, it is caused by determining factors.



The step-by-step data reduction plan is as follows:

1. Print the correlation matrix;
2. Stepwise fit a parsimonious non-spatial linear regression model;
3. Test for spatial autocorrelation in the residuals;
4. Decide if its okay to continue with the non-spatial model, or to fit a spatial one.


The correlation matrix below shows that the target dependent variable cycling frequency has a significant linear relationship with each of the independent variables.

library(Hmisc)
library(corrplot)

Municipality_data_2022 <- read_csv2(file = "werkstuk RPubs/Municipality_data_2022_RPUBS.csv", col_names = TRUE)

Municipality_data_2022_transform <- Municipality_data_2022 %>% 
  mutate(log1p_Population_1_1_2022 = log1p(Population_1_1_2022),
         log1p_ha_Forest = log1p(ha_Forest),
         log1p_n_of_Protected_objects = log1p(n_of_Protected_objects),
         log1p_n_of_Natura2000_areas = log1p(n_of_Natura2000_areas),
         log1p_km2_Water_surface = log1p(km2_Water_surface),
         log1p_max_NAP_altitude = log1p(max_NAP_altitude),
         log1p_sd_NAP_altitude = log1p(sd_NAP_altitude),
         log1p_n_of_Climbs = log1p(n_of_Climbs),
         log1p_Climbing_points = log1p(Climbing_points),
         log1p_mean_Elvchangem = log1p(mean_Elvchangem),
         log1p_mean_Grade = log1p(mean_Grade),  
         Cycle_frequency_PC1 = 0 - Cycle_frequency_PC1,  # n.b. reverse sign (only here)!
         Elevation_PC1 = 0 - Elevation_PC1,               # reverse sign !
         Environment_nature_score = 0 - Environment_nature_score) %>%   # reverse sign !
  dplyr::select(Cycle_frequency_PC1,km_distance_to_Groningen,log1p_Population_1_1_2022,Population_density_1_1_2022,  log1p_n_of_Protected_objects,log1p_ha_Forest,log1p_n_of_Natura2000_areas,Environment_nature_score,    log1p_km2_Water_surface,Elevation_PC1,log1p_max_NAP_altitude,log1p_sd_NAP_altitude,log1p_n_of_Climbs,                 log1p_Climbing_points,log1p_mean_Elvchangem,log1p_mean_Grade)

correlation_full <- Hmisc::rcorr(as.matrix(Municipality_data_2022_transform))
correlation_r <- correlation_full$r
correlation_P <- correlation_full$P

corrplot::corrplot(correlation_r, p.mat = correlation_P, sig.level = 0.05, type = "upper", diag = FALSE, order = "original",
                   insig = 'blank', tl.cex = 0.7, tl.col    = "black",  tl.srt = 45,
         title='Correlation matrix (p < 0.05, insignificant cells suppressed)',  cex.main = 1, mar=c(0,0,1,0),tl.offset = 0.5)



Step 2 resulted in the following model:


lm_concise <- glm(Cycle_frequency_PC1_class ~ log1p(n_of_Natura2000_areas) + 
                                   log1p(n_of_Climbs) + 
                                   km_distance_to_Groningen, data = Municipality_data_2022)


jtools::summ(lm_concise, scale = FALSE)
Observations 345
Dependent variable Cycle_frequency_PC1_class
Type Linear regression
χ²(3) 465.71
Pseudo-R² (Cragg-Uhler) 0.47
Pseudo-R² (McFadden) 0.16
AIC 1146.10
BIC 1165.32
Est. S.E. t val. p
(Intercept) 4.74 0.21 22.55 0.00
log1p(n_of_Natura2000_areas) 0.40 0.13 3.02 0.00
log1p(n_of_Climbs) 0.63 0.07 9.50 0.00
km_distance_to_Groningen -0.01 0.00 -13.55 0.00
Standard errors: MLE


The three key explanatory variables of cycling frequency are:

- the distance to the home base,
- Climbfinder hills, and
- the presence of Natura 2000 areas.

Now we want to check for dependence of residuals. There are basically two classes of dependencies.
a. Residuals correlate with another variable;
b. Residuals correlate with close residuals (autocorrelation).


  1. Residuals may correlate with another variable. This is usually identified visually using graphs such as the quartet below. The residual versus fitted plot can reveal non-linearity because of inequality of the variances of the error terms. For instance a parabola indicates non-linear relationship. Here, we don’t see any distinctive pattern.
    The plot that is in the right upper corner, is the normal probability plot of residuals. It is an indication to see if the model residuals follow a normal distribution. Do residuals follow a straight line well or do they deviate severely?
    The Scale-Location plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal line with equally (randomly) spread points.
    The leverage idea picks out the potential for a specific value to be influential.
    It is noticeable that the estimated values do not exceed six, while the maximum observed is nine.



autoplot(lm_concise)



  1. Residuals correlate with close residuals. A test exists for spatial autocorrelation in the residuals.
    We create a correlogram to explore Moran’s I over different spatial lags. Positive spatial autocorrelation occurs when Moran’s I is close to +1. This means values cluster together, data is clustered.
    Statistically significant values are plotted in red. The table and spatial correlogram below suggest that there is spatial autocorrelation up to scales of ~0.6 decimal degrees and therefore that the residuals are not independent.


options(scipen = 75)

nbc <- 10
cor_r <- pgirmess::correlog(coords=Municipality_data_2022[,c("longitude", "latitude")],   # 345 municipalities
                            z=lm_concise$residuals,                                       # 345 residuals
                            method="Moran", nbclass=nbc)
cor_r
## Moran I statistic 
##       dist.class         coef                                                                               p.value     n
##  [1,]  0.2241789  0.260463613 0.00000000000000000000000000000000000000000000000000000000000000000000000000002682796 12618
##  [2,]  0.6218686 -0.045633096 0.99999939758857503413480571907712146639823913574218750000000000000000000000000000000 25428
##  [3,]  1.0195579 -0.058522830 0.99999999999957822627294490303029306232929229736328125000000000000000000000000000000 26634
##  [4,]  1.4172472 -0.010262447 0.81049198195281069168061094387667253613471984863281250000000000000000000000000000000 23324
##  [5,]  1.8149364  0.010099447 0.10813608122453155502284971589688211679458618164062500000000000000000000000000000000 16864
##  [6,]  2.2126257  0.008245622 0.27215563808514287025275280029745772480964660644531250000000000000000000000000000000  8904
##  [7,]  2.6103150 -0.134460051 0.99999776658245675964309384653461165726184844970703125000000000000000000000000000000  3414
##  [8,]  3.0080042 -0.159608301 0.99919466926464084455261627226718701422214508056640625000000000000000000000000000000  1052
##  [9,]  3.4056935 -0.130006469 0.92100518033972889053728749786387197673320770263671875000000000000000000000000000000   360
## [10,]  3.8033828 -0.516081721 0.99930201088457082381211193933268077671527862548828125000000000000000000000000000000    82
plot(cor_r)      # statistically significant values (p<0.05) are plotted in red


De distance classes on the x-axis can be converted from decimal degrees to distance. Let’s take the cutpoint found above, 0.6 degrees, as an example:

# conversion function of degree to radians
deg2rad <- function(deg) {(deg * pi) / (180)}
# conversion of radians to degree (for completeness only)
rad2deg <- function(rad) {(rad * 180) / (pi)}

degrees <-  0.6
mean_nl_latitude <- 52    # latitude range in the Netherlands: 50.5 - 53.5

lat_distance_spat_autocorr <- round((((2 * pi * 6400000) * degrees) / 360)/1000,3)

# convert from degrees to radians before calling cos function!
lon_distance_spat_autocorr <- round(lat_distance_spat_autocorr *  cos(deg2rad(mean_nl_latitude)),3)

# East-West distance in km per decimal degree longitude, at a given latitude:
lon_distance_spat_autocorr
## [1] 41.262


Basically municipality centroid polygon points more than about 40 km away have a low correlation.
The maximum center - center straight line distance in our group of municipalities is:

max(Municipality_data_2022$km_distance_to_Groningen)
## [1] 298.5


2. spaMM model


Suppose we conclude that he observed spatial autocorrelation in the previous section requires a different model, then we can turn to a spatial one:


lm_concise_spatial <- spaMM::fitme(Cycle_frequency_PC1_class ~ log1p(n_of_Natura2000_areas) + 
                                     log1p(n_of_Climbs) +
                                     km_distance_to_Groningen +
                                     Matern(1|latitude+longitude), data = Municipality_data_2022)


The output is missing p-values and R2, so we calculate them ourselves:


coef_pvalue <- as.data.frame(summary(lm_concise_spatial)$beta_table) %>% 
  rename(Cond_SE = "Cond. SE", t_value = "t-value" ) %>% 
  mutate(lower = Estimate -  1.96 * Cond_SE) %>% 
  mutate(upper = Estimate +  1.96 * Cond_SE) %>% 
  mutate(p_value = round((2*pt(t_value, df=df.residual(lm_concise_spatial), lower.tail=FALSE)),3))
## formula: Cycle_frequency_PC1_class ~ log1p(n_of_Natura2000_areas) + log1p(n_of_Climbs) + 
##     km_distance_to_Groningen + Matern(1 | latitude + longitude)
## ML: Estimation of corrPars, lambda and phi by ML.
##     Estimation of fixed effects by ML.
## Estimation of lambda and phi by 'outer' ML, maximizing logL.
## family: gaussian( link = identity ) 
##  ------------ Fixed effects (beta) ------------
##                              Estimate Cond. SE t-value
## (Intercept)                   3.46837 1.049660  3.3043
## log1p(n_of_Natura2000_areas)  0.32913 0.095567  3.4440
## log1p(n_of_Climbs)            0.37568 0.056044  6.7032
## km_distance_to_Groningen     -0.00537 0.005732 -0.9368
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##      1.nu     1.rho 
## 0.9095624 3.4031042 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    latitude .  :  3.212  
## # of obs: 345; # of groups: latitude ., 345 
##  -------------- Residual variance  ------------
## phi estimate was 0.206948 
##  ------------- Likelihood values  -------------
##                         logLik
## logL       (p_v(h)): -442.7489
coef_pvalue
##                                  Estimate     Cond_SE    t_value       lower       upper p_value
## (Intercept)                   3.468373759 1.049660146  3.3042826  1.41103987 5.525707645   0.001
## log1p(n_of_Natura2000_areas)  0.329128616 0.095566954  3.4439584  0.14181739 0.516439847   0.001
## log1p(n_of_Climbs)            0.375678641 0.056044286  6.7032460  0.26583184 0.485525442   0.000
## km_distance_to_Groningen     -0.005370094 0.005732419 -0.9367938 -0.01660564 0.005865446   1.650
spaMM::pseudoR2(lm_concise_spatial)
##        R2 
## 0.7394511


The correlation parameter ν and ρ represent the strength and the speed of decay in the spatial effect, we’ll use them later.
Surprisingly for me, the distance to Groningen is no longer a significant predictor. The estimate for the effect of this variable is no longer relevant.
At the same time, the significance of the residual spatial autocorrelation has disappeared:


nbc <- 10
cor_r <- pgirmess::correlog(coords=Municipality_data_2022[,c("longitude", "latitude")],
                            z = residuals(lm_concise_spatial),
                            method="Moran", nbclass=nbc)

plot(cor_r) # statistically significant values (p<0.05) are plotted in red


After removing the non-significant “km_distance_to_Groningen” the final model shows up.

lm_concise_spatial <- spaMM::fitme(Cycle_frequency_PC1_class ~ log1p(n_of_Natura2000_areas) + 
                                     log1p(n_of_Climbs) +
                                     Matern(1|latitude+longitude), data = Municipality_data_2022)

coef_pvalue <- as.data.frame(summary(lm_concise_spatial)$beta_table) %>% 
  rename(Cond_SE = "Cond. SE", t_value = "t-value" ) %>% 
  mutate(lower = Estimate -  1.96 * Cond_SE) %>% 
  mutate(upper = Estimate +  1.96 * Cond_SE) %>% 
  mutate(p_value = round((2*pt(t_value, df=df.residual(lm_concise_spatial), lower.tail=FALSE)),3))
## formula: Cycle_frequency_PC1_class ~ log1p(n_of_Natura2000_areas) + log1p(n_of_Climbs) + 
##     Matern(1 | latitude + longitude)
## ML: Estimation of corrPars, lambda and phi by ML.
##     Estimation of fixed effects by ML.
## Estimation of lambda and phi by 'outer' ML, maximizing logL.
## family: gaussian( link = identity ) 
##  ------------ Fixed effects (beta) ------------
##                              Estimate Cond. SE t-value
## (Intercept)                    2.6198  0.62581   4.186
## log1p(n_of_Natura2000_areas)   0.3275  0.09541   3.433
## log1p(n_of_Climbs)             0.3732  0.05590   6.677
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##      1.nu     1.rho 
## 0.8725448 2.9557618 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    latitude .  :  3.672  
## # of obs: 345; # of groups: latitude ., 345 
##  -------------- Residual variance  ------------
## phi estimate was 0.202767 
##  ------------- Likelihood values  -------------
##                         logLik
## logL       (p_v(h)): -443.0979
coef_pvalue
##                               Estimate    Cond_SE  t_value     lower     upper p_value
## (Intercept)                  2.6198060 0.62581331 4.186242 1.3932119 3.8464000   0.000
## log1p(n_of_Natura2000_areas) 0.3275112 0.09540875 3.432716 0.1405100 0.5145123   0.001
## log1p(n_of_Climbs)           0.3732164 0.05589960 6.676549 0.2636532 0.4827796   0.000
spaMM::pseudoR2(lm_concise_spatial)
##        R2 
## 0.7389233
nbc <- 10
cor_r <- pgirmess::correlog(coords=Municipality_data_2022[,c("longitude", "latitude")],
                            z = residuals(lm_concise_spatial),
                            method="Moran", nbclass=nbc)

plot(cor_r) # statistically significant values (p<0.05) are plotted in red

summary(resid(lm_concise_spatial))    # standardized residuals
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.940223 -0.165938  0.004641  0.000000  0.169317  1.172185


We note in passing that the standardized residuals do not show any outliers, see the descriptives at the end of the code above.

The correlation parameter ν and ρ represent the strength and the speed of decay in the spatial effect, which can be turned into the actual spatial correlation effect by plotting the estimated correlation between two locations against their distance (showing a hyperbolic decline curve):

euclid_dist <- dist(Municipality_data_2022[,c("longitude","latitude")], method = "euclidean")
matern_corr <- spaMM::MaternCorr(euclid_dist, nu = 0.8725448, rho = 2.9557618)  # copy values form final model
plot(as.numeric(euclid_dist), as.numeric(matern_corr), 
     xlab = "Distance between pairs of location (in degrees)", 
     ylab = "Estimated correlation")


The predictive value of the GLM and spaMM models is best reflected in maps. The series of maps shown above contained one with the observed cycling frequency in nine classes based on a PCA (Map 12). Now we will see to what extent both models can reproduce this map.


First the code.

# Map GLM fitted cycle frequency

Seq_n <- c(1:9)
Freq_ind <- c(
  '1 = low', 
  '2',   
  '3',   
  '4',   
  '5',   
  '6',   
  '7',   
  '8',   
  '9 = high')

pred_glm <- add_predictions(Municipality_data_2022, lm_concise, var = "pred", type = NULL) %>% 
  mutate(Cycle_frequency_GLM_fit = as.numeric(round(pred,0))) %>%
  dplyr::select(Municipality, Cycle_frequency_GLM_fit) 
  
map_data <- map_data %>% left_join(pred_glm, by = "Municipality") %>% 
  mutate(Cycle_frequency_GLM_fit = factor(as.integer(Cycle_frequency_GLM_fit), levels = Seq_n, labels = Freq_ind)) 

Map_glm_fit <- map_data %>%
  ggplot() +
  geom_sf(aes(fill = Cycle_frequency_GLM_fit)) +
  labs(title="Cycle frequency indicator fitted by GLM model",
       subtitle = paste0("Predictors: log1p(n_of_Natura2000_areas) + log1p(n_of_Climbs) + km_distance_to_Groningen", "\n",
                  "Pseudo R\u00b2: ", round(describedata::nagelkerke(lm_concise),2)),
       caption = paste0("Reference date: ", today(), ". Source: Strava/VeloViewer"),
       fill = "indicator fitted") +
  theme(plot.title = element_text(size=15), legend.title  = element_text(size=11)) +
  scale_fill_brewer(palette="Oranges", direction = 1,  labels = c("1 (almost) never","2","3","4","5","6","7","8","9 very frequent"),drop = FALSE) 


The fitted map (Map 17) is somewhat more evenly colored than the observed one. The three most intensive classes 7-9 do not appear at all in the model. You may switch between Map 12 (observed) and map 17 (GLM predicted) via the overview of maps to compare both.


Map 17. Cycle frequency fitted by GLM model [List of maps](#maplist)

Map 17. Cycle frequency fitted by GLM model List of maps


Now the spatial regression model. First the script.

# Map spaMM fitted cycle frequency

pred_spatial <- modelr::add_predictions(Municipality_data_2022, lm_concise_spatial, var = "pred", type = NULL) %>% 
         mutate(Cycle_frequency_spaMM_fit = as.numeric(round(pred,0))) %>%
  dplyr::select(Municipality, Cycle_frequency_spaMM_fit)

map_data <- map_data %>% left_join(pred_spatial, by = "Municipality")

Map_spatial_fit <- map_data %>%
  ggplot() +
  geom_sf(aes(fill = as.factor(Cycle_frequency_spaMM_fit))) +
  labs(title="Cycle frequency indicator fitted by spatial spaMM model",
       subtitle = paste0("Predictors: log1p(n_of_Natura2000_areas) + log1p(n_of_Climbs) + Matern(1 | latitude + longitude)", "\n",
                         "Pseudo R\u00b2: ", round(spaMM::pseudoR2(lm_concise_spatial),2)),
       caption = paste0("Reference date: ", today(), ". Source: Strava/VeloViewer"),
       fill = "indicator fitted") +
  theme(plot.title = element_text(size=15), legend.title  = element_text(size=11)) +
  scale_fill_brewer(palette="Oranges", direction = 1, labels = c("1 (almost) never","2","3","4","5","6","7","8","9 very frequent")) 


Plotting the map:

Map 18. Cycle frequency fitted by spaMM model [List of maps](#maplist)

Map 18. Cycle frequency fitted by spaMM model List of maps


The spaMM model predicts astonishingly well, only based on Natura 2000 areas and the presence of Climbfinder’s trajectories, corrected for spatial autocorrelation. You can switch between Map 12 (observed) and map 18 (spaMM predicted) via the overview of maps to compare both.

The disadvantage is that the most important predictor - distance to Groningen - has been absorbed by the model. Although understandable in a technical sense, we would still like to have the distance variable as a quantified predictor in the model. We will check this out this with a logit model in which the dependent variable is dichotomized and with the distance a priori as one of the independent variables.



3. Logit model (GLM)



A logit model is a regression type model where the response variable is a categorical variable taking on only discrete values, in our case (see Map 15):

1 = high frequency or ‘hardcore’ cycle municipality (n=50)
0 = low cycle frequency (n=150). Have we ever cycled there? My memory is failing me…

But first, we will examine the bivariate relationships between the dichotomized cycling frequency and the now well-known set of independent variables. For this purpose we shall use the Wilcoxon unpaired signed-rank test.
The Wilcoxon test is a robust non-parametric test, specifically testing whether the median of a variable differs in both groups. Particularly useful in the presence of outliers or heavily skewed distributions.

The measure of central tendency - the median - appears to differ significantly for each variable, see the overview below. The box plots show that all relationships are significant and in the expected direction, except for the number of inhabitants, which is slightly surprising. On closer inspection, both groups lack large cities. Let’s wait and see the effect of population size in the multivariate analysis.



The log1p_n_of_Climbs estimator is positive. The more climbs, the greater the chance of being a favorite cycling municipality. The n_of_Natura2000_areas estimator is positive. The more Natura2000 area in a municipality, the greater the chance of being a favorite. km_distance_to_Groningen is negative. The further away, the smaller the chance of being a favorite, as the summary of the stepwise logit analyses below shows.



Observations 200
Dependent variable outer_quartiles
Type Generalized linear model
Family binomial
Link logit
χ²(3) 131.33
Pseudo-R² (Cragg-Uhler) 0.71
Pseudo-R² (McFadden) 0.58
AIC 101.61
BIC 114.80
Est. S.E. z val. p
(Intercept) 1.42 0.74 1.92 0.05
log1p_n_of_Climbs 1.59 0.28 5.74 0.00
log1p_n_of_Natura2000_areas 1.12 0.55 2.05 0.04
km_distance_to_Groningen -0.03 0.00 -6.46 0.00
Standard errors: MLE


The 2x2 confusion matrix helps assess classification model performance by comparing predicted values against actual values for a dataset. Confusion matrices can be used with any classifier algorithm, such as logistic regression models. It has been around for many years; lately it has been in vogue for machine learning. Therefore the question: what about the confusion matrix?
Schematic:



The R package “caret” calculates a cross-tabulation of observed and predicted classes with associated statistics. The functions requires that the factors have exactly the same levels.
The coding and its output:

library(caret) # Confusion matrix

bin_labels <- c("high", "low")
bin_values <- 1:0

conf_tab_data <- Municipality_data_2022_outer_quartiles %>% 
  dplyr::select(outer_quartiles, predicted_cycle_frequency) %>% 
  rename(observed_cycle_frequency = outer_quartiles) %>% 
  mutate(observed_cycle_frequency = factor(observed_cycle_frequency, levels=bin_values, labels = bin_labels),
         predicted_cycle_frequency = factor(ifelse(predicted_cycle_frequency == "very frequent", 1, 0), levels=bin_values, labels = bin_labels))

confusionMatrix(conf_tab_data$predicted_cycle_frequency, conf_tab_data$observed_cycle_frequency ,
                dnn = c("predicted cycle frequency", "observed cycle frequency"))
## Confusion Matrix and Statistics
## 
##                          observed cycle frequency
## predicted cycle frequency high low
##                      high   40   8
##                      low    10 142
##                                           
##                Accuracy : 0.91            
##                  95% CI : (0.8615, 0.9458)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.000000006942  
##                                           
##                   Kappa : 0.7568          
##                                           
##  Mcnemar's Test P-Value : 0.8137          
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.9467          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.9342          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2000          
##    Detection Prevalence : 0.2400          
##       Balanced Accuracy : 0.8733          
##                                           
##        'Positive' Class : high            
## 


Not enough statistics yet? Certainly not, there are still some missing, such as the false positive rate and odds ratio:6


# item                                         calculation                          outcome
# -----------------------------------------------------------------------------------------------
# accuracy (prop. cases classified correctly)  TP + TN / (TP + FP + FN + TN)        0.91
# sensitivity = recall                         TP / (TP + FN)                       0.80
# specificity                                  TN / (TN + FP)                       0.95 
# precision = true positive rate (TPR) = PPV   TP / (TP + FP)                       0.83
# true negatieve rate (TNR)            = NPV   TN / (TN + FN)                       0.93
# false positive rate (FPR)    1-TPR           FP / (FP + TP)                       0.17 
# false negative rate (FNR)    1-TNR           FN / (FN + TN)                       0.07
# balanced accuracy                            (sensitivity + specificity) / 2      0.875
# detection rate                               TP / (TP + FP + FN + TN)             0.20
# prevalence                                   (TP + FN) / (TP + FP + FN + TN)      0.25
# detection prevalence                         (TP + FP) / (TP + FP + FN + TN)      0.24
# odds ratio (OR)                              TP * TN / FP * FN                    71   p < 0.01
# relative risk (RR)                           TP * (FN + TN) / FN * (TP + FP)      13   p < 0.01
# -----------------------------------------------------------------------------------------------


Write the logit formula:


b0 <- coefficients(fit)[1] # intercept
b1 <- coefficients(fit)[2] # log1p_n_of_Climbs
b2 <- coefficients(fit)[3] # log1p_n_of_Natura2000_areas          
b3 <- coefficients(fit)[4] # km_distance_to_Groningen            

logit_model <- paste0("log[p/(1-p)] = ", round(b0,3), " + ", round(b1,3), "*log1p_n_of_Climbs + ", round(b2,3), "*log1p_n_of_Natura2000_areas + ",  round(b3,3), "*km_distance_to_Groningen")
logit_model
## [1] "log[p/(1-p)] = 1.423 + 1.593*log1p_n_of_Climbs + 1.119*log1p_n_of_Natura2000_areas + -0.029*km_distance_to_Groningen"



An R script to print seven examples with all relevant model input and output data:


logit_model_examples <-  t(Municipality_data_2022_outer_quartiles %>% 
  rename(observed_cycle_frequency = outer_quartiles) %>% 
  mutate(logit = (b0 + b1*log1p_n_of_Climbs + b2*log1p_n_of_Natura2000_areas + b3*km_distance_to_Groningen),
          probability = exp(logit)/(1 + exp(logit)),
          deviance_residual = round(fit$residuals,4),   # deviance residuals
          standardized_residual = round(resid(fit),4),   # standardized residuals
          predicted_cycle_frequency = ifelse(probability > 0.500, 1, 0),
          predicted_cycle_frequency_label = ifelse(probability > 0.500, 'high', "low"),
          observed_cycle_frequency_label = ifelse(observed_cycle_frequency == 1, 'high', "low"), 
          raw_residual = round(observed_cycle_frequency - probability,6)) %>% 
  dplyr::select(Municipality, n_cycled_segments, max_tries, Cycle_frequency_PC1, Cycle_frequency_PC1_class,
    observed_cycle_frequency, observed_cycle_frequency_label, n_of_Climbs, log1p_n_of_Climbs, n_of_Natura2000_areas, log1p_n_of_Natura2000_areas, km_distance_to_Groningen, logit, probability, raw_residual,deviance_residual,standardized_residual, predicted_cycle_frequency, predicted_cycle_frequency_label) %>% 
  filter(Municipality %in% c("Voorschoten", "Eindhoven", "Vaals", "Midden-Drenthe", "Arnhem", "Heerlen", "Sluis")))

colnames(logit_model_examples) <- unlist(logit_model_examples[1, ])

logit_model_examples <- logit_model_examples[2:nrow(logit_model_examples), ]
logit_model_examples <- format(logit_model_examples, justify = "left")

knitr::kable(logit_model_examples, format = "html", escape = FALSE) %>% 
  row_spec(1:nrow(logit_model_examples),font_size=10) %>% 
  row_spec(0,bold=TRUE) %>% 
  row_spec(5,bold=TRUE,color = "#006D2C") %>% 
  row_spec(8,bold=TRUE,color = "#006D2C") %>% 
  row_spec(10:11,bold=TRUE,color = "#006D2C") %>% 
  row_spec(14:16,bold=TRUE,color = "#FC9272") %>% 
  row_spec(12:13,color = "blue") %>% 
  row_spec(17,color = "blue") %>% 
  kableExtra::kable_styling(full_width = TRUE, font_size = 10) 
Midden-Drenthe Vaals Arnhem Sluis Heerlen Voorschoten Eindhoven
n_cycled_segments 420 300 286 190 10 6 0
max_tries 51 13 10 7 1 1 0
Cycle_frequency_PC1 -2.8817857 -1.9198604 -1.7480349 -1.3993882 0.5008414 0.6606654 1.7942023
Cycle_frequency_PC1_class 7 6 5 5 2 2 1
observed_cycle_frequency 1 1 1 1 0 0 0
observed_cycle_frequency_label high high high high low low low
n_of_Climbs 3 26 71 3 40 0 0
log1p_n_of_Climbs 1.386294 3.295837 4.276666 1.386294 3.713572 0.000000 0.000000
n_of_Natura2000_areas 7 1 2 4 3 0 1
log1p_n_of_Natura2000_areas 2.0794415 0.6931472 1.0986123 1.6094379 1.3862944 0.0000000 0.6931472
km_distance_to_Groningen 39.4 275.5 144.2 298.5 262.5 191.3 212.1
logit 4.8219794 -0.4986478 5.3053688 -3.1787330 1.3174332 -4.0953321 -3.9197905
probability 0.99201346 0.37785850 0.99505966 0.03997393 0.78875435 0.01637753 0.01945908
raw_residual 0.007987 0.622142 0.004940 0.960026 -0.788754 -0.016378 -0.019459
deviance_residual 1.0081 2.6465 1.0050 25.0163 -4.7338 -1.0167 -1.0198
standardized_residual 0.1266 1.3952 0.0995 2.5375 -1.7634 -0.1817 -0.1982
predicted_cycle_frequency 1 0 1 0 1 0 0
predicted_cycle_frequency_label high low high low high low low



The ROC curve and AUC are typical performance measurements for a binary classifier. The ROC is a curve generated by plotting sensitivity against the specificity at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5. An alternative is to plot True Positive Rate against False Positive Rate.



#                                                            odds ratio                     b
# exp(coefficients(fit)[1])   # intercept :                   4.151617    ln(4.151617)  =  1.423
# exp(coefficients(fit)[2])   # log1p_n_of_Climbs :           4.918343    ln(4.918343)  =  1.593
# exp(coefficients(fit)[3])   # log1p_n_of_Natura2000_areas : 3.061662    ln(3.061662)  =  1.119
# exp(coefficients(fit)[4])   # km_distance_to_Groningen :    0.9715631   ln(0.9715631) = -0.029

logits <- b0 + b1 * Municipality_data_2022_outer_quartiles$log1p_n_of_Climbs + 
  b2 * Municipality_data_2022_outer_quartiles$log1p_n_of_Natura2000_areas +
  b3 * Municipality_data_2022_outer_quartiles$km_distance_to_Groningen

probs <- exp(logits)/(1 + exp(logits))

Municipality_data_2022_outer_quartiles <- Municipality_data_2022_outer_quartiles %>% cbind(.,data.frame(logits)) %>%
  cbind(.,data.frame(probs)) %>%
  mutate(must_be_zero = probabilities - probs)

# check: outcome must be zeros (compare two ways to extract probs).
#summary(Municipality_data_2022_outer_quartiles$must_be_zero)

pROC_obj <- roc(Municipality_data_2022_outer_quartiles$outer_quartiles ~ Municipality_data_2022_outer_quartiles$probs,
                smoothed = TRUE, xlab="Specificity", ylab="Sensitivity",
                # arguments for ci
                ci=TRUE, ci.alpha=0.9, stratified=FALSE,
                # arguments for plot
                plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
                print.auc=TRUE, show.thres=TRUE)

sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
plot(sens.ci, type="bars")


Although the presented logit model has a high goodness-of fit, we are slightly less happy with some outliers (defined by high standardized residu values), see the summary below, aspecially the min and max values.


summary(resid(fit))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.32473 -0.28791 -0.18356 -0.07672 -0.04792  2.61121


Let’s take a closer look. We can track down outliers by filtering on all false positives (n=8, in green) and false negatives (n=10, in orange), see table below.
The positive thing about this list is that municipalities appear where, according to the predictors, it is advisable to cycle more often. Such as Heerlen and Woensdrecht (both with surprising elevation profile) and the Wadden Islands (relatively nearby).
On the other hand, there are a number of attractive cycling places on the false negative list that are not predicted as such by the model. In particular Sluis and Veere in Zeeland. And Vaals, remarkable because the highest NAP point in the Netherlands is located here. Who wouldn’t want to cycle there? But as far as Vaals is concerned, my concern is mainly emotional: the residual score is lower than 2 and in that respect may be be considered non-significant.


Municipality observed_cycle_freq n_of_Climbs n_of_Natura2000_areas km_distance_to_Groningen predicted_probability predicted_cycle_freq stand_residual
Terschelling low 3 3 88.4 0.93 high -2.325
Enschede low 3 3 112.4 0.87 high -2.037
Schiermonnikoog low 0 3 40.1 0.86 high -1.984
Heerlen low 40 3 262.5 0.79 high -1.763
Ameland low 0 3 63.5 0.76 high -1.685
Vlieland low 1 3 110.0 0.71 high -1.578
Venlo low 12 1 205.8 0.59 high -1.328
Woensdrecht low 21 2 254.4 0.56 high -1.280
Hoogeveen high 0 0 55.6 0.45 low 1.255
Vaals high 26 1 275.5 0.38 low 1.395
Den Helder high 0 3 129.4 0.32 low 1.512
Schouwen-Duiveland high 4 5 251.4 0.22 low 1.738
Nijkerk high 0 2 135.9 0.22 low 1.741
Barneveld high 0 1 134.3 0.16 low 1.922
Alkmaar high 0 1 140.2 0.14 low 1.996
Heumen high 1 0 169.3 0.09 low 2.212
Sluis high 3 4 298.5 0.04 low 2.538
Veere high 1 6 280.1 0.03 low 2.611



4. Logit model (spaMM)


We cannot escape the last question: what about the spatial autocorrelation in the logit GLM model?

The correlogram below is a visualization of Moran´s I as a function of distance (degrees) among municipalities. Distances with significant spatial autocorrelation (values of Moran´s I are significantly larger or smaller than expected) are showed in red points.

nbc <- 10
cor_r <- pgirmess::correlog(coords=as.data.frame(Municipality_data_2022_outer_quartiles)[,c("longitude", "latitude")],
                            z=fit$residuals,                              
                            method="Moran", nbclass=nbc)
cor_r
## Moran I statistic 
##       dist.class        coef                                                      p.value    n
##  [1,]  0.2320618  0.32477626 0.0000000000000000000000000000000000000000000000000008120005 5032
##  [2,]  0.6289218 -0.01934953 0.8485299660047946890273351527866907417774200439453125000000 8728
##  [3,]  1.0257813 -0.04526124 0.9986433247426138581914756287005729973316192626953125000000 7960
##  [4,]  1.4226408 -0.04974797 0.9993609245985559663694175469572655856609344482421875000000 7054
##  [5,]  1.8195003 -0.01914532 0.8108224217535907607512513095571193844079971313476562500000 5744
##  [6,]  2.2163597  0.23174165 0.0000000000000000000011684653551528313966331729112368975620 3432
##  [7,]  2.6132192 -0.10073651 0.9915429657446078737237371569790411740541458129882812500000 1358
##  [8,]  3.0100787 -0.04121701 0.6631331486291521803977389026840683072805404663085937500000  364
##  [9,]  3.4069382 -0.52128522 0.9999962705871252577338736955425702035427093505859375000000  106
## [10,]  3.8037976 -0.72057418 0.9968691411806792634919816009642090648412704467773437500000   22
plot(cor_r)      # statistically significant values (p<0.05) are plotted in red


After running the equivalent spaMM logit model, both the distance to Groningen and Natura 2000 appear to have been absorbed. Climbfinder remains as the only predictor. After removing both redundant predictors we get the following output:


library(ROI.plugin.glpk) # spaMM properly checks (quasi-)separation in binary regression problem.

fit_concise_spatial <- spaMM::fitme(outer_quartiles ~ log1p(n_of_Climbs) +
                                     Matern(1|latitude+longitude), data = as.data.frame(Municipality_data_2022_outer_quartiles), family=binomial())

coef_pvalue <- as.data.frame(summary(fit_concise_spatial)$beta_table) %>% 
  rename(Cond_SE = "Cond. SE", t_value = "t-value" ) %>% 
  mutate(lower = Estimate -  1.96 * Cond_SE) %>% 
  mutate(upper = Estimate +  1.96 * Cond_SE) %>% 
  mutate(p_value = round((2*pt(t_value, df=df.residual(fit_concise_spatial), lower.tail=FALSE)),3))
## formula: outer_quartiles ~ log1p(n_of_Climbs) + Matern(1 | latitude + 
##     longitude)
## Estimation of corrPars and lambda by ML (p_v approximation of logL).
## Estimation of fixed effects by ML (p_v approximation of logL).
## Estimation of lambda by 'outer' ML, maximizing logL.
## family: binomial( link = logit ) 
##  ------------ Fixed effects (beta) ------------
##                    Estimate Cond. SE t-value
## (Intercept)          -5.329    3.213  -1.659
## log1p(n_of_Climbs)    3.506    1.355   2.587
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##     1.nu    1.rho 
## 8.442754 9.591327 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    latitude .  :  70.71  
## # of obs: 200; # of groups: latitude ., 200 
##  ------------- Likelihood values  -------------
##                         logLik
## logL       (p_v(h)): -37.18975
coef_pvalue
##                     Estimate  Cond_SE   t_value       lower     upper p_value
## (Intercept)        -5.328746 3.212935 -1.658529 -11.6260994 0.9686065   1.901
## log1p(n_of_Climbs)  3.505711 1.355090  2.587069   0.8497351 6.1616860   0.010
spaMM::pseudoR2(fit_concise_spatial)
##        R2 
## 0.5289414
summary(resid(fit_concise_spatial))    # standardized residuals
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.1433996 -0.0408154 -0.0137502 -0.0134721 -0.0005351  1.4328161
nbc <- 10
cor_r <- pgirmess::correlog(coords=as.data.frame(Municipality_data_2022_outer_quartiles)[,c("longitude", "latitude")],
                            z = residuals(fit_concise_spatial),
                           method="Moran", nbclass=nbc)

plot(cor_r) # statistically significant values (p<0.05) are plotted in red


We pick some figures from the list above.
1. The Matérn function which has two parameters: ν (strength of association) and ρ (the speed of decay in the spatial effect, smaller values mean faster decay). The listing above shows ν = 8.44 (rather strong) and ρ = 9.59 (slow decay).
2. The residual range is limited (a maximum of 1.43).

The confusion matrix below reveals only one false negative municipality and none false positives:


## Confusion Matrix and Statistics
## 
##                          observed cycle frequency
## predicted cycle frequency high low
##                      high   49   0
##                      low     1 150
##                                              
##                Accuracy : 0.995              
##                  95% CI : (0.9725, 0.9999)   
##     No Information Rate : 0.75               
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9866             
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 0.9800             
##             Specificity : 1.0000             
##          Pos Pred Value : 1.0000             
##          Neg Pred Value : 0.9934             
##              Prevalence : 0.2500             
##          Detection Rate : 0.2450             
##    Detection Prevalence : 0.2450             
##       Balanced Accuracy : 0.9900             
##                                              
##        'Positive' Class : high               
## 


The model written in full:

b0 <- spaMM::fixef(fit_concise_spatial)[1] # intercept
b1 <- spaMM::fixef(fit_concise_spatial)[2] # log1p_n_of_Climbs

logit_model_spatial <- paste0("log[p/(1-p)] ~ ", round(b0,3), " + ", round(b1,3), " * log1p_n_of_Climbs + Matern(1 | latitude + longitude)")
logit_model_spatial
## [1] "log[p/(1-p)] ~ -5.329 + 3.506 * log1p_n_of_Climbs + Matern(1 | latitude + longitude)"


Only one is left out, the false negative:

Municipality observed_cycle_frequency n_of_Climbs log1p_n_of_Climbs predicted_probability predicted_cycle_frequency Matern(1|latitude+longitude) stand_residual
Alkmaar high 0 0 0.3583 low 4.745846 1.433


A comparison of the non-spatial and spatial variant of regression with a continuous dependent variable and dichotomous logit regression gives the same result: the spatial variants allow a better prediction of the observed data. Dropping the distance to Groningen as a fixed-effect parameter makes interpretation of the model more difficult. How can cycling frequency be explained by a model that does not know where you live? A spatial model based on similarity of neighboring geographical coordinates can do? Indeed, it does not matter where you live for the spatial model. There is no need to name a home base here.

Summary



In short: this publication is about personal cycling activities, exploring data, mapping, data reduction and a journey through R libraries.

After 10 years of uploading cycling rides to Strava, which produces a nice heatmap (Map 1), you would like to get to know the characteristics of your favorite and less favorite cycling places. The spatial unit in our analysis, the “place”, is the Dutch municipality (n = 345). Strava and public sources provide the necessary data. The variable to be explained is cycling frequency, an interval variable, operationalized as the grouped first component of a PCA on the number of cycled segments in a municipality, and the highest number of efforts on one segment.
Possible explanatory variables include: population, population density, number of associated Natura 2000 areas, an environment/nature score, water surface, forest, distance to Groningen, number of Climbfinder climbs and climbing points, maximum NAP altitude, standard deviation of the NAP altitude, number of protected buildings/objects, Strava’s elevation change meters and grade and finally elevation as the first component after PCA on the elevation variables. The data is visualized in a series of maps. Logarithmization has been applied to highly skewed variables.

At the bivariate level, cycling frequency has a significant linear relationship with each of the independent variables.
We present four parsimonious multivariate models, a couple with the cycle frequency interval variable mentioned in the previous paragraph as variable to be explained, and another pair with a dichotomized one based on outer ‘quartiles’, “high” (n = 50) versus “low cycle frequency” (n = 200). Each set has a spatial and non-spatial variant.

GLM model (pseudo-R2 = 0.47):

Cycle_frequency ~ 4.74 - 0.01 * log1p(km_distance_to_Groningen) + 0.63 * log1p(n_of_Climbs) + 0.40 * n_of_Natura2000_areas

Spatial (spaMM) model (pseudo-R2 = 0.74):

Cycle_frequency ~ 2.62 + Matern(1|latitude+longitude) + 0.37 * log1p(n_of_Climbs) + 0.33 * n_of_Natura2000_areas

GLM logit model (pseudo-R2 = 0.71, balanced accuracy = 0.87): (where p is the probability of high cycle frequency)

log[p/(1-p)] ~ 1.42 - 0.03 * km_distance_to_Groningen + 1.59 * log1p_n_of_Climbs + 1.12 * log1p_n_of_Natura2000_areas

Spatial (spaMM) logit model (pseudo-R2 = 0.53, balanced accuracy = 0.99)):

log[p/(1-p)] ~ -5.33 + Matern(1|latitude + longitude) + 3.51 * log1p_n_of_Climbs

The explanatory variables of cycle frequency according to the non-spatial models are: 1. distance to Groningen, 2. number of Climbfinder segments and 3. the number of associated Natura 2000 areas. The spaMM models absorb the distance variable in the Matern coefficient, which is not a fixed effect, but calculated for each municipality and correlated with the distance-to-Groningen variable. The GLM predictive power is much lower than its spaMM equivalents, which has been made clear by comparing maps (Map 12, 17, 18) and confusion matrices.

According to objective metrics spatial autocorrelation is present in our GLM models. The spatial model basically translates the expectation that closer municipalities are more correlated. The fact is that elevation profile, Natura 2000 areas and also cycle routes transcend the municipality as a spatial unit. The distance-to-Groningen variable is intrinsically spatially autocorrelated. So before starting to fit a spatial model, one should check that such complexity is warranted by the data. Even if there are spatial patterns in the data this does not mean that spatial regression models should be used.
Fortunately, we do not have to make any decisions regarding the choice of a spatial or non-spatial model. The idea behind it was more descriptive and explanatory, than use for future predictions. After correction for distance, either with a covariate or by Matérn calculations, Natura 2000 areas and Climbfinder segments - from a set of fifteen collected items - appear to be the most important underlying cycling triggers explaining my personal Strava heatmap.

Bibliography



Stack Overflow, the R library of libraries
https://stackoverflow.com/

Strava
https://www.strava.com

VeloViewer
https://www.veloviewer.com

Climbfinder
https://www.climbfinder.com

Lionel Hertzog: “Spatial regression in R part 1: spaMM vs glmmTMB”, DataScience+, Sep 2, 2019
https://datascienceplus.com/spatial-regression-in-r-part-1-spamm-vs-glmmtmb/

Hugh Sturrock, Adam Bennett, Francois Rerolle, Amanda Irish: “Applied Spatial Analysis for Public Health”, 2021
https://hughst.github.io/

François Rousset: “The spaMM package”
https://gitlab.mbb.univ-montp2.fr/francois/spamm-ref

François Rousset, Jean-Baptiste Ferdy, Alexandre Courtiol:
“spaMM: Mixed-Effect Models, with or without Spatial Random Effects”
https://rdrr.io/cran/spaMM/

Paula Moraga: “Spatial Statistics for Data Science: Theory and Practice with R”, 2023, Chapman & Hall/CRC Data Science Series.
https://www.paulamoraga.com/book-spatial/index.html



Footnotes


  1. Geographer, data-analist and cyclist↩︎

  2. Strava switched from Google Maps to MapBox↩︎

  3. Bonaire, Sint Eustatius and Saba are special municipalities of the Netherlands. Together these three special municipalities are called the Caribbean Netherlands, also known as the BES islands. Aruba, Curaçao and Sint Maarten are, like the Netherlands, independent countries.↩︎

  4. Permission has been granted to use Climbfinder data for publishing purposes.↩︎

  5. I have added some little mini Hondsrug hill-ups on my own initiative, although they are too flat to qualify for Climbfinder. Consider it a sign of frustration living in an area with no elevation difference.↩︎

  6. The relative risk (RR) is the risk of the event in an experimental group relative to that in a control group. The odds ratio (OR) is the odds of an event in an experimental group relative to that in a control group. An RR or OR of 1 indicates that the probability of high cycle frequency is comparable in the two groups. A value greater than 1 indicates increased risk.↩︎