Introduction

It will not be unknown to many cyclists: on some days the air feels ‘light’ during your bike ride and you can easily achieve higher speed. On other days, with ‘heavy’ air, you can’t seem to get through. A second experience is that the average ride speed decreases with increasing wind force. The two relevant weather characteristics in this context are air density and wind force. In this article we will investigate to what extent we can quantify the expected influences of both variables on the average speed of a bicycle trip in a raw, unpolished personal data set.

Data collection

We use hourly weather data in the Netherlands. The Royal Netherlands Meteorological Institute (KNMI) is the dutch national weather service. Their hourly weather data provide an overview of the observations made over the years at approximately 35 stations.
Since the city of Groningen is my start and finish location, we download the data from the nearest weather station: Eelde, 8.5 kilometers away as the crow flies.
Since 2008 - still in the pre-Strava era - I have been collecting some cycling data in an Excel file: date, distance of the ride, duration, speed, region and sometimes some comments. Sufficient data to make a selection of solo rides in the north of the country, i.e. the provinces of Groningen, Fryslân (excl. Wadden Islands) and Drenthe. The weather data can be linked to the cycling table with date as the key field.
All MTB and cyclo-cross activities, as well as rides shorter than 15 and longer than 165 km, and a few fun rides on my retro racing bike are not taken into account. By applying these selection criteria, the sample of bicycle rides is still far from homogeneous. The sample contains both time trials - where we ride from start to finish as quickly as possible - and interval training. The latter especially since the introduction of Strava, when we suddenly wanted to set fast times on segments. Furthermore, the sample also includes recovery rides and less motivated ones. Almost all of them are round rides, there and back in one day, usually in the afternoon.
Below is the R code to create the working file. The code also contains the programming of the formula for air density - this variable is not offered by the KNMI. Later in the article we present the formula in a neat way.
Winter fitness is partly maintained by indoor cycling activities - these are obviously not used in this article.
We use average weather data from each cycling day over an 8-hour period, from 11 a.m. to 7 p.m., because there is a good chance that cycling occurred during that time period. It should be clear, however, that in the working file the weather data may apply to a time interval when I was not on the bike.
Finally, it can be noted that this analysis covers a period of more than 15 years of cycling, where, taking into account the age of the author, it is the highest achievable to compensate for the natural loss of fitness as much as possible.

version[['version.string']]

## [1] "R version 4.3.2 (2023-10-31 ucrt)"

library(openxlsx)
library(tidyverse)

options(width = 2500, scipen = 99)  


# SOURCE: ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE (KNMI)
# STN         LON(east)   LAT(north)  ALT(m)      NAME
# 280         6.585       53.125      5.20        Eelde
# HH        : Hour; HH 12 is from van 11.00 UT to 12.00 UT
# FH        : Hourly mean wind speed (in 0.1 m/s)
# T         : Temperature (in 0.1 degrees Celsius) at 1.50 m at the time of observation
# P         : Air pressure (in 0.1 hPa) reduced to mean sea level; at the time of observation
# U         : Relative atmospheric humidity (in percents) at 1.50 m at the time of observation       

# download.file("https://www.daggegevens.knmi.nl/klimatologie/uurgegevens", "data/knmi uurgegevens 2008-2017.txt", "curl",
#               extra = "-d stns=280&vars=FH:T:P:U&start=2008010112&end=2018010120&fmt=csv") # KNMI requires TLS v1.2

# download.file("https://www.daggegevens.knmi.nl/klimatologie/uurgegevens", "data/knmi uurgegevens 2018-2027.txt", "curl",
#               extra = "-d stns=280&vars=FH:T:P:U&start=2018010112&end=2027010120&fmt=csv") # KNMI requires TLS v1.2

knmi_hourly_data_2008_2017 <- read_csv("data/knmi uurgegevens 2008-2017.txt", col_names = FALSE, comment = "#")
knmi_hourly_data_2018_2027 <- read_csv("data/knmi uurgegevens 2018-2027.txt", col_names = FALSE, comment = "#")

knmi_hourly_data <- rbind(knmi_hourly_data_2008_2017, knmi_hourly_data_2018_2027) %>% 
  mutate(Date = ymd(X2), Hour = X3, Windforce_ms = X4 / 10, Temperature_C = X5 / 10, Air_pressure_hPa = X6 / 10, Relative_humidity = X7) %>%
  select( - c(X1:X7)) %>% 
  group_by(Date) %>% 
  summarise(Windforce_ms = round(max(Windforce_ms),0),
            Temperature_C = round(mean(Temperature_C),1),
            Air_pressure_hPa = round(mean(Air_pressure_hPa),1),
            Relative_humidity = round(mean(Relative_humidity),0),) %>% 
  mutate(Air_pressure_area = ifelse(Air_pressure_hPa < 1010, "L", ifelse(Air_pressure_hPa > 1020, "H", "normal")),
         Windforce_Bft = cut(Windforce_ms, breaks = c(0.3, 1.6, 3.4, 5.5, 7.9, 10.8, 13.9, 17.2, 20.8, 24.5, 28.5, 32.6),
                                           labels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))) %>% 
  mutate(Windforce_kmh = Windforce_ms * 3.6)

region <- read_csv("data/regio thesaurus.txt", 
                   locale = locale(encoding = "WINDOWS-1252")) %>% 
  filter(regio == "Noordelijke prov.") 

regional_cycle_activities <- read.xlsx("data/sport.xlsx", sheet="sport", detectDates = TRUE) %>% 
  left_join(region, by = "route") %>%
  left_join(knmi_hourly_data, by = c("datum"="Date")) %>% 
  replace_na(list(fiets = '', route = '', opmerking = '')) %>%
  mutate(Windforce_Bft = ifelse(row_number() == 4785, 7, ifelse(as.numeric(Windforce_Bft) == 1, 2, Windforce_Bft))) %>%  # 1 ride Bft 9 and 4 rides Bft 1
  mutate(Windforce_ms = ifelse(row_number() == 4785, 17, ifelse(Windforce_ms == 1, 2, Windforce_ms))) %>%
  mutate(Windforce_Bft = factor(Windforce_Bft, levels = c(2:7), labels = c("2", "3", "4", "5", "6", "7")))  %>% 
  filter (!opmerking %in% c("censuur", "Tacx Boost", "VR") &
            !fiets %in% c("Lapierre","cross","Casati","EM", "Bottecchia") & 
            regio == 'Noordelijke prov.' & sport == "wielrennen" & (trunc(km) >= 15 & trunc(km) <= 165))  %>%
  rename (Date_cycle_activity = datum, Speed_kmh = "km/uur", Distance_km = km) %>% 
  mutate(Spring_summer = month(Date_cycle_activity) %in% c(5,6,7,8)) %>% 
  mutate( p1 = 0.61078 * exp((17.27 * Temperature_C)/(Temperature_C + 237.3)),   # Calculate the saturation vapor pressure 
          pv = p1 * Relative_humidity,                                       # Find the actual vapor pressure by multiplying the saturation vapor pressure by the relative humidity
          pd = Air_pressure_hPa * 100 - pv,                                                                           # 1 hPa = 1 mbar = 100 Pa
          Air_density_kg_m3 = (pd / (287.058 * (Temperature_C + 273.15))) + (pv / (461.495 * (Temperature_C + 273.15)))) %>%    # Kelvin = Celsius + 273.15 %>% 
  mutate(Absolute_humidity_g_m3 = (p1 * Relative_humidity * 100 / (461.495 * (Temperature_C + 273.15)))*100) %>%   # Relative humidity in %
  select(Date_cycle_activity, Speed_kmh, Distance_km, Temperature_C, Windforce_Bft, Windforce_ms, Relative_humidity, Absolute_humidity_g_m3, Air_pressure_hPa, Air_pressure_area, Air_density_kg_m3) %>%
  mutate(Drag_force_N = 0.5 * Air_density_kg_m3 * (Speed_kmh/3.6)^2 * 0.5 * 0.9) %>%   # speed required in m/s
  filter(!is.na(Windforce_Bft)) %>% mutate(ID = row_number())

tail(regional_cycle_activities)

##      Date_cycle_activity Speed_kmh Distance_km Temperature_C Windforce_Bft Windforce_ms Relative_humidity Absolute_humidity_g_m3 Air_pressure_hPa Air_pressure_area Air_density_kg_m3 Drag_force_N   ID
## 1764          2024-07-16  29.53636       75.81          18.6             4            6                77               12.25609           1010.1            normal          1.205357     18.25606 1764
## 1765          2024-08-06  34.32308       44.62          25.9             2            3                50               12.10604           1009.8                 L          1.175575     24.04364 1765
## 1766          2024-08-06  29.00513       56.56          25.9             2            3                50               12.10604           1009.8                 L          1.175575     17.17028 1766
## 1767          2024-08-12  28.46809      111.50          26.6             3            5                53               13.34226           1011.0            normal          1.174147     16.52025 1767
## 1768          2024-08-13  27.96397      126.77          28.8             4            6                67               19.03756           1008.7                 L          1.162587     15.78340 1768
## 1769          2024-08-15  30.62333       91.87          22.8             4            7                71               14.42848           1013.8            normal          1.192463     19.41456 1769

The script above ends with a list of the most recent trips in the working file and its variables.

I. Cycle speed and wind force

Let’s look at the relationship between speed and wind force, first. We do this by successively creating a density graph, a box plot, a curve and a regression model.

Two quantities to measure wind force are Beaufort and meters per second. A cross tabulation of both quantities in our working file is presented below. Please note that these are mean wind speeds, averaged over 60 minutes.² The wind speeds shown in the table below and that you hear quoted in weather reports are always measured at 10 meters above the ground. In general, they do not reflect exactly the wind speeds that you would feel on the ground.
Meters per second is an interval scale - the difference between e.g. 5 and 6 m/s is identical to the difference between 30 and 31 m/s. Beaufort does not have a constant pace of change.
For the purposes of this article, the Beaufort scale was chosen because of my familiarity with it after years of using weather apps.³ A second reason to use Beaufort is that it has fewer classes and therefore more observations per class.

table(regional_cycle_activities$Windforce_Bft, regional_cycle_activities$Windforce_ms, dnn = c("in Bft", "Windforce in meters per second"))

##       Windforce in meters per second
## in Bft   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##      2  36 156   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##      3   0   0 308 364   0   0   0   0   0   0   0   0   0   0   0   0
##      4   0   0   0   0 323 242   0   0   0   0   0   0   0   0   0   0
##      5   0   0   0   0   0   0 153  74  49   0   0   0   0   0   0   0
##      6   0   0   0   0   0   0   0   0   0  24  13  14   0   0   0   0
##      7   0   0   0   0   0   0   0   0   0   0   0   0   5   5   1   2

1. Density plot

A density plot is a representation of the distribution of a numeric variable. It is a smoothed version of the histogram; a smooth curve shows the distribution of the data in a more continuous way. Histograms show counts, density plots show proportions. A multi density chart is a density chart where several groups are represented. It allows to compare their distribution. The issue with this kind of chart is that it gets easily cluttered: groups overlap each other and the figure gets unreadable. That’s why we use transparency in the color scheme.
Figure 1 is a multi density chart of the observed distribution of the average bicycle speed, per Beaufort wind force number. It is very clear that as the wind force decreases, the distribution of speed shifts slightly further to the right, indicating an increasing average ride speed.
A normal distribution speed curve of the complete sample (n = 1769) is superimposed. This can be used as a reference and to check to what extent a possible necessary normality applies.

windcolors = c("#99FF00", "lightgreen", "green", "yellow", "orange", "red")

ggplot(regional_cycle_activities, aes(x = Speed_kmh, fill = Windforce_Bft)) +
  scale_fill_manual(values = windcolors) +
  geom_density(alpha = 0.2) +
  stat_function(fun = dnorm, colour="red", linewidth = 1.5, args = list(mean =mean(regional_cycle_activities$Speed_kmh), sd = sd(regional_cycle_activities$Speed_kmh))) +
  labs(title = "Density plots of average cycle ride speed grouped by windforce",
       subtitle = paste0("compared to the normal distribution of the complete sample (n = ", nrow(regional_cycle_activities),")"),
       caption = paste0("All my solo rides 2008 - ", year(today()), " in the north of the Netherlands. Reference date: ", today()),
       fill='windforce (Bft)')  + 
  ylab("density") + xlab("Ride average speed (km/h)") +
  geom_segment(aes(x=22, xend=23.5, y=0.18, yend=0.18), color = 'red', linewidth = 1.5) +
  annotate("text", x = 26, y = 0.18, label = "sample normal distribution", color = 'red')

Figure 1. Density plot speed and windforce List of figures

2. Boxplot

A boxplot is a graph that gives a visual indication of how a variable’s 25th percentile, 50th percentile (the median), 75th percentile, minimum, maximum and outlier values are spread out in different groups. The median of a variable is the value separating the higher half from the lower half of a variable.

Figure 2. How to read a boxplot (image: Michael Galarnyk) List of figures

Below is the R code and the resulting boxplot.

library(rstatix) # Kruskal-Wallis test, Anova, kruskal_effsize

res.kruskal <- regional_cycle_activities %>% kruskal_test(Speed_kmh ~ Windforce_Bft)

# The first input variables the function stat_box_data is the y variable, the other input is the upper_limit. 
# It basically is the value to position the text on top of the boxes. For each plot it is necessary to adapt upper_limit and vjust.

stat_box_data <- function(y, upper_limit = max(as.numeric(regional_cycle_activities$Windforce_Bft)) * 4.4) {
  return( 
    data.frame(
      y = 0.95 * as.numeric(upper_limit),
      label = paste('n of rides:', length(y), '\n',
                    'mean km/h:', round(mean(y),1), '\n',
                    'median km/h:', round(median(y),1), '\n')
    )
  )
}

ggplot(regional_cycle_activities, aes(Windforce_Bft,  y = Speed_kmh, fill=as.numeric(Windforce_Bft))) +
  geom_boxplot() +
  theme(legend.position="none",
  plot.title = element_text(size=17)) +
  labs(title = "Relationship cycle trip speed and Beaufort windforce in the northern part of the Netherlands",
       subtitle = get_test_label(res.kruskal, detailed = TRUE),
       caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today())) + 
  xlab("windforce (Bft)") + ylab("Cycle trip speed km/h") +
  scale_fill_gradient(high=hcl(15,100,75), low=hcl(195,100,75)) + 
  stat_summary(
    fun.data = stat_box_data, 
    geom = "text",  hjust = 0.5, vjust = 1.3
  )

Figure 3. Boxplot Beaufort windforce and ride speed List of figures

The Kruskal–Wallis test is a non-parametric method for testing whether samples originate from the same distribution.⁴
After applying the Kruskal–Wallis test, the null hypothesis that there is no relationship between wind force and median ride cycling speed is rejected as the p-value is less than the significance level 0.05. From the output of the Kruskal-Wallis test, we know that there is a significant difference in median ride speed between wind force groups, but we don’t know which pairs of groups are different. It’s possible to use the pairwise Wilcoxon test to calculate pairwise comparisons between group levels with corrections for multiple testing. The pairwise comparison shows that all are significantly different (p < 0.05), except 2 compared to 3, as shown in the table after the following script.

pc <- pairwise.wilcox.test(regional_cycle_activities$Speed_kmh, as.numeric(regional_cycle_activities$Windforce_Bft), p.adjust.method = "BH")

pc_df <- as.data.frame(round(pc$p.value,9)) %>% 
  replace(is.na(.), "-")

colnames(pc_df) <-  c("2", "3", "4", "5", "6") 
rownames(pc_df) <-  c("3", "4", "5", "6", "7")

pc_df

##             2           3           4           5           6
## 3 0.159849100           -           -           -           -
## 4 0.000624398 0.001039662           -           -           -
## 5 0.000000127 0.000000002 0.000538863           -           -
## 6 0.000000008 0.000000002  0.00000022 0.000290843           -
## 7 0.000000611 0.000000237 0.000000651 0.000004819 0.009895642

3. (Non-)linear relation

In addition to the significant relationship between wind force and the average speed of the entire trip, Figure 3 reveals a non-linear decrease. With every increase in Beaufort, the speed decreases disproportionately. Not surprising, because the Beaufort scale is not a linear scale. A twice as high wind force does not correspond to double the wind speed.
The script below calculates the absolute and relative decrease in the median trip speed per Beaufort. Graphically shown in Figure 4.

decreasing_speed_table <- regional_cycle_activities %>% 
  group_by(Windforce_Bft) %>% 
  summarise(n_of_rides  = n(), median_kmh = round(median(Speed_kmh),5)) %>% 
  mutate(change_in_median = round(median_kmh - lag(median_kmh),5),
         across(where(is.numeric), ~replace(., is.na(.), 0)),
         cum_change_in_median = cumsum(change_in_median),
         cum_perc_change_in_median = round((1 - median_kmh / median_kmh[1]) * 100,5))
         
head(decreasing_speed_table)

## # A tibble: 6 × 6
##   Windforce_Bft n_of_rides median_kmh change_in_median cum_change_in_median cum_perc_change_in_median
##   <fct>              <dbl>      <dbl>            <dbl>                <dbl>                     <dbl>
## 1 2                    192       31.7            0                    0                         0    
## 2 3                    672       31.4           -0.307               -0.307                     0.967
## 3 4                    565       31.0           -0.445               -0.752                     2.37 
## 4 5                    276       30.3           -0.732               -1.48                      4.68 
## 5 6                     51       29.0           -1.27                -2.75                      8.67 
## 6 7                     13       27.5           -1.51                -4.26                     13.4

scale_factor <- max(-decreasing_speed_table$cum_change_in_median) / max(decreasing_speed_table$cum_perc_change_in_median)

decreasing_speed_model <- lm(formula = cum_change_in_median ~ poly(as.numeric(Windforce_Bft), 2), data = decreasing_speed_table)
decreasing_speed_model$fitted.values

##           1           2           3           4           5           6 
## -0.04866357 -0.22293100 -0.72964629 -1.56880943 -2.74042043 -4.24447929

ggplot(decreasing_speed_table, aes(x = Windforce_Bft, group=1)) +
  geom_line(aes(y=cum_change_in_median), color = "blue", linewidth = 1.1) +
  stat_smooth(aes(y=cum_change_in_median), method = lm, se = FALSE, color = "darkgreen", linewidth = 1.0) +
  stat_smooth(aes(y=cum_change_in_median), method = lm, formula = y ~ poly(x, 2), se = FALSE, color = "red", linewidth = 1.5) +
  scale_y_continuous(name="change speed in km/h (≈ median cycle speed)", breaks = seq(-4,0,1), labels = c("-4 (≈28)", "-3 (≈29)", "-2 (≈30)", "-1 (≈31)", "0 (≈32)"), sec.axis=sec_axis(~./scale_factor, breaks = seq(-14,0,1), name="% decrease from reference wind force 2")) +
  theme(
    axis.title.y.left=element_text(color = "black"),
    axis.text.y.left=element_text(color = "black"),
    axis.title.y.right=element_text(color = "black"),
    axis.text.y.right=element_text(color = "black")) +
  labs(title = "Decreasing median trip speed (in km/h and as %) with increasing wind force (Bft)",
       subtitle = paste0("in a group of ",  nrow(regional_cycle_activities), " rides in the northern part of the Netherlands"),
       caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today())) + 
  xlab("wind force (Bft)") +
  geom_segment(aes(x=4.6, xend=4.95, y=-0.2, yend=-0.2), color = 'red', linewidth = 1.4) +
  annotate("text", x = 5.5, y = -0.2, label = "y ~ poly(x, 2)", color = 'red') +
  geom_segment(aes(x=4.6, xend=4.95, y=-0.4, yend=-0.4), color = 'blue', linewidth = 1.4) +
  annotate("text", x = 5.5, y = -0.4, label = "observed      ", color = 'blue') +
  geom_segment(aes(x=4.6, xend=4.95, y=-0.6, yend=-0.6), color = 'darkgreen', linewidth = 1.4) +
  annotate("text", x = 5.5, y = -0.6, label = "linear           ", color = 'darkgreen') +
  annotate("segment", x = 1, xend = 2, y = decreasing_speed_model$fitted.values[2], yend = decreasing_speed_model$fitted.values[2], color = "red") +
  annotate("segment", x = 1, xend = 3, y = decreasing_speed_model$fitted.values[3], yend = decreasing_speed_model$fitted.values[3], color = "red") +
  annotate("segment", x = 1, xend = 4, y = decreasing_speed_model$fitted.values[4], yend = decreasing_speed_model$fitted.values[4], color = "red") +
  annotate("segment", x = 1, xend = 5, y = decreasing_speed_model$fitted.values[5], yend = decreasing_speed_model$fitted.values[5], color = "red") +
  annotate("segment", x = 1, xend = 6, y = decreasing_speed_model$fitted.values[6], yend = decreasing_speed_model$fitted.values[6], color = "red") +
  annotate("segment", x = 2, xend = Inf, y = decreasing_speed_model$fitted.values[2], yend = decreasing_speed_model$fitted.values[2], color = "red", linetype='dotted') +
  annotate("segment", x = 3, xend = Inf, y = decreasing_speed_model$fitted.values[3], yend = decreasing_speed_model$fitted.values[3], color = "red", linetype='dotted') +
  annotate("segment", x = 4, xend = Inf, y = decreasing_speed_model$fitted.values[4], yend = decreasing_speed_model$fitted.values[4], color = "red", linetype='dotted') +
  annotate("segment", x = 5, xend = Inf, y = decreasing_speed_model$fitted.values[5], yend = decreasing_speed_model$fitted.values[5], color = "red", linetype='dotted') +
  annotate("segment", x = 6, xend = Inf, y = decreasing_speed_model$fitted.values[6], yend = decreasing_speed_model$fitted.values[6], color = "red", linetype='dotted') +
  geom_point(aes(x = 1, y = decreasing_speed_model$fitted.values[1]), colour="red", size = 3.5) +
  geom_point(aes(x = 2, y = decreasing_speed_model$fitted.values[2]), colour="red", size = 3.5) +
  geom_point(aes(x = 3, y = decreasing_speed_model$fitted.values[3]), colour="red", size = 3.5) +
  geom_point(aes(x = 4, y = decreasing_speed_model$fitted.values[4]), colour="red", size = 3.5) +
  geom_point(aes(x = 5, y = decreasing_speed_model$fitted.values[5]), colour="red", size = 3.5) +
  geom_point(aes(x = 6, y = decreasing_speed_model$fitted.values[6]), colour="red", size = 3.5) +
  annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[2], label = round(decreasing_speed_model$fitted.values[2],1), color = 'red') +
  annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[3], label = round(decreasing_speed_model$fitted.values[3],1), color = 'red') +
  annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[4], label = round(decreasing_speed_model$fitted.values[4],1), color = 'red') +
  annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[5], label = round(decreasing_speed_model$fitted.values[5],1), color = 'red') +
  annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[6], label = round(decreasing_speed_model$fitted.values[6],1), color = 'red')

Figure 4. Non-linear relation median ride speed and Beaufort wind force List of figures

The left y-axis of Figure 4 represents the median speed of the ride and the descent of this median. In blue the observed scores, in red the polynomial smoothed ones. The right y-axis contains the percentage decrease. We conclude that the trip median speed at wind force 7 (less than 28 km/h) has decreased by 4 km/hour compared to wind force 2 (32 km/h), which is a decrease of more than 13%. The speed reduction per Beaufort increases progressively.
Finally, the question arises whether the relationship between ride speed and wind speed measured in meters per second is also non-linear. We answer that question by means of \(R^2\) model comparison.
First we collect some goodness-of-fit scores for four model variants with ride average speed as dependent variable and wind direction as predictor, measured in the following ways: 1. wind force in Bft; 2. Bft polynomial; 3. wind force in m/s; 4. m/s polynomial. Script plus output:

model1 <- lm(Speed_kmh ~ as.numeric(Windforce_Bft), data = regional_cycle_activities)
model2 <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2), data = regional_cycle_activities)
model3 <- lm(Speed_kmh ~ as.numeric(Windforce_ms), data = regional_cycle_activities %>% filter(as.numeric(Windforce_ms) >= 2 & as.numeric(Windforce_ms) <= 12))
model4 <- lm(Speed_kmh ~ poly(as.numeric(Windforce_ms), 2), data = regional_cycle_activities %>% filter(as.numeric(Windforce_ms) >= 2 & as.numeric(Windforce_ms) <= 12))

model_AIC <- as.data.frame(c(AIC(model1),AIC(model2),AIC(model3),AIC(model4)))

model_RSS <- as.data.frame(c(sum(resid(model1)^2),sum(resid(model2)^2),sum(resid(model3)^2),sum(resid(model4)^2)))

model_r_squared <- as.data.frame(c(round(summary(model1)$r.squared,3),
                             round(summary(model2)$r.squared,3),
                             round(summary(model3)$r.squared,3),
                             round(summary(model4)$r.squared,3)))

model_p <- as.data.frame(c(anova(model1)$'Pr(>F)'[1],anova(model2)$'Pr(>F)'[1],anova(model3)$'Pr(>F)'[1],anova(model4)$'Pr(>F)'[1]))

model_variables <- as.data.frame(c("Speed ~ Windforce (Bft)",
                                    "Speed ~ polynomial(Windforce (Bft))",
                                    "Speed ~ Windforce (m/s)",
                                    "Speed ~ polynomial(Windforce (m/s))"))

models_compared <- cbind(model_variables, model_AIC, model_RSS, model_r_squared, model_p)

get_p_value_class <- function(pvalue){                # Function to get parameter p-value classification.
  p_class = cut(pvalue, breaks = c(0,0.001,0.01,0.05,1),
                                 labels = c("p < .001", "p < .01", "p < .05", "n.s."))}         

names(models_compared) <- c("model", "AIC", "RSS", "R²", "significance")

models_compared <- models_compared %>% mutate(significance = get_p_value_class(significance))

models_compared

##                                 model      AIC      RSS    R² significance
## 1             Speed ~ Windforce (Bft) 7978.051 9384.569 0.059     p < .001
## 2 Speed ~ polynomial(Windforce (Bft)) 7958.462 9270.737 0.071     p < .001
## 3             Speed ~ Windforce (m/s) 7825.121 9077.016 0.044     p < .001
## 4 Speed ~ polynomial(Windforce (m/s)) 7824.703 9064.421 0.045     p < .001

Next we test the linear with its polynomial counterpart. The output below shows that the speed and wind force relationship in Beaufort is non-linear and therefore requires a polynomial component. The non-linear model is significantly better in our dataset (p < .001). However, if we represent the wind force in meters per second, a linear model is sufficient. An extra component doesn’t add anything significant.

compare_12 <- anova(model1, model2)$"Pr(>F)"[2]
compare_34 <- anova(model3, model4)$"Pr(>F)"[2]

significance <- as.data.frame(c(compare_12,compare_34)) %>% mutate(significance = get_p_value_class(.[[1]])) %>% select(significance)

model_speed_predictors <- as.data.frame(c("linear Beaufort wind force model versus polynomial Bft",
                                "linear wind force model in m/s versus polynomial m/s"))


bft_ms_compared <- cbind(model_speed_predictors, significance)

names(bft_ms_compared) <- c("models compared", "significance")

bft_ms_compared

##                                          models compared significance
## 1 linear Beaufort wind force model versus polynomial Bft     p < .001
## 2   linear wind force model in m/s versus polynomial m/s         n.s.

4. Model wind force

The following regression model summarizes the relationship between ride average speed and Beaufort wind force in my data.

Average ride speed = 30.7975693 + -24.300838 * Windforce + -10.6692425 * Windforce\(^2\)

\(R^2\) = 0.0706

II. Cycle speed and air density

In general terms, density is the mass of anything – including air – divided by the volume it occupies. In the metric system air density is measured in terms of kilograms per cubic meter. The air’s density depends on its temperature, its pressure and how much water vapor is in the air (appendix Figure 7d,e,i). Air pressure is inversely associated with altitude, but since my home cycling area is at sea level, we can suffice with the KNMI air pressure. If the analyzed location is at a high altitude, you can apply a specific formula to establish a more accurate value for air pressure.⁵
At 1013.25 hPa sea level pressure and 15 °C, air has a density of approximately 1.225 \(kg/m^3\).⁶
Other things being equal (most notably the pressure and humidity), hotter air is less dense than cooler air and will thus rise while cooler air tends to fall.
Humid air is lighter, or less dense, than dry air at the same temperature and pressure. Explanation: replacing nitrogen and oxygen with water vapor decreases the weight of the air in the cubic meter; that is, it’s density decreases. A fixed volume of gas, say one cubic meter, at the same temperature and pressure, has always the same number of molecules no matter what gas is in it.⁷ Compared to the differences made by temperature and air pressure, humidity has a small effect on the air’s density. Air can only contain a certain amount of water vapor: the lower the temperature; the smaller the quantity (appendix Figure 7b). When the air is humid, heat feels more oppressive than when the air is dry. This is because cooling through evaporation is made more difficult at high humidity. Due to a combination of high humidity and high temperatures, it feels stuffy and there is less enthusiasm for strenuous exercise. The body temperature and heart rate increase rapidly in such conditions.⁸
There is also a second adverse effect of humid air. With high humidity, there is on average more water vapor in the air, up to approximately 20 grams per m³ at a temperature of 30 °C (appendix Figure 7b). This water vapor displaces other gases, resulting in less oxygen per m³ of air in sultry, warm weather, causing the body to inhale less oxygen with the same amount of effort.
High air pressure means ‘heavy’ air, but there is no correlation low air pressure - ‘light’ air. Low air pressure can indicate either ‘heavy’ or ‘light’ air (appendix Figure 7i).

Air density is a significant factor in many disciplines. Just to mention a few examples: power from a wind turbine and range of a baseball.
The power available in the wind is directly proportional to air density. As air density increases the available power also increases, lower air density results in lower power from a wind turbine.⁹
When the air density is lower, baseballs can travel further. The aerodynamic ‘drag’ force gets smaller. Conversely, cold temperatures can reduce the distance a ball travels due to increased air density. But at the same time, as air density gets smaller, the Magnus force¹⁰ also gets smaller, which means that the ball will not be held aloft as long and will therefore not go as far. These two effects are in opposite directions. Simulation shows that the change in the drag force affects the trajectory of the ball more than the change in the Magnus force. Therefore, as air density goes down, the range of a potential home run ball increases.¹¹

The importance of air density for cyclists is the following: more dense, or ‘heavier’ air will slow down cyclists moving through it more because the cyclist has to, in effect, shove aside more or heavier molecules. Such air resistance, ‘drag’, is proportional to the density of the air though which a cyclist is pushing both body and bike. If the air density decreases by a certain percentage, the drag drops by the same percentage, assuming that other relevant factors remain the same.
A 5% reduction in air density implies that, other things being equal, the cube of your speed can increase by 5%, resulting in a 1.6% reduction in time. The difference is a second per minute.¹²
Let’s check the following: a cyclist completing a 30 km lap in one hour, averaging 30 km/h, would finish in 59 minutes in 5% less dense air, ceteris paribus.

\(\Delta_{speed\space km/h} = \frac{\sqrt[3]{30^{3}\cdot 1.05} - 30}{30}\cdot100 = \frac{0.49189}{30}\cdot100 = 1.63963\space \%\)

\(duration_{minutes} = \frac{30\space km}{30.49189} \cdot 60 = 59\)

Low air density not only implies speed gains, but also a lower oxygen concentration (called partial pressure of oxygen). Suppose you are cycling in the mountains at an altitude of 3,000 meters. At that location the oxygen concentration is approximately 70% of that at sea level. The air you breathe therefore contains significantly less oxygen which of course results in a reduced maximum performance. Studies show that V̇O₂max decreases as altitude increases. A small decrease in V̇O₂max, compared to values at sea level, is already observed at an altitude of 580 meters. As altitude increases, the amount of oxygen that can be absorbed into the blood decreases, causing V̇O₂max to further decrease by an average of about 15% at 3,000 meters relative to sea level.¹³ After listing the formulas in sections II.1 and II.2, we will present this data in a table.
It is expected that a lower oxygen content due to low air density does not play any role in my data. The probably minimal loss of power due to less oxygen is more than compensated by the reduced air resistance.

1. Air density formula

The density of humid air is found by:¹⁴

\(\displaystyle \rho\,_{humid\,air} = \frac{p_d}{R_dT} + \frac{p_v}{R_vT}\)

where:

\(\displaystyle \rho\,_{humid\,air}\) density of the humid air (\(kg/m^3\))

\(\displaystyle p_d\) partial pressure of dry air (Pa)

\(\displaystyle R_d\) specific gas constant for dry air, 287.058 J/(kg·K)

\(\displaystyle T\) temperature (K)

\(\displaystyle p_v\) pressure of water vapor (Pa)

\(\displaystyle R_v\) specific gas constant for water vapor, 461.495 J/(kg·K)

The vapor pressure of water may be calculated from the saturation vapor pressure and relative humidity:

\(\displaystyle p_v = \phi p_{sat}\)

where:

\(\displaystyle p_v\) pressure of water vapor (Pa)

\(\displaystyle \phi\) relative humidity (range 0.0 – 1.0)

\(\displaystyle p_{sat}\) saturation vapor pressure (kPa)

The saturation vapor pressure of water at any given temperature is the vapor pressure when relative humidity is 100%. One formula used to find the saturation vapor pressure is:

\(\displaystyle p_{sat} = 0.61078 \, · exp \left(\frac{17.27T}{T + 237.3}\right)\)

where:

\(\displaystyle p_{sat}\) saturation vapor pressure (kPa)

\(\displaystyle T\) temperature (°C)

The partial pressure of dry air \(\displaystyle p_d\) is found considering partial pressure, resulting in:

\(\displaystyle p_d = p - p_v\)

where \(\displaystyle p\) denotes the observed absolute pressure.

Below is my R script for the air density formula:

mutate(p1 = 0.61078 * exp((17.27 * Temperature_C)/(Temperature_C + 237.3)),  # Calculate the saturation vapor pressure 
       pv = p1 * Relative_humidity,   # Find the actual vapor pressure by multiplying the saturation vapor pressure by the relative humidity
       pd = Air_pressure_hPa * 100 - pv,     # 1 hPa = 1 mbar = 100 Pa
       Air_density_kg_m3 = (pd / (287.058 * (Temperature_C + 273.15))) + (pv / (461.495 * (Temperature_C + 273.15)))) %>%    # Kelvin = Celsius + 273.15

2. Air pressure at altitude

To calculate air pressure at altitude it is necessary to use the barometric formula:¹⁵

\(\displaystyle p = p_0 \, · exp \left(\frac{−gM(h−h_0)}{RT}\right)\)

where:

\(\displaystyle p\)    pressure in hPa at altitude h
\(\displaystyle h\)    altitude in m
\(\displaystyle p_0\) pressure at sea level (1013.25 hPa)
\(\displaystyle h_0\) altitude at sea level (0 m)
\(\displaystyle T\)    temperature in Kelvins at altitude h
\(\displaystyle M\) molar mass of air, 0.0289644 kg/mol
\(\displaystyle R\)    ideal universal gas constant, 8.31432 N · m/(mol · K)
\(\displaystyle g\)     acceleration due to gravity, 9.80665 (\(m/s\))

R script:

pressure_at_altitude_h <- 1013.25 * exp((-9.80665 * 0.0289644 * Altitude)/((8.31432 * (Temperature_C + 273.15))))

By applying the above formulas we can determine the air density at different altitudes, for example at 15 degrees Celsius and a humidity of 70%, see output below. The V̇O₂max decline column is taken from Charles Fulco et al.¹⁶ It is an estimate based on various studies.

##                                Pass Altitude_m Temperature_C Relative_humidity Air_pressure_hPa Oxygen_level_perc VO2max_decline_perc Air_density_kg_m3
## 1       Groningen (The Netherlands)          0            15                70          1013.25               100                   0             1.224
## 2          Langenhard (Schwarzwald)        430            15                70           962.89                95                   0             1.164
## 3                  Valbrona (Italy)        500            15                70           954.93                94                   0             1.154
## 4    Col du Rosier (Walloon region)        580            15                70           945.92                93                   1             1.143
## 5  Col de la Rivière Noire (France)       1000            15                70           899.97                89                   3             1.087
## 6            Cruz de Fierro (Spain)       1500            15                70           848.17                84                   5             1.025
## 7               Bormio 2000 (Italy)       2000            15                70           799.35                79                   8             0.966
## 8            Großglockner (Austria)       2500            15                70           753.34                74                  11             0.910
## 9             Passo Stelvio (Italy)       2750            15                70           731.34                72                  13             0.884
## 10       Col de Sommeiller (France)       3000            15                70           709.98                70                  15             0.858

When we reach the top of the Col de Sommeiller, we cycle at an altitude of 3,000 meters. The air pressure is 709.98 hPa (compared to 1013.25 at sea level). The oxygen concentration is 70% (100% at sea level). The V̇O₂max has decreased by 15%. The air density is not higher than 0.858 kg/m3, which of course also corresponds to 70% of the reference value 1.224.
The lowest hPa in my cycle weather data is 979.3. The minimum density measured is 1.147 - both independent of temperature and humidity. These relatively high scores make us refrain from further using V̇O₂max in this article.

3. Absolute humidity

Relative humidity is the amount of water vapor present in the air expressed as a percentage of the maximum possible amount of water vapor that the air can hold at a given temperature. It depends on the temperature.
Absolute humidity describes the actual amount of water vapour in the air in \(g/m^3\). Absolute humidity does not take into account the air temperature. It only depends on the amount of water vapor.

If we know the relative humidity and air temperature, we can easily calculate the absolute humidity (AH) in \(g/m^3\) using the equation:¹⁷

\(\displaystyle AH = \left(\frac{p_{sat} \phi · 100}{R_vT}\right) · 100\)

In R:

mutate(Absolute_humidity_g_m3 = (p1 * Relative_humidity * 100 / (461.495 * (Temperature_C + 273.15)))*100) %>%   # Relative humidity in %

4. Drag equation

If an object moves through the air, it experiences a resisting force, the drag force. The value of this force depends on the size and shape of an object, the air density and the velocity. The drag equation is a formula used to calculate this force. ¹⁸

\(F_d = \frac{1}{2}\space \rho\space u^2\space A\space C_d\)

where:

\(F_d\) the drag force in Newton (\(N\))

\(\rho\) air density (\(kg/m^3\))

\(u\) velocity (\(m/s\)). It’s squared in the formula: aerodynamic drag increases exponentially the faster you go.

\(A\) reference area (\(m^2\)) i.e. the frontal area of the cyclist, let’s say 0.5.

\(C_d\) drag coefficient, let’s say 0.9.

The typical drag coefficient of a cyclist on a road bike is around 0.8 to 1.0. This value depends on factors such as the cyclist’s body position, the bike design, and the cyclist’s clothing. More aerodynamic body positions and equipment can reduce the drag coefficient down to around 0.7, while a less aerodynamic setup may result in a coefficient closer to 1.0.¹⁹
In cycling the drag area is often used, \(AC_d\) (\(m^2\)), which is the product of the frontal area of the cyclist and the drag coefficient.

In R:

mutate(Drag_force_N = 0.5 * Air_density_kg_m3 * (Speed_kmh/3.6)^2 * 0.5 * 0.9)  # speed required in m/s

An online drag equation calculator may help you compute the force.²⁰ Here you can simply confirm that a 10% decrease in air density (for example from 1.28 to 1.152) leads to an equal decrease in drag (from 20 to 18 \(N\)). Tables are available that show how drag reduction affects you at your speeds.²¹

5. Segmented regression

Segmented or broken-line models are regression models where the relationships between the response and one or more explanatory variables are piecewise linear, namely represented by two or more straight lines connected at unknown values, called breakpoints.²² ²³ A segmented regression model has two or more sub-models. Segmented regression is useful when the independent variable, clustered into two or more groups, exhibits a different relationship with the dependent one in these groups.
The estimated model can be visualized by the print summary. The variable labeled with ‘U1.Air_density_kg_m3’ stands for the difference-in-slope parameter of the variable ‘Air_density_kg_m3’. The p-value relevant to U1 is never reported in the ‘segmented’ output and NA is printed because “standard asymptotics do not apply”.²⁴ In order to test for a significant difference-in-slope, the Davies’ test can be used. The output below shows that a significant breakpoint exists. The use of two submodels is therefore justified. The submodel-intercepts and slopes are printed in the output, as well as the breakpoint’s confidence interval.

library(segmented)

fit.lm <- lm(Speed_kmh ~ Air_density_kg_m3, data = regional_cycle_activities)
fit.seg <- segmented(fit.lm, seg.Z = ~ Air_density_kg_m3, psi = 1.225)
summary(fit.seg)

## 
##  ***Regression Model with Segmented Relationship(s)***
## 
## Call: 
## segmented.lm(obj = fit.lm, seg.Z = ~Air_density_kg_m3, psi = 1.225)
## 
## Estimated Break-Point(s):
##                          Est. St.Err
## psi1.Air_density_kg_m3 1.196  0.003
## 
## Coefficients of the linear terms:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -42.85      19.18  -2.234  0.02563 *  
## Air_density_kg_m3       62.42      16.19   3.855  0.00012 ***
## U1.Air_density_kg_m3   -86.10      16.33  -5.272       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.287 on 1765 degrees of freedom
## Multiple R-Squared: 0.07466,  Adjusted R-squared: 0.07309 
## 
## Boot restarting based on 6 samples. Last fit:
## Convergence attained in 2 iterations (rel. change 0.0000000000052253)

davies.test(fit.lm, seg.Z = ~ Air_density_kg_m3,  k=15)

## 
##  Davies' test for a change in the slope
## 
## data:  formula = Speed_kmh ~ Air_density_kg_m3 ,   method = lm 
## model = gaussian , link = identity  
## segmented variable = Air_density_kg_m3
## 'best' at = 1.1885, n.points = 15, p-value = 0.000000001902
## alternative hypothesis: two.sided

intercept(fit.seg)

## $Air_density_kg_m3
##               Est.
## intercept1 -42.848
## intercept2  60.120

slope(fit.seg)

## $Air_density_kg_m3
##           Est. St.Err.  t value CI(95%).l CI(95%).u
## slope1  62.419  16.192   3.8549    30.661    94.178
## slope2 -23.679   2.128 -11.1270   -27.853   -19.506

seg_preds <- predict(fit.seg)
seg_res <- regional_cycle_activities$Speed_kmh - seg_preds

CI_break_point <- confint(fit.seg)
CI_break_point

##                           Est. CI(95%).low CI(95%).up
## psi1.Air_density_kg_m3 1.19592      1.1898    1.20205

The results obtained can now be graphed with air density on the x-axis and the ride average of the speed on the y-axis. Each point, i.e. each ride, is given a color depending on the daytime temperature (Figure 5a).
The segmented regression model looks like an inverted v-shaped curve (red line). The gam-smoothing curve (in green) approximates the v-shape. The breakpoint is estimated at 1.196 (vertical thin line).
The calculated turning point is personal; it will be different for every cyclist. Cyclists from Eritrea are very likely to suffer less from heat. The breaking point can be interpreted as the optimal air density to achieve a high cycling speed.

pred_speed_CI_break_point_low <- predict(fit.seg, data.frame(Air_density_kg_m3 = c(CI_break_point[,2])))
pred_speed_CI_break_point_upp <- predict(fit.seg, data.frame(Air_density_kg_m3 = c(CI_break_point[,3])))

regional_cycle_activities <- regional_cycle_activities %>%
  mutate(Temp_class = cut(Temperature_C, breaks = c(-10,0,9,14,19,24,29,34,40),
                          labels = c("<0", "1-9", "10-14", "15-19", "20-24", "25-29", "30-34", "35+")))

tempcolors = c("<0" = "black", "1-9" = "blue", "10-14" = "lightblue", "15-19" = "#FFCCCC", "20-24" = "#FFCC99", "25-29" = "#FF9933", "30-34" = "red", "35+" = "#CC0000" )


ggplot(regional_cycle_activities, aes(x = Air_density_kg_m3,  y = Speed_kmh )) +
  geom_point(aes(color = Temp_class)) +
  scale_color_manual(values=tempcolors) +
  scale_x_continuous(expression ("air density in "~kg/m^3)) +
  scale_y_continuous("speed km/h") +
  geom_smooth(method = "gam", se = TRUE, level = 0.95, color="darkgreen") +
  geom_line(aes(y = seg_preds), color = "red", linewidth = 1) +
  annotate("segment", x = round(fit.seg$psi[2],3), xend = round(fit.seg$psi[2],3), y = -Inf, yend = max(seg_preds), color = "red", linetype= 'dotted', linewidth = 1) +
    annotate("segment", x = CI_break_point[,2], xend = CI_break_point[,2], y = -Inf, yend = pred_speed_CI_break_point_low, color = "darkgrey") +
  annotate("segment", x = CI_break_point[,3], xend = CI_break_point[,3], y = -Inf, yend = pred_speed_CI_break_point_upp, color = "darkgrey") +
  annotate("rect", xmin =  CI_break_point[,2], xmax = CI_break_point[,3], ymin = -Inf, ymax = pred_speed_CI_break_point_low, alpha = .2) +
  geom_segment(aes(x=round(fit.seg$psi[2],3)+0.002, xend=round(fit.seg$psi[2],3)+0.018, y=24, yend=24),
               arrow = arrow(length = unit(0.4, "cm")), color = 'red') +
  annotate("text", x = round(fit.seg$psi[2],3)+0.045, y = 24, label = paste0("breakpoint ", round(fit.seg$psi[2],3), "  (CI = 0.95)"), color = 'red') +
  annotate("segment", x = 1.26, xend = 1.275, y = 37.5, yend = 37.5, color = "darkgreen") +
  annotate("segment", x = 1.26, xend = 1.275, y = 37, yend = 37, color = "red") +
  annotate("text", x =  1.30, y =  37.5, label = "gam smoothing (CI = 0.95)",  size=3.5, color = "darkgreen") +
  annotate("text", x =  1.30, y =  37, label = "segmented regression        ",  size=3.5, color = "red") +
  theme(plot.title = element_text(size=17)) +
  labs(title = "Segmented relationship average ride speed and air density",
       subtitle = paste0("in a group of ",  nrow(regional_cycle_activities), " rides in the northern part of the Netherlands"),
       caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today()),   
       colour='temperature °C')

Figure 5a. Segmented relation average ride speed and air density List of figures

As expected, we see a decreasing average trip speed with increasing air density, but the decrease does not start until after the breaking point. At very low air density the chance of a high average speed apparently decreases. These appear to be very warm days - at least by local standards. The explanatory factor is high absolute humidity that can occur at high temperatures. Moist air is lighter, but it makes the air temperature feel much hotter than it is. It interferes with the body’s natural cooling mechanism causing the heart rate to increase. It could lead to dehydration.

6. Scatterplot in 3d

The bivariate daytime relationships between temperature, absolute humidity and air density, as recorded in Eelde in the period 2008-2024, are plotted in Appendix Figure 7b,c,d.
We can add the absolute humidity as an extra dimension to Figure 5a. Below the script and the output, Figure 5b, an interactive 3D scatter plot. While hovering over the plot, the popup text shows the relevant ride information.

library(plotly)

plot_ly(data = regional_cycle_activities, x = ~Air_density_kg_m3, y = ~Absolute_humidity_g_m3, z = ~Speed_kmh, type="scatter3d", mode="markers",
        color = ~Temp_class, colors = tempcolors, alpha = 0.9, size = 0.4, text = ~paste('id: ', ID, '<br>distance:', Distance_km, '<br>year:', year(Date_cycle_activity), '<br>windforce:', Windforce_Bft)) %>%
  layout(legend = list(orientation = "vertical",   
                       yanchor = "center",  
                       y = 0.5)) %>%
  layout( title = "<br><br><br>Air density, humidity, temperature and cycle speed",
          scene = list(xaxis = list(title = 'x = air density (kg/m³)'),
                       yaxis = list(title = 'y = absolute humidity (g/m³)'),
                       zaxis = list(title = 'z = speed (km/h)')),
          annotations = list(
            x = 1.15,
            y = 0.60,
            text = 'temperature (°C)',
            xref = 'paper',
            yref = 'paper',
            showarrow = FALSE))

Figure 5b. Speed, air density and humidity in 3d List of figures

7. Model air density

The following regression model summarizes the relationship between average speed of the ride and air density in my data.

if Air density < 1.196 : Speed = -42.848 + 62.419 * Air density

if Air density = 1.196 : Speed = -42.848 + 62.419 * 1.196 = 60.12 + -23.679 * 1.196 = 31.799916

if Air density > 1.196 : Speed = 60.12 + -23.679 * Air density

\(R^2\) = 0.0747

8. Deviation from drag equation predicted speed

Now we can compare the air density - speed relationship modeled according to our own data with the actual, scientific relationship. An air density of 1.31 is chosen as a reference, together with the observed median speed of just over 29 km per hour. The table below shows the effect on speed if the air density decreases in steps of 10 grams per cubic meter. Completely as expected, the table shows that, for example, a 5.3% density decrease compared to 1.31 leads to a 1.75% speed gain if we apply the drag equation, which corresponds to 62 seconds per hour. The table also contains the predicted speed according to the segmented regression model and that model indicates an improvement of 3:14 minutes per hour for the same decrease in air density, which is more than three times faster than the 62 seconds just mentioned.

air_density_values <- data.frame(seq(from = 1.31, to = 1.15, by = -0.01))
names(air_density_values) <- "Air_density_kg_m3"

segm_prediction <- data.frame(predict(fit.seg, air_density_values))
names(segm_prediction) <- "speed_predicted_segmented"

format_min_sec <- function(s) {
  s <- as.integer(s)
  sprintf('%02i:%02i', s %/% 60 %% 60, s %% 60)}

air_density_pred <- regional_cycle_activities %>% 
  mutate(air_density_kg_m3 = round(Air_density_kg_m3,2),
         air_density_kg_m3 = ifelse(air_density_kg_m3 <= 1.15, 1.15, ifelse(air_density_kg_m3 >= 1.31, 1.31, air_density_kg_m3)),
         deviation_from_reference = 1.31 - air_density_kg_m3,
         prop_deviation_from_reference = deviation_from_reference / 1.31) %>% 
  group_by(air_density_kg_m3) %>%
  summarise(n_rides = n(), speed_kmh_observed_median = median(Speed_kmh),
            speed_kmh_observed_mean = mean(Speed_kmh),
            deviation_from_reference = mean(deviation_from_reference), prop_deviation_from_reference = mean(prop_deviation_from_reference)) %>% 
  arrange(desc(air_density_kg_m3)) %>% 
  cbind(.,segm_prediction) %>% 
  mutate(speed_increase_in_perc_due_to_less_drag = (((first(speed_kmh_observed_median)^3 * (1+prop_deviation_from_reference))^(1 / 3) - first(speed_kmh_observed_median)) / first(speed_kmh_observed_median) ) * 100 ,
         speed_predicted_drag = (1+speed_increase_in_perc_due_to_less_drag/100) * first(speed_kmh_observed_median),
         delta_speed_segmented = speed_predicted_segmented - lag(speed_predicted_segmented),
         delta_speed_drag = speed_predicted_drag - lag(speed_predicted_drag),
         drag_force_N = 0.5 * air_density_kg_m3 * (first(speed_kmh_observed_median)/3.6)^2 * 0.5 * 0.9,
         drag_gain_minutes_per_hour = format_min_sec(round(abs((first(speed_kmh_observed_median) / speed_predicted_drag * 3600) - 3600),0)),
         segm_gain_minutes_per_hour = format_min_sec(round(abs((first(speed_predicted_segmented) / speed_predicted_segmented * 3600) - 3600),0))) %>% 
  dplyr::select(air_density_kg_m3, n_rides, deviation_from_reference, prop_deviation_from_reference, speed_increase_in_perc_due_to_less_drag, speed_predicted_drag, delta_speed_drag, drag_gain_minutes_per_hour, drag_force_N, speed_predicted_segmented, delta_speed_segmented, segm_gain_minutes_per_hour, speed_kmh_observed_median, speed_kmh_observed_mean) %>% 
  mutate(across(c(4:7,9:11,13:14), ~ format(round(.x, digits=3), nsmall = 3))) 

air_density_pred

##    air_density_kg_m3 n_rides deviation_from_reference prop_deviation_from_reference speed_increase_in_perc_due_to_less_drag speed_predicted_drag delta_speed_drag drag_gain_minutes_per_hour drag_force_N speed_predicted_segmented delta_speed_segmented segm_gain_minutes_per_hour speed_kmh_observed_median speed_kmh_observed_mean
## 1               1.31      22                     0.00                         0.000                                   0.000               29.369               NA                      00:00       19.616                    29.100                    NA                      00:00                    29.369                  29.681
## 2               1.30      30                     0.01                         0.008                                   0.254               29.443            0.075                      00:09       19.467                    29.337                 0.237                      00:29                    29.780                  29.504
## 3               1.29      51                     0.02                         0.015                                   0.506               29.517            0.074                      00:18       19.317                    29.574                 0.237                      00:58                    30.390                  30.046
## 4               1.28      82                     0.03                         0.023                                   0.758               29.591            0.074                      00:27       19.167                    29.810                 0.237                      01:26                    30.312                  29.669
## 5               1.27     111                     0.04                         0.031                                   1.008               29.665            0.073                      00:36       19.017                    30.047                 0.237                      01:53                    30.222                  29.946
## 6               1.26     154                     0.05                         0.038                                   1.256               29.738            0.073                      00:45       18.868                    30.284                 0.237                      02:21                    30.741                  30.304
## 7               1.25     178                     0.06                         0.046                                   1.504               29.810            0.073                      00:53       18.718                    30.521                 0.237                      02:48                    30.375                  30.230
## 8               1.24     180                     0.07                         0.053                                   1.750               29.883            0.072                      01:02       18.568                    30.758                 0.237                      03:14                    31.006                  30.756
## 9               1.23     236                     0.08                         0.061                                   1.996               29.955            0.072                      01:10       18.418                    30.994                 0.237                      03:40                    31.009                  30.826
## 10              1.22     206                     0.09                         0.069                                   2.240               30.026            0.072                      01:19       18.269                    31.231                 0.237                      04:06                    31.869                  31.425
## 11              1.21     188                     0.10                         0.076                                   2.482               30.098            0.071                      01:27       18.119                    31.468                 0.237                      04:31                    31.673                  31.591
## 12              1.20     147                     0.11                         0.084                                   2.724               30.169            0.071                      01:35       17.969                    31.705                 0.237                      04:56                    32.286                  31.765
## 13              1.19     104                     0.12                         0.092                                   2.965               30.239            0.071                      01:44       17.819                    31.431                -0.273                      04:27                    31.921                  31.429
## 14              1.18      50                     0.13                         0.099                                   3.204               30.310            0.070                      01:52       17.670                    30.807                -0.624                      03:19                    31.029                  31.189
## 15              1.17      19                     0.14                         0.107                                   3.442               30.380            0.070                      02:00       17.520                    30.183                -0.624                      02:09                    28.468                  29.928
## 16              1.16       6                     0.15                         0.115                                   3.680               30.449            0.070                      02:08       17.370                    29.559                -0.624                      00:56                    28.328                  28.933
## 17              1.15       5                     0.16                         0.122                                   3.916               30.519            0.069                      02:16       17.220                    28.935                -0.624                      00:21                    28.899                  28.496

Let’s see how both models compare (Figure 5c). The blue line is the speed prediction according to the segmented regression based on the dataset, as shown in Figure 5a. The red line represents the actual effect of air density according to the drag equation. The secondary y-axis shows the speed gain in minutes and as a percentage, both compared to the reference speed cycled at an air density of 1.31 (29.369 km per hour). Due to reduced drag, the speed increases to 30.519 km/h at an air density of 1.15, other things being equal. A time saving of 02:16 minutes per hour. In our own cycling data, which in no way meet the ceteris paribus clause, a maximum average time saving of 5 minutes per hour has been predicted at an air density of 1.196, after which speed will decrease. This decrease is the result of unpleasant effects (on myself) of high humidity associated with - for local standards - high temperatures, as noted earlier. The speed increase with decreasing air density to the breaking point is greater in the segmented regression model than calculated according to reduced drag. No doubt a number of other factors play a role here, such as dry roads, aero summer outfit, use of a time trial bike.

ylim.prim <- c(28, 32)   
ylim.sec <- c(-120, 320)    

b <- diff(ylim.prim)/diff(ylim.sec)
a <- ylim.prim[1] - b*ylim.sec[1]

ggplot() +
  geom_line(aes(x = air_density_pred$air_density_kg_m3,  y = as.numeric(air_density_pred$speed_predicted_drag)), color="red", linewidth = 1.1) +
  annotate("segment", x = fit.seg$psi[2], xend = fit.seg$psi[2], y = -Inf, yend = max(seg_preds),linewidth = 1, linetype= 'dotted', color = "blue") +
  geom_line(aes(x = regional_cycle_activities$Air_density_kg_m3, y = seg_preds), color = "blue", linewidth = 1) +
    scale_y_continuous("predicted speed km/h", limits=c(28,32), breaks = c(28:32), sec.axis = sec_axis(~ (. - a)/b, name = "gain in min/h and % compared to reference 1.31", breaks= seq(0,300, by=60), labels = c("00:00","01:00 = 1.7%","02:00 = 3.3%","03:00 = 5.0%","04:00 = 6.7%","05:00 = 8.3%"))) +
  geom_segment(aes(x=round(fit.seg$psi[2],3)+0.0053, xend=round(fit.seg$psi[2],3)+0.018, y=28.2, yend=28.2), 
               arrow = arrow(length = unit(0.35, "cm")), color = 'blue') +
  annotate("text", x = round(fit.seg$psi[2],3)+0.035, y = 28.2, label = paste0("breakpoint ", round(fit.seg$psi[2],3)), color = 'blue') +
  annotate("segment", x = 1.24, xend = 1.25, y = 28.5, yend = 28.5, color = "red") +
  annotate("segment", x = 1.24, xend = 1.25, y = 28.7, yend = 28.7, color = "blue") +
  annotate("text", x =  1.27, y =  28.4, label = "drag equation-predicted\n(reference point 1.31)",  size=3.5, color = "red") +
  annotate("text", x =  1.27, y =  28.7, label = "segmented regression  ",  size=3.5, color = "blue") +
  scale_x_continuous(expression ("air density in "~kg/m^3),limits=c(1.15,1.31),breaks=c(1.15,1.16,1.17,1.18,1.19,1.20,1.21,1.22,1.23,1.24,1.25,1.26,1.27,1.28,1.29,1.30,1.31)) + 
  theme(plot.title = element_text(size=17), panel.grid.minor.x = element_blank()) +
  labs(title = "Relationship trip cycle speed and air density: segmented versus drag equation",
       subtitle = paste0("in a group of ",  nrow(regional_cycle_activities), " rides in the northern part of the Netherlands"),
       caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today()))

Figure 5c. Relationship trip cycle speed and air density: segmented versus drag equation List of figures

III. Multiple regression

Multiple regression is an extension of simple linear regression, which we have used so far. It is used when we want to predict the value of a variable based on the value of two or more other variables. The dependent variable is the average trip speed. In addition to both weather characteristics, wind force and air density, we add two control variables: the length of the ride in km, and the year of the activity.²⁵

1. Model selection

We collected goodness-of-fit measures and p-values for simple and multiple regressions, paying particular attention to the AIC and R². The AIC (‘Akaike Information Criterion’) is a a widely used metric for evaluating the accuracy of predictions. The lower the AIC score the better. The aim is to find the model that minimizes AIC. R² is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by the predictors.

lm_no_1 <- lm(Speed_kmh ~ as.numeric(Windforce_Bft), data = regional_cycle_activities)
lm_no_2 <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2), data = regional_cycle_activities)

lm_no_3 <- lm(Speed_kmh ~ Air_density_kg_m3, data = regional_cycle_activities)
lm_no_4 <- segmented(lm_no_3,seg.Z = ~ Air_density_kg_m3, psi = 1.20)

lm_tem <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2) + Air_density_kg_m3, data = regional_cycle_activities)
lm_no_5 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)

lm_tem <- lm(Speed_kmh ~ (poly(as.numeric(Windforce_Bft), 2) + Air_density_kg_m3)^2, data = regional_cycle_activities)
lm_no_6 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)

lm_no_7 <- lm(Speed_kmh ~ Distance_km, data = regional_cycle_activities)
lm_no_8 <- lm(Speed_kmh ~ year(Date_cycle_activity), data = regional_cycle_activities)

lm_tem <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2) + Distance_km + Air_density_kg_m3, data = regional_cycle_activities)
lm_no_9 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)

lm_tem <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2) + year(Date_cycle_activity) + Air_density_kg_m3, data = regional_cycle_activities)
lm_no_10 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)

lm_tem <- lm(Speed_kmh ~ year(Date_cycle_activity) +  poly(as.numeric(Windforce_Bft), 2) + Distance_km + Air_density_kg_m3, data = regional_cycle_activities)
lm_no_11 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)

# the glm p-value can be determined with the anova function comparing the fitted model to a null model. 
# The null model is fit with only an intercept term on the right side of the model.
glm.0 <- glm(Speed_kmh ~ 1, data = regional_cycle_activities)

lm_tem <- glm(Speed_kmh ~  poly(as.numeric(Windforce_Bft), 2) + Air_density_kg_m3 + (year(Date_cycle_activity)  + Distance_km)^2, data = regional_cycle_activities)
lm_no_12 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)

model_aic <- as.data.frame(c(AIC(lm_no_1),AIC(lm_no_2),AIC(lm_no_3),AIC(lm_no_4),AIC(lm_no_5),AIC(lm_no_6),AIC(lm_no_7),AIC(lm_no_8),AIC(lm_no_9),AIC(lm_no_10),AIC(lm_no_11),AIC(lm_no_12)))

model_r_squared <- as.data.frame(c(round(summary(lm_no_1)$r.squared,3),round(summary(lm_no_2)$r.squared,3),
                             round(summary(lm_no_3)$r.squared,3),round(summary(lm_no_4)$r.squared,3),
                             round(summary(lm_no_5)$r.squared,3),round(summary(lm_no_6)$r.squared,3),
                             round(summary(lm_no_7)$r.squared,3),round(summary(lm_no_8)$r.squared,3),
                             round(summary(lm_no_9)$r.squared,3),round(summary(lm_no_10)$r.squared,3),round(summary(lm_no_11)$r.squared,3),round(cor(lm_no_12$fitted, lm_no_12$y)^2, 3)))

model_compared <- as.data.frame(c("Speed ~ Windforce",
                                  "Speed ~ polynomial(Windforce)",
                                  "Speed ~ Air density",
                                  "Speed ~ segmented(Air density)",
                                  "Speed ~ polynomial(Windforce) + segmented(Air density)",
                                  "Speed ~ polynomial(Windforce) + segmented(Air density) + interaction",
                                  "Speed ~ Distance",
                                  "Speed ~ Year",
                                  "Speed ~ polynomial(Windforce) + segmented(Air density) + Distance",
                                  "Speed ~ polynomial(Windforce) + segmented(Air density) + Year",
                                  "Speed ~ polynomial(Windforce) + segmented(Air density) + Distance + Year",
                                  "Speed ~ polynomial(Windforce) + segmented(Air density) + interaction(Distance + Year)"))

model_rss <- as.data.frame(c(sum(resid(lm_no_1)^2),
                             sum(resid(lm_no_2)^2),
                             sum(resid(lm_no_3)^2),
                             sum(resid(lm_no_4)^2),
                             sum(resid(lm_no_5)^2),
                             sum(resid(lm_no_6)^2),
                             sum(resid(lm_no_7)^2),
                             sum(resid(lm_no_8)^2),
                             sum(resid(lm_no_9)^2), 
                             sum(resid(lm_no_10)^2),
                             sum(resid(lm_no_11)^2),
                             sum(resid(lm_no_12)^2)))


model_p <- as.data.frame(c(anova(lm_no_1)$'Pr(>F)'[1],anova(lm_no_2)$'Pr(>F)'[1],anova(lm_no_3)$'Pr(>F)'[1],anova(lm_no_4)$'Pr(>F)'[1],anova(lm_no_5)$'Pr(>F)'[1],anova(lm_no_6)$'Pr(>F)'[1],anova(lm_no_7)$'Pr(>F)'[1],anova(lm_no_8)$'Pr(>F)'[1],anova(lm_no_9)$'Pr(>F)'[1],anova(lm_no_10)$'Pr(>F)'[1],anova(lm_no_11)$'Pr(>F)'[1],anova(lm_no_12, glm.0, test="F")$'Pr(>F)'[2]
))

models_compared <- cbind(model_compared, model_aic, model_rss, model_r_squared, model_p)

names(models_compared) <- c("model", "AIC", "RSS", "R²", "significance")

models_compared <- models_compared %>% mutate(significance = get_p_value_class(significance))

models_compared

##                                                                                    model      AIC      RSS    R² significance
## 1                                                                      Speed ~ Windforce 7978.051 9384.569 0.059     p < .001
## 2                                                          Speed ~ polynomial(Windforce) 7958.462 9270.737 0.071     p < .001
## 3                                                                    Speed ~ Air density 7994.251 9470.909 0.051     p < .001
## 4                                                         Speed ~ segmented(Air density) 7952.740 9230.356 0.075     p < .001
## 5                                 Speed ~ polynomial(Windforce) + segmented(Air density) 7815.463 8521.866 0.146     p < .001
## 6                   Speed ~ polynomial(Windforce) + segmented(Air density) + interaction 7817.430 8512.076 0.147     p < .001
## 7                                                                       Speed ~ Distance 8023.638 9629.553 0.035     p < .001
## 8                                                                           Speed ~ Year 7450.890 6966.182 0.302     p < .001
## 9                      Speed ~ polynomial(Windforce) + segmented(Air density) + Distance 7702.614 7986.175 0.199     p < .001
## 10                         Speed ~ polynomial(Windforce) + segmented(Air density) + Year 7097.051 5671.154 0.431     p < .001
## 11              Speed ~ polynomial(Windforce) + segmented(Air density) + Distance + Year 7037.266 5476.499 0.451     p < .001
## 12 Speed ~ polynomial(Windforce) + segmented(Air density) + interaction(Distance + Year) 7035.652 5465.324 0.452     p < .001

From the above list it appears that the best single predictor of bicycle trip speed is the passage of time (AIC 7450.9), followed by the two weather variables segmented air density (AIC 7952.7) and polynomial wind force (AIC 7958.5). Distance is the least important (AIC 8023.6). In terms of R² of the simple regressions: the passing of the years, in other words, growing older, explains 30% of the variability of trip speed, the weather variables 7.1 - 7.5%, distance only 3%.
The intercept b0 is the starting point of the regression line, the slope b1 is the predicted change in speed with one unit change in the year. The age effect on speed according to the simple regression model is as follows (model no. 8 on the list above). Given an intercept of 699.57 and a slope of -0.3318, the predicted average speed in the starting year 2008 is 33.4 km per hour, and in 2024 28.1 km per hour, which is a decrease of 5.3 km per hour over a period of 17 years of cycling.

Multiple regression shows that both weather variables together explain 14.6% of the variability of the dependent variable (model no. 5 on the list above). The best model contains all four predictors, including distance (model no. 11, percentage explained variance 45%). The model with distance is significantly better than a model without, given the p-value less than 0.05 (see output below where model no. 10 and 11 are compared).

p <- get_p_value_class(anova(lm_no_10, lm_no_11, test="F")$"Pr(>F)"[2])
p

## [1] p < .001
## Levels: p < .001 p < .01 p < .05 n.s.

Although we assume that we have made longer trips more often in recent years, an interaction term year - distance does not provide a better model in terms of explaining average speed. Specifying the main effects is sufficient.

p <- get_p_value_class(anova(lm_no_11, lm_no_12, test="F")$"Pr(>F)"[2])
p

## [1] n.s.
## Levels: p < .001 p < .01 p < .05 n.s.

Adding an interaction effect of both weather variables doesn’t make the model significantly better, given the result in the output below where model no. 5 and 6 are compared.²⁶

p <- get_p_value_class(anova(lm_no_5, lm_no_6, test="F")$"Pr(>F)"[2])
p

## [1] n.s.
## Levels: p < .001 p < .01 p < .05 n.s.

2. Model summary

Below are the summary printouts of the chosen model. It is worth mentioning that the air density breakpoint in the multiple model (1.198) is almost equal to the value found in the simple regression model (1.196).
The explained proportion of variance is very low (R² = 0.451), making the model extremely unqualified for prediction. However, we only use the multiple model to measure the effect of covariates. The conclusion is that the relationships found at bivariate level between the weather variables and trip speed do not disappear after controlling for trip length and a time variable. An interaction effect of wind - air density cannot be statistically proven.

lm_basic <- lm(Speed_kmh ~ year(Date_cycle_activity) +  poly(as.numeric(Windforce_Bft), 2) + Distance_km + Air_density_kg_m3, data = regional_cycle_activities)

lm_segmented <- segmented(lm_basic, seg.Z = ~ Air_density_kg_m3, psi = 1.20)
summary(lm_segmented)

## 
##  ***Regression Model with Segmented Relationship(s)***
## 
## Call: 
## segmented.lm(obj = lm_basic, seg.Z = ~Air_density_kg_m3, psi = 1.2)
## 
## Estimated Break-Point(s):
##                          Est. St.Err
## psi1.Air_density_kg_m3 1.198  0.003
## 
## Coefficients of the linear terms:
##                                       Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                         617.652074  26.662692  23.165 < 0.0000000000000002 ***
## year(Date_cycle_activity)            -0.310077   0.010916 -28.407 < 0.0000000000000002 ***
## poly(as.numeric(Windforce_Bft), 2)1 -23.981342   1.782149 -13.456 < 0.0000000000000002 ***
## poly(as.numeric(Windforce_Bft), 2)2  -8.334884   1.767477  -4.716  0.00000259851877523 ***
## Distance_km                          -0.013933   0.001762  -7.910  0.00000000000000451 ***
## Air_density_kg_m3                    33.688025  11.662768   2.889              0.00392 ** 
## U1.Air_density_kg_m3                -58.263762  11.774643  -4.948                   NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.763 on 1761 degrees of freedom
## Multiple R-Squared: 0.451,  Adjusted R-squared: 0.4488 
## 
## Boot restarting based on 10 samples. Last fit:
## Convergence attained in 2 iterations (rel. change -0.00000051467)

davies.test(lm_basic, seg.Z = ~ Air_density_kg_m3,  k=15)

## 
##  Davies' test for a change in the slope
## 
## data:  formula = Speed_kmh ~ year(Date_cycle_activity) + poly(as.numeric(Windforce_Bft), 2) + Distance_km + Air_density_kg_m3 ,   method = lm 
## model = gaussian , link = identity  
## segmented variable = Air_density_kg_m3
## 'best' at = 1.2011, n.points = 15, p-value = 0.00000001118
## alternative hypothesis: two.sided

intercept(lm_segmented)

## $Air_density_kg_m3
##              Est.
## intercept1 617.65
## intercept2 687.44

slope(lm_segmented)

## $Air_density_kg_m3
##           Est. St.Err.  t value CI(95%).l CI(95%).u
## slope1  33.688 11.6630   2.8885    10.814    56.562
## slope2 -24.576  1.6796 -14.6320   -27.870   -21.281

3. Model equation

Written as an equation:

if Air density < 1.198 : Speed = 617.65 + -0.3100773 * Year + -23.9813416 * Windforce
+ -8.3348836 * Windforce\(^2\) + -0.0139333 * Distance + 33.688 * Air density

if Air density = 1.198 : Speed = 617.65 + -0.3100773 * Year + -23.9813416 * Windforce
+ -8.3348836 * Windforce\(^2\) + -0.0139333 * Distance + 33.688 * 1.198
= 687.44 + -0.3100773 * Year + -23.9813416 * Windforce
+ -8.3348836 * Windforce\(^2\) + -0.0139333 * Distance + -24.576 * 1.198

if Air density > 1.198 : Speed = 687.44 + -0.3100773 * Year + -23.9813416 * Windforce
+ -8.3348836 * Windforce\(^2\) + -0.0139333 * Distance + -24.576 * Air density

\(R^2\) = 0.451

4. Residuals and leverage

We conclude with a section on residual and leverage analysis. The residual for each observation is the difference between the predicted and observed values. A high-leverage point has extreme values for one or more predictors, or unusual combinations of predictor values.
Residuals may correlate with (an)other variable(s). This is usually identified visually using graphs such as the quartet below. The residual versus fitted plot can reveal non-linearity because of inequality of the variances of the error terms. For instance a parabola indicates non-linear relationship. Here, we don’t see any distinctive pattern. But given the low R², we know that there are important uncaptured variables left out of the model.
The plot that is in the right upper corner, is the normal probability plot of residuals. It is an indication to see if the model residuals follow a normal distribution. Do residuals follow a straight line well or do they deviate severely? The Scale-Location plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal line with equally (randomly) spread points. The leverage idea picks out the potential for a specific value to be influential. Leverage refers to the extent to which the coefficients in the regression model would change if a particular observation was removed from the dataset. Observations with high leverage have a strong influence on the coefficients in the regression model. Leverage values range between 0 (no influence) and 1 (strong influence). There doesn’t seem to be any serious leverage involved.

library(ggfortify)

autoplot(lm_segmented)

Figure 6a. Residual analysis multiple regression model List of figures

Cook’s distance is useful for identifying outliers in the predictor variables. It also shows the influence of each observation on the fitted response values.²⁷

library(olsrr)

ols_plot_cooksd_chart(lm_segmented, threshold = 0.01)

Figure 6b. Cook’s distance chart multiple regression model List of figures

Following a suggestion on Stack Overflow, we will merge the residuals/leverage graph from Figure 6a with Cook’s distance chart (Figure 6b) by adding Cook’s distance levels to the former.²⁸

model_residual_data <- regional_cycle_activities %>% 
  mutate(Speed_predicted = predict.segmented(lm_segmented)) %>% 
  mutate(cooks_distance = cooks.distance(lm_segmented),
        leverage = hatvalues(lm_segmented),
        residuals_standard = rstandard(lm_segmented)) %>% 
  mutate(influence = ifelse(cooks_distance >= 0.01 | leverage >= 0.04 |  abs(residuals_standard) >= 3.0, TRUE, FALSE)) %>% 
  mutate(rownumber = row_number()) 

influential_observations <- model_residual_data %>%  filter(influence) %>% 
  mutate(observation = ifelse(cooks_distance >= 0.01 & abs(residuals_standard) >= 3.0, "cook's + outlier",
                         ifelse(cooks_distance >= 0.01, "cook's",
                           ifelse(leverage >= 0.04, "leverage",
                             ifelse(abs(residuals_standard) >= 3.0, "outlier", NA))))) %>% 
  dplyr::select(rownumber,Date_cycle_activity,observation,Distance_km,Speed_kmh,Speed_predicted,Temperature_C,Windforce_Bft,Air_density_kg_m3,cooks_distance,leverage,residuals_standard) %>% 
  mutate(Distance_km = round(Distance_km,0)) %>% 
  arrange(observation,Date_cycle_activity)

cooksd_contour_pos <- function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)}
cooksd_contour_neg <- function(leverage, level, model) {-cooksd_contour_pos(leverage, level, model)}

obscolors = c("outlier" = "red", "leverage" = "blue", "cook's" = "orange", "cook's + outlier" = "#CC0000" )

ggplot(model_residual_data, aes(x = leverage, y = residuals_standard)) +
  geom_hline(yintercept = 3, linetype = 'dotted', color= "red", linewidth = 1) +
  geom_hline(yintercept = -3, linetype = 'dotted', color= "red", linewidth = 1) +
  geom_vline(xintercept = 0.04, linetype = 'dotted', color= "blue", linewidth = 1) +
  geom_point(alpha = 0.2) +
  geom_point(data = influential_observations, aes(x = leverage, y = residuals_standard,  colour = observation)) +
  geom_smooth(method = "loess", se = FALSE, span = 0.75, color = "black") + 
  stat_function(fun = cooksd_contour_pos, args = list(level = 0.01, model = lm_segmented), xlim = c(0, 0.05), lty = 2, colour = "orange") +
  stat_function(fun = cooksd_contour_neg, args = list(level = 0.01, model = lm_segmented), xlim = c(0, 0.05), lty = 2, colour = "orange") +
  ggrepel::geom_text_repel(data = influential_observations, aes(label = rownumber, colour = observation), cex=4, hjust=1.1, fontface = 'italic', show.legend = FALSE) +
  scale_y_continuous("standardized residual",limits = c(-4, 4)) +
  scale_color_manual(values=obscolors) +
  annotate("text", x =  0.047, y =  4.0, label = "leverage threshold: 0.04",  size=3.5, color = "blue") +
  annotate("text", x =  0.049, y =  1.45, label = "cook's d threshold: 0.01",  size=3.5, color = "orange") +
  annotate("text", x =  0.020, y = -3.2, label = "abs. outlier threshold: 3.0",  size=3.5, color = "red") +
  labs(title = "Influential rides and outliers in the regression model",
      subtitle = paste0("in a group of ",  nrow(regional_cycle_activities), " rides in the northern part of the Netherlands"),
      caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today()),
      colour='observation')

Figure 6c. Influential rides and outliers in multiple regression model List of figures

Finally, a list of all 21 rides that exceed the three defined thresholds. High leverage appears to be absent in the database - the threshold value has been deliberately set very low at 0.06. On the other hand, there are quite a few outliers. As an example, we have traced one striking outlier, ride number 1744: January 27, 2024, average speed 32.8 km/h, distance 138 km, wind force 4, air density 1.29 and a temperature of 5 degrees Celsius. It turns out to be a one-way trip from Harlingen to Groningen with the wind at our backs. (… and ride numbers 1385 and 1387 are a one-way journey in two stages with a headwind).

library(kableExtra)
library(knitr)

knitr::kable(influential_observations, format = "html", escape = FALSE, caption = "Listing influential rides in regression model",
             col.names = c("Ride", "Date", "Influence", "Distance (km)", "Speed (km/h)", "Predicted speed", "Temp.(°C)", "Windforce (Bft)", "Air density (kg/m³)", "Cook's D", "leverage", "stand.residual")) %>% 
  row_spec(1:nrow(influential_observations),font_size=10) %>% 
  row_spec(0,bold=TRUE) %>% 
  row_spec(which(influential_observations$observation == "outlier"),italic=FALSE, bold=FALSE, color = 'red') %>%
  row_spec(which(influential_observations$observation == "leverage"),italic=FALSE, bold=FALSE, color = 'blue') %>% 
  row_spec(which(influential_observations$observation == "cook's"),italic=FALSE, bold=FALSE, color = 'orange') %>% 
  row_spec(which(influential_observations$observation == "cook's + outlier"),italic=FALSE, bold=TRUE, color = '#CC0000') %>% 
  kableExtra::kable_styling(full_width = TRUE, font_size = 10) %>% 
  column_spec(2:3, width = "11em")

Listing influential rides in regression model
Ride	Date	Influence	Distance (km)	Speed (km/h)	Predicted speed	Temp.(°C)	Windforce (Bft)	Air density (kg/m³)	Cook’s D	leverage	stand.residual
708	2015-07-05	cook’s	53	36.13636	31.65783	21.5	5	1.197559	0.0101980886086	0.0123395	2.555406676
913	2016-07-20	cook’s	62	34.12844	30.89153	28.8	4	1.165747	0.0106329767880	0.0240483	1.857992850
1293	2019-02-07	cook’s	94	23.45000	26.34377	7.5	7	1.245624	0.0104027848173	0.0291324	-1.665373749
1453	2020-02-09	cook’s	74	23.52632	27.11390	10.4	7	1.212612	0.0162494363712	0.0295793	-2.065144296
1765	2024-08-06	cook’s	45	34.32308	29.61288	25.9	2	1.175575	0.0163882035589	0.0177316	2.694957904
1385	2019-08-07	cook’s + outlier	50	24.79339	30.24329	20.9	5	1.191146	0.0104030879086	0.0085654	-3.103738934
1387	2019-08-07	cook’s + outlier	26	24.00000	30.57769	20.9	5	1.191146	0.0192594896507	0.0108360	-3.750314986
1744	2024-01-27	cook’s + outlier	138	32.89048	26.05431	5.1	4	1.294321	0.0168807909593	0.0088287	3.893735765
1203	2018-07-26	leverage	111	28.95652	29.62366	33.1	3	1.153235	0.0010126377113	0.0509803	-0.388336870
1204	2018-07-27	leverage	105	28.89908	29.19931	32.5	4	1.151709	0.0002264925515	0.0557395	-0.175200878
1209	2018-08-07	leverage	112	30.13453	29.62382	31.0	3	1.153653	0.0005783712916	0.0498098	0.297095695
1369	2019-07-25	leverage	120	27.48092	28.96770	34.6	3	1.146690	0.0073531004604	0.0713671	-0.874889858
1370	2019-07-26	leverage	131	27.01031	28.49778	32.0	4	1.150842	0.0059044276226	0.0588120	-0.869439281
1371	2019-07-27	leverage	80	29.62963	28.66047	28.7	5	1.156569	0.0018218041246	0.0440933	0.562105102
1674	2022-07-19	leverage	100	28.62571	28.61213	32.7	3	1.155555	0.0000003665128	0.0450673	0.007882155
12	2009-02-08	outlier	77	26.40000	32.22723	2.3	4	1.267063	0.0046042305724	0.0033508	-3.309934883
1351	2019-05-28	outlier	51	24.87805	30.23137	12.1	4	1.236845	0.0031238992011	0.0026974	-3.039749584
1580	2021-06-12	outlier	52	34.82697	29.00831	16.0	5	1.230857	0.0052389704840	0.0038204	3.305841746
1591	2021-07-22	outlier	31	25.45135	30.79288	18.9	3	1.218457	0.0071432051875	0.0061523	-3.038320732
1601	2021-09-16	outlier	45	36.39200	30.06308	16.4	4	1.221582	0.0070816197702	0.0043602	3.596720821
1604	2021-09-21	outlier	46	35.97632	29.58680	15.6	4	1.240916	0.0064404933780	0.0038943	3.630305271

Summary

The influence of the weather variables wind force and air density on the average ride-cycling speed is investigated using two files. The first is a personal file with data regarding all solo cycling trips in the period 2008 to the present in my home region, which is the flat northern provinces of the Netherlands. Located at sea level. Date of activity, distance and average speed are selected for each trip. The rides are very diverse in nature, ranging from time trial to interval training including KOM hunting, as well as recovery and less motivated efforts. The distance is between 15 and 165 km, with a median distance of 80. Median speed 31.03 km/h. The second source is the hourly weather data from the Royal Netherlands Meteorological Institute (KNMI), Eelde station - which is located about eight kilometers from my home town, the city of Groningen. The afternoon weather data is averaged and then merged by date with the cycling data file, resulting in a working database of 1769 rides.

Two quantities to measure wind force are Beaufort and meters per second. Meters per second is an interval scale, Beaufort does not have a constant pace of change. There is a significant difference in median ride speed between Beaufort wind force groups. The trip median speed at wind force 7 (less than 28 km/h) has decreased by 4 km/hour compared to reference wind force 2 (32 km/h), which is a decrease of more than 13%. The speed reduction per Beaufort increases progressively (see left summary graph below). The speed and wind force relationship in Beaufort is non-linear and requires a polynomial component. The non-linear model is significantly better. However, if we represent the wind force in meters per second, a linear model is sufficient. An extra component doesn’t add anything significant in this case.

The connection between air density and ride speed in my data can best be described with a segmented regression model where the relationship is represented by two lines with a breakpoint in between. Up to the breaking point we observe: the lower the air density, the higher the ride average speed. The significant breaking point has been estimated at an air density of 1.196 (see right summary graph above). After this breaking point, the speed drops, despite the lower air density. The drop is caused by a combination of both high temperature and high absolute humidity, resulting in fewer oxygen molecules per m³ of air, a stuffy feeling and a higher heart rate. Probably every cyclist has their own favorite air density. Mine is around 1.2.

The red line in the summary graph represents the actual effect of air density according to the drag equation. Assuming a reference speed of 29.369 km per hour at an air density of 1.31 kg per m³, and we reduce this density in steps of 10 grams, each step leads to a time gain of 8.5 seconds per hour, other things being equal. A density of 1.15 - a difference of 160 grams per m³ compared to 1.31 - equates to 30.519 km per hour and a time saving of 02:16 compared to the reference. In our own cycling data, which in no way meet the ceteris paribus clause, a maximum average time saving of 04:56 per hour has been predicted at an air density of 1.196, after which the speed will decrease.

The wind force - speed and air density - speed relationships found remain unchanged in a multivariate analysis with distance and year as control variables. Our data reveals no interaction effect of both weather variables of average ride speed.

The regression models presented have no predictive value whatsoever, given the recurring low R². Instead, we regard them as a good description of the direction and shape of the relationships.

Appendix: Eelde weather data

Figure 7a-l. Daytime weather data, Eelde, NL, average 2008-2024 List of figures

Relation between humidity and temperature. With the same amount of absolute humidity, air will have a higher relative humidity if the air is cooler, and a lower relative humidity if the air is warmer. If temperature increases above freezing point, it will on average lead to a decrease in relative humidity.
What we “feel” outside is the absolute humidity in the air. As the temperature rises, this will lead to an increase in absolute humidity with increasing variance of the latter.
Very low air density is associated with low relative humidity. Both quantities increase up to approximately the average air density, after which the relationship seems to disappear.
The density of air decreases with increasing absolute humidity. The relationship is inversely proportional.
As the air heats up the density of the air decreases. When air cools, its density increases. Warmer molecules of air move faster, creating an expansion effect that decreases air density.
As the wind force increases, the temperature variance decreases.
As the wind force increases, the air density variance decreases.
When the speed of wind increases, the water molecules are scattered and decrease the amount of water vapor in the surrounding. Higher wind speed causes minimum evaporation of water and low humidity and lower wind speed cause maximum evaporation of water and high humidity.
Higher air pressure implies a higher chance of high air density - in an H the relationship appears linear. In an L, on the other hand, the air pressure - air density relationship seems absent.
Apparently the highest temperature is not found at either extremely low or high air pressure. In an H we find a negative correlation air pressure - temperature, in an L a positive one. Seems suitable for a segmented regression.
We may associate a storm with a low-pressure area.
The lowest relative humidity appears to be associated with normal air pressure. As the air pressure decreases in an L, the relative humidity increases.

List of figures

Figure 1. Density plot speed and wind force
Figure 2. How to read a boxplot
Figure 3. Boxplot Beaufort wind force
Figure 4. Non-linear relation average ride speed and Beaufort wind force
Figure 5a. Segmented relation speed and air density
Figure 5b. 3d plot speed, air density and humidity
Figure 5c. Relationship trip cycle speed and air density: segmented versus drag equation
Figure 6a. Residual analysis multiple regression model
Figure 6b. Cook’s distance chart multiple regression model
Figure 6c. Influential rides and outliers in multiple regression model
Figure 7a-l. Daytime weather data, Eelde, NL, average 2008-2024

Scientific publications on cycling aerodynamics

Bert Blocken, Thijs van Druenen, Yasin Toparlar, Thomas Andrianne “CFD analysis of an exceptional cyclist sprint position”. Sports Engineering. 22, 2019
https://www.researchgate.net/publication/331366310_CFD_analysis_of_an_exceptional_cyclist_sprint_position

Fabio Malizia, Bert Blocken “Cyclist aerodynamics through time: Better, faster, stronger”. Journal of Wind Engineering and Industrial Aerodynamics, Volume 214, 2021
https://www.sciencedirect.com/science/article/pii/S0167610521001574

Dwyer, D. B. “The effect of environmental conditions on performance in timed cycling events”. Journal of Science and Cycling, 3(3), 17-22, 2014
https://www.jsc-journal.com/index.php/JSC/article/view/152

Footnotes

Geographer, data-analist and cyclist↩︎
The wind force according to Beaufort is determined from the average of the wind speed over 10 minutes. KNMI offers ‘FF’: Wind speed (in 0.1 m/s) averaged over the last 10 minutes of the past hour. We use “FH”: the average over 1 hour is decisive.↩︎
Strava reports wind force in km/h instead of Beaufort for every ride. That in itself is of course more accurate.↩︎
The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA)↩︎
How to calculate air pressure at high altitude: https://www.omnicalculator.com/physics/air-pressure-at-altitude ↩︎
Source: https://en.wikipedia.org/wiki/Density_of_air ↩︎
Source: https://iflycoast.com/understanding-air-density-and-its-effects/↩︎
It is good to realize that, depending on your condition, a high percentage of the carbohydrates and fats that you burn while cycling are converted into heat that your body must lose.↩︎
Badran, Omar & Abdulhadi, Emad & Mamlook, Rustom. (1992). Evaluation of parameters affecting wind turbine power generation.
https://www.researchgate.net/publication/229042504_Evaluation_of_parameters_affecting_wind_turbine_power_generation ↩︎
The Magnus effect is the name given to the physical phenomenon that the rotation of objects in a liquid or in air affects their forward motion. If the object is spinning, there is a force perpendicular to the direction of motion↩︎
Bahill, A. Terry & Baldwin, David & Ramberg, John. (2009). Effects of altitude and atmospheric conditions on the flight of a baseball. International Journal of Sports Science and Engineering. 03. 1750-9823
https://www.researchgate.net/publication/228668381_Effects_of_altitude_and_atmospheric_conditions_on_the_flight_of_a_baseball ↩︎
Gavin Francis, The best weather conditions for a KOM on Strava: https://science4performance.com/2017/03/01/the-best-weather-conditions-for-a-kom-on-strava/↩︎
Charles Fulco, Paul Rock & A. Cymerman (1998). “Maximal and submaximal exercise performance at altitude”. In: Aviation, space, and environmental medicine, no. 69. pp. 793-801. https://www.researchgate.net/publication/13569370_Maximal_and_submaximal_exercise_performance_at_altitude ↩︎
Source: https://en.wikipedia.org/wiki/Density_of_air ↩︎
How to calculate air pressure at high altitude: https://www.omnicalculator.com/physics/air-pressure-at-altitude ↩︎
see footnote 13, Figure 1, page 794↩︎
Absolute humidity calculator: https://www.omnicalculator.com/physics/absolute-humidity ↩︎
Source: https://en.wikipedia.org/wiki/Drag_equation ↩︎
Source: https://www.quora.com/What-is-the-drag-coefficient-of-a-cyclist-on-a-road-bike ↩︎
https://www.omnicalculator.com/physics/drag-equation ↩︎
For instance “Time Savings at Any Speed when Removing 100g of Drag”. Jon Thornham (2019) https://blog.flocycling.com/aero-wheels/cycling-wheel-aerodynamics-how-speed-time-and-power-are-affected-by-reducing-drag/↩︎
Vito Muggeo: Segmented: An R Package to Fit Regression Models With Broken-Line Relationships https://www.researchgate.net/publication/234092680_Segmented_An_R_Package_to_Fit_Regression_Models_With_Broken-Line_Relationships ↩︎
Breakpoint analysis and segmented regression: https://rpubs.com/MarkusLoew/12164 ↩︎
Vito Muggeo, see footnote 22↩︎
The terms predictor, control, independent variable and covariate are often used interchangeably in regression contexts.↩︎
Appendix Figure 7g shows the correlation between both weather variables.↩︎
Source: https://rpubs.com/DragonflyStats/Cooks-Distance ↩︎
Source: https://stackoverflow.com/questions/48962406/add-cooks-distance-levels-to-ggplot2 ↩︎

Wind force, air density and average cycle ride speed at sea level. And your favorite air density for fast cycling.

Jos Schakenraad¹

First published on 2024-09-27 (updated 2024-09-27)

Introduction

Data collection

I. Cycle speed and wind force

1. Density plot

2. Boxplot

3. (Non-)linear relation

4. Model wind force

II. Cycle speed and air density

1. Air density formula

2. Air pressure at altitude

3. Absolute humidity

4. Drag equation

5. Segmented regression

6. Scatterplot in 3d

7. Model air density

8. Deviation from drag equation predicted speed

III. Multiple regression

1. Model selection

2. Model summary

3. Model equation

4. Residuals and leverage

Summary

Appendix: Eelde weather data

List of figures

Wind force, air density and average cycle ride speed at sea level. And your favorite air density for fast cycling.

Jos Schakenraad1

First published on 2024-09-27 (updated 2024-09-27)

Introduction

Data collection

I. Cycle speed and wind force

1. Density plot

2. Boxplot

3. (Non-)linear relation

4. Model wind force

II. Cycle speed and air density

1. Air density formula

2. Air pressure at altitude

3. Absolute humidity

4. Drag equation

5. Segmented regression

6. Scatterplot in 3d

7. Model air density

8. Deviation from drag equation predicted speed

III. Multiple regression

1. Model selection

2. Model summary

3. Model equation

4. Residuals and leverage

Summary

Appendix: Eelde weather data

List of figures

Jos Schakenraad¹