It will not be unknown to many cyclists: on some days the
air feels ‘light’ during your bike ride and you can easily achieve
higher speed. On other days, with ‘heavy’ air, you can’t seem to get
through. A second experience is that the average ride speed
decreases with increasing wind force. The two relevant weather
characteristics in this context are air density and wind force. In this
article we will investigate to what extent we can quantify the expected
influences of both variables on the average speed of a bicycle trip in a
raw, unpolished personal data set.
We use hourly weather data in the Netherlands. The Royal
Netherlands Meteorological Institute (KNMI) is the dutch national
weather service. Their hourly weather data provide an overview of the
observations made over the years at approximately 35 stations.
Since
the city of Groningen is my start and finish location, we download the
data from the nearest weather station: Eelde, 8.5 kilometers away as the
crow flies.
Since 2008 - still in the pre-Strava era - I have been
collecting some cycling data in an Excel file: date, distance of the
ride, duration, speed, region and sometimes some comments. Sufficient
data to make a selection of solo rides in the north of the country,
i.e. the provinces of Groningen, Fryslân (excl. Wadden Islands) and
Drenthe. The weather data can be linked to the cycling table with date
as the key field.
All MTB and cyclo-cross activities, as well as
rides shorter than 15 and longer than 165 km, and a few fun rides on my
retro racing bike are not taken into account. By applying these
selection criteria, the sample of bicycle rides is still far from
homogeneous. The sample contains both time trials - where we ride from
start to finish as quickly as possible - and interval training. The
latter especially since the introduction of Strava, when we suddenly
wanted to set fast times on segments. Furthermore, the sample also
includes recovery rides and less motivated ones. Almost all of them are
round rides, there and back in one day, usually in the afternoon.
Below is the R code to create the working file. The code also contains
the programming of the formula for air density - this variable is not
offered by the KNMI. Later in the article we present the formula in a
neat way.
Winter fitness is partly maintained by indoor cycling
activities - these are obviously not used in this article.
We use
average weather data from each cycling day over an 8-hour period, from
11 a.m. to 7 p.m., because there is a good chance that cycling occurred
during that time period. It should be clear, however, that in the
working file the weather data may apply to a time interval when I was
not on the bike.
Finally, it can be noted that this analysis covers
a period of more than 15 years of cycling, where, taking into account
the age of the author, it is the highest achievable to compensate for
the natural loss of fitness as much as possible.
version[['version.string']]
## [1] "R version 4.3.2 (2023-10-31 ucrt)"
library(openxlsx)
library(tidyverse)
options(width = 2500, scipen = 99)
# SOURCE: ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE (KNMI)
# STN LON(east) LAT(north) ALT(m) NAME
# 280 6.585 53.125 5.20 Eelde
# HH : Hour; HH 12 is from van 11.00 UT to 12.00 UT
# FH : Hourly mean wind speed (in 0.1 m/s)
# T : Temperature (in 0.1 degrees Celsius) at 1.50 m at the time of observation
# P : Air pressure (in 0.1 hPa) reduced to mean sea level; at the time of observation
# U : Relative atmospheric humidity (in percents) at 1.50 m at the time of observation
# download.file("https://www.daggegevens.knmi.nl/klimatologie/uurgegevens", "data/knmi uurgegevens 2008-2017.txt", "curl",
# extra = "-d stns=280&vars=FH:T:P:U&start=2008010112&end=2018010120&fmt=csv") # KNMI requires TLS v1.2
# download.file("https://www.daggegevens.knmi.nl/klimatologie/uurgegevens", "data/knmi uurgegevens 2018-2027.txt", "curl",
# extra = "-d stns=280&vars=FH:T:P:U&start=2018010112&end=2027010120&fmt=csv") # KNMI requires TLS v1.2
knmi_hourly_data_2008_2017 <- read_csv("data/knmi uurgegevens 2008-2017.txt", col_names = FALSE, comment = "#")
knmi_hourly_data_2018_2027 <- read_csv("data/knmi uurgegevens 2018-2027.txt", col_names = FALSE, comment = "#")
knmi_hourly_data <- rbind(knmi_hourly_data_2008_2017, knmi_hourly_data_2018_2027) %>%
mutate(Date = ymd(X2), Hour = X3, Windforce_ms = X4 / 10, Temperature_C = X5 / 10, Air_pressure_hPa = X6 / 10, Relative_humidity = X7) %>%
select( - c(X1:X7)) %>%
group_by(Date) %>%
summarise(Windforce_ms = round(max(Windforce_ms),0),
Temperature_C = round(mean(Temperature_C),1),
Air_pressure_hPa = round(mean(Air_pressure_hPa),1),
Relative_humidity = round(mean(Relative_humidity),0),) %>%
mutate(Air_pressure_area = ifelse(Air_pressure_hPa < 1010, "L", ifelse(Air_pressure_hPa > 1020, "H", "normal")),
Windforce_Bft = cut(Windforce_ms, breaks = c(0.3, 1.6, 3.4, 5.5, 7.9, 10.8, 13.9, 17.2, 20.8, 24.5, 28.5, 32.6),
labels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))) %>%
mutate(Windforce_kmh = Windforce_ms * 3.6)
region <- read_csv("data/regio thesaurus.txt",
locale = locale(encoding = "WINDOWS-1252")) %>%
filter(regio == "Noordelijke prov.")
regional_cycle_activities <- read.xlsx("data/sport.xlsx", sheet="sport", detectDates = TRUE) %>%
left_join(region, by = "route") %>%
left_join(knmi_hourly_data, by = c("datum"="Date")) %>%
replace_na(list(fiets = '', route = '', opmerking = '')) %>%
mutate(Windforce_Bft = ifelse(row_number() == 4785, 7, ifelse(as.numeric(Windforce_Bft) == 1, 2, Windforce_Bft))) %>% # 1 ride Bft 9 and 4 rides Bft 1
mutate(Windforce_ms = ifelse(row_number() == 4785, 17, ifelse(Windforce_ms == 1, 2, Windforce_ms))) %>%
mutate(Windforce_Bft = factor(Windforce_Bft, levels = c(2:7), labels = c("2", "3", "4", "5", "6", "7"))) %>%
filter (!opmerking %in% c("censuur", "Tacx Boost", "VR") &
!fiets %in% c("Lapierre","cross","Casati","EM", "Bottecchia") &
regio == 'Noordelijke prov.' & sport == "wielrennen" & (trunc(km) >= 15 & trunc(km) <= 165)) %>%
rename (Date_cycle_activity = datum, Speed_kmh = "km/uur", Distance_km = km) %>%
mutate(Spring_summer = month(Date_cycle_activity) %in% c(5,6,7,8)) %>%
mutate( p1 = 0.61078 * exp((17.27 * Temperature_C)/(Temperature_C + 237.3)), # Calculate the saturation vapor pressure
pv = p1 * Relative_humidity, # Find the actual vapor pressure by multiplying the saturation vapor pressure by the relative humidity
pd = Air_pressure_hPa * 100 - pv, # 1 hPa = 1 mbar = 100 Pa
Air_density_kg_m3 = (pd / (287.058 * (Temperature_C + 273.15))) + (pv / (461.495 * (Temperature_C + 273.15)))) %>% # Kelvin = Celsius + 273.15 %>%
mutate(Absolute_humidity_g_m3 = (p1 * Relative_humidity * 100 / (461.495 * (Temperature_C + 273.15)))*100) %>% # Relative humidity in %
select(Date_cycle_activity, Speed_kmh, Distance_km, Temperature_C, Windforce_Bft, Windforce_ms, Relative_humidity, Absolute_humidity_g_m3, Air_pressure_hPa, Air_pressure_area, Air_density_kg_m3) %>%
mutate(Drag_force_N = 0.5 * Air_density_kg_m3 * (Speed_kmh/3.6)^2 * 0.5 * 0.9) %>% # speed required in m/s
filter(!is.na(Windforce_Bft)) %>% mutate(ID = row_number())
tail(regional_cycle_activities)
## Date_cycle_activity Speed_kmh Distance_km Temperature_C Windforce_Bft Windforce_ms Relative_humidity Absolute_humidity_g_m3 Air_pressure_hPa Air_pressure_area Air_density_kg_m3 Drag_force_N ID
## 1764 2024-07-16 29.53636 75.81 18.6 4 6 77 12.25609 1010.1 normal 1.205357 18.25606 1764
## 1765 2024-08-06 34.32308 44.62 25.9 2 3 50 12.10604 1009.8 L 1.175575 24.04364 1765
## 1766 2024-08-06 29.00513 56.56 25.9 2 3 50 12.10604 1009.8 L 1.175575 17.17028 1766
## 1767 2024-08-12 28.46809 111.50 26.6 3 5 53 13.34226 1011.0 normal 1.174147 16.52025 1767
## 1768 2024-08-13 27.96397 126.77 28.8 4 6 67 19.03756 1008.7 L 1.162587 15.78340 1768
## 1769 2024-08-15 30.62333 91.87 22.8 4 7 71 14.42848 1013.8 normal 1.192463 19.41456 1769
The script above ends with a list of the most recent trips in the
working file and its variables.
Let’s look at the relationship between speed and wind force, first.
We do this by successively creating a density graph, a box plot, a curve
and a regression model.
Two quantities to measure wind force are
Beaufort and meters per second. A cross tabulation of both quantities in
our working file is presented below. Please note that these are mean
wind speeds, averaged over 60 minutes.2 The wind speeds shown
in the table below and that you hear quoted in weather reports are
always measured at 10 meters above the ground. In general, they do not
reflect exactly the wind speeds that you would feel on the ground.
Meters per second is an interval scale - the difference between e.g. 5
and 6 m/s is identical to the difference between 30 and 31 m/s. Beaufort
does not have a constant pace of change.
For the purposes of this
article, the Beaufort scale was chosen because of my familiarity with it
after years of using weather apps.3 A second reason to use Beaufort is that it
has fewer classes and therefore more observations per class.
table(regional_cycle_activities$Windforce_Bft, regional_cycle_activities$Windforce_ms, dnn = c("in Bft", "Windforce in meters per second"))
## Windforce in meters per second
## in Bft 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 2 36 156 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 308 364 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 323 242 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 153 74 49 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 24 13 14 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0 0 0 5 5 1 2
A density plot is a representation of the distribution of a
numeric variable. It is a smoothed version of the histogram; a smooth
curve shows the distribution of the data in a more continuous way.
Histograms show counts, density plots show proportions. A multi density
chart is a density chart where several groups are represented. It allows
to compare their distribution. The issue with this kind of chart is that
it gets easily cluttered: groups overlap each other and the figure gets
unreadable. That’s why we use transparency in the color scheme.
Figure 1 is a multi density chart of the observed distribution of the
average bicycle speed, per Beaufort wind force number. It is very clear
that as the wind force decreases, the distribution of speed shifts
slightly further to the right, indicating an increasing average ride
speed.
A normal distribution speed curve of the complete sample (n =
1769) is superimposed. This can be used as a reference and to check to
what extent a possible necessary normality applies.
windcolors = c("#99FF00", "lightgreen", "green", "yellow", "orange", "red")
ggplot(regional_cycle_activities, aes(x = Speed_kmh, fill = Windforce_Bft)) +
scale_fill_manual(values = windcolors) +
geom_density(alpha = 0.2) +
stat_function(fun = dnorm, colour="red", linewidth = 1.5, args = list(mean =mean(regional_cycle_activities$Speed_kmh), sd = sd(regional_cycle_activities$Speed_kmh))) +
labs(title = "Density plots of average cycle ride speed grouped by windforce",
subtitle = paste0("compared to the normal distribution of the complete sample (n = ", nrow(regional_cycle_activities),")"),
caption = paste0("All my solo rides 2008 - ", year(today()), " in the north of the Netherlands. Reference date: ", today()),
fill='windforce (Bft)') +
ylab("density") + xlab("Ride average speed (km/h)") +
geom_segment(aes(x=22, xend=23.5, y=0.18, yend=0.18), color = 'red', linewidth = 1.5) +
annotate("text", x = 26, y = 0.18, label = "sample normal distribution", color = 'red')
Figure 1. Density plot speed and windforce List of figures
A boxplot is a graph that gives a visual indication of how a variable’s 25th percentile, 50th percentile (the median), 75th percentile, minimum, maximum and outlier values are spread out in different groups. The median of a variable is the value separating the higher half from the lower half of a variable.
Figure 2. How to read a boxplot (image: Michael Galarnyk) List of figures
Below is the R code and the resulting boxplot.
library(rstatix) # Kruskal-Wallis test, Anova, kruskal_effsize
res.kruskal <- regional_cycle_activities %>% kruskal_test(Speed_kmh ~ Windforce_Bft)
# The first input variables the function stat_box_data is the y variable, the other input is the upper_limit.
# It basically is the value to position the text on top of the boxes. For each plot it is necessary to adapt upper_limit and vjust.
stat_box_data <- function(y, upper_limit = max(as.numeric(regional_cycle_activities$Windforce_Bft)) * 4.4) {
return(
data.frame(
y = 0.95 * as.numeric(upper_limit),
label = paste('n of rides:', length(y), '\n',
'mean km/h:', round(mean(y),1), '\n',
'median km/h:', round(median(y),1), '\n')
)
)
}
ggplot(regional_cycle_activities, aes(Windforce_Bft, y = Speed_kmh, fill=as.numeric(Windforce_Bft))) +
geom_boxplot() +
theme(legend.position="none",
plot.title = element_text(size=17)) +
labs(title = "Relationship cycle trip speed and Beaufort windforce in the northern part of the Netherlands",
subtitle = get_test_label(res.kruskal, detailed = TRUE),
caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today())) +
xlab("windforce (Bft)") + ylab("Cycle trip speed km/h") +
scale_fill_gradient(high=hcl(15,100,75), low=hcl(195,100,75)) +
stat_summary(
fun.data = stat_box_data,
geom = "text", hjust = 0.5, vjust = 1.3
)
Figure 3. Boxplot Beaufort windforce and ride speed List of figures
The Kruskal–Wallis test is a non-parametric method for testing
whether samples originate from the same distribution.4
After applying the
Kruskal–Wallis test, the null hypothesis that there is no relationship
between wind force and median ride cycling speed is rejected as
the p-value is less than the significance level 0.05. From the output of
the Kruskal-Wallis test, we know that there is a significant difference
in median ride speed between wind force groups, but we don’t know which
pairs of groups are different. It’s possible to use the pairwise
Wilcoxon test to calculate pairwise comparisons between group levels
with corrections for multiple testing. The pairwise comparison shows
that all are significantly different (p < 0.05),
except 2 compared to 3, as shown in the table after the
following script.
pc <- pairwise.wilcox.test(regional_cycle_activities$Speed_kmh, as.numeric(regional_cycle_activities$Windforce_Bft), p.adjust.method = "BH")
pc_df <- as.data.frame(round(pc$p.value,9)) %>%
replace(is.na(.), "-")
colnames(pc_df) <- c("2", "3", "4", "5", "6")
rownames(pc_df) <- c("3", "4", "5", "6", "7")
pc_df
## 2 3 4 5 6
## 3 0.159849100 - - - -
## 4 0.000624398 0.001039662 - - -
## 5 0.000000127 0.000000002 0.000538863 - -
## 6 0.000000008 0.000000002 0.00000022 0.000290843 -
## 7 0.000000611 0.000000237 0.000000651 0.000004819 0.009895642
In addition to the significant relationship between wind force and
the average speed of the entire trip, Figure 3 reveals a non-linear
decrease. With every increase in Beaufort, the speed decreases
disproportionately. Not surprising, because the Beaufort scale is not a
linear scale. A twice as high wind force does not correspond to double
the wind speed.
The script below calculates the absolute and
relative decrease in the median trip speed per Beaufort. Graphically
shown in Figure 4.
decreasing_speed_table <- regional_cycle_activities %>%
group_by(Windforce_Bft) %>%
summarise(n_of_rides = n(), median_kmh = round(median(Speed_kmh),5)) %>%
mutate(change_in_median = round(median_kmh - lag(median_kmh),5),
across(where(is.numeric), ~replace(., is.na(.), 0)),
cum_change_in_median = cumsum(change_in_median),
cum_perc_change_in_median = round((1 - median_kmh / median_kmh[1]) * 100,5))
head(decreasing_speed_table)
## # A tibble: 6 × 6
## Windforce_Bft n_of_rides median_kmh change_in_median cum_change_in_median cum_perc_change_in_median
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 192 31.7 0 0 0
## 2 3 672 31.4 -0.307 -0.307 0.967
## 3 4 565 31.0 -0.445 -0.752 2.37
## 4 5 276 30.3 -0.732 -1.48 4.68
## 5 6 51 29.0 -1.27 -2.75 8.67
## 6 7 13 27.5 -1.51 -4.26 13.4
scale_factor <- max(-decreasing_speed_table$cum_change_in_median) / max(decreasing_speed_table$cum_perc_change_in_median)
decreasing_speed_model <- lm(formula = cum_change_in_median ~ poly(as.numeric(Windforce_Bft), 2), data = decreasing_speed_table)
decreasing_speed_model$fitted.values
## 1 2 3 4 5 6
## -0.04866357 -0.22293100 -0.72964629 -1.56880943 -2.74042043 -4.24447929
ggplot(decreasing_speed_table, aes(x = Windforce_Bft, group=1)) +
geom_line(aes(y=cum_change_in_median), color = "blue", linewidth = 1.1) +
stat_smooth(aes(y=cum_change_in_median), method = lm, se = FALSE, color = "darkgreen", linewidth = 1.0) +
stat_smooth(aes(y=cum_change_in_median), method = lm, formula = y ~ poly(x, 2), se = FALSE, color = "red", linewidth = 1.5) +
scale_y_continuous(name="change speed in km/h (≈ median cycle speed)", breaks = seq(-4,0,1), labels = c("-4 (≈28)", "-3 (≈29)", "-2 (≈30)", "-1 (≈31)", "0 (≈32)"), sec.axis=sec_axis(~./scale_factor, breaks = seq(-14,0,1), name="% decrease from reference wind force 2")) +
theme(
axis.title.y.left=element_text(color = "black"),
axis.text.y.left=element_text(color = "black"),
axis.title.y.right=element_text(color = "black"),
axis.text.y.right=element_text(color = "black")) +
labs(title = "Decreasing median trip speed (in km/h and as %) with increasing wind force (Bft)",
subtitle = paste0("in a group of ", nrow(regional_cycle_activities), " rides in the northern part of the Netherlands"),
caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today())) +
xlab("wind force (Bft)") +
geom_segment(aes(x=4.6, xend=4.95, y=-0.2, yend=-0.2), color = 'red', linewidth = 1.4) +
annotate("text", x = 5.5, y = -0.2, label = "y ~ poly(x, 2)", color = 'red') +
geom_segment(aes(x=4.6, xend=4.95, y=-0.4, yend=-0.4), color = 'blue', linewidth = 1.4) +
annotate("text", x = 5.5, y = -0.4, label = "observed ", color = 'blue') +
geom_segment(aes(x=4.6, xend=4.95, y=-0.6, yend=-0.6), color = 'darkgreen', linewidth = 1.4) +
annotate("text", x = 5.5, y = -0.6, label = "linear ", color = 'darkgreen') +
annotate("segment", x = 1, xend = 2, y = decreasing_speed_model$fitted.values[2], yend = decreasing_speed_model$fitted.values[2], color = "red") +
annotate("segment", x = 1, xend = 3, y = decreasing_speed_model$fitted.values[3], yend = decreasing_speed_model$fitted.values[3], color = "red") +
annotate("segment", x = 1, xend = 4, y = decreasing_speed_model$fitted.values[4], yend = decreasing_speed_model$fitted.values[4], color = "red") +
annotate("segment", x = 1, xend = 5, y = decreasing_speed_model$fitted.values[5], yend = decreasing_speed_model$fitted.values[5], color = "red") +
annotate("segment", x = 1, xend = 6, y = decreasing_speed_model$fitted.values[6], yend = decreasing_speed_model$fitted.values[6], color = "red") +
annotate("segment", x = 2, xend = Inf, y = decreasing_speed_model$fitted.values[2], yend = decreasing_speed_model$fitted.values[2], color = "red", linetype='dotted') +
annotate("segment", x = 3, xend = Inf, y = decreasing_speed_model$fitted.values[3], yend = decreasing_speed_model$fitted.values[3], color = "red", linetype='dotted') +
annotate("segment", x = 4, xend = Inf, y = decreasing_speed_model$fitted.values[4], yend = decreasing_speed_model$fitted.values[4], color = "red", linetype='dotted') +
annotate("segment", x = 5, xend = Inf, y = decreasing_speed_model$fitted.values[5], yend = decreasing_speed_model$fitted.values[5], color = "red", linetype='dotted') +
annotate("segment", x = 6, xend = Inf, y = decreasing_speed_model$fitted.values[6], yend = decreasing_speed_model$fitted.values[6], color = "red", linetype='dotted') +
geom_point(aes(x = 1, y = decreasing_speed_model$fitted.values[1]), colour="red", size = 3.5) +
geom_point(aes(x = 2, y = decreasing_speed_model$fitted.values[2]), colour="red", size = 3.5) +
geom_point(aes(x = 3, y = decreasing_speed_model$fitted.values[3]), colour="red", size = 3.5) +
geom_point(aes(x = 4, y = decreasing_speed_model$fitted.values[4]), colour="red", size = 3.5) +
geom_point(aes(x = 5, y = decreasing_speed_model$fitted.values[5]), colour="red", size = 3.5) +
geom_point(aes(x = 6, y = decreasing_speed_model$fitted.values[6]), colour="red", size = 3.5) +
annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[2], label = round(decreasing_speed_model$fitted.values[2],1), color = 'red') +
annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[3], label = round(decreasing_speed_model$fitted.values[3],1), color = 'red') +
annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[4], label = round(decreasing_speed_model$fitted.values[4],1), color = 'red') +
annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[5], label = round(decreasing_speed_model$fitted.values[5],1), color = 'red') +
annotate("text", x = 0.85, y = decreasing_speed_model$fitted.values[6], label = round(decreasing_speed_model$fitted.values[6],1), color = 'red')
Figure 4. Non-linear relation median ride speed and Beaufort wind force List of figures
The left y-axis of Figure 4 represents the median speed of the
ride and the descent of this median. In blue the observed scores, in red
the polynomial smoothed ones. The right y-axis contains the percentage
decrease. We conclude that the trip median speed at wind force 7 (less
than 28 km/h) has decreased by 4 km/hour compared to wind force 2 (32
km/h), which is a decrease of more than 13%. The speed reduction per
Beaufort increases progressively.
Finally, the question arises
whether the relationship between ride speed and wind speed measured in
meters per second is also non-linear. We answer that question by means
of \(R^2\) model comparison.
First
we collect some goodness-of-fit scores for four model variants with ride
average speed as dependent variable and wind direction as predictor,
measured in the following ways: 1. wind force in Bft; 2. Bft polynomial;
3. wind force in m/s; 4. m/s polynomial. Script plus output:
model1 <- lm(Speed_kmh ~ as.numeric(Windforce_Bft), data = regional_cycle_activities)
model2 <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2), data = regional_cycle_activities)
model3 <- lm(Speed_kmh ~ as.numeric(Windforce_ms), data = regional_cycle_activities %>% filter(as.numeric(Windforce_ms) >= 2 & as.numeric(Windforce_ms) <= 12))
model4 <- lm(Speed_kmh ~ poly(as.numeric(Windforce_ms), 2), data = regional_cycle_activities %>% filter(as.numeric(Windforce_ms) >= 2 & as.numeric(Windforce_ms) <= 12))
model_AIC <- as.data.frame(c(AIC(model1),AIC(model2),AIC(model3),AIC(model4)))
model_RSS <- as.data.frame(c(sum(resid(model1)^2),sum(resid(model2)^2),sum(resid(model3)^2),sum(resid(model4)^2)))
model_r_squared <- as.data.frame(c(round(summary(model1)$r.squared,3),
round(summary(model2)$r.squared,3),
round(summary(model3)$r.squared,3),
round(summary(model4)$r.squared,3)))
model_p <- as.data.frame(c(anova(model1)$'Pr(>F)'[1],anova(model2)$'Pr(>F)'[1],anova(model3)$'Pr(>F)'[1],anova(model4)$'Pr(>F)'[1]))
model_variables <- as.data.frame(c("Speed ~ Windforce (Bft)",
"Speed ~ polynomial(Windforce (Bft))",
"Speed ~ Windforce (m/s)",
"Speed ~ polynomial(Windforce (m/s))"))
models_compared <- cbind(model_variables, model_AIC, model_RSS, model_r_squared, model_p)
get_p_value_class <- function(pvalue){ # Function to get parameter p-value classification.
p_class = cut(pvalue, breaks = c(0,0.001,0.01,0.05,1),
labels = c("p < .001", "p < .01", "p < .05", "n.s."))}
names(models_compared) <- c("model", "AIC", "RSS", "R²", "significance")
models_compared <- models_compared %>% mutate(significance = get_p_value_class(significance))
models_compared
## model AIC RSS R² significance
## 1 Speed ~ Windforce (Bft) 7978.051 9384.569 0.059 p < .001
## 2 Speed ~ polynomial(Windforce (Bft)) 7958.462 9270.737 0.071 p < .001
## 3 Speed ~ Windforce (m/s) 7825.121 9077.016 0.044 p < .001
## 4 Speed ~ polynomial(Windforce (m/s)) 7824.703 9064.421 0.045 p < .001
Next we test the linear with its polynomial counterpart. The
output below shows that the speed and wind force relationship in
Beaufort is non-linear and therefore requires a polynomial component.
The non-linear model is significantly better in our dataset (p <
.001). However, if we represent the wind force in meters per second, a
linear model is sufficient. An extra component doesn’t add anything
significant.
compare_12 <- anova(model1, model2)$"Pr(>F)"[2]
compare_34 <- anova(model3, model4)$"Pr(>F)"[2]
significance <- as.data.frame(c(compare_12,compare_34)) %>% mutate(significance = get_p_value_class(.[[1]])) %>% select(significance)
model_speed_predictors <- as.data.frame(c("linear Beaufort wind force model versus polynomial Bft",
"linear wind force model in m/s versus polynomial m/s"))
bft_ms_compared <- cbind(model_speed_predictors, significance)
names(bft_ms_compared) <- c("models compared", "significance")
bft_ms_compared
## models compared significance
## 1 linear Beaufort wind force model versus polynomial Bft p < .001
## 2 linear wind force model in m/s versus polynomial m/s n.s.
The following regression model summarizes the relationship between ride average speed and Beaufort wind force in my data.
Average ride speed = 30.7975693 + -24.300838 * Windforce + -10.6692425 * Windforce\(^2\)
\(R^2\) = 0.0706
In general terms, density is the mass of anything – including air –
divided by the volume it occupies. In the metric system air density is
measured in terms of kilograms per cubic meter. The air’s density
depends on its temperature, its pressure and how much water vapor is in
the air (appendix Figure 7d,e,i). Air pressure is inversely associated
with altitude, but since my home cycling area is at sea level, we can
suffice with the KNMI air pressure. If the analyzed location is at a
high altitude, you can apply a specific formula to establish a more
accurate value for air pressure.5
At 1013.25 hPa sea level pressure and
15 °C, air has a density of approximately 1.225 \(kg/m^3\).6
Other things being
equal (most notably the pressure and humidity), hotter air is less dense
than cooler air and will thus rise while cooler air tends to fall.
Humid air is lighter, or less dense, than dry air at the same
temperature and pressure. Explanation: replacing nitrogen and oxygen
with water vapor decreases the weight of the air in the cubic meter;
that is, it’s density decreases. A fixed volume of gas, say one cubic
meter, at the same temperature and pressure, has always the same number
of molecules no matter what gas is in it.7 Compared to the
differences made by temperature and air pressure, humidity has a small
effect on the air’s density. Air can only contain a certain amount of
water vapor: the lower the temperature; the smaller the quantity
(appendix Figure 7b). When the air is humid, heat feels more oppressive
than when the air is dry. This is because cooling through evaporation is
made more difficult at high humidity. Due to a combination of high
humidity and high temperatures, it feels stuffy and there is less
enthusiasm for strenuous exercise. The body temperature and heart rate
increase rapidly in such conditions.8
There is also a second adverse effect
of humid air. With high humidity, there is on average more water vapor
in the air, up to approximately 20 grams per m³ at a temperature of 30
°C (appendix Figure 7b). This water vapor displaces other gases,
resulting in less oxygen per m³ of air in sultry, warm weather, causing
the body to inhale less oxygen with the same amount of effort.
High
air pressure means ‘heavy’ air, but there is no correlation low air
pressure - ‘light’ air. Low air pressure can indicate either ‘heavy’ or
‘light’ air (appendix Figure 7i).
Air density is a significant
factor in many disciplines. Just to mention a few examples:
power from a wind turbine and range of a
baseball.
The power available in the wind is directly
proportional to air density. As air density increases the available
power also increases, lower air density results in lower power from a
wind turbine.9
When the air density is lower,
baseballs can travel further. The aerodynamic ‘drag’ force gets smaller.
Conversely, cold temperatures can reduce the distance a ball travels due
to increased air density. But at the same time, as air density gets
smaller, the Magnus force10 also gets smaller, which means that the
ball will not be held aloft as long and will therefore not go as far.
These two effects are in opposite directions. Simulation shows that the
change in the drag force affects the trajectory of the ball more than
the change in the Magnus force. Therefore, as air density goes down, the
range of a potential home run ball increases.11
The importance of air density for cyclists is the
following: more dense, or ‘heavier’ air will slow down cyclists moving
through it more because the cyclist has to, in effect, shove aside more
or heavier molecules. Such air resistance, ‘drag’, is proportional to
the density of the air though which a cyclist is pushing both body and
bike. If the air density decreases by a certain percentage, the drag
drops by the same percentage, assuming that other relevant factors
remain the same.
A 5% reduction in air density implies that, other
things being equal, the cube of your speed can increase by 5%, resulting
in a 1.6% reduction in time. The difference is a second per minute.12
Let’s check the following: a cyclist completing a 30 km lap in one hour,
averaging 30 km/h, would finish in 59 minutes in 5% less dense air,
ceteris paribus.
\(\Delta_{speed\space km/h} =
\frac{\sqrt[3]{30^{3}\cdot 1.05} - 30}{30}\cdot100 =
\frac{0.49189}{30}\cdot100 = 1.63963\space \%\)
\(duration_{minutes} = \frac{30\space km}{30.49189}
\cdot 60 = 59\)
Low air density not only implies speed gains, but also a lower
oxygen concentration (called partial pressure of oxygen). Suppose you
are cycling in the mountains at an altitude of 3,000 meters. At that
location the oxygen concentration is approximately 70% of that at sea
level. The air you breathe therefore contains significantly less oxygen
which of course results in a reduced maximum performance. Studies show
that V̇O₂max decreases as altitude increases. A small decrease in V̇O₂max,
compared to values at sea level, is already observed at an altitude of
580 meters. As altitude increases, the amount of oxygen that can be
absorbed into the blood decreases, causing V̇O₂max to further decrease by
an average of about 15% at 3,000 meters relative to sea level.13 After
listing the formulas in sections II.1 and II.2, we will present this
data in a table.
It is expected that a lower oxygen content due to
low air density does not play any role in my data. The probably minimal
loss of power due to less oxygen is more than compensated by the reduced
air resistance.
The density of humid air is found by:14
\(\displaystyle \rho\,_{humid\,air} =
\frac{p_d}{R_dT} + \frac{p_v}{R_vT}\)
where:
\(\displaystyle
\rho\,_{humid\,air}\) density of the humid air (\(kg/m^3\))
\(\displaystyle p_d\) partial pressure of dry air (Pa)
\(\displaystyle R_d\) specific gas constant for dry air, 287.058 J/(kg·K)
\(\displaystyle T\) temperature (K)
\(\displaystyle p_v\) pressure of water vapor (Pa)
\(\displaystyle R_v\) specific gas constant for water vapor, 461.495 J/(kg·K)
The vapor pressure of water may be calculated from the
saturation vapor pressure and relative humidity:
\(\displaystyle p_v = \phi p_{sat}\)
where:
\(\displaystyle p_v\) pressure of water vapor (Pa)
\(\displaystyle \phi\) relative humidity (range 0.0 – 1.0)
\(\displaystyle p_{sat}\)
saturation vapor pressure (kPa)
The saturation vapor pressure
of water at any given temperature is the vapor pressure when relative
humidity is 100%. One formula used to find the saturation vapor pressure
is:
\(\displaystyle p_{sat} = 0.61078 \, · exp \left(\frac{17.27T}{T + 237.3}\right)\)
where:
\(\displaystyle p_{sat}\) saturation vapor pressure (kPa)
\(\displaystyle T\) temperature
(°C)
The partial pressure of dry air \(\displaystyle p_d\) is found considering
partial pressure, resulting in:
\(\displaystyle p_d = p - p_v\)
where \(\displaystyle p\) denotes the observed absolute pressure.
Below is my R script for the air density formula:
mutate(p1 = 0.61078 * exp((17.27 * Temperature_C)/(Temperature_C + 237.3)), # Calculate the saturation vapor pressure
pv = p1 * Relative_humidity, # Find the actual vapor pressure by multiplying the saturation vapor pressure by the relative humidity
pd = Air_pressure_hPa * 100 - pv, # 1 hPa = 1 mbar = 100 Pa
Air_density_kg_m3 = (pd / (287.058 * (Temperature_C + 273.15))) + (pv / (461.495 * (Temperature_C + 273.15)))) %>% # Kelvin = Celsius + 273.15
To calculate air pressure at altitude it is necessary to use
the barometric formula:15
\(\displaystyle p = p_0 \, · exp
\left(\frac{−gM(h−h_0)}{RT}\right)\)
where:
\(\displaystyle p\) pressure in
hPa at altitude h
\(\displaystyle
h\) altitude in m
\(\displaystyle p_0\) pressure at sea
level (1013.25 hPa)
\(\displaystyle
h_0\) altitude at sea level (0 m)
\(\displaystyle T\) temperature in
Kelvins at altitude h
\(\displaystyle
M\) molar mass of air, 0.0289644 kg/mol
\(\displaystyle R\) ideal universal gas
constant, 8.31432 N · m/(mol · K)
\(\displaystyle g\) acceleration due to
gravity, 9.80665 (\(m/s\))
R script:
pressure_at_altitude_h <- 1013.25 * exp((-9.80665 * 0.0289644 * Altitude)/((8.31432 * (Temperature_C + 273.15))))
By applying the above formulas we can determine the air density
at different altitudes, for example at 15 degrees Celsius and a humidity
of 70%, see output below. The V̇O₂max decline column is taken from
Charles Fulco et al.16 It is an estimate based on various
studies.
## Pass Altitude_m Temperature_C Relative_humidity Air_pressure_hPa Oxygen_level_perc VO2max_decline_perc Air_density_kg_m3
## 1 Groningen (The Netherlands) 0 15 70 1013.25 100 0 1.224
## 2 Langenhard (Schwarzwald) 430 15 70 962.89 95 0 1.164
## 3 Valbrona (Italy) 500 15 70 954.93 94 0 1.154
## 4 Col du Rosier (Walloon region) 580 15 70 945.92 93 1 1.143
## 5 Col de la Rivière Noire (France) 1000 15 70 899.97 89 3 1.087
## 6 Cruz de Fierro (Spain) 1500 15 70 848.17 84 5 1.025
## 7 Bormio 2000 (Italy) 2000 15 70 799.35 79 8 0.966
## 8 Großglockner (Austria) 2500 15 70 753.34 74 11 0.910
## 9 Passo Stelvio (Italy) 2750 15 70 731.34 72 13 0.884
## 10 Col de Sommeiller (France) 3000 15 70 709.98 70 15 0.858
When we reach the top of the Col de Sommeiller, we cycle at an
altitude of 3,000 meters. The air pressure is 709.98 hPa (compared to
1013.25 at sea level). The oxygen concentration is 70% (100% at sea
level). The V̇O₂max has decreased by 15%. The air density is not higher
than 0.858 kg/m3, which of course also corresponds to 70% of the
reference value 1.224.
The lowest hPa in my cycle weather data is
979.3. The minimum density measured is 1.147 - both independent of
temperature and humidity. These relatively high scores make us refrain
from further using V̇O₂max in this article.
Relative humidity is the amount of water vapor present in the air
expressed as a percentage of the maximum possible amount of water vapor
that the air can hold at a given temperature. It depends on the
temperature.
Absolute humidity describes the actual amount of water
vapour in the air in \(g/m^3\).
Absolute humidity does not take into account the air temperature. It
only depends on the amount of water vapor.
If we know the relative humidity and air temperature, we can easily calculate the absolute humidity (AH) in \(g/m^3\) using the equation:17
\(\displaystyle AH = \left(\frac{p_{sat} \phi · 100}{R_vT}\right) · 100\)
In R:
mutate(Absolute_humidity_g_m3 = (p1 * Relative_humidity * 100 / (461.495 * (Temperature_C + 273.15)))*100) %>% # Relative humidity in %
If an object moves through the air, it experiences a
resisting force, the drag force. The value of this force depends on the
size and shape of an object, the air density and the velocity. The drag
equation is a formula used to calculate this force. 18
\(F_d = \frac{1}{2}\space \rho\space u^2\space A\space C_d\)
where:
\(F_d\) the drag force in Newton (\(N\))
\(\rho\) air density (\(kg/m^3\))
\(u\) velocity (\(m/s\)). It’s squared in the formula: aerodynamic drag increases exponentially the faster you go.
\(A\) reference area (\(m^2\)) i.e. the frontal area of the cyclist, let’s say 0.5.
\(C_d\) drag coefficient, let’s
say 0.9.
The typical drag coefficient of a cyclist on a road
bike is around 0.8 to 1.0. This value depends on factors such as the
cyclist’s body position, the bike design, and the cyclist’s clothing.
More aerodynamic body positions and equipment can reduce the drag
coefficient down to around 0.7, while a less aerodynamic setup may
result in a coefficient closer to 1.0.19
In cycling the
drag area is often used, \(AC_d\)
(\(m^2\)), which is the product of the
frontal area of the cyclist and the drag coefficient.
In R:
mutate(Drag_force_N = 0.5 * Air_density_kg_m3 * (Speed_kmh/3.6)^2 * 0.5 * 0.9) # speed required in m/s
An online drag equation calculator may help you compute the force.20 Here
you can simply confirm that a 10% decrease in air density (for example
from 1.28 to 1.152) leads to an equal decrease in drag (from 20 to 18
\(N\)). Tables are available that show
how drag reduction affects you at your speeds.21
Segmented or broken-line models are regression models where the
relationships between the response and one or more explanatory variables
are piecewise linear, namely represented by two or more straight lines
connected at unknown values, called breakpoints.22 23 A segmented
regression model has two or more sub-models. Segmented regression is
useful when the independent variable, clustered into two or more groups,
exhibits a different relationship with the dependent one in these
groups.
The estimated model can be visualized by the print summary.
The variable labeled with ‘U1.Air_density_kg_m3’ stands for the
difference-in-slope parameter of the variable ‘Air_density_kg_m3’. The
p-value relevant to U1 is never reported in the ‘segmented’ output and
NA is printed because “standard asymptotics do not apply”.24 In
order to test for a significant difference-in-slope, the Davies’ test
can be used. The output below shows that a significant breakpoint
exists. The use of two submodels is therefore justified. The
submodel-intercepts and slopes are printed in the output, as well as the
breakpoint’s confidence interval.
library(segmented)
fit.lm <- lm(Speed_kmh ~ Air_density_kg_m3, data = regional_cycle_activities)
fit.seg <- segmented(fit.lm, seg.Z = ~ Air_density_kg_m3, psi = 1.225)
summary(fit.seg)
##
## ***Regression Model with Segmented Relationship(s)***
##
## Call:
## segmented.lm(obj = fit.lm, seg.Z = ~Air_density_kg_m3, psi = 1.225)
##
## Estimated Break-Point(s):
## Est. St.Err
## psi1.Air_density_kg_m3 1.196 0.003
##
## Coefficients of the linear terms:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42.85 19.18 -2.234 0.02563 *
## Air_density_kg_m3 62.42 16.19 3.855 0.00012 ***
## U1.Air_density_kg_m3 -86.10 16.33 -5.272 NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.287 on 1765 degrees of freedom
## Multiple R-Squared: 0.07466, Adjusted R-squared: 0.07309
##
## Boot restarting based on 6 samples. Last fit:
## Convergence attained in 2 iterations (rel. change 0.0000000000052253)
davies.test(fit.lm, seg.Z = ~ Air_density_kg_m3, k=15)
##
## Davies' test for a change in the slope
##
## data: formula = Speed_kmh ~ Air_density_kg_m3 , method = lm
## model = gaussian , link = identity
## segmented variable = Air_density_kg_m3
## 'best' at = 1.1885, n.points = 15, p-value = 0.000000001902
## alternative hypothesis: two.sided
intercept(fit.seg)
## $Air_density_kg_m3
## Est.
## intercept1 -42.848
## intercept2 60.120
slope(fit.seg)
## $Air_density_kg_m3
## Est. St.Err. t value CI(95%).l CI(95%).u
## slope1 62.419 16.192 3.8549 30.661 94.178
## slope2 -23.679 2.128 -11.1270 -27.853 -19.506
seg_preds <- predict(fit.seg)
seg_res <- regional_cycle_activities$Speed_kmh - seg_preds
CI_break_point <- confint(fit.seg)
CI_break_point
## Est. CI(95%).low CI(95%).up
## psi1.Air_density_kg_m3 1.19592 1.1898 1.20205
The results obtained can now be graphed with air density on
the x-axis and the ride average of the speed on the y-axis. Each point,
i.e. each ride, is given a color depending on the daytime temperature
(Figure 5a).
The segmented regression model looks like an inverted
v-shaped curve (red line). The gam-smoothing curve (in green)
approximates the v-shape. The breakpoint is estimated at 1.196 (vertical
thin line).
The calculated turning point is personal; it will be
different for every cyclist. Cyclists from Eritrea are very likely to
suffer less from heat. The breaking point can be interpreted as the
optimal air density to achieve a high cycling speed.
pred_speed_CI_break_point_low <- predict(fit.seg, data.frame(Air_density_kg_m3 = c(CI_break_point[,2])))
pred_speed_CI_break_point_upp <- predict(fit.seg, data.frame(Air_density_kg_m3 = c(CI_break_point[,3])))
regional_cycle_activities <- regional_cycle_activities %>%
mutate(Temp_class = cut(Temperature_C, breaks = c(-10,0,9,14,19,24,29,34,40),
labels = c("<0", "1-9", "10-14", "15-19", "20-24", "25-29", "30-34", "35+")))
tempcolors = c("<0" = "black", "1-9" = "blue", "10-14" = "lightblue", "15-19" = "#FFCCCC", "20-24" = "#FFCC99", "25-29" = "#FF9933", "30-34" = "red", "35+" = "#CC0000" )
ggplot(regional_cycle_activities, aes(x = Air_density_kg_m3, y = Speed_kmh )) +
geom_point(aes(color = Temp_class)) +
scale_color_manual(values=tempcolors) +
scale_x_continuous(expression ("air density in "~kg/m^3)) +
scale_y_continuous("speed km/h") +
geom_smooth(method = "gam", se = TRUE, level = 0.95, color="darkgreen") +
geom_line(aes(y = seg_preds), color = "red", linewidth = 1) +
annotate("segment", x = round(fit.seg$psi[2],3), xend = round(fit.seg$psi[2],3), y = -Inf, yend = max(seg_preds), color = "red", linetype= 'dotted', linewidth = 1) +
annotate("segment", x = CI_break_point[,2], xend = CI_break_point[,2], y = -Inf, yend = pred_speed_CI_break_point_low, color = "darkgrey") +
annotate("segment", x = CI_break_point[,3], xend = CI_break_point[,3], y = -Inf, yend = pred_speed_CI_break_point_upp, color = "darkgrey") +
annotate("rect", xmin = CI_break_point[,2], xmax = CI_break_point[,3], ymin = -Inf, ymax = pred_speed_CI_break_point_low, alpha = .2) +
geom_segment(aes(x=round(fit.seg$psi[2],3)+0.002, xend=round(fit.seg$psi[2],3)+0.018, y=24, yend=24),
arrow = arrow(length = unit(0.4, "cm")), color = 'red') +
annotate("text", x = round(fit.seg$psi[2],3)+0.045, y = 24, label = paste0("breakpoint ", round(fit.seg$psi[2],3), " (CI = 0.95)"), color = 'red') +
annotate("segment", x = 1.26, xend = 1.275, y = 37.5, yend = 37.5, color = "darkgreen") +
annotate("segment", x = 1.26, xend = 1.275, y = 37, yend = 37, color = "red") +
annotate("text", x = 1.30, y = 37.5, label = "gam smoothing (CI = 0.95)", size=3.5, color = "darkgreen") +
annotate("text", x = 1.30, y = 37, label = "segmented regression ", size=3.5, color = "red") +
theme(plot.title = element_text(size=17)) +
labs(title = "Segmented relationship average ride speed and air density",
subtitle = paste0("in a group of ", nrow(regional_cycle_activities), " rides in the northern part of the Netherlands"),
caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today()),
colour='temperature °C')
Figure 5a. Segmented relation average ride speed and air density List of figures
As expected, we see a decreasing average trip speed with
increasing air density, but the decrease does not start until after the
breaking point. At very low air density the chance of a high average
speed apparently decreases. These appear to be very warm days - at least
by local standards. The explanatory factor is high absolute humidity
that can occur at high temperatures. Moist air is lighter, but it makes
the air temperature feel much hotter than it is. It interferes with the
body’s natural cooling mechanism causing the heart rate to increase. It
could lead to dehydration.
The bivariate daytime relationships between temperature,
absolute humidity and air density, as recorded in Eelde in the period
2008-2024, are plotted in Appendix Figure 7b,c,d.
We can add the
absolute humidity as an extra dimension to Figure 5a. Below the script
and the output, Figure 5b, an interactive 3D scatter plot. While
hovering over the plot, the popup text shows the relevant ride
information.
library(plotly)
plot_ly(data = regional_cycle_activities, x = ~Air_density_kg_m3, y = ~Absolute_humidity_g_m3, z = ~Speed_kmh, type="scatter3d", mode="markers",
color = ~Temp_class, colors = tempcolors, alpha = 0.9, size = 0.4, text = ~paste('id: ', ID, '<br>distance:', Distance_km, '<br>year:', year(Date_cycle_activity), '<br>windforce:', Windforce_Bft)) %>%
layout(legend = list(orientation = "vertical",
yanchor = "center",
y = 0.5)) %>%
layout( title = "<br><br><br>Air density, humidity, temperature and cycle speed",
scene = list(xaxis = list(title = 'x = air density (kg/m³)'),
yaxis = list(title = 'y = absolute humidity (g/m³)'),
zaxis = list(title = 'z = speed (km/h)')),
annotations = list(
x = 1.15,
y = 0.60,
text = 'temperature (°C)',
xref = 'paper',
yref = 'paper',
showarrow = FALSE))
Figure 5b. Speed, air density and humidity in 3d List of figures
The following regression model summarizes the relationship
between average speed of the ride and air density in my data.
if Air density < 1.196 : Speed = -42.848 + 62.419 * Air density
if Air density = 1.196 : Speed = -42.848 + 62.419 * 1.196 = 60.12 + -23.679 * 1.196 = 31.799916
if Air density > 1.196 : Speed = 60.12 + -23.679 * Air density
\(R^2\) = 0.0747
Now we can compare the air density - speed
relationship modeled according to our own data with the actual,
scientific relationship. An air density of 1.31 is chosen as a
reference, together with the observed median speed of just over 29 km
per hour. The table below shows the effect on speed if the air density
decreases in steps of 10 grams per cubic meter. Completely as expected,
the table shows that, for example, a 5.3% density decrease compared to
1.31 leads to a 1.75% speed gain if we apply the drag equation, which
corresponds to 62 seconds per hour. The table also contains the
predicted speed according to the segmented regression model and that
model indicates an improvement of 3:14 minutes per hour for the same
decrease in air density, which is more than three times faster than the
62 seconds just mentioned.
air_density_values <- data.frame(seq(from = 1.31, to = 1.15, by = -0.01))
names(air_density_values) <- "Air_density_kg_m3"
segm_prediction <- data.frame(predict(fit.seg, air_density_values))
names(segm_prediction) <- "speed_predicted_segmented"
format_min_sec <- function(s) {
s <- as.integer(s)
sprintf('%02i:%02i', s %/% 60 %% 60, s %% 60)}
air_density_pred <- regional_cycle_activities %>%
mutate(air_density_kg_m3 = round(Air_density_kg_m3,2),
air_density_kg_m3 = ifelse(air_density_kg_m3 <= 1.15, 1.15, ifelse(air_density_kg_m3 >= 1.31, 1.31, air_density_kg_m3)),
deviation_from_reference = 1.31 - air_density_kg_m3,
prop_deviation_from_reference = deviation_from_reference / 1.31) %>%
group_by(air_density_kg_m3) %>%
summarise(n_rides = n(), speed_kmh_observed_median = median(Speed_kmh),
speed_kmh_observed_mean = mean(Speed_kmh),
deviation_from_reference = mean(deviation_from_reference), prop_deviation_from_reference = mean(prop_deviation_from_reference)) %>%
arrange(desc(air_density_kg_m3)) %>%
cbind(.,segm_prediction) %>%
mutate(speed_increase_in_perc_due_to_less_drag = (((first(speed_kmh_observed_median)^3 * (1+prop_deviation_from_reference))^(1 / 3) - first(speed_kmh_observed_median)) / first(speed_kmh_observed_median) ) * 100 ,
speed_predicted_drag = (1+speed_increase_in_perc_due_to_less_drag/100) * first(speed_kmh_observed_median),
delta_speed_segmented = speed_predicted_segmented - lag(speed_predicted_segmented),
delta_speed_drag = speed_predicted_drag - lag(speed_predicted_drag),
drag_force_N = 0.5 * air_density_kg_m3 * (first(speed_kmh_observed_median)/3.6)^2 * 0.5 * 0.9,
drag_gain_minutes_per_hour = format_min_sec(round(abs((first(speed_kmh_observed_median) / speed_predicted_drag * 3600) - 3600),0)),
segm_gain_minutes_per_hour = format_min_sec(round(abs((first(speed_predicted_segmented) / speed_predicted_segmented * 3600) - 3600),0))) %>%
dplyr::select(air_density_kg_m3, n_rides, deviation_from_reference, prop_deviation_from_reference, speed_increase_in_perc_due_to_less_drag, speed_predicted_drag, delta_speed_drag, drag_gain_minutes_per_hour, drag_force_N, speed_predicted_segmented, delta_speed_segmented, segm_gain_minutes_per_hour, speed_kmh_observed_median, speed_kmh_observed_mean) %>%
mutate(across(c(4:7,9:11,13:14), ~ format(round(.x, digits=3), nsmall = 3)))
air_density_pred
## air_density_kg_m3 n_rides deviation_from_reference prop_deviation_from_reference speed_increase_in_perc_due_to_less_drag speed_predicted_drag delta_speed_drag drag_gain_minutes_per_hour drag_force_N speed_predicted_segmented delta_speed_segmented segm_gain_minutes_per_hour speed_kmh_observed_median speed_kmh_observed_mean
## 1 1.31 22 0.00 0.000 0.000 29.369 NA 00:00 19.616 29.100 NA 00:00 29.369 29.681
## 2 1.30 30 0.01 0.008 0.254 29.443 0.075 00:09 19.467 29.337 0.237 00:29 29.780 29.504
## 3 1.29 51 0.02 0.015 0.506 29.517 0.074 00:18 19.317 29.574 0.237 00:58 30.390 30.046
## 4 1.28 82 0.03 0.023 0.758 29.591 0.074 00:27 19.167 29.810 0.237 01:26 30.312 29.669
## 5 1.27 111 0.04 0.031 1.008 29.665 0.073 00:36 19.017 30.047 0.237 01:53 30.222 29.946
## 6 1.26 154 0.05 0.038 1.256 29.738 0.073 00:45 18.868 30.284 0.237 02:21 30.741 30.304
## 7 1.25 178 0.06 0.046 1.504 29.810 0.073 00:53 18.718 30.521 0.237 02:48 30.375 30.230
## 8 1.24 180 0.07 0.053 1.750 29.883 0.072 01:02 18.568 30.758 0.237 03:14 31.006 30.756
## 9 1.23 236 0.08 0.061 1.996 29.955 0.072 01:10 18.418 30.994 0.237 03:40 31.009 30.826
## 10 1.22 206 0.09 0.069 2.240 30.026 0.072 01:19 18.269 31.231 0.237 04:06 31.869 31.425
## 11 1.21 188 0.10 0.076 2.482 30.098 0.071 01:27 18.119 31.468 0.237 04:31 31.673 31.591
## 12 1.20 147 0.11 0.084 2.724 30.169 0.071 01:35 17.969 31.705 0.237 04:56 32.286 31.765
## 13 1.19 104 0.12 0.092 2.965 30.239 0.071 01:44 17.819 31.431 -0.273 04:27 31.921 31.429
## 14 1.18 50 0.13 0.099 3.204 30.310 0.070 01:52 17.670 30.807 -0.624 03:19 31.029 31.189
## 15 1.17 19 0.14 0.107 3.442 30.380 0.070 02:00 17.520 30.183 -0.624 02:09 28.468 29.928
## 16 1.16 6 0.15 0.115 3.680 30.449 0.070 02:08 17.370 29.559 -0.624 00:56 28.328 28.933
## 17 1.15 5 0.16 0.122 3.916 30.519 0.069 02:16 17.220 28.935 -0.624 00:21 28.899 28.496
Let’s see how both models compare (Figure 5c). The blue line
is the speed prediction according to the segmented regression based on
the dataset, as shown in Figure 5a. The red line represents the actual
effect of air density according to the drag equation. The secondary
y-axis shows the speed gain in minutes and as a percentage, both
compared to the reference speed cycled at an air density of 1.31 (29.369
km per hour). Due to reduced drag, the speed increases to 30.519 km/h at
an air density of 1.15, other things being equal. A time saving of 02:16
minutes per hour. In our own cycling data, which in no way meet the
ceteris paribus clause, a maximum average time saving of 5 minutes per
hour has been predicted at an air density of 1.196, after which speed
will decrease. This decrease is the result of unpleasant effects (on
myself) of high humidity associated with - for local standards - high
temperatures, as noted earlier. The speed increase with decreasing air
density to the breaking point is greater in the segmented regression
model than calculated according to reduced drag. No doubt a number of
other factors play a role here, such as dry roads, aero summer outfit,
use of a time trial bike.
ylim.prim <- c(28, 32)
ylim.sec <- c(-120, 320)
b <- diff(ylim.prim)/diff(ylim.sec)
a <- ylim.prim[1] - b*ylim.sec[1]
ggplot() +
geom_line(aes(x = air_density_pred$air_density_kg_m3, y = as.numeric(air_density_pred$speed_predicted_drag)), color="red", linewidth = 1.1) +
annotate("segment", x = fit.seg$psi[2], xend = fit.seg$psi[2], y = -Inf, yend = max(seg_preds),linewidth = 1, linetype= 'dotted', color = "blue") +
geom_line(aes(x = regional_cycle_activities$Air_density_kg_m3, y = seg_preds), color = "blue", linewidth = 1) +
scale_y_continuous("predicted speed km/h", limits=c(28,32), breaks = c(28:32), sec.axis = sec_axis(~ (. - a)/b, name = "gain in min/h and % compared to reference 1.31", breaks= seq(0,300, by=60), labels = c("00:00","01:00 = 1.7%","02:00 = 3.3%","03:00 = 5.0%","04:00 = 6.7%","05:00 = 8.3%"))) +
geom_segment(aes(x=round(fit.seg$psi[2],3)+0.0053, xend=round(fit.seg$psi[2],3)+0.018, y=28.2, yend=28.2),
arrow = arrow(length = unit(0.35, "cm")), color = 'blue') +
annotate("text", x = round(fit.seg$psi[2],3)+0.035, y = 28.2, label = paste0("breakpoint ", round(fit.seg$psi[2],3)), color = 'blue') +
annotate("segment", x = 1.24, xend = 1.25, y = 28.5, yend = 28.5, color = "red") +
annotate("segment", x = 1.24, xend = 1.25, y = 28.7, yend = 28.7, color = "blue") +
annotate("text", x = 1.27, y = 28.4, label = "drag equation-predicted\n(reference point 1.31)", size=3.5, color = "red") +
annotate("text", x = 1.27, y = 28.7, label = "segmented regression ", size=3.5, color = "blue") +
scale_x_continuous(expression ("air density in "~kg/m^3),limits=c(1.15,1.31),breaks=c(1.15,1.16,1.17,1.18,1.19,1.20,1.21,1.22,1.23,1.24,1.25,1.26,1.27,1.28,1.29,1.30,1.31)) +
theme(plot.title = element_text(size=17), panel.grid.minor.x = element_blank()) +
labs(title = "Relationship trip cycle speed and air density: segmented versus drag equation",
subtitle = paste0("in a group of ", nrow(regional_cycle_activities), " rides in the northern part of the Netherlands"),
caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today()))
Figure 5c. Relationship trip cycle speed and air density: segmented versus drag equation List of figures
Multiple regression is an extension of simple linear
regression, which we have used so far. It is used when we want to
predict the value of a variable based on the value of two or more other
variables. The dependent variable is the average trip speed. In addition
to both weather characteristics, wind force and air density, we add two
control variables: the length of the ride in km, and the year of the
activity.25
We collected goodness-of-fit measures and p-values for
simple and multiple regressions, paying particular attention to the AIC
and R². The AIC (‘Akaike Information Criterion’) is a a widely used
metric for evaluating the accuracy of predictions. The lower the AIC
score the better. The aim is to find the model that minimizes AIC. R² is
a statistical measure that represents the proportion of the variance for
a dependent variable that’s explained by the predictors.
lm_no_1 <- lm(Speed_kmh ~ as.numeric(Windforce_Bft), data = regional_cycle_activities)
lm_no_2 <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2), data = regional_cycle_activities)
lm_no_3 <- lm(Speed_kmh ~ Air_density_kg_m3, data = regional_cycle_activities)
lm_no_4 <- segmented(lm_no_3,seg.Z = ~ Air_density_kg_m3, psi = 1.20)
lm_tem <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2) + Air_density_kg_m3, data = regional_cycle_activities)
lm_no_5 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)
lm_tem <- lm(Speed_kmh ~ (poly(as.numeric(Windforce_Bft), 2) + Air_density_kg_m3)^2, data = regional_cycle_activities)
lm_no_6 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)
lm_no_7 <- lm(Speed_kmh ~ Distance_km, data = regional_cycle_activities)
lm_no_8 <- lm(Speed_kmh ~ year(Date_cycle_activity), data = regional_cycle_activities)
lm_tem <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2) + Distance_km + Air_density_kg_m3, data = regional_cycle_activities)
lm_no_9 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)
lm_tem <- lm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2) + year(Date_cycle_activity) + Air_density_kg_m3, data = regional_cycle_activities)
lm_no_10 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)
lm_tem <- lm(Speed_kmh ~ year(Date_cycle_activity) + poly(as.numeric(Windforce_Bft), 2) + Distance_km + Air_density_kg_m3, data = regional_cycle_activities)
lm_no_11 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)
# the glm p-value can be determined with the anova function comparing the fitted model to a null model.
# The null model is fit with only an intercept term on the right side of the model.
glm.0 <- glm(Speed_kmh ~ 1, data = regional_cycle_activities)
lm_tem <- glm(Speed_kmh ~ poly(as.numeric(Windforce_Bft), 2) + Air_density_kg_m3 + (year(Date_cycle_activity) + Distance_km)^2, data = regional_cycle_activities)
lm_no_12 <- segmented(lm_tem, seg.Z = ~ Air_density_kg_m3, psi = 1.20)
model_aic <- as.data.frame(c(AIC(lm_no_1),AIC(lm_no_2),AIC(lm_no_3),AIC(lm_no_4),AIC(lm_no_5),AIC(lm_no_6),AIC(lm_no_7),AIC(lm_no_8),AIC(lm_no_9),AIC(lm_no_10),AIC(lm_no_11),AIC(lm_no_12)))
model_r_squared <- as.data.frame(c(round(summary(lm_no_1)$r.squared,3),round(summary(lm_no_2)$r.squared,3),
round(summary(lm_no_3)$r.squared,3),round(summary(lm_no_4)$r.squared,3),
round(summary(lm_no_5)$r.squared,3),round(summary(lm_no_6)$r.squared,3),
round(summary(lm_no_7)$r.squared,3),round(summary(lm_no_8)$r.squared,3),
round(summary(lm_no_9)$r.squared,3),round(summary(lm_no_10)$r.squared,3),round(summary(lm_no_11)$r.squared,3),round(cor(lm_no_12$fitted, lm_no_12$y)^2, 3)))
model_compared <- as.data.frame(c("Speed ~ Windforce",
"Speed ~ polynomial(Windforce)",
"Speed ~ Air density",
"Speed ~ segmented(Air density)",
"Speed ~ polynomial(Windforce) + segmented(Air density)",
"Speed ~ polynomial(Windforce) + segmented(Air density) + interaction",
"Speed ~ Distance",
"Speed ~ Year",
"Speed ~ polynomial(Windforce) + segmented(Air density) + Distance",
"Speed ~ polynomial(Windforce) + segmented(Air density) + Year",
"Speed ~ polynomial(Windforce) + segmented(Air density) + Distance + Year",
"Speed ~ polynomial(Windforce) + segmented(Air density) + interaction(Distance + Year)"))
model_rss <- as.data.frame(c(sum(resid(lm_no_1)^2),
sum(resid(lm_no_2)^2),
sum(resid(lm_no_3)^2),
sum(resid(lm_no_4)^2),
sum(resid(lm_no_5)^2),
sum(resid(lm_no_6)^2),
sum(resid(lm_no_7)^2),
sum(resid(lm_no_8)^2),
sum(resid(lm_no_9)^2),
sum(resid(lm_no_10)^2),
sum(resid(lm_no_11)^2),
sum(resid(lm_no_12)^2)))
model_p <- as.data.frame(c(anova(lm_no_1)$'Pr(>F)'[1],anova(lm_no_2)$'Pr(>F)'[1],anova(lm_no_3)$'Pr(>F)'[1],anova(lm_no_4)$'Pr(>F)'[1],anova(lm_no_5)$'Pr(>F)'[1],anova(lm_no_6)$'Pr(>F)'[1],anova(lm_no_7)$'Pr(>F)'[1],anova(lm_no_8)$'Pr(>F)'[1],anova(lm_no_9)$'Pr(>F)'[1],anova(lm_no_10)$'Pr(>F)'[1],anova(lm_no_11)$'Pr(>F)'[1],anova(lm_no_12, glm.0, test="F")$'Pr(>F)'[2]
))
models_compared <- cbind(model_compared, model_aic, model_rss, model_r_squared, model_p)
names(models_compared) <- c("model", "AIC", "RSS", "R²", "significance")
models_compared <- models_compared %>% mutate(significance = get_p_value_class(significance))
models_compared
## model AIC RSS R² significance
## 1 Speed ~ Windforce 7978.051 9384.569 0.059 p < .001
## 2 Speed ~ polynomial(Windforce) 7958.462 9270.737 0.071 p < .001
## 3 Speed ~ Air density 7994.251 9470.909 0.051 p < .001
## 4 Speed ~ segmented(Air density) 7952.740 9230.356 0.075 p < .001
## 5 Speed ~ polynomial(Windforce) + segmented(Air density) 7815.463 8521.866 0.146 p < .001
## 6 Speed ~ polynomial(Windforce) + segmented(Air density) + interaction 7817.430 8512.076 0.147 p < .001
## 7 Speed ~ Distance 8023.638 9629.553 0.035 p < .001
## 8 Speed ~ Year 7450.890 6966.182 0.302 p < .001
## 9 Speed ~ polynomial(Windforce) + segmented(Air density) + Distance 7702.614 7986.175 0.199 p < .001
## 10 Speed ~ polynomial(Windforce) + segmented(Air density) + Year 7097.051 5671.154 0.431 p < .001
## 11 Speed ~ polynomial(Windforce) + segmented(Air density) + Distance + Year 7037.266 5476.499 0.451 p < .001
## 12 Speed ~ polynomial(Windforce) + segmented(Air density) + interaction(Distance + Year) 7035.652 5465.324 0.452 p < .001
From the above list it appears that the best single
predictor of bicycle trip speed is the passage of time (AIC 7450.9),
followed by the two weather variables segmented air density (AIC 7952.7)
and polynomial wind force (AIC 7958.5). Distance is the least important
(AIC 8023.6). In terms of R² of the simple regressions: the passing of
the years, in other words, growing older, explains 30% of the
variability of trip speed, the weather variables 7.1 - 7.5%, distance
only 3%.
The intercept b0 is the starting point of the regression
line, the slope b1 is the predicted change in speed with one unit change
in the year. The age effect on speed according to the simple regression
model is as follows (model no. 8 on the list above). Given an intercept
of 699.57 and a slope of -0.3318, the predicted average speed in the
starting year 2008 is 33.4 km per hour, and in 2024 28.1 km per hour,
which is a decrease of 5.3 km per hour over a period of 17 years of
cycling.
Multiple regression shows that both weather variables
together explain 14.6% of the variability of the dependent variable
(model no. 5 on the list above). The best model contains all four
predictors, including distance (model no. 11, percentage explained
variance 45%). The model with distance is significantly better than a
model without, given the p-value less than 0.05 (see output below where
model no. 10 and 11 are compared).
p <- get_p_value_class(anova(lm_no_10, lm_no_11, test="F")$"Pr(>F)"[2])
p
## [1] p < .001
## Levels: p < .001 p < .01 p < .05 n.s.
Although we assume that we have made longer trips more often in
recent years, an interaction term year - distance does not provide a
better model in terms of explaining average speed. Specifying the main
effects is sufficient.
p <- get_p_value_class(anova(lm_no_11, lm_no_12, test="F")$"Pr(>F)"[2])
p
## [1] n.s.
## Levels: p < .001 p < .01 p < .05 n.s.
Adding an interaction effect of both weather variables doesn’t make
the model significantly better, given the result in the output below
where model no. 5 and 6 are compared.26
p <- get_p_value_class(anova(lm_no_5, lm_no_6, test="F")$"Pr(>F)"[2])
p
## [1] n.s.
## Levels: p < .001 p < .01 p < .05 n.s.
Below are the summary printouts of the chosen model. It is
worth mentioning that the air density breakpoint in the multiple model
(1.198) is almost equal to the value found in the simple regression
model (1.196).
The explained proportion of variance is very low (R²
= 0.451), making the model extremely unqualified for prediction.
However, we only use the multiple model to measure the effect of
covariates. The conclusion is that the relationships found at bivariate
level between the weather variables and trip speed do not disappear
after controlling for trip length and a time variable. An interaction
effect of wind - air density cannot be statistically proven.
lm_basic <- lm(Speed_kmh ~ year(Date_cycle_activity) + poly(as.numeric(Windforce_Bft), 2) + Distance_km + Air_density_kg_m3, data = regional_cycle_activities)
lm_segmented <- segmented(lm_basic, seg.Z = ~ Air_density_kg_m3, psi = 1.20)
summary(lm_segmented)
##
## ***Regression Model with Segmented Relationship(s)***
##
## Call:
## segmented.lm(obj = lm_basic, seg.Z = ~Air_density_kg_m3, psi = 1.2)
##
## Estimated Break-Point(s):
## Est. St.Err
## psi1.Air_density_kg_m3 1.198 0.003
##
## Coefficients of the linear terms:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 617.652074 26.662692 23.165 < 0.0000000000000002 ***
## year(Date_cycle_activity) -0.310077 0.010916 -28.407 < 0.0000000000000002 ***
## poly(as.numeric(Windforce_Bft), 2)1 -23.981342 1.782149 -13.456 < 0.0000000000000002 ***
## poly(as.numeric(Windforce_Bft), 2)2 -8.334884 1.767477 -4.716 0.00000259851877523 ***
## Distance_km -0.013933 0.001762 -7.910 0.00000000000000451 ***
## Air_density_kg_m3 33.688025 11.662768 2.889 0.00392 **
## U1.Air_density_kg_m3 -58.263762 11.774643 -4.948 NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.763 on 1761 degrees of freedom
## Multiple R-Squared: 0.451, Adjusted R-squared: 0.4488
##
## Boot restarting based on 10 samples. Last fit:
## Convergence attained in 2 iterations (rel. change -0.00000051467)
davies.test(lm_basic, seg.Z = ~ Air_density_kg_m3, k=15)
##
## Davies' test for a change in the slope
##
## data: formula = Speed_kmh ~ year(Date_cycle_activity) + poly(as.numeric(Windforce_Bft), 2) + Distance_km + Air_density_kg_m3 , method = lm
## model = gaussian , link = identity
## segmented variable = Air_density_kg_m3
## 'best' at = 1.2011, n.points = 15, p-value = 0.00000001118
## alternative hypothesis: two.sided
intercept(lm_segmented)
## $Air_density_kg_m3
## Est.
## intercept1 617.65
## intercept2 687.44
slope(lm_segmented)
## $Air_density_kg_m3
## Est. St.Err. t value CI(95%).l CI(95%).u
## slope1 33.688 11.6630 2.8885 10.814 56.562
## slope2 -24.576 1.6796 -14.6320 -27.870 -21.281
Written as an equation:
if Air density
< 1.198 : Speed = 617.65 + -0.3100773 * Year +
-23.9813416 * Windforce
+ -8.3348836 * Windforce\(^2\) + -0.0139333 * Distance + 33.688 * Air
density
if Air density
= 1.198 : Speed = 617.65 + -0.3100773 * Year +
-23.9813416 * Windforce
+ -8.3348836 * Windforce\(^2\) + -0.0139333 * Distance + 33.688 *
1.198
= 687.44 + -0.3100773 * Year + -23.9813416 *
Windforce
+ -8.3348836 * Windforce\(^2\) + -0.0139333 * Distance + -24.576 *
1.198
if Air density
> 1.198 : Speed = 687.44 + -0.3100773 * Year +
-23.9813416 * Windforce
+ -8.3348836 * Windforce\(^2\) + -0.0139333 * Distance + -24.576 *
Air density
\(R^2\) = 0.451
We conclude with a section on residual and leverage analysis. The
residual for each observation is the difference between the predicted
and observed values. A high-leverage point has extreme values for one or
more predictors, or unusual combinations of predictor values.
Residuals may correlate with (an)other variable(s). This is usually
identified visually using graphs such as the quartet below. The residual
versus fitted plot can reveal non-linearity because of inequality of the
variances of the error terms. For instance a parabola indicates
non-linear relationship. Here, we don’t see any distinctive pattern. But
given the low R², we know that there are important uncaptured variables
left out of the model.
The plot that is in the right upper corner,
is the normal probability plot of residuals. It is an indication to see
if the model residuals follow a normal distribution. Do residuals follow
a straight line well or do they deviate severely? The Scale-Location
plot shows if residuals are spread equally along the ranges of
predictors. It’s good if you see a horizontal line with equally
(randomly) spread points. The leverage idea picks out the potential for
a specific value to be influential. Leverage refers to the extent to
which the coefficients in the regression model would change if a
particular observation was removed from the dataset. Observations with
high leverage have a strong influence on the coefficients in the
regression model. Leverage values range between 0 (no influence) and 1
(strong influence). There doesn’t seem to be any serious leverage
involved.
library(ggfortify)
autoplot(lm_segmented)
Figure 6a. Residual analysis multiple regression model List of figures
Cook’s distance is useful for identifying outliers in the predictor
variables. It also shows the influence of each observation on the fitted
response values.27
library(olsrr)
ols_plot_cooksd_chart(lm_segmented, threshold = 0.01)
Figure 6b. Cook’s distance chart multiple regression model List of figures
Following a suggestion on Stack Overflow, we will merge the residuals/leverage graph from Figure 6a with Cook’s distance chart (Figure 6b) by adding Cook’s distance levels to the former.28
model_residual_data <- regional_cycle_activities %>%
mutate(Speed_predicted = predict.segmented(lm_segmented)) %>%
mutate(cooks_distance = cooks.distance(lm_segmented),
leverage = hatvalues(lm_segmented),
residuals_standard = rstandard(lm_segmented)) %>%
mutate(influence = ifelse(cooks_distance >= 0.01 | leverage >= 0.04 | abs(residuals_standard) >= 3.0, TRUE, FALSE)) %>%
mutate(rownumber = row_number())
influential_observations <- model_residual_data %>% filter(influence) %>%
mutate(observation = ifelse(cooks_distance >= 0.01 & abs(residuals_standard) >= 3.0, "cook's + outlier",
ifelse(cooks_distance >= 0.01, "cook's",
ifelse(leverage >= 0.04, "leverage",
ifelse(abs(residuals_standard) >= 3.0, "outlier", NA))))) %>%
dplyr::select(rownumber,Date_cycle_activity,observation,Distance_km,Speed_kmh,Speed_predicted,Temperature_C,Windforce_Bft,Air_density_kg_m3,cooks_distance,leverage,residuals_standard) %>%
mutate(Distance_km = round(Distance_km,0)) %>%
arrange(observation,Date_cycle_activity)
cooksd_contour_pos <- function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)}
cooksd_contour_neg <- function(leverage, level, model) {-cooksd_contour_pos(leverage, level, model)}
obscolors = c("outlier" = "red", "leverage" = "blue", "cook's" = "orange", "cook's + outlier" = "#CC0000" )
ggplot(model_residual_data, aes(x = leverage, y = residuals_standard)) +
geom_hline(yintercept = 3, linetype = 'dotted', color= "red", linewidth = 1) +
geom_hline(yintercept = -3, linetype = 'dotted', color= "red", linewidth = 1) +
geom_vline(xintercept = 0.04, linetype = 'dotted', color= "blue", linewidth = 1) +
geom_point(alpha = 0.2) +
geom_point(data = influential_observations, aes(x = leverage, y = residuals_standard, colour = observation)) +
geom_smooth(method = "loess", se = FALSE, span = 0.75, color = "black") +
stat_function(fun = cooksd_contour_pos, args = list(level = 0.01, model = lm_segmented), xlim = c(0, 0.05), lty = 2, colour = "orange") +
stat_function(fun = cooksd_contour_neg, args = list(level = 0.01, model = lm_segmented), xlim = c(0, 0.05), lty = 2, colour = "orange") +
ggrepel::geom_text_repel(data = influential_observations, aes(label = rownumber, colour = observation), cex=4, hjust=1.1, fontface = 'italic', show.legend = FALSE) +
scale_y_continuous("standardized residual",limits = c(-4, 4)) +
scale_color_manual(values=obscolors) +
annotate("text", x = 0.047, y = 4.0, label = "leverage threshold: 0.04", size=3.5, color = "blue") +
annotate("text", x = 0.049, y = 1.45, label = "cook's d threshold: 0.01", size=3.5, color = "orange") +
annotate("text", x = 0.020, y = -3.2, label = "abs. outlier threshold: 3.0", size=3.5, color = "red") +
labs(title = "Influential rides and outliers in the regression model",
subtitle = paste0("in a group of ", nrow(regional_cycle_activities), " rides in the northern part of the Netherlands"),
caption = paste0("All my solo rides 2008 - ", year(today()), ". Reference date: ", today()),
colour='observation')
Figure 6c. Influential rides and outliers in multiple regression model List of figures
Finally, a list of all 21 rides that exceed the three
defined thresholds. High leverage appears to be absent in the database -
the threshold value has been deliberately set very low at 0.06. On the
other hand, there are quite a few outliers. As an example, we have
traced one striking outlier, ride number 1744: January 27, 2024, average
speed 32.8 km/h, distance 138 km, wind force 4, air density 1.29 and a
temperature of 5 degrees Celsius. It turns out to be a one-way trip from
Harlingen to Groningen with the wind at our backs. (… and ride numbers
1385 and 1387 are a one-way journey in two stages with a headwind).
library(kableExtra)
library(knitr)
knitr::kable(influential_observations, format = "html", escape = FALSE, caption = "Listing influential rides in regression model",
col.names = c("Ride", "Date", "Influence", "Distance (km)", "Speed (km/h)", "Predicted speed", "Temp.(°C)", "Windforce (Bft)", "Air density (kg/m³)", "Cook's D", "leverage", "stand.residual")) %>%
row_spec(1:nrow(influential_observations),font_size=10) %>%
row_spec(0,bold=TRUE) %>%
row_spec(which(influential_observations$observation == "outlier"),italic=FALSE, bold=FALSE, color = 'red') %>%
row_spec(which(influential_observations$observation == "leverage"),italic=FALSE, bold=FALSE, color = 'blue') %>%
row_spec(which(influential_observations$observation == "cook's"),italic=FALSE, bold=FALSE, color = 'orange') %>%
row_spec(which(influential_observations$observation == "cook's + outlier"),italic=FALSE, bold=TRUE, color = '#CC0000') %>%
kableExtra::kable_styling(full_width = TRUE, font_size = 10) %>%
column_spec(2:3, width = "11em")
| Ride | Date | Influence | Distance (km) | Speed (km/h) | Predicted speed | Temp.(°C) | Windforce (Bft) | Air density (kg/m³) | Cook’s D | leverage | stand.residual |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 708 | 2015-07-05 | cook’s | 53 | 36.13636 | 31.65783 | 21.5 | 5 | 1.197559 | 0.0101980886086 | 0.0123395 | 2.555406676 |
| 913 | 2016-07-20 | cook’s | 62 | 34.12844 | 30.89153 | 28.8 | 4 | 1.165747 | 0.0106329767880 | 0.0240483 | 1.857992850 |
| 1293 | 2019-02-07 | cook’s | 94 | 23.45000 | 26.34377 | 7.5 | 7 | 1.245624 | 0.0104027848173 | 0.0291324 | -1.665373749 |
| 1453 | 2020-02-09 | cook’s | 74 | 23.52632 | 27.11390 | 10.4 | 7 | 1.212612 | 0.0162494363712 | 0.0295793 | -2.065144296 |
| 1765 | 2024-08-06 | cook’s | 45 | 34.32308 | 29.61288 | 25.9 | 2 | 1.175575 | 0.0163882035589 | 0.0177316 | 2.694957904 |
| 1385 | 2019-08-07 | cook’s + outlier | 50 | 24.79339 | 30.24329 | 20.9 | 5 | 1.191146 | 0.0104030879086 | 0.0085654 | -3.103738934 |
| 1387 | 2019-08-07 | cook’s + outlier | 26 | 24.00000 | 30.57769 | 20.9 | 5 | 1.191146 | 0.0192594896507 | 0.0108360 | -3.750314986 |
| 1744 | 2024-01-27 | cook’s + outlier | 138 | 32.89048 | 26.05431 | 5.1 | 4 | 1.294321 | 0.0168807909593 | 0.0088287 | 3.893735765 |
| 1203 | 2018-07-26 | leverage | 111 | 28.95652 | 29.62366 | 33.1 | 3 | 1.153235 | 0.0010126377113 | 0.0509803 | -0.388336870 |
| 1204 | 2018-07-27 | leverage | 105 | 28.89908 | 29.19931 | 32.5 | 4 | 1.151709 | 0.0002264925515 | 0.0557395 | -0.175200878 |
| 1209 | 2018-08-07 | leverage | 112 | 30.13453 | 29.62382 | 31.0 | 3 | 1.153653 | 0.0005783712916 | 0.0498098 | 0.297095695 |
| 1369 | 2019-07-25 | leverage | 120 | 27.48092 | 28.96770 | 34.6 | 3 | 1.146690 | 0.0073531004604 | 0.0713671 | -0.874889858 |
| 1370 | 2019-07-26 | leverage | 131 | 27.01031 | 28.49778 | 32.0 | 4 | 1.150842 | 0.0059044276226 | 0.0588120 | -0.869439281 |
| 1371 | 2019-07-27 | leverage | 80 | 29.62963 | 28.66047 | 28.7 | 5 | 1.156569 | 0.0018218041246 | 0.0440933 | 0.562105102 |
| 1674 | 2022-07-19 | leverage | 100 | 28.62571 | 28.61213 | 32.7 | 3 | 1.155555 | 0.0000003665128 | 0.0450673 | 0.007882155 |
| 12 | 2009-02-08 | outlier | 77 | 26.40000 | 32.22723 | 2.3 | 4 | 1.267063 | 0.0046042305724 | 0.0033508 | -3.309934883 |
| 1351 | 2019-05-28 | outlier | 51 | 24.87805 | 30.23137 | 12.1 | 4 | 1.236845 | 0.0031238992011 | 0.0026974 | -3.039749584 |
| 1580 | 2021-06-12 | outlier | 52 | 34.82697 | 29.00831 | 16.0 | 5 | 1.230857 | 0.0052389704840 | 0.0038204 | 3.305841746 |
| 1591 | 2021-07-22 | outlier | 31 | 25.45135 | 30.79288 | 18.9 | 3 | 1.218457 | 0.0071432051875 | 0.0061523 | -3.038320732 |
| 1601 | 2021-09-16 | outlier | 45 | 36.39200 | 30.06308 | 16.4 | 4 | 1.221582 | 0.0070816197702 | 0.0043602 | 3.596720821 |
| 1604 | 2021-09-21 | outlier | 46 | 35.97632 | 29.58680 | 15.6 | 4 | 1.240916 | 0.0064404933780 | 0.0038943 | 3.630305271 |
The influence of the weather variables wind force
and air density on the average ride-cycling speed is
investigated using two files. The first is a personal file with
data regarding all solo cycling trips in the period 2008 to the present
in my home region, which is the flat northern provinces of the
Netherlands. Located at sea level. Date of activity, distance and
average speed are selected for each trip. The rides are very diverse in
nature, ranging from time trial to interval training including KOM
hunting, as well as recovery and less motivated efforts. The distance is
between 15 and 165 km, with a median distance of 80. Median speed 31.03
km/h. The second source is the hourly weather data from the
Royal Netherlands Meteorological Institute (KNMI), Eelde station - which
is located about eight kilometers from my home town, the city of
Groningen. The afternoon weather data is averaged and then merged by
date with the cycling data file, resulting in a working database of 1769
rides.
Two quantities to measure wind force are
Beaufort and meters per second. Meters per second is an interval scale,
Beaufort does not have a constant pace of change. There is a significant
difference in median ride speed between Beaufort wind force groups. The
trip median speed at wind force 7 (less than 28 km/h) has decreased by 4
km/hour compared to reference wind force 2 (32 km/h), which is a
decrease of more than 13%. The speed reduction per Beaufort increases
progressively (see left summary graph below). The speed and wind force
relationship in Beaufort is non-linear and requires a polynomial
component. The non-linear model is significantly better. However, if we
represent the wind force in meters per second, a linear model is
sufficient. An extra component doesn’t add anything significant in this
case.
The connection between air density and ride speed in
my data can best be described with a segmented regression model where
the relationship is represented by two lines with a breakpoint in
between. Up to the breaking point we observe: the lower the air density,
the higher the ride average speed. The significant breaking point has
been estimated at an air density of 1.196 (see right summary graph
above). After this breaking point, the speed drops, despite the lower
air density. The drop is caused by a combination of both high
temperature and high absolute humidity, resulting in fewer oxygen
molecules per m³ of air, a stuffy feeling and a higher heart rate.
Probably every cyclist has their own favorite air density. Mine is
around 1.2.
The red line in the summary graph represents
the actual effect of air density according to the drag
equation. Assuming a reference speed of 29.369 km per hour at
an air density of 1.31 kg per m³, and we reduce this density in steps of
10 grams, each step leads to a time gain of 8.5 seconds per hour, other
things being equal. A density of 1.15 - a difference of 160 grams per m³
compared to 1.31 - equates to 30.519 km per hour and a time saving of
02:16 compared to the reference. In our own cycling data, which in no
way meet the ceteris paribus clause, a maximum average time saving of
04:56 per hour has been predicted at an air density of 1.196, after
which the speed will decrease.
The wind force - speed
and air density - speed relationships found remain unchanged in
a multivariate analysis with distance and year as control variables. Our
data reveals no interaction effect of both weather variables of average
ride speed.
The regression models presented have no predictive
value whatsoever, given the recurring low R². Instead, we regard them as
a good description of the direction and shape of the relationships.
Figure 7a-l. Daytime weather data, Eelde, NL, average 2008-2024 List of figures
Figure 1. Density plot speed and wind force
Figure 2. How to read a boxplot
Figure 3. Boxplot Beaufort wind force
Figure 4. Non-linear relation average ride speed and Beaufort wind
force
Figure 5a. Segmented relation speed and
air density
Figure 5b. 3d plot speed, air
density and humidity
Figure 5c. Relationship
trip cycle speed and air density: segmented versus drag equation
Figure 6a. Residual analysis multiple regression
model
Figure 6b. Cook’s distance chart multiple
regression model
Figure 6c. Influential rides
and outliers in multiple regression model
Figure 7a-l. Daytime weather data, Eelde, NL, average
2008-2024
Scientific publications on cycling aerodynamics
Bert Blocken, Thijs van Druenen, Yasin Toparlar, Thomas Andrianne
“CFD analysis of an exceptional cyclist sprint position”. Sports
Engineering. 22, 2019
https://www.researchgate.net/publication/331366310_CFD_analysis_of_an_exceptional_cyclist_sprint_position
Fabio Malizia, Bert Blocken “Cyclist aerodynamics through time:
Better, faster, stronger”. Journal of Wind Engineering and Industrial
Aerodynamics, Volume 214, 2021
https://www.sciencedirect.com/science/article/pii/S0167610521001574
Dwyer, D. B. “The effect of environmental conditions on performance
in timed cycling events”. Journal of Science and Cycling, 3(3), 17-22,
2014
https://www.jsc-journal.com/index.php/JSC/article/view/152
Footnotes
Geographer, data-analist and cyclist↩︎
The wind force according to Beaufort is determined from the average of the wind speed over 10 minutes. KNMI offers ‘FF’: Wind speed (in 0.1 m/s) averaged over the last 10 minutes of the past hour. We use “FH”: the average over 1 hour is decisive.↩︎
Strava reports wind force in km/h instead of Beaufort for every ride. That in itself is of course more accurate.↩︎
The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA)↩︎
How to calculate air pressure at high altitude: https://www.omnicalculator.com/physics/air-pressure-at-altitude↩︎
Source: https://iflycoast.com/understanding-air-density-and-its-effects/↩︎
It is good to realize that, depending on your condition, a high percentage of the carbohydrates and fats that you burn while cycling are converted into heat that your body must lose.↩︎
Badran, Omar & Abdulhadi, Emad & Mamlook,
Rustom. (1992). Evaluation of parameters affecting wind turbine power
generation.
https://www.researchgate.net/publication/229042504_Evaluation_of_parameters_affecting_wind_turbine_power_generation↩︎
The Magnus effect is the name given to the physical phenomenon that the rotation of objects in a liquid or in air affects their forward motion. If the object is spinning, there is a force perpendicular to the direction of motion↩︎
Bahill, A. Terry & Baldwin, David & Ramberg,
John. (2009). Effects of altitude and atmospheric conditions on the
flight of a baseball. International Journal of Sports Science and
Engineering. 03. 1750-9823
https://www.researchgate.net/publication/228668381_Effects_of_altitude_and_atmospheric_conditions_on_the_flight_of_a_baseball↩︎
Gavin Francis, The best weather conditions for a KOM on Strava: https://science4performance.com/2017/03/01/the-best-weather-conditions-for-a-kom-on-strava/↩︎
Charles Fulco, Paul Rock & A. Cymerman (1998). “Maximal and submaximal exercise performance at altitude”. In: Aviation, space, and environmental medicine, no. 69. pp. 793-801. https://www.researchgate.net/publication/13569370_Maximal_and_submaximal_exercise_performance_at_altitude↩︎
How to calculate air pressure at high altitude: https://www.omnicalculator.com/physics/air-pressure-at-altitude↩︎
see footnote 13, Figure 1, page 794↩︎
Absolute humidity calculator: https://www.omnicalculator.com/physics/absolute-humidity↩︎
Source: https://www.quora.com/What-is-the-drag-coefficient-of-a-cyclist-on-a-road-bike↩︎
For instance “Time Savings at Any Speed when Removing 100g of Drag”. Jon Thornham (2019) https://blog.flocycling.com/aero-wheels/cycling-wheel-aerodynamics-how-speed-time-and-power-are-affected-by-reducing-drag/↩︎
Vito Muggeo: Segmented: An R Package to Fit Regression Models With Broken-Line Relationships https://www.researchgate.net/publication/234092680_Segmented_An_R_Package_to_Fit_Regression_Models_With_Broken-Line_Relationships↩︎
Breakpoint analysis and segmented regression: https://rpubs.com/MarkusLoew/12164↩︎
Vito Muggeo, see footnote 22↩︎
The terms predictor, control, independent variable and covariate are often used interchangeably in regression contexts.↩︎
Appendix Figure 7g shows the correlation between both weather variables.↩︎
Source: https://stackoverflow.com/questions/48962406/add-cooks-distance-levels-to-ggplot2↩︎