Predicting Cycling Speed based on Slope and Terrain Type

Project for GEO 880: Patterns and Trends in Environmental Data – Computational Movement Analysis

Author

Jan Krummenacher & Anna Schweiter

Code

# libraries
library(XML)
library(sf)
library(tidyverse)
library(lubridate)
library(leaflet)
library(RColorBrewer)
library(ggplot2)
library(patchwork)
library(terra)
library(readr)
library(sp)
library(dplyr)
library(geosphere)
library(tmap)
library(viridis)
library(zoo)
library(mgcv)
library(randomForest)
library(future)
library(ranger)

Background and research goals

Cycling speed is influenced by a wide range of factors, including the cyclist’s physical condition as well as environmental conditions such as weather, wind, road surface, topography, and the type of bicycle used (Flügel et al., 2019). It can vary significantly across individuals, regions, road types and even the bicycle itself. Prior studies have looked at factors influencing cycling speed, often in urban contexts. For example, Montoya-Zamora et al. (2024) predicted cyclist speed in urban environments using GPS-data with elevation and temporal attributes to inform urban infrastructure planning. Our study extends this research by including unpaved trails and varied terrain, allowing speed prediction in more diverse and challenging environments. Laube (2017) emphasized the often network-based nature of human movement, which is also relevant for cycling, as it typically occurs along defined paths such as roads or trails. This motivated us to model speed based on attributes of the underlying network characteristics. Based on personal cycling experience, factors such as weather, training companions, slope and road surface were recognized to influence speed. Given the limited scope of this project, we chose to focus specifically on slope and terrain type. As previous studies have shown, cyclists represent a highly heterogeneous group in terms of age, gender, and physical characteristics (Parkin and Rotheram, 2010), which makes it challenging to predict speed in a generalized way. Therefore, we focus our analysis on high-resolution GPS tracks from a single person training for a long endurance mountain bike race. Our project addresses the following research questions:

- How accurately can speed be predicted using a regression model on a predefined bike track?

- How do slope and terrain type influence cycling speed, and which of these factors are more relevant in predicting speed?

To explore these questions, we considered the role of road surface type and the impact of elevation changes. The ability to predict speed accurately based on such factors can provide valu-able insights for performance optimization in mountain biking. To understand how different factors play a role can additionally help to design better personal training plans for cyclists.

Data and Methods

Data

The data used in this project was collected between March 3, 2025, and May 1, 2025, from various bike routes. To ensure computational feasibility and maintain data consistency, we focused our analysis on two specific geographic regions. The Zürcher Oberland was selected due to the high density of recorded tracks in this area. In contrast, the Albula region was chosen for its greater topographic diversity, with tracks reaching higher elevations and displaying more significant elevation changes. Additionally, initial exploration using the Garmin Connect tool revealed that the Albula region includes a higher proportion of unpaved trails compared to the Zürcher Oberland, where paved surfaces are more common. This variation in terrain and eleva-tion provided valuable input for training and validating our speed prediction model.

Figure 1: Overview of GPS tracks used for model training in the Zürcher Oberland region.

Figure 2: Overview of GPS tracks used for model training in the Albula region.

Code

# read GPX files
read_gpx_folder <- function(folder_path, crs = 2056) {
  gpx_files <- list.files(folder_path, pattern = "^activity_.*\\.gpx$", full.names = TRUE)
  activities <- lapply(gpx_files, function(path) htmlTreeParse(path, useInternalNodes = TRUE))
  
  extract_df <- function(doc, id_text) {
    coords <- xpathSApply(doc, "//trkpt", xmlAttrs)
    ele <- xpathSApply(doc, "//trkpt/ele", xmlValue)
    time <- xpathSApply(doc, "//trkpt/time", xmlValue)
    data.frame(
      lat = as.numeric(coords["lat", ]),
      lon = as.numeric(coords["lon", ]),
      ts = ymd_hms(time, tz = "Europe/Zurich"),
      elevation = as.numeric(ele),
      ID_text = id_text
    )
  }
  
  all_data <- Map(extract_df, activities, basename(gpx_files))
  combined_df <- bind_rows(all_data, .id = "ID")
  combined_df$ID <- paste0("act_", combined_df$ID)
  
  st_as_sf(combined_df, coords = c("lon", "lat"), crs = crs, remove = FALSE)
}

albula_sf <- read_gpx_folder("data/albula")
duernten_sf <- read_gpx_folder("data/dürnten")


plot_tracks <- function(points_sf, title = "Tracks") {
  lines_sf <- points_sf %>%
    arrange(ID_text, ts) %>%
    group_by(ID_text) %>%
    summarise(do_union = FALSE) %>%
    st_cast("LINESTRING")
  ids <- unique(lines_sf$ID_text)
  colors <- rep(brewer.pal(min(length(ids), 8), "Dark2"), length.out = length(ids))
  color_df <- data.frame(ID_text = ids, color = colors)
  lines_sf <- left_join(lines_sf, color_df, by = "ID_text")
  leaflet(lines_sf) %>%
    addProviderTiles("CartoDB.Positron") %>%
    addPolylines(color = ~color, weight = 3, label = ~ID_text) %>%
    addLegend("bottomright", title = title, colors = color_df$color, labels = color_df$ID_text)
}

# create map
plot_tracks(albula_sf, title = "Albula Activities")
plot_tracks(duernten_sf, title = "Zürcher Oberland Activities")

We used GPS data with a temporal resolution of 1 second from activities recorded on a Garmin device. The exported GPX-files include the following attributes: Geometry (latitude, longitude), timestamp, and elevation for each GPS point. The tracks included durations ranging from one to five hours and covering elevation gains from 200 to 1’400 meters. Inspired by Montoya-Zamora et al. (2024) , who used 70% of their data for model training and the remaining 30% for validation, of our 11 tracks in total, we used 8 (2 tracks in the Albula region and 6 in the Zürcher Oberland covering nearly 70’000 observations) as training data and 3 for validation (2 tracks in the Zürcher Oberland and 1 in the Albula region covering roughly 25’000 observations).

To get data for factors influencing speed, the following sources were used:

Terrain type: The dataset SWISSTLM3D_CHLV95LN02.gdb provided by Swisstopo (2025) was used, from which the layer TLM_STRASSE was extracted. From this layer, we retained the attributes BELAGSART (road surface classification) and SHAPE (geometry).
Elevation: A digital terrain model (DTM) with a 5m spatial resolution was generated from the official swissALTI3D elevation dataset provided by Swisstopo (2025).

Methods

Pre-Processing

Tracks

Since the bike data originates from the Garmin platform (https://connect.garmin.com), and the automatic assignment of activity types was found to be unreliable, each activity had to be manually reviewed to determine whether it represented a bike tour or another type of activity, such as a ski tour, hike, or jogging session. The classification was primarily based on the route locations and observed speed ranges. After selecting the appropriate tracks and exporting them as GPX files, we began by applying all preprocessing steps to a single track. These steps included selecting relevant columns (latitude, longitude, timestamp, and elevation), calculating speed, and adding terrain type and slope information. This initial test helped us refine our methods before scaling up the analysis.

The tracks were not divided into segments prior to running our processing loop to avoid the un-derestimation of average speeds per segments (Laube and Purves, 2011). Instead, we used a moving window approach to smooth the data and reduce noise. This technique is robust to static points, which do not influence calculations. Speed was computed as the ratio of step length to time lag over the window, resulting in average speed in meters per second. Similarly, slope was estimated as the elevation difference over the same window, providing a measure of terrain steepness. Static points (speed = 0) were excluded before performing regression analysis and training the predictive model. Once the preprocessing workflow was finished, it was applied in a loop to all training tracks. The resulting datasets (excluding the three held back for validation) were then merged into a single data frame used for model training.

Once we had a process that made sense, a loop was created and applied to all tracks. After processing all tracks the resulting files (except for the 3 tracks used for validation) were combined into a single data frame that served as the foundation for training our predictive model. The selection of validation tracks was evaluated by checking whether similar distributions of speed and slope are present.

Code

# Define shared axis limits
speed_limits <- range(c(training_df$speed_10, validation_df$speed_10), na.rm = TRUE)

# Histograms
p1 <- ggplot(training_df, aes(x = speed_10)) +
  geom_histogram(binwidth = 1, fill = "#2c7fb8", color = "white") +
  coord_cartesian(xlim = speed_limits) +
  labs(title = "Training: Speed Distribution", x = "Speed (m/s)", y = "Count") +
  theme_minimal()

p2 <- ggplot(validation_df, aes(x = speed_10)) +
  geom_histogram(binwidth = 1, fill = "#2c7fb8", color = "white") +
  coord_cartesian(xlim = speed_limits) +
  labs(title = "Validation: Speed Distribution", x = "Speed (m/s)", y = "Count") +
  theme_minimal()

p3 <- ggplot(training_df, aes(x = slope_10)) +
  geom_histogram(binwidth = 4, fill = "#2c7fb8", color = "white") +
  labs(title = "Training: Slope Distribution", x = "Slope (m/m)", y = "Count") +
  theme_minimal()

p4 <- ggplot(validation_df, aes(x = slope_10)) +
  geom_histogram(binwidth = 2, fill = "#2c7fb8", color = "white") +
  labs(title = "Validation: Slope Distribution", x = "Slope (m/m)", y = "Count") +
  theme_minimal()

# Combine into 2x2 layout
(p1 + p2) / (p3 + p4)

Figure 3: Comparison of speed and slope distribution in training and validation tracks.

Terrain type

The analysis of road types was done using the SWISSTLM3D_CHLV95LN02.gdb dataset, which provides detailed information on the Swiss road network. For each GPS point of the tracks, the nearest feature in the road surface layer was identified. The corresponding terrain type was assigned to the respective GPS point and a categorical label for the terrain type was created.

Elevation

Given that the Garmin elevation data has a barometric altitude accuracy of ±15 meters, and the accuracy of the GPS-based elevation measurement is ±120 meters (Garmin 2025), we obtained more accurate elevation data from the SwissAlti3D model (Swisstopo). However, the original dataset was too large to process efficiently. Therefore, a new raster dataset with a spatial resolution of 5m was created. We acknowledge that this reduction in resolution results in a loss of detail, but it was necessary to ensure that it was computationally feasible. To ensure full coverage, multiple raster files were combined to create a seamless digital terrain model (DTM). The aggregated DTM was resampled to a 5-meter spatial resolution, and the resulting elevation values were extracted from the resampled DTM and added as a new variable. These elevation data were subsequently used in further analyses.

Regression

Linear

To assess the relationship between slope, terrain type and cycling speed, a simple linear regression model was created, first with slope as sole predictor, then in combination with terrain type.

Random forest

To predict cycling speed, a Random Forest model was used. The model ran through the data various times with different inputs. The primary inputs included slope and a combination of slope and terrain type. The model’s predicted speeds were then compared to the actual speeds recorded in three separate validation tracks. This comparison serves to evaluate the accuracy of the model’s predictions.

Results

Linear Regression

The regression results show a statistically significant negative relationship between slope and speed (β = -0.34, p < 0.001). This indicates that for each 1-unit increase in slope (in %), speed decreases on average by approximately 0.34m/s. The intercept of the model is 3.81m/s, which can be interpreted as the expected speed when the slope is 0% (i.e., flat terrain). Although the effect is statistically significant, the model explains only a small portion of the variance in speed (R² = 0.15), suggesting that slope alone is not sufficient to predict cycling speed accurately. The residual standard error is 2.28m/s, reflecting considerable variability not captured by the model.

In combination with terrain type, the model shows that slope remains a significant predictor of cycling speed (β = -0.34, p < 0.001) while the terrain type is also significant (β = -0.0000029, p < 0.001), indicating a small but statistically significant reduction in speed on natural surfaces compared to paved ones. The interaction term is not statistically significant (p = 0.435), suggesting that the effect of slope on speed does not meaningfully differ between surface types. Overall, the model explains slightly more variance than the simpler model (R² = 0.162 vs. 0.153), but the improvement is minimal.

Only slope

Code

# Linear Regression: only slope
linear_modell_slope <-   lm(speed_10 ~ slope_10, data = training_df)
summary(linear_modell_slope)


Call:
lm(formula = speed_10 ~ slope_10, data = training_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.5316  -1.6377  -0.0911   1.4867  21.3375 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.814568   0.008720   437.4   <2e-16 ***
slope_10    -0.339867   0.003064  -110.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.278 on 68258 degrees of freedom
Multiple R-squared:  0.1527,    Adjusted R-squared:  0.1527 
F-statistic: 1.23e+04 on 1 and 68258 DF,  p-value: < 2.2e-16

Only terrain type

Code

# Linear Regression: only terrain type
linear_modell_belag <- lm(speed_10 ~ BELAGSART_LABEL, data = training_df)
summary(linear_modell_belag)


Call:
lm(formula = speed_10 ~ BELAGSART_LABEL, data = training_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7431 -1.5700 -0.3883  1.3662 16.1128 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4.74316    0.01182  401.43   <2e-16 ***
BELAGSART_LABELk_W   -3.76730    0.10695  -35.22   <2e-16 ***
BELAGSART_LABELNatur -1.97760    0.01741 -113.58   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.26 on 68257 degrees of freedom
Multiple R-squared:  0.1663,    Adjusted R-squared:  0.1663 
F-statistic:  6808 on 2 and 68257 DF,  p-value: < 2.2e-16

Combination slope and terrain type

Code

# Linear Regression: combination of slope and terrain type
linear_modell_slope_belag <- lm(speed_10 ~ slope_10 * BELAGSART_LABEL, data = training_df)
summary(linear_modell_slope_belag)


Call:
lm(formula = speed_10 ~ slope_10 * BELAGSART_LABEL, data = training_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.4293  -1.2854  -0.1909   1.1366  21.0977 

Coefficients:
                               Estimate Std. Error  t value Pr(>|t|)    
(Intercept)                    4.741634   0.010670  444.378   <2e-16 ***
slope_10                      -0.376446   0.003913  -96.215   <2e-16 ***
BELAGSART_LABELk_W            -3.806384   0.096935  -39.267   <2e-16 ***
BELAGSART_LABELNatur          -1.972087   0.015724 -125.418   <2e-16 ***
slope_10:BELAGSART_LABELk_W    0.086634   0.058849    1.472    0.141    
slope_10:BELAGSART_LABELNatur  0.073353   0.005497   13.345   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.041 on 68254 degrees of freedom
Multiple R-squared:  0.3202,    Adjusted R-squared:  0.3201 
F-statistic:  6429 on 5 and 68254 DF,  p-value: < 2.2e-16

GAM (Generalized Additive Model)

Only slope

Code

# GAM: only slope
gam_model_slope <- gam(speed_10 ~ s(slope_10), data = training_df)
summary(gam_model_slope)


Family: gaussian 
Link function: identity 

Formula:
speed_10 ~ s(slope_10)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.813578   0.008082   471.8   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df    F p-value    
s(slope_10) 8.988      9 2836  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.272   Deviance explained = 27.2%
GCV = 4.4597  Scale est. = 4.459     n = 68260

Combination of slope and terrain type

Code

# GAM: combination of slope and terrain type
gam_model_slope_belag <- gam(speed_10 ~ s(slope_10) + BELAGSART_LABEL, data = training_df)
summary(gam_model_slope_belag)


Family: gaussian 
Link function: identity 

Formula:
speed_10 ~ s(slope_10) + BELAGSART_LABEL

Parametric coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4.738403   0.009769  485.04   <2e-16 ***
BELAGSART_LABELk_W   -3.846886   0.088110  -43.66   <2e-16 ***
BELAGSART_LABELNatur -1.966045   0.014468 -135.88   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df    F p-value    
s(slope_10) 8.99      9 3600  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.435   Deviance explained = 43.5%
GCV = 3.4641  Scale est. = 3.4635    n = 68260

Random Forest

Random Forest models require more computational effort compared to the previous models. To facilitate rendering and reduce the computational effort, those were only calculated once and saved afterwards. The original code which was used is provided, but the model is imported from a previous save.

Only slope, 100 trees

Code

# Load model
rf_model_slope_100 <- read_rds("rf_model_slope.rds")

print(rf_model_slope_100)


Call:
 randomForest(formula = speed_10 ~ slope_10, data = rf_data, ntree = 100,      importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 100
No. of variables tried at each split: 1

          Mean of squared residuals: 4.057427
                    % Var explained: 33.77

Code

randomForest::importance(rf_model_slope_100)

          %IncMSE IncNodePurity
slope_10 519.4971      372989.1

Slope, terrain type, 300 trees

Code

# Load model
rf_model_slope_belag_300 <- read_rds("rf_model_slope_belag_300.rds")

print(rf_model_slope_belag_300)


Call:
 randomForest(formula = speed_10 ~ slope_10 + BELAGSART, data = rf_data,      ntree = 300, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 300
No. of variables tried at each split: 1

          Mean of squared residuals: 3.467947
                    % Var explained: 43.39

Code

randomForest::importance(rf_model_slope_belag_300)

            %IncMSE IncNodePurity
slope_10   34.39113      85878.31
BELAGSART 237.19290      68993.43

Slope, spatial smoothing, 300 trees

Code

# Load model
rf_model_slope_sm_300 <- read_rds("rf_model_slope_sm_300.rds")

print(rf_model_slope_sm_300)

Ranger result

Call:
 ranger(formula = speed_10 ~ slope_10 + x + y, data = rf_data,      num.trees = 300, importance = "permutation", num.threads = parallel::detectCores(),      verbose = TRUE) 

Type:                             Regression 
Number of trees:                  300 
Sample size:                      68260 
Number of independent variables:  3 
Mtry:                             1 
Target node size:                 5 
Variable importance mode:         permutation 
Splitrule:                        variance 
OOB prediction error (MSE):       0.3060598 
R squared (OOB):                  0.9500404

Code

print(rf_model_slope_sm_300$variable.importance)

slope_10        x        y 
4.403146 7.741671 6.970755

Predictions

The three plots below visualize the results of three different speed predictions by the three Random Forest models. In general, all three models were not able to predict speed on a high accuracy. The plots show the measured speed on the x-axis and the predicted speed by the model on the y-axis. The red dashed line represents a linear relationship, which would in this case mean that our model is predicting speed accurate. The model which takes applies spatial smoothing seams to be the most accurate but nevertheless has a large error margin. In the second prediction the categorical data of the terrain type seams to cause problems. The predicted speeds are grouped horizontally, which means they got the same output speed value form the model, although their real speed varies.

Prediction: Slope, 100 trees

Figure 4: Predicted versus measured Speed using RandomForest (slope).

Prediction: Slope, terrain type, 300 trees

Figure 5: Predicted versus measured Speed using RandomForest (slope and terrain type).

Prediction: Slope, spatial smoothing, 300 trees

Code

validation_sf <- st_as_sf(validation_df, coords = c("lon", "lat"), crs = 2056)

# extract coordinates
validation_sf$x <- st_coordinates(validation_sf)[, 1]
validation_sf$y <- st_coordinates(validation_sf)[, 2]

# remove geometry
validation_df <- st_drop_geometry(validation_sf)

# remove NAs
validation_df <- validation_df %>%
  dplyr::filter(!is.na(slope_10), !is.na(x), !is.na(y))

# predict
validation_df$prediction <- predict(rf_model_slope_sm_300, data = validation_df)$predictions

Figure 6: Predicted versus measured Speed using RandomForest (slope and spatial smoothing).

Visual comparison

We extracted the predicted and measured speed values for one track in validation group and mapped them together with their difference to visualize the differences in space. A positive difference (red color) means the prediction overestimated the speed and has a higher value for the respective segment. A negative difference (green color) represents an underestimation by the model, the measured speed was higher.

Code

library(ggplot2)
library(ggmap)

# sf out of predictions
validation_sf <- validation_df %>%
  st_as_sf(coords = c("x", "y"), crs = 2056)  

# add prediction as new column
validation_sf$prediction <- validation_df$prediction

a_0105 <- filter(validation_sf, source_file == "activity_0105")

# Create difference
a_0105 <- mutate(a_0105, diff = a_0105$prediction - a_0105$speed_10)

Code

ggplot() +
  geom_sf(data = a_0105, aes(color = prediction), size = 1, alpha = 0.6) +
  scale_color_viridis_c(option = "viridis") +  
  theme_minimal() +
  labs(title = "Predicted Speed on Map",
       color = "Predicted Speed") +
  theme(legend.position = "bottom")

Figure 7: Speed Prediction for a single track.

Code

ggplot() +
  geom_sf(data = a_0105, aes(color = speed_10), size = 1, alpha = 0.6) +
  scale_color_viridis_c(option = "viridis") +  
  theme_minimal() +
  labs(title = "Measured Speed on Map",
       color = "Measured Speed") +
  theme(legend.position = "bottom")

Figure 8: Speed Measurements for a single track (validation).

Code

ggplot() +
  geom_sf(data = a_0105, aes(color = diff), size = 1, alpha = 0.6) +
  scale_color_gradient2(midpoint = 0, low = "green", mid = "white", high = "red") +
  labs(title = "Difference Speed on Map",
       color = "Difference") +
  theme(legend.position = "bottom")+
  theme_minimal()

Figure 9: Difference of Speed Prediction and Measurements for a single track.

Discussion

Our analysis confirms a statistically significant negative relationship between slope and cycling speed, consistent with expectations and findings from previous studies. In the initial linear regression model, speed decreases by approximately 0.34m/s for every 1% increase in slope (β = -0.34, p < 0.001). This aligns closely with the findings of Parkin and Rotheram (2010), who re-ported a reduction of 0.39m/s per 1% uphill gradient. However, with an R² value of only 0.15, the model explains a relatively small proportion of the variance in speed, suggesting that slope alone does not capture the full complexity of cycling dynamics.

A separate linear model with terrain type as predictor similarly explains only 16.6% of the variance, despite being statistical-ly significant (p-value < 2.2e-16). For the gravel/grass category, a slight reduction in speed compared to paved surfaces is shown. When the two variables are combined in a multiple line-ar regression model, the explanatory power improves to an R² of 0.32, which suggests that both factors are important predictors. Still, the R² values highlight limitations of a linear modeling approach, suggesting that the relationship between speed, slope, and terrain may be nonlinear.

The results of the Generalized Additive Model (GAM) show a significant nonlinear relationship between slope and speed. This model fits the data better than the linear one, explaining 27% of the variance. Using the GAM with slope and surface type further improves its performance as it now explains 43.5% of the variance. This indicates that nonlinear patterns, such as cyclists braking on steep descents or slowing down disproportionately on steep inclines, are better captured. The unexplained variance could be due to factors not included in our models, such as rider effort, wind resistance or weather conditions.

In our first random forest model we used only slope as a predictor and we set the number of trees to 100 to get a first impression of the resulting model accuracy and running time. The model was able to explain 33.77% of the Variance. Unsurprisingly, the importance of the variable slope was considered as very high (Inc%MSE and IncNodePurity), since it is the only predictor and has a significant influence on speed (see Linear models). However, the overall model performance was insufficient for speed predictions.

In our second attempt, we included a second predictor variable, terrain type, and increased the number of trees in the model. Overall model performance improved by 10%, reaching a total of 43% variance explained. Terrain type appeared more important based on the %IncMSE value, whereas slope was favored in terms of IncNodePurity. %IncMSE indicates how much the model’s prediction error increases when the values of a predictor are randomly permuted. Since terrain type is a categorical variable with only three levels, randomly shuffling its values has a stronger effect than shuffling values of slope, a continuous variable—where randomly chosen values may still be relatively close to the actual ones.For this reason, we consider slope to be a more reliable predictor than terrain type, despite the higher %IncMSE of the latter.

Our third attempt included a new variable which was the location data. This approach is known as spatial smoothing, in our case based on geolocation (Fahrmeir et al. 2023). The idea is, that our model can decide based on spatial patterns, how differences and similarities in the training data has to be interpreted. The model takes now into account that points close to each other should be similar. Like before, we kept slope as a predictor and 300 trees. This time our model improved a lot and we reached 95 % of explained variance through the model. However, regarding the prediction accuracy, this model did not perform much better than the others.

Our analysis indicates that both slope and terrain type significantly influence cycling speed, with each factor contributing similarly to the overall variations in speed. Initially, we anticipated that slope would play a much larger role in predicting speed, but the results showed that terrain type also has a considerable impact, particularly when combined with slope in nonlinear models like the Generalized Additive Models (GAMs). While slope had a slightly greater numerical effect, both factors were important predictors of cycling speed.

In terms of prediction accuracy, regression models can provide basic speed predictions, with the linear models explaining up to 32% of the variance when both slope and terrain type are considered. Nonlinear models, offered a better fit, explaining 43.5% (GAMs model) and 43% (rf) of the variance in speed. By including location data, even 95% were reached. Despite these improvements, significant unexplained variability remains, likely due to missing factors such as rider effort, environmental conditions, and surface roughness. While these models provide useful insights, they highlight that more factors need to be considered for more precise speed prediction.

Challenges and limitations

Data-related challenges included that due to the available data structure, we only had two terrain type classes (paved vs. unpaved). This binary categorization does not capture the full range of surface types and thus does not reflect important differences in surface quality. Additionally, the terrain types were not validated against ground truth, so small trails may have been misclassified or linked to the nearest road. It could not be identified, which type of bicycle (e.g., mountain bike, gravel bike, road bike) was used for each track. Since bike characteristics such as weight, tire width, and rolling resistance can significantly affect speed (Flügel et al., 2019), this missing information introduces uncertainty. Furthermore, variables such as wind, temperature, and rider effort were not available but likely contribute to unexplained variance.

Tool-related challenges included large dataset sizes (elevation and terrain type in very fine spatial resolution) what resulted in long processing times. The computational load could be reduced by limiting to geographical extent to our two areas of interest.

Conclusions

This study developed and compared multiple regression models to predict cycling speed using GPS-derived variables. The best-performing model, a GAM incorporating both slope and terrain type, explained 43.5% of the variance in speed. We demonstrated that slope has a slightly greater numerical influence than surface type, but both are important predictors.

This project opens several opportunities for further research. With more time and data, it would be possible to conduct a more in-depth analysis and improve the accuracy of speed predictions. Future work could include additional influencing factors such as trail sinuosity, the impact of fatigue over time, a more detailed classification of road surface types, and weather conditions. Including these elements would likely enhance the model’s performance and provide a more comprehensive understanding of what affects cycling speed.

References

Data

GPS data bike tracks.: Garmin (2025): https://connect.garmin.com
Elevation data: Swisstopo (2025): swissALTI3D. Available at: https://www.swisstopo.admin.ch/de/hoehenmodell-swissalti3d (08.01.2024).
Terrain type: Swisstopo (2025): swissTLM3D. Available at: https://www.swisstopo.admin.ch/de/landschaftsmodell-swisstlm3d (08.03.2024).

Literature

Flügel, S. et al. (2019) ‘Empirical speed models for cycling in the Oslo road network’, Transpor-tation, 46(4), pp. 1395–1419. Available at: https://doi.org/10.1007/s11116-017-9841-8.
Fahrmeir, L. et al. (2023) ’Spatial smoothing revisited: An application to rental data in Munich. Statistical Modelling. 480-494. doi:10.1177/1471082X231158465
Garmin (2025): Genauigkeit der Höhenmessung von Outdoor- und Fitnessgeräten. Available at : https://support.garmin.com/de-CH/?faq=WlvNrOungC28xGtwB7hLY5 (accessed: 20.04.2025).
Laube, P. (2017) ‘Representation: Trajectories’, in D. Richardson et al. (eds) International Ency-clopedia of Geography. 1st edn. Wiley, pp. 1–11. Available at: https://doi.org/10.1002/9781118786352.wbieg0593.
Laube, P. and Purves, R.S. (2011) ‘How fast is a cow? Cross‐Scale Analysis of Movement Data’, Transactions in GIS, 15(3), pp. 401–418. Available at: https://doi.org/10.1111/j.1467-9671.2011.01256.x.
Montoya-Zamora, R. et al. (2024) ‘Predicting Cyclist Speed in Urban Contexts: A Neural Net-work Approach’, Modelling, 5(4), pp. 1601–1617. Available at: https://doi.org/10.3390/modelling5040084.
Parkin, J. and Rotheram, J. (2010) ‘Design speeds and acceleration characteristics of bicycle traffic for use in planning, design and appraisal’, Transport Policy, 17(5), pp. 335–341. Available at: https://doi.org/10.1016/j.tranpol.2010.03.001.

Use of AI

ChatGPT was used as inspiration for coding and for troubleshooting errors during the imple-mentation process. In addition, it supported grammar and spelling correction and was used to provide suggestions for clearer wording throughout the report.

We hereby declare that we have composed this work independently and without the use of any aids other than those declared (ChatGPT). We are aware that we take full responsibility for the scientific character of the submitted text, even if AI aids were used and declared. All passages taken verbatim or in sense from published or unpublished writings are identified as such.

Code

library("pacman")
wordcountaddin:::text_stats("report_group5.qmd")

Method	koRpus	stringi
Word count	3237	3132
Character count	20909	21209
Sentence count	245	Not available
Reading time	16.2 minutes	15.7 minutes