Winding Paths: Identifying & Characterizing Mountainbike Trails Using Spatiotemporal Data
Project for Patterns & Trends in Environmental Data / Computational Movement Analysis Geo 880
Author
Reto Deuber and Gabriel Sciascia
Abstract
Regional planning is concerned with balancing recreational use of natural landscapes and ecological preservation. Mountainbiking on unofficial trails can lead to soil degradation and loss of biodiversity. This study explores the use of spatiotemporal movement data to detect and classify frequently used mountainbike segments, with the goal of supporting sustainable trail management. Using GPS data from 23 mountainbike sessions, three movement features - speed, slope, and sinuosity - were derived and applied to train a random forest classifier to distinguish between downhill trails and connecting paths. The model was trained on manually labeled data and evaluated using both cross-validation and qualitative map-based assessment. Results indicate that the classifier can reliably identify purpose-built downhill trails, with sinuosity emerging as the most significant predictive variable. However, the models performance decreases with natural trails and with curved sections of connecting paths, revealing limitations in the modelling approach applied here. We discuss potential improvements including the integration of additional sensor data, refined terrain modeling, and broader, more diverse model training datasets. Our findings demonstrate the potential of computational movement analysis to inform regional trail planning, while highlighting the need for careful model design.
Introduction
In regional planning, recreational use of natural landscapes must be balanced with ecological functions such as biodiversity and soil quality. Mountainbiking on unofficial paths poses a threat to these functions (Smith and Pickering (2025)). This study explores the possibility of locating commonly used biking trails and connecting paths based on GPS data of mountainbikers. Such data could provide a basis for planning official mountainbike infrastructure, channeling traffic flows and limiting damages to ecosystems.
Specifically, this study aims to apply principles of computational movement analysis (Laube (2014), Laube and Purves (2011)) to differentiate two types of downhill segments: Connecting paths are typically fire roads, logging roads or tarmac roads. They are characterized by moderate to steep slopes, low sinuosity and high riding speeds. Downhill trails are technically difficult, narrow and winding dirt paths. Their characteristics are steep slopes, high sinuosity and low to moderate riding speeds. Our goal is to develop a model predicting the segment type of GPS fixes, by training a machine learning algorithm to classify fixes as belonging to connecting paths or downhill trails based on the characteristics mentioned.
Material and Methods
Data
Data was obtained from the Strava records of one of the researchers, recorded with a Garmin Forerunner 745 and a Wahoo ELMNT Roam. GPS fixes were recorded in intervals of 1 second. After preprocessing the data consisted of 28’965 GPS fixes on downhill paths, contained in 23 individual mountainbiking sessions taking place from March 2023 to February 2025. Figure 1 displays a 3D plot of an example session (05.02.2025 Madeira, Portugal) with multiple downhill rides.
Code
# 3D-Graphlibrary(pacman)p_load(dplyr, readr, lubridate, sf, plotly)data_mtb_proc <-read_delim("data_mtb_proc.csv", ",")# Changing the coordinates to geometry (needed for "distance_by_element")data_mtb_proc <- data_mtb_proc |>mutate(lon = position_long,lat = position_lat) |>st_as_sf(coords =c("position_long", "position_lat"), crs =4326) |>st_transform(crs =3857)# Selecting a specific sessionsession1 <- data_mtb_proc |>ungroup() |>filter(date(session_id) ==as.Date("2025-02-05"))# Adding new columns for the plotsession1 <- session1 |>mutate( x = sf::st_coordinates(geometry)[, 1], y = sf::st_coordinates(geometry)[, 2])# Manually changing the values for better plot visualization session1$x <- session1$x +1921000session1$y <- session1$y -3862000# Creating the plotplot_ly(data = session1, x =~x, y =~y, z =~enhanced_altitude, type ="scatter3d", mode ="markers",marker =list(size =2, color =~enhanced_altitude, colorscale ='Viridis')) |>layout(scene =list(aspectmode ="data", xaxis =list(title ="Easting (m)"), yaxis =list(title ="Northing (m)"),zaxis =list(tick0 =0, dtick =500, title ="Altitude (m)")))
Figure 1: 3D plot of an example session (05.02.2025 Madeira, Portugal).
A function was written to extract the necessary variables from .fit files obtained via Strava bulk download. Unprocessed data contained the following variables:
sport & sub_sport: identifying different types of sport
timestamp: date and time of the GPS fix
enhanced_altitude: altitude calculated by recording devices from geodata and barometric measurements, measured at discrete intervals of 0.2 meters
session_id: end time of the session as unique identifier
sequence_number: sequential number of the GPS fix in the respective session
lon & lat: Longitude and latitude in the WGS84 format
Code
# Processing a single .fit filep_load(FITfileR)process_fit_file <-function(file_path) { fit <-tryCatch({readFitFile(file_path) }, error =function(e) {warning(paste("Error reading file:", basename(file_path), ":", e$message))return(NULL) # Return NULL if file cannot be read })# If the file couldn't be read, skip further processingif (is.null(fit)) {return(NULL) # Skip this file }# Extract geolocation data safely record_data <-records(fit)# Define crucial and optional variables crucial_vars <-c("timestamp", "position_lat", "position_long") optional_vars <-c("enhanced_altitude", "gps_accuracy", "altitude", "grade", "distance", "cadence", "speed", "enhanced_speed", "ascent", "descent")# Check if record_data is a data frame or list# dfif (is.data.frame(record_data)) {# If it's a data frame, check if all crucial columns are present missing_vars <-setdiff(crucial_vars, names(record_data))if (length(missing_vars) ==0) {# All crucial variables are present, compile geo_data geo_data <- record_data %>%select(any_of(c(crucial_vars, optional_vars))) %>%mutate(sequence_number =row_number()) } else {# Some crucial variables are missing, return the list of missing variableswarning(paste("Skipping file (missing crucial variables in record_data df:", paste(missing_vars, collapse =", "), "):", basename(file_path)))return(NULL) # Skip this file } } # listelseif (is.list(record_data)) {# If it's a list, flatten the list and check for crucial variables geo_data <-tryCatch({ flattened_data <- record_data %>% data.table::rbindlist(fill =TRUE) %>%# Flatten list and fill missing values tibble::as_tibble()# Check if all crucial variables are present missing_vars <-setdiff(crucial_vars, names(flattened_data))if (length(missing_vars) ==0) {# All crucial variables are present, compile geo_data flattened_data %>%select(any_of(c(crucial_vars, optional_vars))) %>%mutate(sequence_number =row_number()) } else {# Some crucial variables are missing, return the list of missing variableswarning(paste("Skipping file (missing crucial variables in record_data list:", paste(missing_vars, collapse =", "), "):", basename(file_path)))return(NULL) # Skip this file } }, error =function(e) {warning(paste("Error processing records in:", basename(file_path)))return(NULL) # Skip this file }) } else {warning(paste("Skipping file (invalid record data):", basename(file_path)))return(NULL) }# Extract session data session <-getMessagesByType(fit, message_type ="session") %>%as.data.frame() %>%select(timestamp, sport, sub_sport) %>%rename(session_id = timestamp)# Ensure only one session row existsif (nrow(session) ==0) {warning(paste("Skipping file (no session data):", basename(file_path)))return(NULL) } session <- session[1, ] # Take the first row# Merge session data with geo_data geo_data <- geo_data %>%mutate(session_id = session$session_id,sport = session$sport,sub_sport = session$sub_sport,file_name =basename(file_path)) # Track source filereturn(geo_data)}# Processing a folder of .fit filesprocess_fit_folder <-function(folder_path) { fit_files <-list.files(folder_path, pattern ="\\.fit$", full.names =TRUE)results <-lapply(fit_files, function(file) {message(paste("Processing:", basename(file))) # Print progressprocess_fit_file(file) })# Remove NULL results (skipped files) results <-Filter(Negate(is.null), results)if (length(results) ==0) {warning("No valid FIT files processed.")return(NULL) }# Combine all data frames into one combined_data <- dplyr::bind_rows(results)return(combined_data)}
To obtain the relevant data, mountainbiking data was selected and cleaned from unrealistic or unusable data in various variables. The data frame was then converted to the shapefile format in order to use vector geometry variables. Furthermore, the location data was transformed from the “WGS84” coordinate reference system, which expresses positions in radial units, to “WGS84 / Pseudo-Mercator”, which uses metrical units. The following variables were then derived from the geometry data to characterize individual GPS fixes:
Speed [m/s]: Speed was calculated by dividing the distances (step_length) from one fix to the next by the time difference between the two. Data points with unrealistic values above 23.6111 m/s (=85 km/h) were omitted.
Slope [degrees]: The slope of each fix was calculated with the trigonometric arctangent function using the distances between the geometry objects (step_length) and the height difference (derived from the difference in enhanced_altitude).
slope = (atan(height_difference / step length))
To remove unrealistic data, all data points with a height difference of more than 20 meters were omitted. Furthermore, data points with slope values of more than 40° were omitted too.
Sinuosity [ratio]: The sinuosity of a fix was calculated looking at a moving window of eleven fixes. The linear distance between the 5th fix before and the 5th fix after the reference fix (air_distance) was divided by the distance of the actual path between these fixes (sum of step_lengths). The size of the moving window was chosen to give points on short straight sections between narrow curves high sinuosity values too, and not just the points that form the curve. To remove unrealistic data, data points with sinuosity values higher than 10 were omitted.
Finally, the data frame was confined to data of downhill movement by removing data with slope 0° or higher. Static data was removed by omitting fixes with speeds lower than 1m/s.
Code
# Read Datafit_data <-read_delim("data_raw.csv")# Cleaning Data# Select only cycling dataunique(fit_data$sport) unique(fit_data$sub_sport)data_cy <- fit_data |>filter( sport =="cycling" )# Selecting only MTB data, removing NAs of fixes and removing 180 values of position_lat/longunique(data_cy$sub_sport)data_mtb <- data_cy |>filter( sub_sport =="mountain") |>filter(!is.na(position_lat)) |>filter(!is.na(position_long)) |>filter(position_lat !=180) |>filter(position_long !=180) |> dplyr::select(!(cadence:descent))# For some reason lat/long 180 cannot be removed with filter. Manually remove them by sorting and deleting rows containing 180.data_mtb <-sort_by(data_mtb, data_mtb$position_lat)data_mtb <- data_mtb[1:124721, ]data_mtb <-sort_by(data_mtb, data_mtb$timestamp)unique(data_mtb$session_id) # Identify Number of sessions in data# Adding functions for time difference and steplengthdifftime_secs <-function(x, y){as.numeric(difftime(x, y, units ="secs"))}distance_by_element <-function(later, now){as.numeric(st_distance(later, now, by_element =TRUE))}# Changing the coordinates to geometry (needed for "distance_by_element")mtb_sf_merc <- data_mtb |>mutate(lon = position_long,lat = position_lat) |>st_as_sf(coords =c("position_long", "position_lat"), crs =4326) |>st_transform(crs =3857)str(mtb_sf_merc)# Adding time difference, steplength and speeddata_mtb_speed <- mtb_sf_merc |>group_by(session_id) |>mutate(timelag =difftime_secs(lead(timestamp), timestamp)) |>mutate(steplength =distance_by_element(lead(geometry), geometry)) |>mutate(speed = steplength/timelag) # Adding slope # Adding the height_difference and slope data_mtb_speed_slope <- data_mtb_speed |>group_by(session_id) |>mutate(height_difference =lead(enhanced_altitude) - enhanced_altitude) |>mutate(slope =if_else( # Berechnung von Slope nur wenn steplength != 0, da dort slope = 90 steplength !=0,atan(height_difference / steplength) * (180/ pi),0 )) # Adding sinuosity1 data_mtb_speed_slope_sinuosity <- data_mtb_speed_slope |>group_by(session_id) |>mutate(pathway_length =lag(steplength, 2) +lag(steplength) + steplength +lead(steplength)) |>mutate(air_length =distance_by_element(lag(geometry, 2), lead(geometry, 2))) |>mutate(sinuosity = pathway_length / air_length)# Adding sinuosity2 (calculated for a trajectory of 11 points)data_mtb_speed_slope_sinuosity <- data_mtb_speed_slope_sinuosity |>group_by(session_id) |>mutate(pathway_length_2 =lag(steplength, 5) +lag(steplength, 4) +lag(steplength, 3) +lag(steplength, 2) +lag(steplength) + steplength +lead(steplength) +lead(steplength, 2) +lead(steplength, 3) +lead(steplength, 4)) |>mutate(air_length_2 =distance_by_element(lag(geometry, 5), lead(geometry, 5))) |>mutate(sinuosity_2 = pathway_length_2 / air_length_2)# Renaming PreProcessed Datadata_mtb_proc <- data_mtb_speed_slope_sinuosity# Removing unrealistic datadata_mtb_proc <- data_mtb_proc |>filter(sinuosity <10) |>filter(sinuosity_2 <10) |>filter( slope >-40& slope <40) |># Removing unrealistic slopesfilter(height_difference <20& height_difference >-20 ) |># Removing unrealistic height differencesfilter(speed <23.6111) # filter above 85 km/h# Downhill-only data analysis# Looking only at negative slopes, with speed >1m/sdh <- data_mtb_proc |>filter(slope <0) |>filter(speed >1)
Figure 2 (a) displays the distribution of all obtained speed values. The modal value is at approximately 7.5 m/s (27 km/h) and the number of observations steeply decreases with increasing speed. Figure 2 (b) shows the distribution of all observations of slope. The slope was derived from the enhanced_altitude value, which was measured at discrete intervals of 0.2 meters. Therefore, many slope values on rather flat sections were rounded to exactly 0°. Figure 2 (c) displays the distribution of all obtained sinuosity values. As expected, the amount of observations decreases with higher sinuosity (since a majority of the data consists of rather straight sections).
Code
# Visualisation of speed p1 <-ggplot(dh, aes(speed)) +geom_histogram(binwidth = .1) +labs(x ="speed in m/s") +ggtitle("(a)")# Visualisation of slope p2 <-ggplot(dh, aes(slope)) +geom_histogram(binwidth = .3) +labs(x ="slope in °") +ggtitle("(b)")# Visualize sinuosity2p3 <-ggplot(dh, aes(sinuosity_2)) +geom_histogram(binwidth =0.002) +scale_x_continuous(limits =c(1, 2)) +labs(x ="sinuosity") +ggtitle("(c)")# Plots next to each otherp1 + p2 + p3
Figure 2: Histogram of (a) speed, (b) slope and (c) sinuosity values.
Development of prediction model
To train the machine learning algorithm, a training data set with fixes labeled as either connecting path or downhill trail is required. This was achieved by manually classifying the fixes of three sessions. This data was then used to train a random forest algorithm (Breiman (2001)). This algorithm achieves classification of segment type (downhill trail / connecting path) by creating multiple decision trees with feature classes (speed, sinuosity, slope) as nodes. Figure 3 shows an exemplary decision tree for our model. The final classification of a fix is determined by the classification that the majority of trees assign to this fix. Model quality was assessed by splitting the training data into subsets for training and cross-validation testing. The resulting model was then used to predict segment type for the fixes of the unlabeled data. Accuracy of this prediction is assessed qualitatively, by evaluating if the mapped GPS fixes colored according to predicted path type correspond to the actual path type as identified by the researchers.
Figure 3: An example of one random forest. The boxes contain the prediction for each node, the probability of the prediction and the percentage of fixes in the node.
Results
To assess model quality, two thirds of the training data are used to train the model, which is then tested on the remaining third of the data. Figure 4 gives an indication of model quality by plotting the share of falsely classified fixes against the number of trees integrated into the model. Ideally, such a plot shows a steep initial descent, as the first few trees greatly reduce error, and then stabilizes, as new trees do not bring any more variance in precision. Our models’ error probability dropped sharply and stabilized after roughly 100 trees, which indicates a significant improvement of predictions through our feature variables sinuosity, speed and slope. After that, the error stabilizes at a probability of roughly 0.2, which is a decent value yet leaves some room improvement of the model.
Code
# Error plot of modelplot(model_rf, main ="Classification error with increasing number of trees",)legend("topright",legend =colnames(model_rf$err.rate)[-1], # Class namescol =2:(ncol(model_rf$err.rate)), # Start from color 2lty =1)
Figure 4: Probability of falsely classified fixes depending on number of trees integrated in the model.
Variable significance gives an indication of how important each feature variable is for predicting path type. It is measured as mean decrease in Gini impurity. The goal of a random forest model is to obtain samples that are more homogeneous regarding true path type at every node of the tree. Mean decrease in Gini impurity measures how much a variable has decreased heterogeneity at all the nodes it was applied. Figure 5 shows mean decrease in Gini impurity for speed, sinuosity and slope. It is evident that sinuosity and speed are most important in predicting path types while slope plays a smaller, yet not negligible role.
Code
# Plot of variable importancevar_imp <-importance(model_rf)var_imp_df <-data.frame(Variable =rownames(var_imp),Importance = var_imp[, "MeanDecreaseGini"])var_imp_df$Variable <-recode(var_imp_df$Variable,slope ="Slope",speed ="Speed",sinuosity_2 ="Sinuosity")barplot(var_imp_df$Importance, names.arg = var_imp_df$Variable, las =1, # Makes text horizontalmain ="Variable Importance",xlab ="Mean Decrease in Gini Impurity",ylim =c(0, 600*1.2))
Figure 5: Mean decrease in Gini impurity for the variables in the model.
For a qualitative assessment of the model predictions on unlabeled data, a map with GPS fixes colored according to path type was created. In Figure Figure 6, the data is limited to the region of Eschenbergwald in Winterthur for a better overview. A visual assessment shows, that possible locations of trails can be estimated fairly good, but exact predictions need to be improved. The model appears to perform best in identifying mountainbiking trails which are built for this specific purpose, as can be found in the west/southwest corner of Eschenbergwald. On the other hand, the model struggles to consistently identify more natural, multi-purpose trails, as seen for example center of the map, west of Wildpark Bruderhaus or in the very south, near Kyburg. Connecting paths are identified quite reliably both on fire roads as well as in the city. However, fixes in curves of connecting paths are often falsely classified as trails. Vice versa, longer straight sections on trails are sometimes classified as connecting paths. This reflects the high importance of the sinuosity variable in our model. The overall impression is that the model mainly detects curvature of paths. A further observation is that bulks of fixes that were factually static but registered as moving due to GPS inaccuracies, are classified as downhill trails due to their high sinuosity. Ways to address these issue are discussed in the last section.
Code
# Only Data from Eschenbergwaldunlabeled_data_map <- unlabeled_data |>filter(date(session_id) %in%c(date("2024-07-11 17:27:18"), date("2023-10-08 16:21:19"),date("2023-07-27 15:09:18"),date("2024-05-01 16:25:39"),date("2023-06-01 18:30:28")))# Render Maptmap_mode("view")tm_shape(unlabeled_data_map) +tm_dots(col ="predicted_type", palette =c("blue", "red") ) +tm_basemap("OpenStreetMap") +tm_view(set.view =c(lon =8.73, lat =47.478, zoom =15))
Figure 6: Map of path type predictions for the area of Eschenbergwald.
Discussion & Conclusion
The goal of this study was to explore possibilities of localizing and identifying different types of mountainbiking paths using spatiotemporal data. To do so, raw GPS data was amended with derived variables. For a subset of data, true trail type was assigned manually. A random forest machine learning algorithm was trained on this subset to obtain a model for identifying trail type based on the derived variables. Model performance was tested quantitatively through error probabilities and qualitatively by assessing if paths could be correctly identified by humans using a map created with the model. The results showed promising potential for this method. However, it needs refinement to accurately and precisely predict path types.
A first set of obstacles to overcome is related to the ground truth definition of different trail types. The model performs best at identifying purpose-specific (as opposed to natural) downhill trails, because the unique features of trails defined here (i.e. high sinuosity, moderate speeds and steep slopes) are most pronounced on such trails. The ground truth classification as downhill trail or connecting path is more fuzzy for natural trails, as natural trails come in more diverse shapes. For example, they often have stretches with no curvature. Furthermore, there may be differences in trail features between geographic regions as these generally differ in elevation and slope profiles. Besides topography, vegetation and soil type influence paths and their characteristics too. Another factor to take into account are varying difficulty levels of trails, as more difficult trails are more likely to be steeper in nature and allow less speed. To objectively and consistently define downhill trails and connecting paths, more features of trails should be considered and integrated into the definition. This reduces subjectivity in the ground truth definition. Alternative ways of definition could be explored as well, for example through existing data-bases.
Further difficulties lie in the way the model is trained and defined. First, the above-mentioned differences in path types should be integrated as model variables, reducing variance and leading to more accurate predictions. Possible additional variables could emerge from a multi-source approach. An example would be the addition of information about the rider’s heart rate, skill level or other attributes influencing the ride such as the weather. Second, the feature variables used here could be calculated differently or complemented with further variables to improve the model. For example, sinuosity could be calculated with different methods or with a different size of moving window, which would result in differing model predictions. Furthermore, more precise height differences than the enhanced_altitude value (which was measured at discrete intervals of 0.2 meters by the recording devices), could potentially increase the slope variable’s importance in the model. An option to obtain more precise height differences is the implementation of a comprehensive height model. Third, in this study the model was trained on data containing mostly purpose-specific trails situated in one specific region, which might have led to a model bias. For a more generally applicable model, a bigger and broader training data set should be applied. This would also make grouping of different regions or path difficulties more viable. Fourth, the capabilities of other modelling approaches than machine learning could be explored.
Another issue are factually static fixes, often classified as mountainbike trails. They cannot simply be removed by raising the speed threshold (of >1m/s) to an random value, since especially mountainbike trail sections with low speed would be deleted too. A possible solution would be an estimation of the uncertainty of the fixes with a Monte Carlo Simulation (Metropolis et al. (1953)). Another option could be an additional variable or algorithm checking if the fixes in question are progressing on a path or if they are just GPS inaccuracies within factually static fixes (generating high sinuosity values).
The fact that sinuosity is the most important variable, together with the observation that the prediction model mainly identifies curvature of paths, raises the question if the combination of the three feature variables chosen here allows for reliable predictions at all. It could be the case that the complex influences of various intervening factors such as rider skill, weather conditions or surface structures cannot be controlled for and that the nature of mountain biking trails is too diverse to be predicted.
Laube, Patrick. 2014. Computational MovementAnalysis. SpringerBriefs in ComputerScience. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-10268-9.
Laube, Patrick, and Ross S Purves. 2011. “How Fast Is a Cow? Cross-Scale Analysis of Movement Data.”Transactions in GIS 15 (3): 401–18.
Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equation of StateCalculations by FastComputingMachines.”The Journal of Chemical Physics 21 (6): 1087–92. https://doi.org/10.1063/1.1699114.
Smith, Isabella, and Catherine Marina Pickering. 2025. “Assessing the EnvironmentalImpacts, Condition and Sustainability of MountainBikingTrails in an UrbanNationalPark.”Environmental Management 75 (4): 793–805. https://doi.org/10.1007/s00267-024-02029-6.