IRON MAN

Author

Walter Hinkley

Published

December 7, 2024

Womens 2022 Lake Placid Iron Man

Which event is the best event to predict overall rank?

I am looking at The Women’s Lake Placid Ironman data set. The source of the data set is Ironman Lake Placid Results and articles.CoachCox. (n.d.). An Ironman is a very long version of a triathlon. It is a total of 140.6 miles with a 2.4 mile swim, a 112 mile bike, and a 26.2 mile run. I will be looking at the variables of Swim rank, Bike rank, and Run rank, as well as overall rank. I want to find out which event of the 3 has the highest impact towards overall victory. Which event should someone train and focus on most to achieve the best results. I will also look at how the Division variable affects peoples performance. Divisions other than FPRO are broken down by age every 5 years starting at F18-24 through F70-74. I am interested in this topic because while I am not interested in trying an Ironman, I do think a shorter triathlon might be fun and certainly would be good exercise in 3 different disciplines. The data was collected by timers associated with the Lake Placid Ironman Group.

Load libraries and data set

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(jpeg)
library(ggpubr)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(viridis)

Loading required package: viridisLite

iron <- read_csv("ironman_lake_placid_female_2022.csv")

Rows: 489 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): Name, Country, Gender, Division, Finish.Status, Location
dbl (11): Bib, Division.Rank, Overall.Time, Overall.Rank, Swim.Time, Swim.Ra...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#install.packages("shiny")
#install.packages("shinythemes")
#install.packages("shinydashboard")
#install.packages("shinyWidgets")
library(shiny)
library(shinydashboard)


Attaching package: 'shinydashboard'

The following object is masked from 'package:graphics':

    box

library(shinythemes)
library(shinyWidgets)
library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(scales)


Attaching package: 'scales'

The following object is masked from 'package:viridis':

    viridis_pal

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

library(RColorBrewer)

Which event has the most weight towards achieveing overall victory.

Create multiple linear regression comparing Swim, Bike, and Run

lm_iron <- lm(Overall.Rank ~ Swim.Rank + Bike.Rank + Run.Rank, data = iron)
summary(lm_iron)


Call:
lm(formula = Overall.Rank ~ Swim.Rank + Bike.Rank + Run.Rank, 
    data = iron)

Residuals:
     Min       1Q   Median       3Q      Max 
-136.907  -31.222   -7.815   26.890  303.304 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.363e+02  5.490e+00  -24.82   <2e-16 ***
Swim.Rank    9.302e-02  4.600e-03   20.22   <2e-16 ***
Bike.Rank    4.877e-01  6.408e-03   76.11   <2e-16 ***
Run.Rank     5.103e-01  5.865e-03   87.00   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 48.03 on 485 degrees of freedom
Multiple R-squared:  0.9915,    Adjusted R-squared:  0.9914 
F-statistic: 1.879e+04 on 3 and 485 DF,  p-value: < 2.2e-16

Due to the very low p scores of each event we can tell that they are all significant in relation to overall victory but the running in this particular race has the highest weight due to its highest regression coefficient estimate of 5.103e-01. Which means as overall rank increases by 1 along the x axis then y would increase by .5103.

Plot the linear regression model

par(mfrow=c(2,2))
plot(lm_iron)

#pairs(iron[, c("Swim.Rank", "Run.Rank", "Bike.Rank")], panel = panel.smooth)

pairs function shows and plots how the listed variables plot against themselves

Create another linear regression model including all the divisions

lm_iron2 <- lm(Overall.Rank ~ Swim.Rank + Bike.Rank + Run.Rank + Division, data = iron)
summary(lm_iron2)


Call:
lm(formula = Overall.Rank ~ Swim.Rank + Bike.Rank + Run.Rank + 
    Division, data = iron)

Residuals:
     Min       1Q   Median       3Q      Max 
-130.741  -27.175   -3.588   25.039  299.287 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.525e+02  2.132e+01  -7.153 3.30e-12 ***
Swim.Rank       9.605e-02  4.513e-03  21.282  < 2e-16 ***
Bike.Rank       4.950e-01  6.177e-03  80.139  < 2e-16 ***
Run.Rank        5.129e-01  5.763e-03  88.989  < 2e-16 ***
DivisionF25-29 -9.810e+00  2.174e+01  -0.451    0.652    
DivisionF30-34 -4.732e+00  2.124e+01  -0.223    0.824    
DivisionF35-39  4.431e+00  2.108e+01   0.210    0.834    
DivisionF40-44 -7.457e-01  2.078e+01  -0.036    0.971    
DivisionF45-49 -3.801e-01  2.083e+01  -0.018    0.985    
DivisionF50-54  3.049e+00  2.082e+01   0.146    0.884    
DivisionF55-59 -6.420e+00  2.132e+01  -0.301    0.763    
DivisionF60-64 -4.422e+00  2.352e+01  -0.188    0.851    
DivisionF65-69 -3.969e+01  3.783e+01  -1.049    0.295    
DivisionF70-74 -2.196e+01  4.950e+01  -0.444    0.658    
DivisionFPC    -3.049e+00  4.951e+01  -0.062    0.951    
DivisionFPRO    1.094e+02  2.436e+01   4.491 8.95e-06 ***
DivisionM30-34 -1.655e+01  4.956e+01  -0.334    0.739    
DivisionM35-39  6.956e+01  4.945e+01   1.407    0.160    
DivisionM40-44  6.311e+00  4.949e+01   0.128    0.899    
DivisionM50-54  2.483e+01  4.963e+01   0.500    0.617    
DivisionM55-59  5.578e+01  4.954e+01   1.126    0.261    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.07 on 468 degrees of freedom
Multiple R-squared:  0.9928,    Adjusted R-squared:  0.9924 
F-statistic:  3205 on 20 and 468 DF,  p-value: < 2.2e-16

Division is a poor model to determine overall rank unless you are in the Pro division. This makes sense since all of the additional divisions are not professionally trained triathlon athletes. FPRO division thought has a very low and significant p score or 8.90e-06.

PLot linear regression model that includes all divisions

iron_pl <- ggplot(data = iron, aes(x=Swim.Rank, y=Overall.Rank, color=Division))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)+
  scale_color_viridis(discrete = TRUE, option = "C")+
  labs(title = "Swim Rank vs Overall Rank",
       x="Swim Rank",
       y="Overall Rank",
       caption = "Ironman Lake Placid Results and articles. CoachCox.(n.d)")+
  theme_classic()
ggplotly(iron_pl)

`geom_smooth()` using formula = 'y ~ x'

Plot second visualization in Highcharter

Remove Slower Divisions to reduce number of competitors to better visualize

iron2 <- iron[iron$Division %in% c("FPRO", "F18-24", "F25-29", "F30-34", "F35-39"),]
head(iron2)

# A tibble: 6 × 17
    Bib Name     Country Gender Division Division.Rank Overall.Time Overall.Rank
  <dbl> <chr>    <chr>   <chr>  <chr>            <dbl>        <dbl>        <dbl>
1     3 Sarah T… United… Female FPRO                 1         540.           11
2     1 Heather… United… Female FPRO                 2         556.           13
3     8 Jodie R… United… Female FPRO                 3         562.           16
4     5 Rachel … United… Female FPRO                 4         573.           20
5     2 Melanie… Canada  Female FPRO                 5         575.           21
6    10 Angela … United… Female FPRO                 6         586.           28
# ℹ 9 more variables: Swim.Time <dbl>, Swim.Rank <dbl>, Bike.Time <dbl>,
#   Bike.Rank <dbl>, Run.Time <dbl>, Run.Rank <dbl>, Finish.Status <chr>,
#   Location <chr>, Year <dbl>

Clean data frame names so tool tip would not confuse “.”

iron2 <- iron2 %>%
            clean_names()

Highcharter Plot of Racers, Run ranks, and Overall ranks

cols <- magma(5)
highchart() |>
  hc_add_series(data = iron2,
                type = "bubble",
                hcaes(x = run_rank,
                      y = overall_rank,
                      group = division)) |>
hc_colors(cols) |>
  hc_title(text = "Female Ranks in 2022 Iron Man") |>
hc_xAxis(title = list(text="Run Rank")) |>
hc_yAxis(title = list(text="Overall Rank")) |>
  hc_tooltip(shared = TRUE,
    borderColor = "black",
             pointFormat = "{point.name}: {point.country}<br>
    {point.overall_rank}: {point.run_rank}:")

Conlusion

In conclusion to answer my question i have found that in this event the run is the most significant event. After investigating farther I have found that the run is most significant in most longer distance Ironman triathlons. However, when other distances of triathlons are examined, there is different results. In reading a Frontiers.org brief they have concluded that Swimming is the most important predictor discipline in Sprint and Olympic distance triathlons, cycling in Ironman 70.3, and running in Ironman 140.6. I have 2 visualizations, first is with GGplot and shows relation of swim rank and overall rank as well as the division by color. Obviously the FPRO’s are down to the left since they are the fastest. After that though, The plot is mostly a positive regression with F65-69 being the exception but is also scattered and not the strongest relation. In the Highcharter visualization when looking at each division separately the regression is also positive but also a stronger relation, more linear.

Sources

2022 Womens Iron Man Lake Placid Data set source: Ironman Lake Placid Results and articles. CoachCox. (n.d.). https://www.coachcox.co.uk/imstats/series/13/ https://data.scorenetwork.org/data/ironman_lake_placid_female_2022.csv

Women’s victor photo source: tri247.com Sarah True wins IRONMAN Lake Placid 2022 [Photo credit: Patrick McDermott / Getty Images for IRONMAN] https://www.tri247.com/triathlon-news/elite/ironman-lake-placid-2022-results-report

Additional triathlon analysis source: Frontier.org , Frontiers in Physiology https://www.frontiersin.org/journals/physiology/articles/10.3389/fphys.2021.654552/full Sousa CV, Aguiar S, Olher RR, Cunha R, Nikolaidis PT, Villiger E, Rosemann T and Knechtle B (2021) What Is the Best Discipline to Predict Overall Triathlon Performance? An Analysis of Sprint, Olympic, Ironman® 70.3, and Ironman® 140.6. Front. Physiol. 12:654552. doi: 10.3389/fphys.2021.654552