Which event is the best event to predict overall rank?
I am looking at The Women’s Lake Placid Ironman data set. The source of the data set is Ironman Lake Placid Results and articles.CoachCox. (n.d.). An Ironman is a very long version of a triathlon. It is a total of 140.6 miles with a 2.4 mile swim, a 112 mile bike, and a 26.2 mile run. I will be looking at the variables of Swim rank, Bike rank, and Run rank, as well as overall rank. I want to find out which event of the 3 has the highest impact towards overall victory. Which event should someone train and focus on most to achieve the best results. I will also look at how the Division variable affects peoples performance. Divisions other than FPRO are broken down by age every 5 years starting at F18-24 through F70-74. I am interested in this topic because while I am not interested in trying an Ironman, I do think a shorter triathlon might be fun and certainly would be good exercise in 3 different disciplines. The data was collected by timers associated with the Lake Placid Ironman Group.
Load libraries and data set
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(jpeg)library(ggpubr)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(viridis)
Loading required package: viridisLite
iron <-read_csv("ironman_lake_placid_female_2022.csv")
Rows: 489 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Name, Country, Gender, Division, Finish.Status, Location
dbl (11): Bib, Division.Rank, Overall.Time, Overall.Rank, Swim.Time, Swim.Ra...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:viridis':
viridis_pal
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(RColorBrewer)
Which event has the most weight towards achieveing overall victory.
Create multiple linear regression comparing Swim, Bike, and Run
Call:
lm(formula = Overall.Rank ~ Swim.Rank + Bike.Rank + Run.Rank,
data = iron)
Residuals:
Min 1Q Median 3Q Max
-136.907 -31.222 -7.815 26.890 303.304
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.363e+02 5.490e+00 -24.82 <2e-16 ***
Swim.Rank 9.302e-02 4.600e-03 20.22 <2e-16 ***
Bike.Rank 4.877e-01 6.408e-03 76.11 <2e-16 ***
Run.Rank 5.103e-01 5.865e-03 87.00 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 48.03 on 485 degrees of freedom
Multiple R-squared: 0.9915, Adjusted R-squared: 0.9914
F-statistic: 1.879e+04 on 3 and 485 DF, p-value: < 2.2e-16
Due to the very low p scores of each event we can tell that they are all significant in relation to overall victory but the running in this particular race has the highest weight due to its highest regression coefficient estimate of 5.103e-01. Which means as overall rank increases by 1 along the x axis then y would increase by .5103.
Division is a poor model to determine overall rank unless you are in the Pro division. This makes sense since all of the additional divisions are not professionally trained triathlon athletes. FPRO division thought has a very low and significant p score or 8.90e-06.
PLot linear regression model that includes all divisions
iron_pl <-ggplot(data = iron, aes(x=Swim.Rank, y=Overall.Rank, color=Division))+geom_point()+geom_smooth(method="lm", se=FALSE)+scale_color_viridis(discrete =TRUE, option ="C")+labs(title ="Swim Rank vs Overall Rank",x="Swim Rank",y="Overall Rank",caption ="Ironman Lake Placid Results and articles. CoachCox.(n.d)")+theme_classic()ggplotly(iron_pl)
`geom_smooth()` using formula = 'y ~ x'
Plot second visualization in Highcharter
Remove Slower Divisions to reduce number of competitors to better visualize
In conclusion to answer my question i have found that in this event the run is the most significant event. After investigating farther I have found that the run is most significant in most longer distance Ironman triathlons. However, when other distances of triathlons are examined, there is different results. In reading a Frontiers.org brief they have concluded that Swimming is the most important predictor discipline in Sprint and Olympic distance triathlons, cycling in Ironman 70.3, and running in Ironman 140.6. I have 2 visualizations, first is with GGplot and shows relation of swim rank and overall rank as well as the division by color. Obviously the FPRO’s are down to the left since they are the fastest. After that though, The plot is mostly a positive regression with F65-69 being the exception but is also scattered and not the strongest relation. In the Highcharter visualization when looking at each division separately the regression is also positive but also a stronger relation, more linear.
Sources
2022 Womens Iron Man Lake Placid Data set source: Ironman Lake Placid Results and articles. CoachCox. (n.d.). https://www.coachcox.co.uk/imstats/series/13/ https://data.scorenetwork.org/data/ironman_lake_placid_female_2022.csv
Women’s victor photo source: tri247.com Sarah True wins IRONMAN Lake Placid 2022 [Photo credit: Patrick McDermott / Getty Images for IRONMAN] https://www.tri247.com/triathlon-news/elite/ironman-lake-placid-2022-results-report
Additional triathlon analysis source: Frontier.org , Frontiers in Physiology https://www.frontiersin.org/journals/physiology/articles/10.3389/fphys.2021.654552/full Sousa CV, Aguiar S, Olher RR, Cunha R, Nikolaidis PT, Villiger E, Rosemann T and Knechtle B (2021) What Is the Best Discipline to Predict Overall Triathlon Performance? An Analysis of Sprint, Olympic, Ironman® 70.3, and Ironman® 140.6. Front. Physiol. 12:654552. doi: 10.3389/fphys.2021.654552