Introduction

This project analyzes the results of the Ironman Lake Placid 2022 triathlon for female participants. The Ironman triathlon is a long-distance race consisting of three consecutive events: swimming, biking, and running. The dataset contains performance metrics for each athlete who participated in the event. My objective is to explore the distribution of overall completion times and ranks among the participants, the individual performance times in each segment (swimming, biking, and running) and how they correlate with overall rank, the average overall completion time for the participants and how individual performances compare to this average.

Dataset variables

Bib: The race number assigned to each participant. Name: The name of the participant. Country: The country of origin of the participant. Gender: The gender of the participant. Division: The division in which the participant competed. Division.Rank: The rank of the participant within their division. Overall.Time: The total time taken by the participant to complete the triathlon (in minutes). Overall.Rank: The overall rank of the participant among all competitors. Swim.Time: The time taken by the participant to complete the swimming portion of the triathlon (in minutes). Swim.Rank: The rank of the participant in the swimming portion. Bike.Time: The time taken by the participant to complete the biking portion of the triathlon (in minutes). Bike.Rank: The rank of the participant in the biking portion. Run.Time: The time taken by the participant to complete the running portion of the triathlon (in minutes). Run.Rank: The rank of the participant in the running portion. Finish.Status: The status of the participant at the end of the race (e.g., Finisher). Location: The location of the event. Year: The year the event took place.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
getwd()
## [1] "C:/Users/dylan/OneDrive/Documents/Data 110 Summer"
setwd("C:/Users/dylan/OneDrive/Documents/Data 110 Summer")
ironman <- read.csv("ironman_lake_placid_female_2022.csv")
head(ironman)
##   Bib             Name       Country Gender Division Division.Rank Overall.Time
## 1   3       Sarah True United States Female     FPRO             1     540.3667
## 2   1  Heather Jackson United States Female     FPRO             2     556.3833
## 3   8  Jodie Robertson United States Female     FPRO             3     562.0333
## 4   5 Rachel Zilinskas United States Female     FPRO             4     572.5500
## 5   2  Melanie Mcquaid        Canada Female     FPRO             5     574.5333
## 6  10     Angela Naeth United States Female     FPRO             6     585.6000
##   Overall.Rank Swim.Time Swim.Rank Bike.Time Bike.Rank Run.Time Run.Rank
## 1           11  55.60000        28  295.5000        21 184.1167        7
## 2           13  65.36667       238  292.3500        18 193.8833       16
## 3           16  62.91667       148  304.5333        33 187.6833       11
## 4           20  50.95000        10  311.8667        47 203.3833       30
## 5           21  58.05000        57  305.5333        35 205.7500       35
## 6           28  65.83333       254  306.0333        36 208.1333       38
##   Finish.Status    Location Year
## 1      Finisher Lake Placid 2022
## 2      Finisher Lake Placid 2022
## 3      Finisher Lake Placid 2022
## 4      Finisher Lake Placid 2022
## 5      Finisher Lake Placid 2022
## 6      Finisher Lake Placid 2022
summary(ironman$Overall.Time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   540.4   777.0   857.6   849.3   937.8  1041.4
cleaned_data <- ironman %>% filter(!is.na(Overall.Time) & Overall.Time >= 0)
average_time <- mean(cleaned_data$Overall.Time, na.rm = TRUE)
print(average_time)
## [1] 849.2822
data_long <- ironman %>%
  select(Overall.Rank, Swim.Time, Bike.Time, Run.Time) %>%
  pivot_longer(cols = c(Swim.Time, Bike.Time, Run.Time), names_to = "Activity", values_to = "Time")
ggplot(data_long, aes(x = Time, y = Overall.Rank, color = Activity)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_manual(values = c("Swim.Time" = "blue", "Bike.Time" = "red", "Run.Time" = "green")) +
  geom_vline(xintercept = average_time, linetype = "dashed", color = "black", size = 1) +
  ggtitle('Time vs. Overall Rank (Swim, Bike, and Run)') +
  xlab('Time (minutes)') +
  ylab('Overall Rank') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(color = guide_legend(title = "Activity")) +
  annotate("text", x = average_time, y = max(ironman$Overall.Rank), 
           label = paste('Average Time:', round(average_time, 2)), 
           vjust = -1, hjust = 1, color = "black")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Explaining visualization

The resulting visualization is a scatter plot that displays the swim time, bike time, and run time for each participant, colored by activity. Additionally, a dashed vertical line represents the average overall time for all participants. This visualization helps to understand the distribution and relationship between different segment times and overall rank. Cleaning up the data included getting rid of missing values, sorting the data for the different events (swimming, running, cycling), and calculating averages. some surprising patterns are the clustering of times in swim times but in running and cycling the times show more variability. Some challenges were the missing data points as they did not help with the overall data distribution and with multiple different variables influencing the times of the competitors.