This project analyzes the results of the Ironman Lake Placid 2022 triathlon for female participants. The Ironman triathlon is a long-distance race consisting of three consecutive events: swimming, biking, and running. The dataset contains performance metrics for each athlete who participated in the event. My objective is to explore the distribution of overall completion times and ranks among the participants, the individual performance times in each segment (swimming, biking, and running) and how they correlate with overall rank, the average overall completion time for the participants and how individual performances compare to this average.
Bib: The race number assigned to each participant. Name: The name of the participant. Country: The country of origin of the participant. Gender: The gender of the participant. Division: The division in which the participant competed. Division.Rank: The rank of the participant within their division. Overall.Time: The total time taken by the participant to complete the triathlon (in minutes). Overall.Rank: The overall rank of the participant among all competitors. Swim.Time: The time taken by the participant to complete the swimming portion of the triathlon (in minutes). Swim.Rank: The rank of the participant in the swimming portion. Bike.Time: The time taken by the participant to complete the biking portion of the triathlon (in minutes). Bike.Rank: The rank of the participant in the biking portion. Run.Time: The time taken by the participant to complete the running portion of the triathlon (in minutes). Run.Rank: The rank of the participant in the running portion. Finish.Status: The status of the participant at the end of the race (e.g., Finisher). Location: The location of the event. Year: The year the event took place.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
getwd()
## [1] "C:/Users/dylan/OneDrive/Documents/Data 110 Summer"
setwd("C:/Users/dylan/OneDrive/Documents/Data 110 Summer")
ironman <- read.csv("ironman_lake_placid_female_2022.csv")
head(ironman)
## Bib Name Country Gender Division Division.Rank Overall.Time
## 1 3 Sarah True United States Female FPRO 1 540.3667
## 2 1 Heather Jackson United States Female FPRO 2 556.3833
## 3 8 Jodie Robertson United States Female FPRO 3 562.0333
## 4 5 Rachel Zilinskas United States Female FPRO 4 572.5500
## 5 2 Melanie Mcquaid Canada Female FPRO 5 574.5333
## 6 10 Angela Naeth United States Female FPRO 6 585.6000
## Overall.Rank Swim.Time Swim.Rank Bike.Time Bike.Rank Run.Time Run.Rank
## 1 11 55.60000 28 295.5000 21 184.1167 7
## 2 13 65.36667 238 292.3500 18 193.8833 16
## 3 16 62.91667 148 304.5333 33 187.6833 11
## 4 20 50.95000 10 311.8667 47 203.3833 30
## 5 21 58.05000 57 305.5333 35 205.7500 35
## 6 28 65.83333 254 306.0333 36 208.1333 38
## Finish.Status Location Year
## 1 Finisher Lake Placid 2022
## 2 Finisher Lake Placid 2022
## 3 Finisher Lake Placid 2022
## 4 Finisher Lake Placid 2022
## 5 Finisher Lake Placid 2022
## 6 Finisher Lake Placid 2022
summary(ironman$Overall.Time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 540.4 777.0 857.6 849.3 937.8 1041.4
cleaned_data <- ironman %>% filter(!is.na(Overall.Time) & Overall.Time >= 0)
average_time <- mean(cleaned_data$Overall.Time, na.rm = TRUE)
print(average_time)
## [1] 849.2822
data_long <- ironman %>%
select(Overall.Rank, Swim.Time, Bike.Time, Run.Time) %>%
pivot_longer(cols = c(Swim.Time, Bike.Time, Run.Time), names_to = "Activity", values_to = "Time")
ggplot(data_long, aes(x = Time, y = Overall.Rank, color = Activity)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_manual(values = c("Swim.Time" = "blue", "Bike.Time" = "red", "Run.Time" = "green")) +
geom_vline(xintercept = average_time, linetype = "dashed", color = "black", size = 1) +
ggtitle('Time vs. Overall Rank (Swim, Bike, and Run)') +
xlab('Time (minutes)') +
ylab('Overall Rank') +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color = guide_legend(title = "Activity")) +
annotate("text", x = average_time, y = max(ironman$Overall.Rank),
label = paste('Average Time:', round(average_time, 2)),
vjust = -1, hjust = 1, color = "black")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The resulting visualization is a scatter plot that displays the swim time, bike time, and run time for each participant, colored by activity. Additionally, a dashed vertical line represents the average overall time for all participants. This visualization helps to understand the distribution and relationship between different segment times and overall rank. Cleaning up the data included getting rid of missing values, sorting the data for the different events (swimming, running, cycling), and calculating averages. some surprising patterns are the clustering of times in swim times but in running and cycling the times show more variability. Some challenges were the missing data points as they did not help with the overall data distribution and with multiple different variables influencing the times of the competitors.