Project 1 - Womens Ironman 2022

Author

M. Tariq

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(RColorBrewer)
library(readr)
library(tidyr)
library(ggfortify)
library(reshape2)

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths

Comments about the chunks are UNDER the chunk itself

Introduction to the Data

This data set looks into the statistics of all the female ironman competitors in 2022. This specific event took place at Lake Placid, New York, on July 24th. The data consists of the stats of 489 contestants with qualitative data describing their country and division. The quantitative data provided are the times it took for the competitor to complete each leg of the race (in categories running, swimming, and biking). It also includes their overall time, overall rank, and individual sport rank. This data came from CoachCox, a triathlon and Ironman training center and program which records the data of competitors from each year.

Load in Data set

# this chink is being used to load in data from my computer
setwd("C:/Users/tmanh/OneDrive/Documents/college stuff/Data 110")
ironman_fdata22 <- read_csv("ironman_lake_placid_female_2022.csv")
Rows: 489 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): Name, Country, Gender, Division, Finish.Status, Location
dbl (11): Bib, Division.Rank, Overall.Time, Overall.Rank, Swim.Time, Swim.Ra...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data set did not require cleaning (yet) and is able to be used as is. Fun fact: it had ONE SINGULAR data point that was in the THOUSANDS so I had to go back to the excel sheet, do the math, and replace the data point since it was both incorrect and an outlier.

Correlation Showcase

library(psych)

Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':

    %+%, alpha
pairs.panels(ironman_fdata22[8:10:12:14],   # plot distributions and correlations for all the data
             gap = 0,
             pch = 21,
             lm = TRUE)
Warning in 8:10:12: numerical expression has 3 elements: only the first used
Warning in 8:10:12:14: numerical expression has 5 elements: only the first used

I decided to run this model to showcase the correlation between the three sports of the Ironman and overall rank. This provides us some context on just how closely these sports are linked, especially in an event like this. I really like this model in particular since it shows the correlations of all these sports individually against one another, for example, the P-value for the correlation between bike and swim time is 0.59.

Muliple Linear Regression Model and Analysis

model <- lm(Overall.Time ~ Swim.Time + Bike.Time + Run.Time, data = ironman_fdata22)
summary(model)

Call:
lm(formula = Overall.Time ~ Swim.Time + Bike.Time + Run.Time, 
    data = ironman_fdata22)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.000  -3.239  -0.355   2.592  63.730 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22.289154   2.377938  -9.373   <2e-16 ***
Swim.Time     1.046668   0.028263  37.033   <2e-16 ***
Bike.Time     1.072959   0.008348 128.532   <2e-16 ***
Run.Time      1.019555   0.006753 150.974   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.707 on 485 degrees of freedom
Multiple R-squared:  0.9971,    Adjusted R-squared:  0.997 
F-statistic: 5.496e+04 on 3 and 485 DF,  p-value: < 2.2e-16

Summary of the regression model

autoplot(model, 1:4, nrow=2, ncol=2)

Plotted Summary of the regression model This is a multiple linear regression model that analyzes the relationship between the three sports of the Ironman event and the time of the contestants. There technically was an outlier within the data (470), but, it excluding this data point would upset the data as a whole so it didn’t seem necessary or correct to remove it. This data set has an adjusted R squared value of 0.997 or 99.7%. This means that the model can explain almost all of the variance withing the outcome (or Overall.Time).

Violin / Box Plot of Individual Sport Times

top_20 <- ironman_fdata22 %>%
  arrange(Overall.Time) %>%  
  slice_head(n = 20)
print(top_20)
# A tibble: 20 × 17
     Bib Name    Country Gender Division Division.Rank Overall.Time Overall.Rank
   <dbl> <chr>   <chr>   <chr>  <chr>            <dbl>        <dbl>        <dbl>
 1     3 Sarah … United… Female FPRO                 1         540.           11
 2     1 Heathe… United… Female FPRO                 2         556.           13
 3     8 Jodie … United… Female FPRO                 3         562.           16
 4     5 Rachel… United… Female FPRO                 4         573.           20
 5     2 Melani… Canada  Female FPRO                 5         575.           21
 6    10 Angela… United… Female FPRO                 6         586.           28
 7     7 Jessic… United… Female FPRO                 7         591.           33
 8     6 Dede G… United… Female FPRO                 8         593.           35
 9   313 Annama… United… Female F30-34               1         597.           39
10    15 Alexan… United… Female FPRO                 9         598.           40
11     9 Pamela… Canada  Female FPRO                10         610.           51
12   333 Marni … United… Female F40-44               1         625.           67
13    14 Alice … United… Female FPRO                11         629.           70
14   408 Caitli… United… Female F30-34               2         631.           74
15    12 Amy Va… United… Female FPRO                12         635.           79
16   345 Tara M… United… Female F35-39               1         636.           80
17   209 Liz Mi… United… Female F35-39               2         637.           82
18    17 Sarah … United… Female FPRO                13         639.           84
19  1759 Barbar… Germany Female F40-44               2         648.           94
20   412 Rebecc… United… Female F25-29               1         649.           96
# ℹ 9 more variables: Swim.Time <dbl>, Swim.Rank <dbl>, Bike.Time <dbl>,
#   Bike.Rank <dbl>, Run.Time <dbl>, Run.Rank <dbl>, Finish.Status <chr>,
#   Location <chr>, Year <dbl>

Isolated top 20 contestants by their overall time (or finishing time).

top_20_long <- top_20 %>%
  pivot_longer(cols = c(Swim.Time, Bike.Time, Run.Time), 
               names_to = "discipline", 
               values_to = "time")

Changing the data to a “long” format foe the graph

ggplot(top_20_long, aes(x = discipline, y = time, fill = discipline)) +
  geom_violin(trim = FALSE) +  # Set trim = TRUE if you want to cut the tails of the distribution
  geom_boxplot(width = 0.1, fill = "white", alpha = 0.3) +  
  labs(title = "Distribution of Swim, Bike, and Run Times",
       x = "Discipline", 
       y = "Time (in seconds)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

This is a layered violin and box plot showing the distribution of the finishing time for the individual sports. As we can see, bike and run time have the most variance and are spread out mostly evenly while swim time has the least variance, crowding around 53 seconds.

Hisogram Showing the Finishing Time Distributions

ggplot(top_20, aes(x = Overall.Time)) +
  geom_histogram(binwidth = 5, fill = "#385661", color = "grey") +
  labs(title = "Finishing Times Distribution", x = "Finishing Time (minutes)", y = "Frequency") +
  theme_minimal()

This was really just for fun in order to see the distribution of the finishing time of the contestants. I was interested in seeing if there were any noticeable trends and it does seem to be skewed right a bit.

Short Essay

Well this data set was pretty easy and fun to use, it had already been “cleaned” in a sense, the only thing giving me trouble was the single data point I mentioned earlier. I still had to do some cleaning later on though for the layered violin box plot, turning the data from a “wide” format to a “long” format in order to get the correct values. As for the visualizations, nothing really jumped out as surprising to me, considering the data sets’ high R squared value, it made sense that the visualizations came out the way they did. I did want to create a heat map at first but I was having a bit of trouble with the correlation matrix and even after Google-ing it I wasn’t too confident in the results. Still, I included my attempt bellow just for fun, hopefully I can get some feedback on that.

Attempted Heat Map

top_20_subset <- top_20[, c("Swim.Time", "Bike.Time", "Run.Time", "Overall.Time")]
cor_matrix <- cor(top_20_subset, use = "complete.obs")
cor_matrix_melt <- melt(cor_matrix)

Finding the correlation between the sports

ggplot(cor_matrix_melt, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab", 
                       name = "Correlation") +
  theme_minimal() + 
  labs(title = "Correlation Heatmap: Swim, Bike, Run, and Finish Times", 
       x = "Discipline", y = "Discipline") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1))