The following object is masked from 'package:tidyr':
smiths
Comments about the chunks are UNDER the chunk itself
Introduction to the Data
This data set looks into the statistics of all the female ironman competitors in 2022. This specific event took place at Lake Placid, New York, on July 24th. The data consists of the stats of 489 contestants with qualitative data describing their country and division. The quantitative data provided are the times it took for the competitor to complete each leg of the race (in categories running, swimming, and biking). It also includes their overall time, overall rank, and individual sport rank. This data came from CoachCox, a triathlon and Ironman training center and program which records the data of competitors from each year.
Load in Data set
# this chink is being used to load in data from my computersetwd("C:/Users/tmanh/OneDrive/Documents/college stuff/Data 110")ironman_fdata22 <-read_csv("ironman_lake_placid_female_2022.csv")
Rows: 489 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Name, Country, Gender, Division, Finish.Status, Location
dbl (11): Bib, Division.Rank, Overall.Time, Overall.Rank, Swim.Time, Swim.Ra...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data set did not require cleaning (yet) and is able to be used as is. Fun fact: it had ONE SINGULAR data point that was in the THOUSANDS so I had to go back to the excel sheet, do the math, and replace the data point since it was both incorrect and an outlier.
Correlation Showcase
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
pairs.panels(ironman_fdata22[8:10:12:14], # plot distributions and correlations for all the datagap =0,pch =21,lm =TRUE)
Warning in 8:10:12: numerical expression has 3 elements: only the first used
Warning in 8:10:12:14: numerical expression has 5 elements: only the first used
I decided to run this model to showcase the correlation between the three sports of the Ironman and overall rank. This provides us some context on just how closely these sports are linked, especially in an event like this. I really like this model in particular since it shows the correlations of all these sports individually against one another, for example, the P-value for the correlation between bike and swim time is 0.59.
Muliple Linear Regression Model and Analysis
model <-lm(Overall.Time ~ Swim.Time + Bike.Time + Run.Time, data = ironman_fdata22)summary(model)
Call:
lm(formula = Overall.Time ~ Swim.Time + Bike.Time + Run.Time,
data = ironman_fdata22)
Residuals:
Min 1Q Median 3Q Max
-19.000 -3.239 -0.355 2.592 63.730
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.289154 2.377938 -9.373 <2e-16 ***
Swim.Time 1.046668 0.028263 37.033 <2e-16 ***
Bike.Time 1.072959 0.008348 128.532 <2e-16 ***
Run.Time 1.019555 0.006753 150.974 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.707 on 485 degrees of freedom
Multiple R-squared: 0.9971, Adjusted R-squared: 0.997
F-statistic: 5.496e+04 on 3 and 485 DF, p-value: < 2.2e-16
Summary of the regression model
autoplot(model, 1:4, nrow=2, ncol=2)
Plotted Summary of the regression model This is a multiple linear regression model that analyzes the relationship between the three sports of the Ironman event and the time of the contestants. There technically was an outlier within the data (470), but, it excluding this data point would upset the data as a whole so it didn’t seem necessary or correct to remove it. This data set has an adjusted R squared value of 0.997 or 99.7%. This means that the model can explain almost all of the variance withing the outcome (or Overall.Time).
Changing the data to a “long” format foe the graph
ggplot(top_20_long, aes(x = discipline, y = time, fill = discipline)) +geom_violin(trim =FALSE) +# Set trim = TRUE if you want to cut the tails of the distributiongeom_boxplot(width =0.1, fill ="white", alpha =0.3) +labs(title ="Distribution of Swim, Bike, and Run Times",x ="Discipline", y ="Time (in seconds)") +theme_minimal() +scale_fill_brewer(palette ="Set2")
This is a layered violin and box plot showing the distribution of the finishing time for the individual sports. As we can see, bike and run time have the most variance and are spread out mostly evenly while swim time has the least variance, crowding around 53 seconds.
Hisogram Showing the Finishing Time Distributions
ggplot(top_20, aes(x = Overall.Time)) +geom_histogram(binwidth =5, fill ="#385661", color ="grey") +labs(title ="Finishing Times Distribution", x ="Finishing Time (minutes)", y ="Frequency") +theme_minimal()
This was really just for fun in order to see the distribution of the finishing time of the contestants. I was interested in seeing if there were any noticeable trends and it does seem to be skewed right a bit.
Short Essay
Well this data set was pretty easy and fun to use, it had already been “cleaned” in a sense, the only thing giving me trouble was the single data point I mentioned earlier. I still had to do some cleaning later on though for the layered violin box plot, turning the data from a “wide” format to a “long” format in order to get the correct values. As for the visualizations, nothing really jumped out as surprising to me, considering the data sets’ high R squared value, it made sense that the visualizations came out the way they did. I did want to create a heat map at first but I was having a bit of trouble with the correlation matrix and even after Google-ing it I wasn’t too confident in the results. Still, I included my attempt bellow just for fun, hopefully I can get some feedback on that.