1. Athletics needs a new breed of scouts and managers
Athletics goes back to the original Olympics. Since then, little has changed. Athletes compete as individuals, seeking to throw the farthest, jump the farthest (or highest) and run the fastest. But people like cheering for teams, waving banners and yelling like mad during matches, wearing their favorite player’s jerseys and staying loyal to their side through thick and thin.
What if athletics was a team sport? It could potentially be more interesting and would give us a new set of sports analytics to discuss. We might even reduce the incentives to do unsavory things in the pursuit of altius, fortius and citius.
This dataset contains results from American athletes in the horizontal jumps (triple jump and long jump) and throws (shot put, discus, javelin, hammer and weight). Let’s read that in and examine women’s javelin.
# Load the tidyverse package
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Import the full dataset
data <- read_csv("~/Desktop/Scout your Athletics Fantasy Team/datasets/athletics.csv")
## Rows: 2098 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Event, Male_Female, Athlete
## dbl (7): EventID, Flight1, Flight2, Flight3, Flight4, Flight5, Flight6
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Select the results of interest: women's javelin
javelin <- data %>%
filter(Male_Female == "Female" & Event == "Javelin") %>%
select(-Male_Female, -Event)
# Give yourself a snapshot of your data
head(javelin)
## # A tibble: 6 × 8
## EventID Athlete Flight1 Flight2 Flight3 Flight4 Flight5 Flight6
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8 Brittany Borman 54.0 51.2 57.3 52.6 57.0 60.9
## 2 8 Ariana Ince 49.0 54.8 53.6 55.1 55.3 56.7
## 3 8 Kara Patterson 50.1 52.1 0 50.8 55.9 54.6
## 4 8 Kimberley Hamilton 48.0 0 50.9 54.1 55.2 53.3
## 5 8 Laura Loht 44.4 53.8 50.6 54.2 0 49.0
## 6 8 Brianna Bain 49.3 0 51.3 0 48.6 53.0
summary(javelin)
## EventID Athlete Flight1 Flight2
## Min. : 8.0 Length:178 Min. : 0.00 Min. : 0.00
## 1st Qu.: 178.0 Class :character 1st Qu.:41.53 1st Qu.:40.23
## Median : 511.0 Mode :character Median :48.85 Median :48.85
## Mean : 796.8 Mean :40.80 Mean :39.87
## 3rd Qu.:1703.0 3rd Qu.:53.20 3rd Qu.:53.07
## Max. :1859.0 Max. :64.94 Max. :61.38
## Flight3 Flight4 Flight5 Flight6
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.:40.57 1st Qu.: 0.00 1st Qu.: 0.00
## Median :47.34 Median :49.30 Median :48.01 Median :46.80
## Mean :34.22 Mean :39.37 Mean :32.97 Mean :34.82
## 3rd Qu.:52.08 3rd Qu.:52.10 3rd Qu.:51.44 3rd Qu.:52.44
## Max. :62.42 Max. :61.56 Max. :60.84 Max. :64.45
2. Managers love tidy data
This view shows each athlete’s results at individual track meets. Athletes have six throws, but in these meets only one – their longest – actually matters. If all we wanted to do was talk regular track and field, we would have a very easy task: create a new column taking the max of each row, arrange the data frame by that column in descending order and we’d be done.
But our managers need to do and know much more than that! This is a sport of strategy, where every throw matters. Managers need a deeper analysis to choose their teams, craft their plan and make decisions on match-day.
We first need to make this standard “wide” view tidy data. We’re not completely done with the wide view, but the tidy data will allow us to compute our summary statistics.
# Assign the tidy data to javelin_long
javelin_long <- javelin %>% gather(Flight1:Flight6, key = "Flight", value = "Distance")
# Make Flight a numeric
javelin_long$Flight = as.numeric(gsub("Flight","", javelin_long$Flight))
# Examine the first 6 rows
head(javelin_long)
## # A tibble: 6 × 4
## EventID Athlete Flight Distance
## <dbl> <chr> <dbl> <dbl>
## 1 8 Brittany Borman 1 54.0
## 2 8 Ariana Ince 1 49.0
## 3 8 Kara Patterson 1 50.1
## 4 8 Kimberley Hamilton 1 48.0
## 5 8 Laura Loht 1 44.4
## 6 8 Brianna Bain 1 49.3
3. Every throw matters
A throw is a foul if the athlete commits a technical violation during the throw. In javelin, the most common foul is stepping over the release line. Traditionally, the throw is scored as an “F” and it has no further significance. Athletes can also choose to pass on a throw – scored as a “P” – if they are content with their earlier throws and want to “save themselves” for later.
Remember when we said every throw matters? Here, the goal is not for each player to have one great throw. All their throws in each event are summed together, and the team with the highest total distance wins the point. Fouls are scored as 0 and passing, well, your manager and teammates would not be pleased.
Here, we examine which athletes cover the most distance in each of their meets, along with two ways to talk about their consistency.
javelin_totals <- javelin_long %>%
filter(Distance > 0) %>%
group_by(Athlete, EventID) %>%
summarize(TotalDistance = sum(Distance), StandardDev = round(sd(Distance),3), Success = n())
## `summarise()` has grouped output by 'Athlete'. You can override using the
## `.groups` argument.
javelin_totals[60:70,]
## # A tibble: 11 × 5
## # Groups: Athlete [6]
## Athlete EventID TotalDistance StandardDev Success
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Emma Fitzgerald 176 89.4 1.97 2
## 2 Emma Fitzgerald 815 217. 1.26 5
## 3 Erma Gene Evans 247 197. 1.81 4
## 4 Fawn Miller 20 227. 0.317 4
## 5 Fawn Miller 1575 149. 1.27 3
## 6 Gabby Kearney 176 135. 2.75 3
## 7 Gabby Kearney 815 241. 1.48 5
## 8 Grace Zollman 178 180. 0.731 4
## 9 Grace Zollman 180 236. 3.20 5
## 10 Grace Zollman 247 196. 3.04 4
## 11 Haley Crouser 938 312. 2.30 6
4. Find the clutch performers
In many traditional track meets, after the first three throws the leaders in the field are whittled down to the top eight (sometimes more, sometimes less) athletes. Like the meet overall, this is solely based on their best throw of those first three.
We give the choice to the managers. Of the three athletes who start each event, the manager chooses the two who will continue on for the last three throws. The manager will need to know which players tend to come alive – or at least maintain their form – in the late stages of a match. They also need to know if a player’s first three throws are consistent with their playing history. Otherwise, they could make a poor decision about who stays in based only on the sample unfolding in front of them.
For now, let’s examine just our top-line stat – total distance covered – for differences between early and late stages of the match.
javelin <- javelin %>%
mutate(early = Flight1 + Flight2 + Flight3, late = Flight4 + Flight5 + Flight6, diff = late - early)
# Examine the last ten rows
tail(javelin, 10)
## # A tibble: 10 × 11
## EventID Athlete Flight1 Flight2 Flight3 Flight4 Flight5 Flight6 early late
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1773 Melissa … 47.6 48.7 47.5 0 0 45.6 144. 45.6
## 2 1773 Kaelyn C… 43.4 44.9 40 43.2 40.3 40.6 128. 124.
## 3 1859 Kara Win… 56.9 52.9 55.5 54.4 57.6 62.9 165. 175.
## 4 1859 Avione A… 56.5 0 54.4 51.6 54.3 0 111. 106.
## 5 1859 Ariana I… 51.9 53.5 52.4 56.0 55.2 0 158. 111.
## 6 1859 Bethany … 49.9 51.0 54.2 0 50.6 0 155. 50.6
## 7 1859 Alyssa O… 0 53.7 52.1 51.5 0 52.8 106. 104.
## 8 1859 Dominiqu… 49.6 44.2 50.6 51.3 49.2 53.2 144. 154.
## 9 1859 Kristen … 47.2 50.9 0 48.2 49.3 49.6 98.1 147.
## 10 1859 Rebekah … 48.8 0 50.4 48.2 0 46.6 99.2 94.9
## # ℹ 1 more variable: diff <dbl>
5. Pull the pieces together for a new look at the athletes
The aggregate stats are in two data frame. By joining the two together, we can take our first rough look at how the athletes compare.
javelin_totals <- javelin_totals %>%
left_join(javelin, by=c("EventID", "Athlete")) %>%
select(1, 3:5, 14)
# Examine the first ten rows
head(javelin_totals, 10)
## # A tibble: 10 × 5
## # Groups: Athlete [4]
## Athlete TotalDistance StandardDev Success diff
## <chr> <dbl> <dbl> <int> <dbl>
## 1 Abigail Gomez 152. 1.23 3 -52.9
## 2 Abigail Gomez 244. 1.63 5 -48
## 3 Abigail Gomez 207. 2.97 4 -110.
## 4 Abigail Gomez 222. 1.30 4 -3.11
## 5 Abigail Gomez 155. 1.03 3 53.4
## 6 Abigail Gomez Hernandez 135. 0.718 3 45.6
## 7 Alicia DeShasier 270. 2.15 5 60.0
## 8 Alicia DeShasier 320. 2.26 6 0.740
## 9 Alicia DeShasier 275. 1.53 5 53.5
## 10 Allison Updike 147. 3.84 3 -46.6
6. Normalize the data to compare across stats The four summary statistics - total distance, standard deviation, number of successful throws and our measure of early vs. late - are on different scales and measure very different things. Managers need to be able to compare these to each other and then weigh them based on what is most important to their vision and strategy for the team. A simple normalization will allow for these comparisons.
norm <- function(result) {
(result - min(result)) / (max(result) - min(result))
}
aggstats <- c("TotalDistance", "StandardDev", "Success", "diff")
javelin_norm <- javelin_totals %>%
ungroup() %>%
mutate_at(aggstats, norm) %>%
group_by(Athlete) %>%
summarize_all(mean)
head(javelin_norm)
## # A tibble: 6 × 5
## Athlete TotalDistance StandardDev Success diff
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abigail Gomez 0.446 0.268 0.45 0.383
## 2 Abigail Gomez Hernandez 0.244 0.115 0.25 0.720
## 3 Alicia DeShasier 0.753 0.327 0.833 0.687
## 4 Allison Updike 0.283 0.639 0.25 0.320
## 5 Alyssa Olin 0.469 0.250 0.5 0.309
## 6 Ariana Ince 0.660 0.342 0.692 0.446
7. What matters most when building your squad? Managers have to decide what kind of players they want on their team - who matches their vision, who has the skills they need to play their style of athletics and - ultimately - who will deliver the wins. A risk-averse manager will want players who rarely foul. The steely-eyed manager will want the players who can deliver the win with their final throws.
Like any other sport (or profession), rarely will any one player be equally strong in all areas. Managers have to make trade-offs in selecting their teams. Our first batch of managers have the added disadvantage of selecting players based on data from a related but distinct sport. Our data comes from traditional track and field meets, where the motivations and goals are much different than our own.
This is why managers make the big money and get sacked when results go south.
weights <- c(2.1, 4.9, .5, 2.5)
javelin_team <- javelin_norm %>%
mutate(TotalScore = TotalDistance * weights[1] + StandardDev * weights[2] + Success * weights[3] + diff * weights[4]) %>%
arrange(desc(TotalScore)) %>%
slice(1:5) %>%
select(Athlete, TotalScore)
javelin_team
## # A tibble: 5 × 2
## Athlete TotalScore
## <chr> <dbl>
## 1 Asia Easley 7.09
## 2 Madalaine Stulce 6.87
## 3 Dominique Ouellette 5.86
## 4 Laura Loht 5.76
## 5 Tairyn Montgomery 5.74
8. Get to know your players
The data has spoken! Now we have our five javelin throwers but we still don’t really know them. The javelin_totals data frame has the data that went into the decision process, so we will pull that up. This gives us an idea of what they each bring to the team.
We can also take a look at how they compare to the pool of athletes we started from by taking the mean and maximum of each statistic.
team_stats <- javelin_totals %>%
filter(Athlete %in% javelin_team$Athlete)%>%
summarize_all(mean)
pool_stats <- data.frame(do.call('cbind', sapply(javelin_totals, function(x) if(is.numeric(x)) c(max(x), mean(x)))))
pool_stats$MaxAve <- c("Maximum", "Average")
pool_stats <- pool_stats %>%
gather(key="Statistic", value="Aggregate", -MaxAve)
# Examine team stats
team_stats
## # A tibble: 5 × 5
## Athlete TotalDistance StandardDev Success diff
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Asia Easley 188. 4.78 4.5 64.1
## 2 Dominique Ouellette 299. 2.91 6 2.88
## 3 Laura Loht 252. 3.98 5 -45.6
## 4 Madalaine Stulce 275. 4.47 6 -6.42
## 5 Tairyn Montgomery 235. 2.87 5 48.2
9. Make your case to the front office
The manager knows what she wants out of the team and has the data to support her choices, but she still needs to defend her decisions to the team owners. They do write the checks, after all.
The owners are busy people. Many of them work other jobs and own other companies. They trust their managers, so as long the manager can give them an easy-to-digest visual presentation of why they should sign these five athletes out of all the others, they will approve.
A series of plots showing how each athlete compares to the maximum and the average of each statistic will be enough for them.
p <- team_stats %>%
gather(key = "Statistic", value = "Aggregate", -Athlete) %>%
ggplot(aes(x=Athlete, y=Aggregate, fill=Athlete)) +
geom_bar(stat="identity", position="dodge")+
facet_wrap(vars(Statistic), scales="free_y") +
geom_hline(data=pool_stats, aes(yintercept=Aggregate, group=Statistic, color=MaxAve), size=1) +
# labs(title=".... Your Team Name....: Women's Javelin", color="Athlete pool maximum / average") +
scale_fill_hue(l=70) +
scale_color_hue(l=20) +
theme_minimal() +
theme(axis.text.x=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p
10. Time to throw down
Before the athletics season opens, the manager will perform similar analyses for the other throws, the jumps, and running events. Then you’ll game out different permutations of your team and your opponent to come up with the best lineup and make the best decisions on match day. For now, since it’s what we know best and we’re almost out of time, let’s simulate a simple javelin match.
The winner is the team that throws the highest combined distance: six throws from each of your three players against six throws from each of the opponent’s three players.
home <- c(1,3,5)
away <- sample(1:nrow(javelin_totals), 3, replace=FALSE)
HomeTeam <- round(sum(team_stats$TotalDistance[home]),2)
AwayTeam <- round(sum(javelin_totals$TotalDistance[away]),2)
print(paste0("Javelin match, Final Score: ", HomeTeam, " - ", AwayTeam))
## [1] "Javelin match, Final Score: 674.59 - 570.65"
ifelse(HomeTeam > AwayTeam, print("Win!"), print("Sometimes you just have to take the L."))
## [1] "Win!"
## [1] "Win!"