1. Athletics needs a new breed of scouts and managers

Athletics goes back to the original Olympics. Since then, little has changed. Athletes compete as individuals, seeking to throw the farthest, jump the farthest (or highest) and run the fastest. But people like cheering for teams, waving banners and yelling like mad during matches, wearing their favorite player’s jerseys and staying loyal to their side through thick and thin.

What if athletics was a team sport? It could potentially be more interesting and would give us a new set of sports analytics to discuss. We might even reduce the incentives to do unsavory things in the pursuit of altius, fortius and citius.

This dataset contains results from American athletes in the horizontal jumps (triple jump and long jump) and throws (shot put, discus, javelin, hammer and weight). Let’s read that in and examine women’s javelin.

# Load the tidyverse package
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Import the full dataset
data <- read_csv("~/Desktop/Scout your Athletics Fantasy Team/datasets/athletics.csv")
## Rows: 2098 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Event, Male_Female, Athlete
## dbl (7): EventID, Flight1, Flight2, Flight3, Flight4, Flight5, Flight6
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Select the results of interest: women's javelin
javelin <- data %>%
    filter(Male_Female == "Female" & Event == "Javelin") %>%
    select(-Male_Female, -Event)
# Give yourself a snapshot of your data 
head(javelin)
## # A tibble: 6 × 8
##   EventID Athlete            Flight1 Flight2 Flight3 Flight4 Flight5 Flight6
##     <dbl> <chr>                <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1       8 Brittany Borman       54.0    51.2    57.3    52.6    57.0    60.9
## 2       8 Ariana Ince           49.0    54.8    53.6    55.1    55.3    56.7
## 3       8 Kara Patterson        50.1    52.1     0      50.8    55.9    54.6
## 4       8 Kimberley Hamilton    48.0     0      50.9    54.1    55.2    53.3
## 5       8 Laura Loht            44.4    53.8    50.6    54.2     0      49.0
## 6       8 Brianna Bain          49.3     0      51.3     0      48.6    53.0
summary(javelin)
##     EventID         Athlete             Flight1         Flight2     
##  Min.   :   8.0   Length:178         Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 178.0   Class :character   1st Qu.:41.53   1st Qu.:40.23  
##  Median : 511.0   Mode  :character   Median :48.85   Median :48.85  
##  Mean   : 796.8                      Mean   :40.80   Mean   :39.87  
##  3rd Qu.:1703.0                      3rd Qu.:53.20   3rd Qu.:53.07  
##  Max.   :1859.0                      Max.   :64.94   Max.   :61.38  
##     Flight3         Flight4         Flight5         Flight6     
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.00   1st Qu.:40.57   1st Qu.: 0.00   1st Qu.: 0.00  
##  Median :47.34   Median :49.30   Median :48.01   Median :46.80  
##  Mean   :34.22   Mean   :39.37   Mean   :32.97   Mean   :34.82  
##  3rd Qu.:52.08   3rd Qu.:52.10   3rd Qu.:51.44   3rd Qu.:52.44  
##  Max.   :62.42   Max.   :61.56   Max.   :60.84   Max.   :64.45

2. Managers love tidy data

This view shows each athlete’s results at individual track meets. Athletes have six throws, but in these meets only one – their longest – actually matters. If all we wanted to do was talk regular track and field, we would have a very easy task: create a new column taking the max of each row, arrange the data frame by that column in descending order and we’d be done.

But our managers need to do and know much more than that! This is a sport of strategy, where every throw matters. Managers need a deeper analysis to choose their teams, craft their plan and make decisions on match-day.

We first need to make this standard “wide” view tidy data. We’re not completely done with the wide view, but the tidy data will allow us to compute our summary statistics.

# Assign the tidy data to javelin_long
javelin_long <- javelin %>% gather(Flight1:Flight6, key = "Flight", value = "Distance")
# Make Flight a numeric
javelin_long$Flight = as.numeric(gsub("Flight","", javelin_long$Flight))
# Examine the first 6 rows
head(javelin_long)
## # A tibble: 6 × 4
##   EventID Athlete            Flight Distance
##     <dbl> <chr>               <dbl>    <dbl>
## 1       8 Brittany Borman         1     54.0
## 2       8 Ariana Ince             1     49.0
## 3       8 Kara Patterson          1     50.1
## 4       8 Kimberley Hamilton      1     48.0
## 5       8 Laura Loht              1     44.4
## 6       8 Brianna Bain            1     49.3

3. Every throw matters

A throw is a foul if the athlete commits a technical violation during the throw. In javelin, the most common foul is stepping over the release line. Traditionally, the throw is scored as an “F” and it has no further significance. Athletes can also choose to pass on a throw – scored as a “P” – if they are content with their earlier throws and want to “save themselves” for later.

Remember when we said every throw matters? Here, the goal is not for each player to have one great throw. All their throws in each event are summed together, and the team with the highest total distance wins the point. Fouls are scored as 0 and passing, well, your manager and teammates would not be pleased.

Here, we examine which athletes cover the most distance in each of their meets, along with two ways to talk about their consistency.

javelin_totals <- javelin_long %>% 
  filter(Distance > 0) %>%
  group_by(Athlete, EventID) %>%
  summarize(TotalDistance = sum(Distance), StandardDev = round(sd(Distance),3), Success = n())
## `summarise()` has grouped output by 'Athlete'. You can override using the
## `.groups` argument.
javelin_totals[60:70,]
## # A tibble: 11 × 5
## # Groups:   Athlete [6]
##    Athlete         EventID TotalDistance StandardDev Success
##    <chr>             <dbl>         <dbl>       <dbl>   <int>
##  1 Emma Fitzgerald     176          89.4       1.97        2
##  2 Emma Fitzgerald     815         217.        1.26        5
##  3 Erma Gene Evans     247         197.        1.81        4
##  4 Fawn Miller          20         227.        0.317       4
##  5 Fawn Miller        1575         149.        1.27        3
##  6 Gabby Kearney       176         135.        2.75        3
##  7 Gabby Kearney       815         241.        1.48        5
##  8 Grace Zollman       178         180.        0.731       4
##  9 Grace Zollman       180         236.        3.20        5
## 10 Grace Zollman       247         196.        3.04        4
## 11 Haley Crouser       938         312.        2.30        6

4. Find the clutch performers

In many traditional track meets, after the first three throws the leaders in the field are whittled down to the top eight (sometimes more, sometimes less) athletes. Like the meet overall, this is solely based on their best throw of those first three.

We give the choice to the managers. Of the three athletes who start each event, the manager chooses the two who will continue on for the last three throws. The manager will need to know which players tend to come alive – or at least maintain their form – in the late stages of a match. They also need to know if a player’s first three throws are consistent with their playing history. Otherwise, they could make a poor decision about who stays in based only on the sample unfolding in front of them.

For now, let’s examine just our top-line stat – total distance covered – for differences between early and late stages of the match.

javelin <- javelin %>% 
    mutate(early = Flight1 + Flight2 + Flight3, late = Flight4 + Flight5 + Flight6, diff = late - early)
# Examine the last ten rows
tail(javelin, 10)
## # A tibble: 10 × 11
##    EventID Athlete   Flight1 Flight2 Flight3 Flight4 Flight5 Flight6 early  late
##      <dbl> <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl>
##  1    1773 Melissa …    47.6    48.7    47.5     0       0      45.6 144.   45.6
##  2    1773 Kaelyn C…    43.4    44.9    40      43.2    40.3    40.6 128.  124. 
##  3    1859 Kara Win…    56.9    52.9    55.5    54.4    57.6    62.9 165.  175. 
##  4    1859 Avione A…    56.5     0      54.4    51.6    54.3     0   111.  106. 
##  5    1859 Ariana I…    51.9    53.5    52.4    56.0    55.2     0   158.  111. 
##  6    1859 Bethany …    49.9    51.0    54.2     0      50.6     0   155.   50.6
##  7    1859 Alyssa O…     0      53.7    52.1    51.5     0      52.8 106.  104. 
##  8    1859 Dominiqu…    49.6    44.2    50.6    51.3    49.2    53.2 144.  154. 
##  9    1859 Kristen …    47.2    50.9     0      48.2    49.3    49.6  98.1 147. 
## 10    1859 Rebekah …    48.8     0      50.4    48.2     0      46.6  99.2  94.9
## # ℹ 1 more variable: diff <dbl>

5. Pull the pieces together for a new look at the athletes

The aggregate stats are in two data frame. By joining the two together, we can take our first rough look at how the athletes compare.

javelin_totals <- javelin_totals %>%
    left_join(javelin, by=c("EventID", "Athlete")) %>%
    select(1, 3:5, 14)
# Examine the first ten rows
head(javelin_totals, 10)
## # A tibble: 10 × 5
## # Groups:   Athlete [4]
##    Athlete                 TotalDistance StandardDev Success     diff
##    <chr>                           <dbl>       <dbl>   <int>    <dbl>
##  1 Abigail Gomez                    152.       1.23        3  -52.9  
##  2 Abigail Gomez                    244.       1.63        5  -48    
##  3 Abigail Gomez                    207.       2.97        4 -110.   
##  4 Abigail Gomez                    222.       1.30        4   -3.11 
##  5 Abigail Gomez                    155.       1.03        3   53.4  
##  6 Abigail Gomez Hernandez          135.       0.718       3   45.6  
##  7 Alicia DeShasier                 270.       2.15        5   60.0  
##  8 Alicia DeShasier                 320.       2.26        6    0.740
##  9 Alicia DeShasier                 275.       1.53        5   53.5  
## 10 Allison Updike                   147.       3.84        3  -46.6

6. Normalize the data to compare across stats The four summary statistics - total distance, standard deviation, number of successful throws and our measure of early vs. late - are on different scales and measure very different things. Managers need to be able to compare these to each other and then weigh them based on what is most important to their vision and strategy for the team. A simple normalization will allow for these comparisons.

norm <- function(result) {
    (result - min(result)) / (max(result) - min(result))
}
aggstats <- c("TotalDistance", "StandardDev", "Success", "diff")
javelin_norm <- javelin_totals %>%
    ungroup() %>%
    mutate_at(aggstats, norm) %>%
    group_by(Athlete) %>%
    summarize_all(mean)

head(javelin_norm)
## # A tibble: 6 × 5
##   Athlete                 TotalDistance StandardDev Success  diff
##   <chr>                           <dbl>       <dbl>   <dbl> <dbl>
## 1 Abigail Gomez                   0.446       0.268   0.45  0.383
## 2 Abigail Gomez Hernandez         0.244       0.115   0.25  0.720
## 3 Alicia DeShasier                0.753       0.327   0.833 0.687
## 4 Allison Updike                  0.283       0.639   0.25  0.320
## 5 Alyssa Olin                     0.469       0.250   0.5   0.309
## 6 Ariana Ince                     0.660       0.342   0.692 0.446

7. What matters most when building your squad? Managers have to decide what kind of players they want on their team - who matches their vision, who has the skills they need to play their style of athletics and - ultimately - who will deliver the wins. A risk-averse manager will want players who rarely foul. The steely-eyed manager will want the players who can deliver the win with their final throws.

Like any other sport (or profession), rarely will any one player be equally strong in all areas. Managers have to make trade-offs in selecting their teams. Our first batch of managers have the added disadvantage of selecting players based on data from a related but distinct sport. Our data comes from traditional track and field meets, where the motivations and goals are much different than our own.

This is why managers make the big money and get sacked when results go south.

weights <- c(2.1, 4.9, .5, 2.5)
javelin_team <- javelin_norm %>%
    mutate(TotalScore = TotalDistance * weights[1] + StandardDev * weights[2] + Success * weights[3] + diff * weights[4]) %>%
    arrange(desc(TotalScore)) %>%
    slice(1:5) %>%
    select(Athlete, TotalScore)

javelin_team
## # A tibble: 5 × 2
##   Athlete             TotalScore
##   <chr>                    <dbl>
## 1 Asia Easley               7.09
## 2 Madalaine Stulce          6.87
## 3 Dominique Ouellette       5.86
## 4 Laura Loht                5.76
## 5 Tairyn Montgomery         5.74

8. Get to know your players

The data has spoken! Now we have our five javelin throwers but we still don’t really know them. The javelin_totals data frame has the data that went into the decision process, so we will pull that up. This gives us an idea of what they each bring to the team.

We can also take a look at how they compare to the pool of athletes we started from by taking the mean and maximum of each statistic.

team_stats <- javelin_totals %>% 
    filter(Athlete %in% javelin_team$Athlete)%>%
    summarize_all(mean)

pool_stats <- data.frame(do.call('cbind', sapply(javelin_totals, function(x) if(is.numeric(x)) c(max(x), mean(x)))))
pool_stats$MaxAve <- c("Maximum", "Average")
pool_stats <- pool_stats %>%
    gather(key="Statistic", value="Aggregate", -MaxAve)
                                                 
# Examine team stats
team_stats
## # A tibble: 5 × 5
##   Athlete             TotalDistance StandardDev Success   diff
##   <chr>                       <dbl>       <dbl>   <dbl>  <dbl>
## 1 Asia Easley                  188.        4.78     4.5  64.1 
## 2 Dominique Ouellette          299.        2.91     6     2.88
## 3 Laura Loht                   252.        3.98     5   -45.6 
## 4 Madalaine Stulce             275.        4.47     6    -6.42
## 5 Tairyn Montgomery            235.        2.87     5    48.2

9. Make your case to the front office

The manager knows what she wants out of the team and has the data to support her choices, but she still needs to defend her decisions to the team owners. They do write the checks, after all.

The owners are busy people. Many of them work other jobs and own other companies. They trust their managers, so as long the manager can give them an easy-to-digest visual presentation of why they should sign these five athletes out of all the others, they will approve.

A series of plots showing how each athlete compares to the maximum and the average of each statistic will be enough for them.

p <- team_stats %>%
gather(key = "Statistic", value = "Aggregate", -Athlete) %>%
ggplot(aes(x=Athlete, y=Aggregate, fill=Athlete)) +
    geom_bar(stat="identity", position="dodge")+
    facet_wrap(vars(Statistic), scales="free_y") +
    geom_hline(data=pool_stats, aes(yintercept=Aggregate, group=Statistic, color=MaxAve), size=1) +
#   labs(title=".... Your Team Name....: Women's Javelin", color="Athlete pool maximum / average") +
  scale_fill_hue(l=70) +
  scale_color_hue(l=20) +
  theme_minimal() +
  theme(axis.text.x=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p

10. Time to throw down

Before the athletics season opens, the manager will perform similar analyses for the other throws, the jumps, and running events. Then you’ll game out different permutations of your team and your opponent to come up with the best lineup and make the best decisions on match day. For now, since it’s what we know best and we’re almost out of time, let’s simulate a simple javelin match.

The winner is the team that throws the highest combined distance: six throws from each of your three players against six throws from each of the opponent’s three players.

home <- c(1,3,5)
away <- sample(1:nrow(javelin_totals), 3, replace=FALSE)

HomeTeam <- round(sum(team_stats$TotalDistance[home]),2)
AwayTeam <- round(sum(javelin_totals$TotalDistance[away]),2)

print(paste0("Javelin match, Final Score: ", HomeTeam, " - ", AwayTeam))
## [1] "Javelin match, Final Score: 674.59 - 570.65"
ifelse(HomeTeam > AwayTeam, print("Win!"), print("Sometimes you just have to take the L."))
## [1] "Win!"
## [1] "Win!"