The purpose of this project is to uncover and document valued actionable insights which are contained within the available source data for the benefit of the target audience.
The target audience includes coaches, personal trainers, athletes, athletics governing body officials, interested members of the public, sports enthusiasts, sports statisticians and data scientists.
The objective is to explore Victorian inter-club track & field athletic competition results data for the complete 2017-18 summer season and identify:
Important note: This analysis is a not-for-profit independent analysis conducted by Bree McLennan, using publicly available data from the Athletics Victoria Website. This analysis does not represent the opinions of Athletics Victoria.
The data for inter-club rounds 1 to 12 is contained in individual csv files, by round, for each participating Victorian region.
General description of the source data:
Technical approach to creating the analysis data:
We commenced with 21 variables in the source data-set. After performing some high level checks on the data we discovered opportunities to create new variables (features) by applying context/subject matter expertise and joining relevant reference tables to the source data.
All variables have been renamed to use a three-letter acronym prefix to denote the expected general data type values.
## Observations: 28,799
## Variables: 61
## $ KEYRegistrationNumber <fct> 2559, 10002, 446, 459, ...
## $ NUMDistance <int> NA, NA, NA, NA, NA, NA,...
## $ NUMStandardDistance <int> NA, NA, NA, NA, NA, NA,...
## $ NUMPerformance <dbl> 11.53, 22.93, 4.16, 4.0...
## $ NUMWindReading <dbl> NA, NA, -2.1, -1.4, 2.6...
## $ NUMPointsAwarded <int> 153, NA, 307, 283, 222,...
## $ CATAthleteRegisteredZone <fct> Ballarat, Ballarat, Bal...
## $ CATCompetitionVenue <fct> Ballarat, Ballarat, Bal...
## $ CATGender <fct> Female, Female, Female,...
## $ CATAthleteRegisteredClub <fct> BYC, WEN, BHA, WEN, BHA...
## $ CATAgeGroup <fct> F14, F14, F14, F14, F14...
## $ CATAthleteDivision <fct> 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CATAthleteTeamNumber <fct> 1, 0, 1, 1, 1, 1, 1, 1,...
## $ CATEventDiscipline <fct> Javelin, Javelin, Long ...
## $ CATDistance <fct> NA, NA, NA, NA, NA, NA,...
## $ CATStandardDistance <fct> NA, NA, NA, NA, NA, NA,...
## $ CATEventSpecification <chr> "400g", "400g", "", "",...
## $ CATEventHeat <fct> 1, 1, 1, 1, 1, 1, 1, 1,...
## $ CATEventNote <fct> 400g, 400g, INV, None, ...
## $ ORDCompetitionRound <fct> 1, 1, 1, 1, 1, 1, 1, 1,...
## $ TXTPerformanceFormatted <chr> "11.53m", "22.93m", "4....
## $ BINValidEventAttempt <fct> 1, 1, 1, 1, 1, 1, 1, 1,...
## $ BINInvitationEventOrAthlete <fct> 0, 1, 0, 0, 0, 0, 0, 0,...
## $ BINTeamEvent <fct> 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CATEventFullName <fct> Javelin 400g, Javelin 4...
## $ CATEventGroupL1 <fct> Field, Field, Field, Fi...
## $ CATEventGroupL2 <fct> Throw, Throw, Jump, Jum...
## $ CATAgeGroupLeveL1 <fct> Juniors, Juniors, Junio...
## $ CATAgeGroupKey <fct> , , , , , , , , , , , ,...
## $ BINAthleteWithDisability <fct> 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CATAthleteAWDClass <fct> NA, NA, NA, NA, NA, NA,...
## $ ORDEventFinishOrder <int> 2, 1, 1, 2, 3, 4, 5, 6,...
## $ ORDEventFinishOrderPoints <dbl> 1, 0, 1, 2, 3, 4, 5, 6,...
## $ NUMTotAthletesInVenueAgeEventPoints <dbl> 1, 1, 6, 6, 6, 6, 6, 6,...
## $ NUMTotAthletesInVenueAgeEvent <int> 2, 2, 7, 7, 7, 7, 7, 7,...
## $ ORDRoundEventFinishOrder <int> 11, 4, 10, 13, 21, 26, ...
## $ ORDRoundEventFinishOrderPoints <dbl> 11, 0, 11, 14, 21, 26, ...
## $ NUMTotAthletesInRoundAgeEventPoints <dbl> 13, 13, 32, 32, 32, 32,...
## $ NUMTotAthletesInRoundAgeEvent <int> 13, 13, 33, 33, 33, 33,...
## $ CATClubDistrict <fct> Country, Country, Count...
## $ CATClubZoneName <fct> Ballarat, Ballarat, Bal...
## $ CATAthleticClubName <fct> Ballarat YCW Harriers, ...
## $ CATAthleticTrackVenueName <fct> Llanberris Athletic Res...
## $ CATVenueTrackType <fct> Synthetic, Synthetic, S...
## $ CATVenueAddress <chr> "York St, Golden Point,...
## $ CATVenueDistrict <fct> Country, Country, Count...
## $ CATVenueMapCoord <chr> "-37.5693937,143.865968...
## $ NUMVenueLatitude <dbl> -37.56939, -37.56939, -...
## $ NUMVenueLongitude <dbl> 143.866, 143.866, 143.8...
## $ CATVenueZone <fct> Ballarat, Ballarat, Bal...
## $ CATVenueVictoriaRegion <fct> Ballarat, Ballarat, Bal...
## $ NUMAdjustFactorAWD <dbl> NA, NA, NA, NA, NA, NA,...
## $ NUMAdjustFactorVET <dbl> NA, NA, NA, NA, NA, NA,...
## $ BINAthleteCompeteAwayVenue <fct> 1, 1, 1, 1, 1, 1, 1, 1,...
## $ CATAthleteEventStatus <fct> OK, OK, OK, OK, OK, OK,...
## $ NUMPerformanceAWDAdjusted <dbl> 11.53, 22.93, 4.16, 4.0...
## $ ORDRoundEventFinishOrderAWDAdj <int> 11, 4, 10, 13, 21, 26, ...
## $ ORDRoundEventFinishOrderPointsAWDAdj <dbl> 11, 0, 11, 14, 21, 26, ...
## $ NUMEventFinishOrderPoints11 <dbl> 11, 0, 11, 9, 8, 7, 6, ...
## $ NUMRoundEventFinishOrderPoints11 <dbl> 1, 0, 1, 1, 1, 1, 1, 1,...
## $ NUMRoundEventFinishOrderPoints11AWDAdj <dbl> 1, 0, 1, 1, 1, 1, 1, 1,...
# Count of unique registration numbers (Including "0" and invitational IDs) which have participated in season 2017-18
n_distinct(wrk.03DataTrans_03$KEYRegistrationNumber)## [1] 2830
1. What are the participation rates at interclub competitions?
Let’s begin by setting up a meaningful summary data-set to enable us to visualise the data.
# Calculate the participation rates for each round
wrk.03DataTrans_Q1A <- wrk.03DataTrans_03 %>%
filter(KEYRegistrationNumber %ni% c("0")) %>% #remove teams
group_by(ORDCompetitionRound) %>%
summarise(NUMAthletesParticipating = n_distinct(KEYRegistrationNumber),
NUMTotalEventsParticipated = n(),
NUMEventsPerAthlete = as.numeric(round((NUMTotalEventsParticipated/NUMAthletesParticipating),digits = 1)) )
# Visualise the dataset
ggplot(wrk.03DataTrans_Q1A, aes(x = ORDCompetitionRound, y = NUMAthletesParticipating )) +
geom_bar(stat = "identity", fill = 'lightblue') +
geom_text(aes(label = NUMAthletesParticipating), vjust = 1.6, color = "white",
size = 3.5) +
labs(title = "Athlete participation by round",
x = "Number of athletes",
y = "Competition round") +
scale_y_continuous(breaks = c(0,250,500,750,1000,1250,1500))
| ORDCompetitionRound | NUMAthletesParticipating | NUMTotalEventsParticipated | NUMEventsPerAthlete |
|---|---|---|---|
| 1 | 1414 | 2898 | 2 |
| 2 | 1295 | 2644 | 2 |
| 3 | 1423 | 2886 | 2 |
| 4 | 1443 | 2736 | 1.9 |
| 5 | 1363 | 2791 | 2 |
| 6 | 1184 | 2434 | 2.1 |
| 8 | 1170 | 2367 | 2 |
| 9 | 1152 | 2444 | 2.1 |
| 10 | 968 | 2078 | 2.1 |
| 11 | 1228 | 2581 | 2.1 |
| 12 | 1294 | 2512 | 1.9 |
We can see from this summary table and the bar plot that there is a steady volume of athletes participating in round 1,3 and 4, matched with steady volumes in events participated. There is a clear deviation at round 10, with a drop off in athlete volumes.
Curious questions arise, what could be some potential causal factors for this pattern? Could it be time of year, event schedule/program, or something else?
We know that:
Let’s now take a look at the distribution of event participation across rounds.
This cumulative distribution reveals:
| NUMAthletesRounds | NUMTotalAthletesAllRounds |
|---|---|
| 1 | 363 |
| 2 | 302 |
| 3 | 399 |
| 4 | 347 |
| 5 | 286 |
| 6 | 293 |
| 7 | 227 |
| 8 | 210 |
| 9 | 167 |
| 10 | 163 |
| 11 | 72 |
This table highlights, out of nearly 3000 athletes who have competed in the summer inter-club season, only 72 participated in all available rounds of competition. That is less than 5% of the athlete population.
A large proportion of athletes participated in up to 4 rounds of competition.
Curious question at this point: What would be the “optimum” number of rounds for an athlete to participate in?
| BINAthleteCompeteAwayVenue | NUMAthletesAway |
|---|---|
| 1 | 2369 |
This table informs us, nearly 80% of the athlete population competed at a venue which was outside their registered club’s Victorian district. At such a broad level view, we don’t see the breakdown of clubs. Subject matter expertise suggests that the remaining 20% are likely Victorian country region athletes that don’t travel to venues outside their district to compete, while the 80% is likely to be metropolitan with zone versus zone rounds held at different venues each round.
| CATCompetitionVenue | CATVenueZone | NUMAthletesParticipated |
|---|---|---|
| Nunawading | White | 1150 |
| Lakeside | Yellow | 990 |
| Doncaster | Red | 946 |
| Aberfeldie | Yellow | 909 |
| Knox | Blue | 798 |
| Frankston | Blue | 651 |
| Casey Fields | Blue | 613 |
| Meadowglen | Yellow | 609 |
| Werribee | Yellow | 592 |
| Geelong | Geelong | 527 |
| Williamstown | Yellow | 400 |
| Yarra Ranges | White | 372 |
| Ballarat | Ballarat | 233 |
| Bendigo | Bendigo | 232 |
| Clifton Hill | Yellow | 73 |
| Mentone | Blue | 21 |
| Box Hill | White | 19 |
This table is rolling up rounds into the total calculation. Venues which have hosted multiple rounds of competition will have increased chances of having higher numbers of distinct individual athletes.
It is still curious to note, the metropolitan venues dominate the participation by venue. We see country region venues toward the lower half of the list.
The venues of Clifton Hill, Mentone and Box Hill have distinctly lower numbers, and this is likely due to half-completed rounds of competition.
2. How many opportunities are there for each event type?
Let’s now take a look at the most popular events for the whole season.
This graph shows us quite clearly the sprint, middle distance and horizontal jumps events dominate in participation rates. It’s interesting to note:
3. How many incomplete events or invalid event attempts occurred?
The events with the highest non-completion rates are 200m, 800m and 400m. Non-Completion could be due to injury or fatigue.
Let’s visualise this and observe the events with status stacked.
The visual is consistent with the table above. We can begin to see some groupings emerge in the field events.
For field events, is it more appropriate to use “DNF” or “NM” to signify the athlete commenced the event but recorded no final result? Is this a data integrity/consistency of entry issue?
Equally, it is a little concerning to observe the status of “NT” on track events, assuming this is “No time”, under what conditions were no times recorded for these events? Technical equipment malfunction? Or is this a data recording issue?
4. Is there any pattern to performances as the season progresses?
We can tackle this question by selecting three events, one from each general track & field discipline:
Let’s observe how the performances of these events vary throughout the season.
A performance trend across the season looks to be a little cloudy and weak. However we do notice for both men and women:
5. How many venues are involved throughout the season?
The interactive geographical plot below illustrates the venues hosting competitions throughout the season, by round. The Victorian country region venues do not change round to round, and we notice the Melbourne metropolitan venues change each round.
Let’s look at the venues by wind reading. We’ll use the max and min functions across wind reading by venue and round, this is because we know that events which require wind readings can run in multiple directions on competition day.
6. Can we see how points are distributed for performances?
This might prove to be a hot topic to cover. The method in which points are calculated can be the difference between a team qualifying for shield final or an athlete receiving an award at the end of season.
This section is dedicated to offering independent thought on alternative options to point scoring, while showing the differences to the current state.
Methods of point scoring calculation
Method #1 AV Shield 2017-18: Points awarded based on performance within 80% of decathlon world record (method in use for inter-club round only).
Method #2 Points awarded 1st place = 11 points, 2nd-9th = 11 minus placing, 10th and above = 1 point. Grouping based on Round, Venue, Event, Age Group, Finishing Order.
Method #3 Points awarded 1st place = 11 points, 2nd-9th = 11 minus placing, 10th and above = 1 point. Grouping based on Round, Event, Age Group, Finishing Order.
Alteration: Similar to method 2, but grouping next level up, finishing order based on Round/Event results and not Venue/Event.
Method #4 Normalising performances by athletes with disability, using respective adjustment table for AWD classification. Points awarded 1st place = 11 points, 2nd-9th = 11 minus placing, 10th and above = 1 point. Grouping based on Round, Venue, Event, Age Group, Finishing Order using the normalised performance result.
These graphs illustrate that by changing the method of points calculation it does have an impact, at the granular individual athlete level. We observe the ordering of the highest point scoring athletes changes with the method used.
With the 4th method illustrated, while the highest 6 athlete finishing orders do not change, we see some movement between finishing orders 7 and 10.
Consider this:
A significant portion of this analysis was spent on data preparation, cleanup and feature engineering. Spending time carefully assessing the source data helps greatly in establishing a solid foundation to conduct an analysis on.
Here’s the rough breakdown of time spent:
Observations made along the journey in data prep:
Where to next?
There were a few elements in this analysis we could unpack further, such as:
As mentioned in the purpose and objective of this analysis, having a clear business problem to address goes a long way to creating useful, actionable insights, to assist the target audience. If there are more business problems to consider and unpack then it is likely we can look at part two of this analysis.
We’d love to receive your thoughts, queries and feedback on this analysis. Please feel free to reach out to Bree at bree_mclennan@outlook.com .
Thanks for connecting and reading!
Bree.