Purpose & objective of this analysis

The purpose of this project is to uncover and document valued actionable insights which are contained within the available source data for the benefit of the target audience.

The target audience includes coaches, personal trainers, athletes, athletics governing body officials, interested members of the public, sports enthusiasts, sports statisticians and data scientists.

The objective is to explore Victorian inter-club track & field athletic competition results data for the complete 2017-18 summer season and identify:

  • Natural groupings & patterns within the data
  • Basic descriptive statistics across the entire data-set
    • How many competitions per athlete
    • How many events per athlete
    • Min, max, averages and quartiles for each event
    • Geography statistics: By whole of Victorian state and by competition zone/region
  • Further opportunities for athletics data analysis

Important note: This analysis is a not-for-profit independent analysis conducted by Bree McLennan, using publicly available data from the Athletics Victoria Website. This analysis does not represent the opinions of Athletics Victoria.

Key questions from the target audience to guide this analysis

  1. What are the participation rates at inter-club competitions?
    • Can we break this down by zones and clubs?
    • What’s the distribution of total competitions athletes participate in during the season?
      • How many events per competition do athletes partake in?
    • How many athletes participated in all rounds of competition?
    • How many athletes competed at “away” venues? (ignoring metro zone v zone)
  2. How many opportunities are there for each event type?
    • Can we see stats by event grouping?
    • What’s the most poplar event?
  3. How many incomplete events or invalid event attempts occurred?
  4. Is there any pattern to performances as the season progresses?
  5. How many venues are involved throughout the season?
    • What’s the “windiest” venue?
  6. Can we see how points are distributed for performances?
    • What other alternatives to point scoring are there?

Context specific process flow for this analysis

  1. Define the parameters: Purpose, objective, and rough timelines.
  2. Obtain target audience input.
  3. Obtain source data Athletics Victoria “AV Shield 2017-18”.
  4. Conduct a risk assessment on source data with respect to purpose & objective.
    • Discarding any data which is not relevant to the analysis guiding questions.
  5. Technical setup to commence analysis:
    • Github repository
    • R Project file.
      • Load data >> Prepare data >> Merge reference data >> Transform data >> Analyse data
  6. Explore data & key guiding questions to discover answers.
  7. Peer review & publish analysis and findings.
  8. Obtain audience feedback, review and apply updates where appropriate.
    • Opportunities for subsequent analysis.

Analysis data considerations

The data for inter-club rounds 1 to 12 is contained in individual csv files, by round, for each participating Victorian region.

General description of the source data:

  • Round 7 is excluded because it was cancelled due to extreme weather.
  • There are 77 individual csv files for season 2017-18
  • The CSV file contents can be described as: performance results for each athlete by event completed, for a round of competition for a specific region. Season 2017-18.
  • There are 21 variables in the source data, ranging from athlete registration ID, event specification, performance result, age group, club, venue, wind reading and completion status.
  • Dates and times of competitions and events are not included with the source data-sets.

Technical approach to creating the analysis data:

  • Append all CSVs together to create one source data-set.
  • Re-name and format variables for data type consistency.
  • Binarise variables where appropriate.
  • Create hierarchical groupings for event types, venues, and age groups.
  • Triage missing data (careful application of subject matter expertise).
    • Particularly with AWD classification performance adjustments, venue names, event specifications and event status (DNQ, INV).
  • Merging on reference data by created keys:
    • Club details (short-name, full name, zone).
    • Venue details (geographical location, track type).
    • Performance adjustment (AWD & masters age group athletes).
  • Calculate athlete finishing order per event and point scoring methods.

Structure of the created analysis dataset

We commenced with 21 variables in the source data-set. After performing some high level checks on the data we discovered opportunities to create new variables (features) by applying context/subject matter expertise and joining relevant reference tables to the source data.

All variables have been renamed to use a three-letter acronym prefix to denote the expected general data type values.

  • KEY: Primary key.
  • FOR: Foreign key, where other tables are referenced and to be joined.
  • NUM: Numeric values which have a range beyond binary format, may include NA (Missing/Null).
  • BIN: Binary values. 1 and 0 only.
  • CAT: Categorical value. Structured and consistent groupings
  • ORD: Ordinal value, can be numeric or categorical in origin, but ordering of values is significant.
  • TXT: Free/unstructured text.
  • DTM: Date-time values. Specifically formatted as YYY-MM-DD HH:MM:SS.ss

## Observations: 28,799
## Variables: 61
## $ KEYRegistrationNumber                  <fct> 2559, 10002, 446, 459, ...
## $ NUMDistance                            <int> NA, NA, NA, NA, NA, NA,...
## $ NUMStandardDistance                    <int> NA, NA, NA, NA, NA, NA,...
## $ NUMPerformance                         <dbl> 11.53, 22.93, 4.16, 4.0...
## $ NUMWindReading                         <dbl> NA, NA, -2.1, -1.4, 2.6...
## $ NUMPointsAwarded                       <int> 153, NA, 307, 283, 222,...
## $ CATAthleteRegisteredZone               <fct> Ballarat, Ballarat, Bal...
## $ CATCompetitionVenue                    <fct> Ballarat, Ballarat, Bal...
## $ CATGender                              <fct> Female, Female, Female,...
## $ CATAthleteRegisteredClub               <fct> BYC, WEN, BHA, WEN, BHA...
## $ CATAgeGroup                            <fct> F14, F14, F14, F14, F14...
## $ CATAthleteDivision                     <fct> 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CATAthleteTeamNumber                   <fct> 1, 0, 1, 1, 1, 1, 1, 1,...
## $ CATEventDiscipline                     <fct> Javelin, Javelin, Long ...
## $ CATDistance                            <fct> NA, NA, NA, NA, NA, NA,...
## $ CATStandardDistance                    <fct> NA, NA, NA, NA, NA, NA,...
## $ CATEventSpecification                  <chr> "400g", "400g", "", "",...
## $ CATEventHeat                           <fct> 1, 1, 1, 1, 1, 1, 1, 1,...
## $ CATEventNote                           <fct> 400g, 400g, INV, None, ...
## $ ORDCompetitionRound                    <fct> 1, 1, 1, 1, 1, 1, 1, 1,...
## $ TXTPerformanceFormatted                <chr> "11.53m", "22.93m", "4....
## $ BINValidEventAttempt                   <fct> 1, 1, 1, 1, 1, 1, 1, 1,...
## $ BINInvitationEventOrAthlete            <fct> 0, 1, 0, 0, 0, 0, 0, 0,...
## $ BINTeamEvent                           <fct> 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CATEventFullName                       <fct> Javelin 400g, Javelin 4...
## $ CATEventGroupL1                        <fct> Field, Field, Field, Fi...
## $ CATEventGroupL2                        <fct> Throw, Throw, Jump, Jum...
## $ CATAgeGroupLeveL1                      <fct> Juniors, Juniors, Junio...
## $ CATAgeGroupKey                         <fct> , , , , , , , , , , , ,...
## $ BINAthleteWithDisability               <fct> 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CATAthleteAWDClass                     <fct> NA, NA, NA, NA, NA, NA,...
## $ ORDEventFinishOrder                    <int> 2, 1, 1, 2, 3, 4, 5, 6,...
## $ ORDEventFinishOrderPoints              <dbl> 1, 0, 1, 2, 3, 4, 5, 6,...
## $ NUMTotAthletesInVenueAgeEventPoints    <dbl> 1, 1, 6, 6, 6, 6, 6, 6,...
## $ NUMTotAthletesInVenueAgeEvent          <int> 2, 2, 7, 7, 7, 7, 7, 7,...
## $ ORDRoundEventFinishOrder               <int> 11, 4, 10, 13, 21, 26, ...
## $ ORDRoundEventFinishOrderPoints         <dbl> 11, 0, 11, 14, 21, 26, ...
## $ NUMTotAthletesInRoundAgeEventPoints    <dbl> 13, 13, 32, 32, 32, 32,...
## $ NUMTotAthletesInRoundAgeEvent          <int> 13, 13, 33, 33, 33, 33,...
## $ CATClubDistrict                        <fct> Country, Country, Count...
## $ CATClubZoneName                        <fct> Ballarat, Ballarat, Bal...
## $ CATAthleticClubName                    <fct> Ballarat YCW Harriers, ...
## $ CATAthleticTrackVenueName              <fct> Llanberris Athletic Res...
## $ CATVenueTrackType                      <fct> Synthetic, Synthetic, S...
## $ CATVenueAddress                        <chr> "York St, Golden Point,...
## $ CATVenueDistrict                       <fct> Country, Country, Count...
## $ CATVenueMapCoord                       <chr> "-37.5693937,143.865968...
## $ NUMVenueLatitude                       <dbl> -37.56939, -37.56939, -...
## $ NUMVenueLongitude                      <dbl> 143.866, 143.866, 143.8...
## $ CATVenueZone                           <fct> Ballarat, Ballarat, Bal...
## $ CATVenueVictoriaRegion                 <fct> Ballarat, Ballarat, Bal...
## $ NUMAdjustFactorAWD                     <dbl> NA, NA, NA, NA, NA, NA,...
## $ NUMAdjustFactorVET                     <dbl> NA, NA, NA, NA, NA, NA,...
## $ BINAthleteCompeteAwayVenue             <fct> 1, 1, 1, 1, 1, 1, 1, 1,...
## $ CATAthleteEventStatus                  <fct> OK, OK, OK, OK, OK, OK,...
## $ NUMPerformanceAWDAdjusted              <dbl> 11.53, 22.93, 4.16, 4.0...
## $ ORDRoundEventFinishOrderAWDAdj         <int> 11, 4, 10, 13, 21, 26, ...
## $ ORDRoundEventFinishOrderPointsAWDAdj   <dbl> 11, 0, 11, 14, 21, 26, ...
## $ NUMEventFinishOrderPoints11            <dbl> 11, 0, 11, 9, 8, 7, 6, ...
## $ NUMRoundEventFinishOrderPoints11       <dbl> 1, 0, 1, 1, 1, 1, 1, 1,...
## $ NUMRoundEventFinishOrderPoints11AWDAdj <dbl> 1, 0, 1, 1, 1, 1, 1, 1,...
## [1] 2830


A random sampling of records from the analysis data-set to demonstrate the visual of what we are working with:


Exploring the key questions

1. What are the participation rates at interclub competitions?

  • Can we break this down by zones and clubs?
  • What’s the distribution of total competitions athletes participate in during the season?
    • How many events per competition do athletes partake in?
  • How many athletes participated in all rounds of competition?
  • How many athletes competed at “away” venues? (ignoring metro zone v zone)



Let’s begin by setting up a meaningful summary data-set to enable us to visualise the data.




Summary table of athlete participation by round
ORDCompetitionRound NUMAthletesParticipating NUMTotalEventsParticipated NUMEventsPerAthlete
1 1414 2898 2
2 1295 2644 2
3 1423 2886 2
4 1443 2736 1.9
5 1363 2791 2
6 1184 2434 2.1
8 1170 2367 2
9 1152 2444 2.1
10 968 2078 2.1
11 1228 2581 2.1
12 1294 2512 1.9

We can see from this summary table and the bar plot that there is a steady volume of athletes participating in round 1,3 and 4, matched with steady volumes in events participated. There is a clear deviation at round 10, with a drop off in athlete volumes.

Curious questions arise, what could be some potential causal factors for this pattern? Could it be time of year, event schedule/program, or something else?

We know that:

  • Rounds 1 to 5 are generally early October to mid November, held in mid-late afternoon.
  • Rounds 6 to 8 are immediately before the end of year holiday season.
  • Rounds 9 to 12 are generally in January, held as twilight meets.


Let’s now take a look at the distribution of event participation across rounds.




This cumulative distribution reveals:

  • The average athlete in the population is likely to participate in at least 2 events per round of competition
  • The fitted regression line shows a slightly more positive slope change at round 8.
  • Potential emergence of sub groups:
    • Athletes who participate in a single event per round (below average): Could these be niche/specialist event athletes?
    • Average population (the red regression line), participating in two events per round.
    • “Decathletes” (the outliers, above average), athletes who seek to have a go at every event on the program, every round.


Summary of athlete participation by total number of rounds
NUMAthletesRounds NUMTotalAthletesAllRounds
1 363
2 302
3 399
4 347
5 286
6 293
7 227
8 210
9 167
10 163
11 72

This table highlights, out of nearly 3000 athletes who have competed in the summer inter-club season, only 72 participated in all available rounds of competition. That is less than 5% of the athlete population.

A large proportion of athletes participated in up to 4 rounds of competition.

Curious question at this point: What would be the “optimum” number of rounds for an athlete to participate in?



Summary of athletes who competed in away venues
BINAthleteCompeteAwayVenue NUMAthletesAway
1 2369

This table informs us, nearly 80% of the athlete population competed at a venue which was outside their registered club’s Victorian district. At such a broad level view, we don’t see the breakdown of clubs. Subject matter expertise suggests that the remaining 20% are likely Victorian country region athletes that don’t travel to venues outside their district to compete, while the 80% is likely to be metropolitan with zone versus zone rounds held at different venues each round.

Let’s take a look at the participation by venue and Victorian region:
Summary of venues and total distinct athlete participation
CATCompetitionVenue CATVenueZone NUMAthletesParticipated
Nunawading White 1150
Lakeside Yellow 990
Doncaster Red 946
Aberfeldie Yellow 909
Knox Blue 798
Frankston Blue 651
Casey Fields Blue 613
Meadowglen Yellow 609
Werribee Yellow 592
Geelong Geelong 527
Williamstown Yellow 400
Yarra Ranges White 372
Ballarat Ballarat 233
Bendigo Bendigo 232
Clifton Hill Yellow 73
Mentone Blue 21
Box Hill White 19



This table is rolling up rounds into the total calculation. Venues which have hosted multiple rounds of competition will have increased chances of having higher numbers of distinct individual athletes.

It is still curious to note, the metropolitan venues dominate the participation by venue. We see country region venues toward the lower half of the list.

The venues of Clifton Hill, Mentone and Box Hill have distinctly lower numbers, and this is likely due to half-completed rounds of competition.



2. How many opportunities are there for each event type?

  • Can we see stats by event grouping?
    • What’s the most poplar event?




This table displays the details of event participation by round. Sorted by the highest number of athletes participating, the track sprint events 100m and 200m come out on top. The track run events, in particular the 100m and 3000/5000m are held very frequently and often by invitation when not officially scheduled for a round of competition.

Let’s now take a look at the most popular events for the whole season.



This graph shows us quite clearly the sprint, middle distance and horizontal jumps events dominate in participation rates. It’s interesting to note:

  • The 800m has higher participation than 400m.
  • Shot Put doesn’t make it into the top 10.
  • Highly technical events such as hurdles, steeple, walks and pole vault are not in the top 10.

3. How many incomplete events or invalid event attempts occurred?





This table gives us a detailed view as to the nature of a event result being incomplete or unsuccessful. An event status can be:

  • DNF - Did not finish
  • DQ - Disqualified (IAAF rule breach)
  • NT - No time recorded
  • NM - No measurement made
  • DNS - Did not start

The events with the highest non-completion rates are 200m, 800m and 400m. Non-Completion could be due to injury or fatigue.

Let’s visualise this and observe the events with status stacked.




The visual is consistent with the table above. We can begin to see some groupings emerge in the field events.

For field events, is it more appropriate to use “DNF” or “NM” to signify the athlete commenced the event but recorded no final result? Is this a data integrity/consistency of entry issue?

Equally, it is a little concerning to observe the status of “NT” on track events, assuming this is “No time”, under what conditions were no times recorded for these events? Technical equipment malfunction? Or is this a data recording issue?

4. Is there any pattern to performances as the season progresses?

We can tackle this question by selecting three events, one from each general track & field discipline:

  • Run: We’ll select 400m sprint. No wind readings are required, its a circular event, good measure of speed endurance.
  • Jump: We’ll select Long Jump. It made it into the top 10 events for popularity by participation. This will be impacted by wind conditions.
  • Throw: We’ll select Shot Put. Might prove to be a good measure of power & strength.

Let’s observe how the performances of these events vary throughout the season.







A performance trend across the season looks to be a little cloudy and weak. However we do notice for both men and women:

  • A larger number of outliers in the 400m, with slower times than average.
  • The outliers in shot put are all above the mean, throwing better than average.
  • Most of the outliers in Long Jump are below the mean, jumping less than average.




5. How many venues are involved throughout the season?

  • What’s the “windiest” venue?

The interactive geographical plot below illustrates the venues hosting competitions throughout the season, by round. The Victorian country region venues do not change round to round, and we notice the Melbourne metropolitan venues change each round.




Let’s look at the venues by wind reading. We’ll use the max and min functions across wind reading by venue and round, this is because we know that events which require wind readings can run in multiple directions on competition day.

  • Negative wind readings are head wind.
  • Positive wind readings are tail wind.
  • Positive wind readings above +2.0 M/sec are considered “illegal” when attempting to break records or achieve qualification standards.
  • Just because there is a wind reading does not mean it remained at that speed for the entire competition, but it may serve as an interesting indicator when it shows up at a consistent rate.







The table indicates to us Geelong, Bendigo and Werribee have recorded some of the highest wind readings across the state, significantly above +2.0 m/s. These are strong wind gusts, and with curiosity, does this impact performances in any significant way?

6. Can we see how points are distributed for performances?

  • What other alternatives to point scoring are there?



This might prove to be a hot topic to cover. The method in which points are calculated can be the difference between a team qualifying for shield final or an athlete receiving an award at the end of season.

This section is dedicated to offering independent thought on alternative options to point scoring, while showing the differences to the current state.

Methods of point scoring calculation



Method #1 AV Shield 2017-18: Points awarded based on performance within 80% of decathlon world record (method in use for inter-club round only).

  • Pros:
    • Pure performance based versus the world record
    • No discrimination over finishing order, round or venue.
  • Cons:
    • Severely penalises athletes who are well below the “norm” or outside the 80% threshold.
    • Athletes below the threshold receive zero (0) points.
    • Penalty is not a good look for “rewarding” general participation regardless of ability.
    • Time cost of recalculation: For regions which desire to use a different method on the basis of fairness for all athletes, alternative and manual re-calculation of points is required.
    • The threshold needs to be reviewed regularly or when new records are set.

Method #2 Points awarded 1st place = 11 points, 2nd-9th = 11 minus placing, 10th and above = 1 point. Grouping based on Round, Venue, Event, Age Group, Finishing Order.

  • Pros:
    • Rewards all participation and race/event-flight-based competitive finishing order.
    • Encourages balanced seeding of events to promote challenging competition
  • Cons:
    • Will be biased toward smaller competition regions
    • Method is not purely performance based.

Method #3 Points awarded 1st place = 11 points, 2nd-9th = 11 minus placing, 10th and above = 1 point. Grouping based on Round, Event, Age Group, Finishing Order.

Alteration: Similar to method 2, but grouping next level up, finishing order based on Round/Event results and not Venue/Event.

  • Pros:
    • Finishing order calculated versus other athletes competing in same event, at same round.
    • Expands the number of athletes in calculation group.
    • Encourages balanced seeding of events to promote challenging competition
  • Cons:
    • For increased balance and fairness, same events should be held on the same round and not staggered/fragmented each round because of venue resources.
    • Method is not purely performance based.
    • The overall percentages of participation for athletes with disability is less than 5% of the athlete population. The impacts will not be greatly observed, but will be subtle enough to see some minor differences in total points awarded (final finishing orders) over the season.

Method #4 Normalising performances by athletes with disability, using respective adjustment table for AWD classification. Points awarded 1st place = 11 points, 2nd-9th = 11 minus placing, 10th and above = 1 point. Grouping based on Round, Venue, Event, Age Group, Finishing Order using the normalised performance result.

  • Pros:
    • Using the same grouping method as Method 3, but with performance normalisation.
    • Offers equity in points awarded, by recognising input effort
    • Encourages balanced seeding of events to promote challenging competition
  • Cons:
    • For increased balance and fairness, same events should be held on the same round and not staggered/fragmented each round because of venue resources.
    • Method is not purely performance based.







These graphs illustrate that by changing the method of points calculation it does have an impact, at the granular individual athlete level. We observe the ordering of the highest point scoring athletes changes with the method used.

With the 4th method illustrated, while the highest 6 athlete finishing orders do not change, we see some movement between finishing orders 7 and 10.

Consider this:

  • If the methods impact at the individual athlete level, what will happen when it rolls up to club and zone?
  • How should we design a point scoring method which rewards participation at any performance level and is fair and balanced across regions?



Summary notes & future analyses

A significant portion of this analysis was spent on data preparation, cleanup and feature engineering. Spending time carefully assessing the source data helps greatly in establishing a solid foundation to conduct an analysis on.

Here’s the rough breakdown of time spent:

  • 70% - Data understanding, preparation, checking, cleanup
  • 10% - Key feature engineering (AWD classifications, Venue geography, useful data groupings/classifications)
  • 10% - Deciding how to “tell the story”, visualisations, summary tables, appropriate aggregations
  • 10% - Writing the story, peer-review/copy writing, technical design of this R markdown document.

Observations made along the journey in data prep:

  • No dates are recorded in the source data. Is this worth adding?
  • Wind direction and air temperature is not recorded. Could environmental conditions play a part in performance and participation?
  • Inconsistent data entry for athletes with disability, flagging their classification. This is important to consider when calculating normalised performances.
  • Inconsistent data entry for flagging events or athletes as invitation. It is unclear if the athlete is competing as invitation or if the event was held as invitation.
  • Inconsistent data entry for venue hosting competition. For example, small things like “Meadowglen” versus “Meadow Glen”.
  • Different events are held at different venues for a single round, there is no overarching consistency, therefore making it challenging to compare “apples for apples” by round of competition.
    • Subject matter expertise suggests that, some athletics track venues don’t support certain event groupings running on the same program due to safety concerns or logistics, however this can be remedied with program event scheduling. EG: Hammer throw scheduled very early, and Javelin held very late in the event schedule, avoiding the issue of throwing implements crossing the infield simultaneously.


Where to next?

There were a few elements in this analysis we could unpack further, such as:

  • participation by club
  • geographically plotting athlete/club commutes & volumes to venues outside their club’s district
  • athletes breaking records or achieving qualifying standards (Nationals, World Juniors, Commonwealth Games, Olympics)
  • wind readings & performance impacts
  • event performance trends by age group.
    • could we predict performances for athletes, based on participation history?
  • External perspective: How does Victorian athletics compare to other states of Australia, and the rest of the world?
  • Contemporary risk management approach: For Victorian athletics, a key risk event could be described as “Inter-club ceases due to insufficient participation rates”. What causal factors and consequences could we unpack from the data to provide insight into developing potential controls and treatments?


As mentioned in the purpose and objective of this analysis, having a clear business problem to address goes a long way to creating useful, actionable insights, to assist the target audience. If there are more business problems to consider and unpack then it is likely we can look at part two of this analysis.


We’d love to receive your thoughts, queries and feedback on this analysis. Please feel free to reach out to Bree at bree_mclennan@outlook.com .

Thanks for connecting and reading!

Bree.