Join me as I build a linear model from AFL data scraped by the fryzigg function call in the package fitzRoy in RStudio.

Here is a brief outline of what we’ll do in this session:

  1. Wrangling data
  2. Exploratory data analysis
  3. Building models
  4. Evaluating accuracy
  5. Interpreting final model
  6. Linear Model Assumptions

Wrangling Data

# import data from fryzigg

s24 <- fetch_player_stats_fryzigg(season = 2024)
## 
ℹ Returning cached AFLM data from 2024

✔ Returning cached AFLM data from 2024 ... done
# average statistics

player_aves <- s24 %>%
  select(-c(player_first_name, player_last_name)) %>%
  group_by(player_id) %>%
  summarise_all(list(mean))
## Warning: There were 7238 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `venue_name = (function (x, ...) ...`.
## ℹ In group 1: `player_id = 11506`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 7237 remaining warnings.
# inspect variables

str(player_aves)
## tibble [658 × 79] (S3: tbl_df/tbl/data.frame)
##  $ player_id                     : int [1:658] 11506 11554 11591 11683 11692 11706 11724 11759 11766 11770 ...
##  $ venue_name                    : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ match_id                      : num [1:658] 16709 16665 16737 16747 16670 ...
##  $ match_home_team               : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ match_away_team               : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ match_date                    : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ match_local_time              : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ match_attendance              : num [1:658] 0 0 0 0 0 0 0 0 0 0 ...
##  $ match_round                   : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ match_home_team_goals         : num [1:658] 12.8 12.8 12 13.1 13.3 ...
##  $ match_home_team_behinds       : num [1:658] 11 11.4 12.7 11.4 10.9 ...
##  $ match_home_team_score         : num [1:658] 87.8 88.4 84.7 90 90.6 ...
##  $ match_away_team_goals         : num [1:658] 11.6 12.2 11 11.7 11.5 ...
##  $ match_away_team_behinds       : num [1:658] 11.7 10.7 10.3 10.2 10.6 ...
##  $ match_away_team_score         : num [1:658] 81 83.7 76.1 80 79.6 ...
##  $ match_margin                  : num [1:658] 20.6 20.6 31.2 20.3 22.1 ...
##  $ match_winner                  : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ match_weather_temp_c          : num [1:658] 19.1 20.7 19.7 18.1 19.1 ...
##  $ match_weather_type            : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ player_height_cm              : num [1:658] 191 197 183 187 201 189 192 184 184 176 ...
##  $ player_weight_kg              : num [1:658] 0 0 0 0 0 0 0 0 0 0 ...
##  $ player_is_retired             : num [1:658] 0 0 0 0 0 0 0 0 0 0 ...
##  $ player_team                   : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ guernsey_number               : num [1:658] 10 26 10 8 17 35 13 22 14 10 ...
##  $ kicks                         : num [1:658] 11 5.58 10.78 9.65 3.14 ...
##  $ marks                         : num [1:658] 3.1 2.92 5 4.35 1.79 ...
##  $ handballs                     : num [1:658] 9.7 3.5 7.43 8.75 7.07 ...
##  $ disposals                     : num [1:658] 20.7 9.08 18.22 18.4 10.21 ...
##  $ effective_disposals           : num [1:658] 14.65 5.33 13.39 13.85 8.14 ...
##  $ disposal_efficiency_percentage: num [1:658] 69.5 54.8 72.7 75.5 80.5 ...
##  $ goals                         : num [1:658] 0.35 1.25 0.217 0.65 0.143 ...
##  $ behinds                       : num [1:658] 0.2 0.6667 0.7391 0.1 0.0714 ...
##  $ hitouts                       : num [1:658] 0 4.42 0 0 24.57 ...
##  $ tackles                       : num [1:658] 3.95 1.58 2.91 2.7 1.86 ...
##  $ rebounds                      : num [1:658] 1.6 0 1.217 1.65 0.143 ...
##  $ inside_fifties                : num [1:658] 2.8 1.58 3.83 1.85 1.71 ...
##  $ clearances                    : num [1:658] 4.15 0.667 1.391 2.65 2.929 ...
##  $ clangers                      : num [1:658] 2.85 2.5 2.22 2.2 2.14 ...
##  $ free_kicks_for                : num [1:658] 1 0.167 0.391 0.85 1.857 ...
##  $ free_kicks_against            : num [1:658] 0.55 0.25 0.783 0.6 1.286 ...
##  $ brownlow_votes                : num [1:658] 0 0 0 0 0 0 0 0 0 0 ...
##  $ contested_possessions         : num [1:658] 8.3 4.83 5.39 7 5.79 ...
##  $ uncontested_possessions       : num [1:658] 11.65 4.58 13.04 11.5 4.5 ...
##  $ contested_marks               : num [1:658] 0.05 1 0.174 0.2 0.429 ...
##  $ marks_inside_fifty            : num [1:658] 0.15 1.833 0.217 0.55 0.214 ...
##  $ one_percenters                : num [1:658] 1.05 1.33 1.26 1.15 2.57 ...
##  $ bounces                       : num [1:658] 0.1 0 0.174 0.2 0 ...
##  $ goal_assists                  : num [1:658] 0.6 0.417 0.783 0.2 0.357 ...
##  $ time_on_ground_percentage     : num [1:658] 73.2 87.3 78 78.3 72.7 ...
##  $ afl_fantasy_score             : num [1:658] 79.2 50.8 74 73.3 59.9 ...
##  $ supercoach_score              : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ centre_clearances             : num [1:658] 2.2 0 0.087 0.35 1.429 ...
##  $ stoppage_clearances           : num [1:658] 1.95 0.667 1.304 2.3 1.5 ...
##  $ score_involvements            : num [1:658] 5.25 4.25 5 4.05 3.21 ...
##  $ metres_gained                 : num [1:658] 282.9 121.9 304.4 236.2 97.3 ...
##  $ turnovers                     : num [1:658] 2.3 2.33 2.96 2.55 1.71 ...
##  $ intercepts                    : num [1:658] 2.9 0.333 2.696 3.2 1.143 ...
##  $ tackles_inside_fifty          : num [1:658] 0.45 0.833 0.783 0.3 0.143 ...
##  $ contest_def_losses            : num [1:658] 0 0 0.087 0 0.0714 ...
##  $ contest_def_one_on_ones       : num [1:658] 0.05 0 0.261 0.1 0.5 ...
##  $ contest_off_one_on_ones       : num [1:658] 0 2.75 0.13 0.15 0.571 ...
##  $ contest_off_wins              : num [1:658] 0 0.9167 0.0435 0 0.0714 ...
##  $ def_half_pressure_acts        : num [1:658] 9.3 1.42 6.35 6.15 3.43 ...
##  $ effective_kicks               : num [1:658] 7.1 2.67 7.09 6.25 2.57 ...
##  $ f50_ground_ball_gets          : num [1:658] 0.4 1.583 0.783 0.45 0.286 ...
##  $ ground_ball_gets              : num [1:658] 5.85 2.75 4.39 5.3 2.71 ...
##  $ hitouts_to_advantage          : num [1:658] 0 1.08 0 0 7 ...
##  $ hitout_win_percentage         : num [1:658] 0 43 0 0 46.8 ...
##  $ intercept_marks               : num [1:658] 0.25 0 0.435 0.6 0.143 ...
##  $ marks_on_lead                 : num [1:658] 0.1 1.25 0.261 0.3 0.143 ...
##  $ pressure_acts                 : num [1:658] 19.3 9 14.61 11.85 8.64 ...
##  $ rating_points                 : num [1:658] 12.26 7.33 9.2 9.77 9.36 ...
##  $ ruck_contests                 : num [1:658] 0.05 10.08 0 0 53.64 ...
##  $ score_launches                : num [1:658] 1.45 0.75 1.09 0.9 2.14 ...
##  $ shots_at_goal                 : num [1:658] 0.65 2.417 1.043 0.9 0.214 ...
##  $ spoils                        : num [1:658] 0.3 0.5 0.957 0.7 1.714 ...
##  $ subbed                        : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ player_position               : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
##  $ date                          : Date[1:658], format: "2024-05-31" "2024-05-01" ...
# bind name, team, height and position

player_details <- s24 %>%
  mutate(
    full_name = paste(player_first_name, player_last_name)
  ) %>%
  select(player_id, full_name, player_team, player_height_cm, player_position)

full_df <- left_join(player_details, player_aves, by = 'player_id')

names(full_df)
##  [1] "player_id"                      "full_name"                     
##  [3] "player_team.x"                  "player_height_cm.x"            
##  [5] "player_position.x"              "venue_name"                    
##  [7] "match_id"                       "match_home_team"               
##  [9] "match_away_team"                "match_date"                    
## [11] "match_local_time"               "match_attendance"              
## [13] "match_round"                    "match_home_team_goals"         
## [15] "match_home_team_behinds"        "match_home_team_score"         
## [17] "match_away_team_goals"          "match_away_team_behinds"       
## [19] "match_away_team_score"          "match_margin"                  
## [21] "match_winner"                   "match_weather_temp_c"          
## [23] "match_weather_type"             "player_height_cm.y"            
## [25] "player_weight_kg"               "player_is_retired"             
## [27] "player_team.y"                  "guernsey_number"               
## [29] "kicks"                          "marks"                         
## [31] "handballs"                      "disposals"                     
## [33] "effective_disposals"            "disposal_efficiency_percentage"
## [35] "goals"                          "behinds"                       
## [37] "hitouts"                        "tackles"                       
## [39] "rebounds"                       "inside_fifties"                
## [41] "clearances"                     "clangers"                      
## [43] "free_kicks_for"                 "free_kicks_against"            
## [45] "brownlow_votes"                 "contested_possessions"         
## [47] "uncontested_possessions"        "contested_marks"               
## [49] "marks_inside_fifty"             "one_percenters"                
## [51] "bounces"                        "goal_assists"                  
## [53] "time_on_ground_percentage"      "afl_fantasy_score"             
## [55] "supercoach_score"               "centre_clearances"             
## [57] "stoppage_clearances"            "score_involvements"            
## [59] "metres_gained"                  "turnovers"                     
## [61] "intercepts"                     "tackles_inside_fifty"          
## [63] "contest_def_losses"             "contest_def_one_on_ones"       
## [65] "contest_off_one_on_ones"        "contest_off_wins"              
## [67] "def_half_pressure_acts"         "effective_kicks"               
## [69] "f50_ground_ball_gets"           "ground_ball_gets"              
## [71] "hitouts_to_advantage"           "hitout_win_percentage"         
## [73] "intercept_marks"                "marks_on_lead"                 
## [75] "pressure_acts"                  "rating_points"                 
## [77] "ruck_contests"                  "score_launches"                
## [79] "shots_at_goal"                  "spoils"                        
## [81] "subbed"                         "player_position.y"             
## [83] "date"
# bind name, team, height and position

player_details <- s24 %>%
  mutate(
    full_name = paste(player_first_name, player_last_name)
  ) %>%
  select(player_id, full_name, player_team, player_height_cm, player_position)

full_df <- left_join(player_details, player_aves, by = 'player_id')

# I want the champion data official positions, not whatever these positions are (I think these might be from the listed teams on the AFL website)

# now to scrape player details using a different function call

new_details <- fetch_player_details_afl(season = 2024, team = NULL, comp = 'AFLM')
## 
ℹ Fetching player details for Adelaide Crows, 2024

✔ Fetching player details for Adelaide Crows, 2024 ... done
## 
ℹ Fetching player details for Brisbane Lions, 2024

✔ Fetching player details for Brisbane Lions, 2024 ... done
## 
ℹ Fetching player details for Collingwood, 2024

✔ Fetching player details for Collingwood, 2024 ... done
## 
ℹ Fetching player details for Gold Coast SUNS, 2024

✔ Fetching player details for Gold Coast SUNS, 2024 ... done
## 
ℹ Fetching player details for Carlton, 2024

✔ Fetching player details for Carlton, 2024 ... done
## 
ℹ Fetching player details for North Melbourne, 2024

✔ Fetching player details for North Melbourne, 2024 ... done
## 
ℹ Fetching player details for Port Adelaide, 2024

✔ Fetching player details for Port Adelaide, 2024 ... done
## 
ℹ Fetching player details for Western Bulldogs, 2024

✔ Fetching player details for Western Bulldogs, 2024 ... done
## 
ℹ Fetching player details for Hawthorn, 2024

✔ Fetching player details for Hawthorn, 2024 ... done
## 
ℹ Fetching player details for Geelong Cats, 2024

✔ Fetching player details for Geelong Cats, 2024 ... done
## 
ℹ Fetching player details for St Kilda, 2024

✔ Fetching player details for St Kilda, 2024 ... done
## 
ℹ Fetching player details for Essendon, 2024

✔ Fetching player details for Essendon, 2024 ... done
## 
ℹ Fetching player details for Sydney Swans, 2024

✔ Fetching player details for Sydney Swans, 2024 ... done
## 
ℹ Fetching player details for Fremantle, 2024

✔ Fetching player details for Fremantle, 2024 ... done
## 
ℹ Fetching player details for GWS GIANTS, 2024

✔ Fetching player details for GWS GIANTS, 2024 ... done
## 
ℹ Fetching player details for Richmond, 2024

✔ Fetching player details for Richmond, 2024 ... done
## 
ℹ Fetching player details for Melbourne, 2024

✔ Fetching player details for Melbourne, 2024 ... done
## 
ℹ Fetching player details for West Coast Eagles, 2024

✔ Fetching player details for West Coast Eagles, 2024 ... done
# inspect resulting positions

unique(new_details$position)
## [1] "KEY_FORWARD"        "MEDIUM_FORWARD"     "MEDIUM_DEFENDER"   
## [4] "KEY_DEFENDER"       "MIDFIELDER"         "RUCK"              
## [7] "MIDFIELDER_FORWARD"
# bind data frames together
# must use names rather than id since both data frames have different player id's

new_details <- new_details %>% 
  mutate(full_name = paste(firstName, surname)) %>%
  select(-c(firstName:id, season, jumperNumber, providerId, weightInKg,
            data_accessed))

afl_stats <- left_join(
  new_details, full_df, by = join_by(full_name)
)

# change character to factor

afl_stats <- afl_stats %>%
  mutate(position = as.factor(position))

# players listed with multiple positions originally have created players with multiple rows, let's slice so we only get one - all averages are the same anyway

afl_player_stats <- afl_stats %>%
  group_by(player_id) %>%
  slice(1)

Exploratory Data Analysis

afl_player_stats %>%
  ggplot(aes(disposal_efficiency_percentage)) +
  geom_histogram(bins = 50)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

Shape of the data appears to be normally distributed.

ggplot(afl_player_stats, aes(x = position, y = disposal_efficiency_percentage)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

Here we see that two positions have significantly higher disposal efficiency than other positions, with Key Defenders #1 and Medium Defenders #2.

Building Models

m0 <- lm(
  disposal_efficiency_percentage ~ 1,
  afl_player_stats
)

m1 <- lm(
  disposal_efficiency_percentage ~ position,
  afl_player_stats
)

m2 <- lm(
  disposal_efficiency_percentage ~ position + heightInCm,
  afl_player_stats
)

Evaluating Accuracy

compare_performance(m0, m1, m2)
## # Comparison of Model Performance Indices
## 
## Name | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
## ---------------------------------------------------------------------------------------------------
## m0   |    lm | 4836.2 (<.001) | 4836.2 (<.001) | 4845.2 (<.001) | 0.000 |     0.000 | 9.516 | 9.523
## m1   |    lm | 4588.7 (0.288) | 4588.9 (0.293) | 4624.6 (0.792) | 0.326 |     0.320 | 7.812 | 7.854
## m2   |    lm | 4586.8 (0.712) | 4587.1 (0.707) | 4627.2 (0.208) | 0.330 |     0.323 | 7.790 | 7.838

Model m2 is the superior model as it has the lowest AIC, albeit very slightly, with 4586.8. Perhaps more variables are needed for this model. I attempted to add a ‘international variable’ using an indicator variable if the player was born in Australia or Overseas, suspecting that Overseas born players may have a lower disposal efficiency. However, the recruitedFrom variable has many NA values, including players like Mykelti Lefau (born in New Zealand).

Interpreting Final Model

tab_model(m2)
  disposal efficiency
percentage
Predictors Estimates CI p
(Intercept) 53.39 26.42 – 80.37 <0.001
position [KEY_FORWARD] -13.93 -16.35 – -11.50 <0.001
position
[MEDIUM_DEFENDER]
-0.34 -2.93 – 2.25 0.797
position [MEDIUM_FORWARD] -10.97 -13.91 – -8.02 <0.001
position [MIDFIELDER] -8.94 -11.48 – -6.40 <0.001
position
[MIDFIELDER_FORWARD]
-9.55 -12.91 – -6.20 <0.001
position [RUCK] -16.72 -19.87 – -13.57 <0.001
heightInCm 0.14 -0.00 – 0.27 0.052
Observations 658
R2 / R2 adjusted 0.330 / 0.323

Height is suprising correlated with imroved disposal efficiency, with all players equal heights, Medium defenders would have the best efficiency, with a baseline of 53.35 and Rucks would have the worst with a baseline of 36.67. However, efficiency increases by 0.14% with every cm increase in height.

Linear Model Assumptions

Certain assumptions need to be met in order to use linear modelling for prediction.

par(mfrow = c(2, 2))
plot(m2)

  1. Residuals vs Fitted plot (top left)

    Trendline is flat, validating the assumption of linearity

  2. Q-Q Residuals plot (top right)

    Trendline linear towards the centre of the data but quite non-linear at the tails - this assumption may be violated as the distribution may not be normal.

  3. Scale-Location plot (bottom left)
    Spread of residuals appears random (although the no values between 72 and 76 gives me pause).

  4. Residuals vs Leverage (bottom right)
    Cook’s distance is not displayed, indicating an absence of influential cases, meaning the model wouldn’t change significantly is any data points were excluded.

Based on assumption 2 appearing to be violated, it seems as though our model needs to be tweaked further in order to fully trust it’s future predictions.