Join me as I build a linear model from AFL data scraped by the fryzigg function call in the package fitzRoy in RStudio.
Here is a brief outline of what we’ll do in this session:
# import data from fryzigg
s24 <- fetch_player_stats_fryzigg(season = 2024)
##
ℹ Returning cached AFLM data from 2024
✔ Returning cached AFLM data from 2024 ... done
# average statistics
player_aves <- s24 %>%
select(-c(player_first_name, player_last_name)) %>%
group_by(player_id) %>%
summarise_all(list(mean))
## Warning: There were 7238 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `venue_name = (function (x, ...) ...`.
## ℹ In group 1: `player_id = 11506`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 7237 remaining warnings.
# inspect variables
str(player_aves)
## tibble [658 × 79] (S3: tbl_df/tbl/data.frame)
## $ player_id : int [1:658] 11506 11554 11591 11683 11692 11706 11724 11759 11766 11770 ...
## $ venue_name : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ match_id : num [1:658] 16709 16665 16737 16747 16670 ...
## $ match_home_team : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ match_away_team : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ match_date : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ match_local_time : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ match_attendance : num [1:658] 0 0 0 0 0 0 0 0 0 0 ...
## $ match_round : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ match_home_team_goals : num [1:658] 12.8 12.8 12 13.1 13.3 ...
## $ match_home_team_behinds : num [1:658] 11 11.4 12.7 11.4 10.9 ...
## $ match_home_team_score : num [1:658] 87.8 88.4 84.7 90 90.6 ...
## $ match_away_team_goals : num [1:658] 11.6 12.2 11 11.7 11.5 ...
## $ match_away_team_behinds : num [1:658] 11.7 10.7 10.3 10.2 10.6 ...
## $ match_away_team_score : num [1:658] 81 83.7 76.1 80 79.6 ...
## $ match_margin : num [1:658] 20.6 20.6 31.2 20.3 22.1 ...
## $ match_winner : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ match_weather_temp_c : num [1:658] 19.1 20.7 19.7 18.1 19.1 ...
## $ match_weather_type : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ player_height_cm : num [1:658] 191 197 183 187 201 189 192 184 184 176 ...
## $ player_weight_kg : num [1:658] 0 0 0 0 0 0 0 0 0 0 ...
## $ player_is_retired : num [1:658] 0 0 0 0 0 0 0 0 0 0 ...
## $ player_team : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ guernsey_number : num [1:658] 10 26 10 8 17 35 13 22 14 10 ...
## $ kicks : num [1:658] 11 5.58 10.78 9.65 3.14 ...
## $ marks : num [1:658] 3.1 2.92 5 4.35 1.79 ...
## $ handballs : num [1:658] 9.7 3.5 7.43 8.75 7.07 ...
## $ disposals : num [1:658] 20.7 9.08 18.22 18.4 10.21 ...
## $ effective_disposals : num [1:658] 14.65 5.33 13.39 13.85 8.14 ...
## $ disposal_efficiency_percentage: num [1:658] 69.5 54.8 72.7 75.5 80.5 ...
## $ goals : num [1:658] 0.35 1.25 0.217 0.65 0.143 ...
## $ behinds : num [1:658] 0.2 0.6667 0.7391 0.1 0.0714 ...
## $ hitouts : num [1:658] 0 4.42 0 0 24.57 ...
## $ tackles : num [1:658] 3.95 1.58 2.91 2.7 1.86 ...
## $ rebounds : num [1:658] 1.6 0 1.217 1.65 0.143 ...
## $ inside_fifties : num [1:658] 2.8 1.58 3.83 1.85 1.71 ...
## $ clearances : num [1:658] 4.15 0.667 1.391 2.65 2.929 ...
## $ clangers : num [1:658] 2.85 2.5 2.22 2.2 2.14 ...
## $ free_kicks_for : num [1:658] 1 0.167 0.391 0.85 1.857 ...
## $ free_kicks_against : num [1:658] 0.55 0.25 0.783 0.6 1.286 ...
## $ brownlow_votes : num [1:658] 0 0 0 0 0 0 0 0 0 0 ...
## $ contested_possessions : num [1:658] 8.3 4.83 5.39 7 5.79 ...
## $ uncontested_possessions : num [1:658] 11.65 4.58 13.04 11.5 4.5 ...
## $ contested_marks : num [1:658] 0.05 1 0.174 0.2 0.429 ...
## $ marks_inside_fifty : num [1:658] 0.15 1.833 0.217 0.55 0.214 ...
## $ one_percenters : num [1:658] 1.05 1.33 1.26 1.15 2.57 ...
## $ bounces : num [1:658] 0.1 0 0.174 0.2 0 ...
## $ goal_assists : num [1:658] 0.6 0.417 0.783 0.2 0.357 ...
## $ time_on_ground_percentage : num [1:658] 73.2 87.3 78 78.3 72.7 ...
## $ afl_fantasy_score : num [1:658] 79.2 50.8 74 73.3 59.9 ...
## $ supercoach_score : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ centre_clearances : num [1:658] 2.2 0 0.087 0.35 1.429 ...
## $ stoppage_clearances : num [1:658] 1.95 0.667 1.304 2.3 1.5 ...
## $ score_involvements : num [1:658] 5.25 4.25 5 4.05 3.21 ...
## $ metres_gained : num [1:658] 282.9 121.9 304.4 236.2 97.3 ...
## $ turnovers : num [1:658] 2.3 2.33 2.96 2.55 1.71 ...
## $ intercepts : num [1:658] 2.9 0.333 2.696 3.2 1.143 ...
## $ tackles_inside_fifty : num [1:658] 0.45 0.833 0.783 0.3 0.143 ...
## $ contest_def_losses : num [1:658] 0 0 0.087 0 0.0714 ...
## $ contest_def_one_on_ones : num [1:658] 0.05 0 0.261 0.1 0.5 ...
## $ contest_off_one_on_ones : num [1:658] 0 2.75 0.13 0.15 0.571 ...
## $ contest_off_wins : num [1:658] 0 0.9167 0.0435 0 0.0714 ...
## $ def_half_pressure_acts : num [1:658] 9.3 1.42 6.35 6.15 3.43 ...
## $ effective_kicks : num [1:658] 7.1 2.67 7.09 6.25 2.57 ...
## $ f50_ground_ball_gets : num [1:658] 0.4 1.583 0.783 0.45 0.286 ...
## $ ground_ball_gets : num [1:658] 5.85 2.75 4.39 5.3 2.71 ...
## $ hitouts_to_advantage : num [1:658] 0 1.08 0 0 7 ...
## $ hitout_win_percentage : num [1:658] 0 43 0 0 46.8 ...
## $ intercept_marks : num [1:658] 0.25 0 0.435 0.6 0.143 ...
## $ marks_on_lead : num [1:658] 0.1 1.25 0.261 0.3 0.143 ...
## $ pressure_acts : num [1:658] 19.3 9 14.61 11.85 8.64 ...
## $ rating_points : num [1:658] 12.26 7.33 9.2 9.77 9.36 ...
## $ ruck_contests : num [1:658] 0.05 10.08 0 0 53.64 ...
## $ score_launches : num [1:658] 1.45 0.75 1.09 0.9 2.14 ...
## $ shots_at_goal : num [1:658] 0.65 2.417 1.043 0.9 0.214 ...
## $ spoils : num [1:658] 0.3 0.5 0.957 0.7 1.714 ...
## $ subbed : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ player_position : num [1:658] NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date[1:658], format: "2024-05-31" "2024-05-01" ...
# bind name, team, height and position
player_details <- s24 %>%
mutate(
full_name = paste(player_first_name, player_last_name)
) %>%
select(player_id, full_name, player_team, player_height_cm, player_position)
full_df <- left_join(player_details, player_aves, by = 'player_id')
names(full_df)
## [1] "player_id" "full_name"
## [3] "player_team.x" "player_height_cm.x"
## [5] "player_position.x" "venue_name"
## [7] "match_id" "match_home_team"
## [9] "match_away_team" "match_date"
## [11] "match_local_time" "match_attendance"
## [13] "match_round" "match_home_team_goals"
## [15] "match_home_team_behinds" "match_home_team_score"
## [17] "match_away_team_goals" "match_away_team_behinds"
## [19] "match_away_team_score" "match_margin"
## [21] "match_winner" "match_weather_temp_c"
## [23] "match_weather_type" "player_height_cm.y"
## [25] "player_weight_kg" "player_is_retired"
## [27] "player_team.y" "guernsey_number"
## [29] "kicks" "marks"
## [31] "handballs" "disposals"
## [33] "effective_disposals" "disposal_efficiency_percentage"
## [35] "goals" "behinds"
## [37] "hitouts" "tackles"
## [39] "rebounds" "inside_fifties"
## [41] "clearances" "clangers"
## [43] "free_kicks_for" "free_kicks_against"
## [45] "brownlow_votes" "contested_possessions"
## [47] "uncontested_possessions" "contested_marks"
## [49] "marks_inside_fifty" "one_percenters"
## [51] "bounces" "goal_assists"
## [53] "time_on_ground_percentage" "afl_fantasy_score"
## [55] "supercoach_score" "centre_clearances"
## [57] "stoppage_clearances" "score_involvements"
## [59] "metres_gained" "turnovers"
## [61] "intercepts" "tackles_inside_fifty"
## [63] "contest_def_losses" "contest_def_one_on_ones"
## [65] "contest_off_one_on_ones" "contest_off_wins"
## [67] "def_half_pressure_acts" "effective_kicks"
## [69] "f50_ground_ball_gets" "ground_ball_gets"
## [71] "hitouts_to_advantage" "hitout_win_percentage"
## [73] "intercept_marks" "marks_on_lead"
## [75] "pressure_acts" "rating_points"
## [77] "ruck_contests" "score_launches"
## [79] "shots_at_goal" "spoils"
## [81] "subbed" "player_position.y"
## [83] "date"
# bind name, team, height and position
player_details <- s24 %>%
mutate(
full_name = paste(player_first_name, player_last_name)
) %>%
select(player_id, full_name, player_team, player_height_cm, player_position)
full_df <- left_join(player_details, player_aves, by = 'player_id')
# I want the champion data official positions, not whatever these positions are (I think these might be from the listed teams on the AFL website)
# now to scrape player details using a different function call
new_details <- fetch_player_details_afl(season = 2024, team = NULL, comp = 'AFLM')
##
ℹ Fetching player details for Adelaide Crows, 2024
✔ Fetching player details for Adelaide Crows, 2024 ... done
##
ℹ Fetching player details for Brisbane Lions, 2024
✔ Fetching player details for Brisbane Lions, 2024 ... done
##
ℹ Fetching player details for Collingwood, 2024
✔ Fetching player details for Collingwood, 2024 ... done
##
ℹ Fetching player details for Gold Coast SUNS, 2024
✔ Fetching player details for Gold Coast SUNS, 2024 ... done
##
ℹ Fetching player details for Carlton, 2024
✔ Fetching player details for Carlton, 2024 ... done
##
ℹ Fetching player details for North Melbourne, 2024
✔ Fetching player details for North Melbourne, 2024 ... done
##
ℹ Fetching player details for Port Adelaide, 2024
✔ Fetching player details for Port Adelaide, 2024 ... done
##
ℹ Fetching player details for Western Bulldogs, 2024
✔ Fetching player details for Western Bulldogs, 2024 ... done
##
ℹ Fetching player details for Hawthorn, 2024
✔ Fetching player details for Hawthorn, 2024 ... done
##
ℹ Fetching player details for Geelong Cats, 2024
✔ Fetching player details for Geelong Cats, 2024 ... done
##
ℹ Fetching player details for St Kilda, 2024
✔ Fetching player details for St Kilda, 2024 ... done
##
ℹ Fetching player details for Essendon, 2024
✔ Fetching player details for Essendon, 2024 ... done
##
ℹ Fetching player details for Sydney Swans, 2024
✔ Fetching player details for Sydney Swans, 2024 ... done
##
ℹ Fetching player details for Fremantle, 2024
✔ Fetching player details for Fremantle, 2024 ... done
##
ℹ Fetching player details for GWS GIANTS, 2024
✔ Fetching player details for GWS GIANTS, 2024 ... done
##
ℹ Fetching player details for Richmond, 2024
✔ Fetching player details for Richmond, 2024 ... done
##
ℹ Fetching player details for Melbourne, 2024
✔ Fetching player details for Melbourne, 2024 ... done
##
ℹ Fetching player details for West Coast Eagles, 2024
✔ Fetching player details for West Coast Eagles, 2024 ... done
# inspect resulting positions
unique(new_details$position)
## [1] "KEY_FORWARD" "MEDIUM_FORWARD" "MEDIUM_DEFENDER"
## [4] "KEY_DEFENDER" "MIDFIELDER" "RUCK"
## [7] "MIDFIELDER_FORWARD"
# bind data frames together
# must use names rather than id since both data frames have different player id's
new_details <- new_details %>%
mutate(full_name = paste(firstName, surname)) %>%
select(-c(firstName:id, season, jumperNumber, providerId, weightInKg,
data_accessed))
afl_stats <- left_join(
new_details, full_df, by = join_by(full_name)
)
# change character to factor
afl_stats <- afl_stats %>%
mutate(position = as.factor(position))
# players listed with multiple positions originally have created players with multiple rows, let's slice so we only get one - all averages are the same anyway
afl_player_stats <- afl_stats %>%
group_by(player_id) %>%
slice(1)
afl_player_stats %>%
ggplot(aes(disposal_efficiency_percentage)) +
geom_histogram(bins = 50)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
Shape of the data appears to be normally distributed.
ggplot(afl_player_stats, aes(x = position, y = disposal_efficiency_percentage)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
Here we see that two positions have significantly higher disposal efficiency than other positions, with Key Defenders #1 and Medium Defenders #2.
m0 <- lm(
disposal_efficiency_percentage ~ 1,
afl_player_stats
)
m1 <- lm(
disposal_efficiency_percentage ~ position,
afl_player_stats
)
m2 <- lm(
disposal_efficiency_percentage ~ position + heightInCm,
afl_player_stats
)
compare_performance(m0, m1, m2)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2 | R2 (adj.) | RMSE | Sigma
## ---------------------------------------------------------------------------------------------------
## m0 | lm | 4836.2 (<.001) | 4836.2 (<.001) | 4845.2 (<.001) | 0.000 | 0.000 | 9.516 | 9.523
## m1 | lm | 4588.7 (0.288) | 4588.9 (0.293) | 4624.6 (0.792) | 0.326 | 0.320 | 7.812 | 7.854
## m2 | lm | 4586.8 (0.712) | 4587.1 (0.707) | 4627.2 (0.208) | 0.330 | 0.323 | 7.790 | 7.838
Model m2 is the superior model as it has the lowest AIC, albeit very slightly, with 4586.8. Perhaps more variables are needed for this model. I attempted to add a ‘international variable’ using an indicator variable if the player was born in Australia or Overseas, suspecting that Overseas born players may have a lower disposal efficiency. However, the recruitedFrom variable has many NA values, including players like Mykelti Lefau (born in New Zealand).
tab_model(m2)
disposal efficiency percentage |
|||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 53.39 | 26.42 – 80.37 | <0.001 |
position [KEY_FORWARD] | -13.93 | -16.35 – -11.50 | <0.001 |
position [MEDIUM_DEFENDER] |
-0.34 | -2.93 – 2.25 | 0.797 |
position [MEDIUM_FORWARD] | -10.97 | -13.91 – -8.02 | <0.001 |
position [MIDFIELDER] | -8.94 | -11.48 – -6.40 | <0.001 |
position [MIDFIELDER_FORWARD] |
-9.55 | -12.91 – -6.20 | <0.001 |
position [RUCK] | -16.72 | -19.87 – -13.57 | <0.001 |
heightInCm | 0.14 | -0.00 – 0.27 | 0.052 |
Observations | 658 | ||
R2 / R2 adjusted | 0.330 / 0.323 |
Height is suprising correlated with imroved disposal efficiency, with all players equal heights, Medium defenders would have the best efficiency, with a baseline of 53.35 and Rucks would have the worst with a baseline of 36.67. However, efficiency increases by 0.14% with every cm increase in height.
Certain assumptions need to be met in order to use linear modelling for prediction.
par(mfrow = c(2, 2))
plot(m2)
Residuals vs Fitted plot (top left)
Trendline is flat, validating the assumption of linearity
Q-Q Residuals plot (top right)
Trendline linear towards the centre of the data but quite non-linear at the tails - this assumption may be violated as the distribution may not be normal.
Scale-Location plot (bottom left)
Spread of residuals appears random (although the no values between 72
and 76 gives me pause).
Residuals vs Leverage (bottom right)
Cook’s distance is not displayed, indicating an absence of influential
cases, meaning the model wouldn’t change significantly is any data
points were excluded.
Based on assumption 2 appearing to be violated, it seems as though our model needs to be tweaked further in order to fully trust it’s future predictions.