The first step in using R to analyse data about Australian Rules Football is to get the data in the right format.
This article describes how to work with AFL data supplied by Sorenson Technologies Pty Ltd ( http://www.sorensen.com.au/index.html ). Sorenson's provide AFL data for either
Other sources of AFL data include :-
Champion Data - the main supplier of AFL data, but for the sports enthusiast it's only available from either the press or by scraping various websites.
AFL Tables ( http://afltables.com/afl/afl_index.html ). Depending on what the data is, it's available either as a table which you can scrape or downloadable file.
This article will refer data supplied by Sorenson Technologies, as that's what I use.
The data supplied by Sorenson Technologies is contained in three files:
Details of each file are listed in my blog :
http://afandr.blogspot.com.au/2016/08/afldataplayers.html
http://afandr.blogspot.com.au/2016/08/afldatamatches.html
http://afandr.blogspot.com.au/2016/08/afldataladders.html
The Players data file is the file that contains information about match activity, and that is the file we will be using.
The first step is to import the data file and load R packages used in the processing. I also set the working directory to make saving intermediate files easier.
The R packages used are:
dplyr - a "Hadley Wickham" package with a lot of very useful data processing functions
stringr - another "Hadley Wickham" package which provides a set of simple and consistent string processing functions
data.table - a data processing package that has similar functions to dplyr.
This code assumes the packages are already installed. If the packages are not installed, you will need to do so using install.packages() function.
setwd("C:/Users/Graham/Dropbox/1 Sport/Data - Main AFL/Temp Files Go Here")
library(dplyr)
library(stringr)
library(data.table)
# Read in data file from "Use_this_data" folder
data <- read.csv("C:/Users/Graham/Dropbox/1 Sport/Data - Main AFL/Use_this_data/2011_2015 Player_Consolidated.csv")One the data is imported, the next step is to confirm that the file structure and spelling of team names hasn't changed. We won't go through the code to do that; however I have checked for data consistency and identified there have been two versions of spelling the Brisbane Lions
The following code converts instances of "Brisbane" to "Brisbane Lions" (this applies to two variables : Team Name and Opposing Team)
You can use the levels function to display a list of factor levels; for example: levels(data$Opposing_team)
# this ensures consistent spelling of Brisbane in "Team_name" variable
data$Team_name <- as.factor(gsub(" Lions", "", data$Team_name))
data$Team_name <- as.factor(gsub("Brisbane", "Brisbane Lions", data$Team_name))
# this ensures consistent spelling of Brisbane in "Opposing_Team" variable
data$Opposing_team <- as.factor(gsub(" Lions", "", data$Opposing_team))
data$Opposing_team <- as.factor(gsub("Brisbane", "Brisbane Lions", data$Opposing_team))The concept of "tidy data" is one that has been popularised by Hadley Wickham. To quote Wickham, "tidy data sets are eay to manipulate, model and visualise, and have specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table".
The exact structure of a tidy data set will be context specific.
The structure of the data files provided by Sorenson Technologies are "one row per player per game". This means there are 46 rows of data for each game: 22 players for team, two teams, and a row for each team for "rushed behinds".
That structure will be "tidy" if the focus is on the player. For the time being, my focus is on the team. So there are two data manipulation steps I need to do :
summarise player data into team data
for each game for each team, match with their opponent's game data.
The second step is necessary because a team's performance is meaningful only in the context of how their opponents performed - who won , who lost.
The following code summarises player data into team data (one row per team per game; ie with two teams, that means two rows per game)
# This line summarises data by game, with two rows per game
game_stats <- data %>% group_by(Season, Round, Team_name, Opposing_team) %>% summarise(Kicks = sum(Kicks), Marks= sum(Marks),Contested.Marks = sum(Contested_marks), Uncontested.Marks = sum(Uncontested_marks), Handball.s = sum(Handballs), Effective.Possessions = sum(Effective_possessions), Contsted.Possessions = sum(Contested_possessions), Uncontested.Possessions = sum(Uncontested_possessions), goals = sum(Goals), behinds = sum(Behinds), hitouts = sum(Hitouts), tackes = sum(Tackles), rebounds = sum(Rebounds), inside50 = sum(Inside50), clearances = sum(Clearances), clangers = sum(Clangers), Frees.for = sum(Frees_for), frees.against = sum(Frees_against), assists = sum(Assists), brownlow.votes = sum(Brownlow_votes), marks.inside.50 = sum(Marks_inside50), one.percenters = sum(One_percenters), bounces = sum(Bounces), centre.clearances = sum(Centre_clearances),stoppages = sum(Stoppages))The next step (in a round about way) matches - for each game - each team and their opponent.
The first section of code creates a function called "find_first_team" creates a key, which in conjunction with "Season" and "Round", links together the the two teams (rows) that played together in the one game.
The key is comprised - in alphabetical order - the first 6 characters of each team name.
The key is added to each row in the variable named "abr".
# Next section of code converts 2 row per game format to one row per game format
# This function takes as input the names of two football teams (x and y respectively)
# The first 6 characters of the team names are then assigned to a and b respectively
# a and b are compared using < operator
# The output of the function is 12 characters and comprised of the two 6 character segments, in alphabetical order
find_first_team <- function(x,y) {
a = tolower(str_sub(x,1,6))
b = tolower(str_sub(y,1,6))
ifelse(a<b, paste0(a,b), paste0(b,a))
}
# Assigns output of function to new field in game_stats
game_stats$abr <- find_first_team(game_stats$Team_name, game_stats$Opposing_team)The following code section takes the game_stats file and
This is accomplished using the unique() function in the data.table package. Whilst this file has one row per game, it does not include any data about the opposing team.
This file version is called "new_test_1".
The left join function is part of the data.table function, and the setkey() function (from data.table) package is used to link the two files
new_test_1 <- game_stats
#Uses datatable package to use setkey and unique functions
new_test_1 <- data.table(new_test_1)
# Creates a key on nominated columns
setkeyv(new_test_1, c("Season", "Round", "abr"))
# Unique returns a datatable with duplicated rows removed. Duplication is based on columns nominated by setkey
# In this case, the duplication is defined by Season, Round, abr
# abr is the field created above by combining, in alphabetical order, the first 6 characters of the team names
# for example : adelahawtho
# The original file has two rows per game
# This step removes one of the two rows per game
# Required as part of the process to convert file from 2 rows per game to one row per game
write.csv(new_test_1, file = "new_test_1.csv")
new_test_2 <- unique(new_test_1)
write.csv(new_test_2, file = "new_test_2.csv")
# Convert new_test_2 to data frame, so that dplyr functions can be used
new_test_2 <- data.frame(new_test_2)
final2 <- left_join(new_test_2, game_stats, by = c("Season" = "Season", "Round" = "Round", "Opposing_team" = "Team_name"))
final2$Opposing_team.y <- final2$Opposing_team
write.csv(final2, file = "game_data_wide.csv")
###################################The output file - final2 (or "game_data_wide" as the saved version is called) contains one row per game, with the playing statistics of each team included.
The next step in the process is take the data file where there is one row per game and create a summary - for whatever time frame is covered by the data file - of each team that inludes a summary of both
A more detailed summary would show the playing team with each opposition team shown separately.