The first step in using R to analyse data about Australian Rules Football is to get the data in the right format.

This article describes how to work with AFL data supplied by Sorenson Technologies Pty Ltd ( http://www.sorensen.com.au/index.html ). Sorenson's provide AFL data for either

This article will refer data supplied by Sorenson Technologies, as that's what I use.

Sorenson Data Files

The Players data file is the file that contains information about match activity, and that is the file we will be using.

Setup

The first step is to import the data file and load R packages used in the processing. I also set the working directory to make saving intermediate files easier.

This code assumes the packages are already installed. If the packages are not installed, you will need to do so using install.packages() function.

setwd("C:/Users/Graham/Dropbox/1 Sport/Data - Main AFL/Temp Files Go Here")


library(dplyr)
library(stringr)
library(data.table)


# Read in data file from "Use_this_data" folder


data <- read.csv("C:/Users/Graham/Dropbox/1 Sport/Data - Main AFL/Use_this_data/2011_2015 Player_Consolidated.csv")

One the data is imported, the next step is to confirm that the file structure and spelling of team names hasn't changed. We won't go through the code to do that; however I have checked for data consistency and identified there have been two versions of spelling the Brisbane Lions

The following code converts instances of "Brisbane" to "Brisbane Lions" (this applies to two variables : Team Name and Opposing Team)

You can use the levels function to display a list of factor levels; for example: levels(data$Opposing_team)

# this ensures consistent spelling of Brisbane in "Team_name" variable

data$Team_name <- as.factor(gsub(" Lions", "", data$Team_name))



data$Team_name <- as.factor(gsub("Brisbane", "Brisbane Lions", data$Team_name))



# this ensures consistent spelling of Brisbane in "Opposing_Team" variable



data$Opposing_team <- as.factor(gsub(" Lions", "", data$Opposing_team))



data$Opposing_team <- as.factor(gsub("Brisbane", "Brisbane Lions", data$Opposing_team))

The concept of "tidy data" is one that has been popularised by Hadley Wickham. To quote Wickham, "tidy data sets are eay to manipulate, model and visualise, and have specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table".

The structure of the data files provided by Sorenson Technologies are "one row per player per game". This means there are 46 rows of data for each game: 22 players for team, two teams, and a row for each team for "rushed behinds".

That structure will be "tidy" if the focus is on the player. For the time being, my focus is on the team. So there are two data manipulation steps I need to do :

The second step is necessary because a team's performance is meaningful only in the context of how their opponents performed - who won , who lost.

The following code summarises player data into team data (one row per team per game; ie with two teams, that means two rows per game)

# This line summarises data by game, with two rows per game

game_stats <- data  %>% group_by(Season, Round, Team_name, Opposing_team) %>% summarise(Kicks = sum(Kicks), Marks= sum(Marks),Contested.Marks = sum(Contested_marks), Uncontested.Marks = sum(Uncontested_marks), Handball.s = sum(Handballs), Effective.Possessions = sum(Effective_possessions), Contsted.Possessions = sum(Contested_possessions), Uncontested.Possessions = sum(Uncontested_possessions), goals = sum(Goals), behinds = sum(Behinds), hitouts = sum(Hitouts), tackes = sum(Tackles), rebounds = sum(Rebounds), inside50 = sum(Inside50), clearances = sum(Clearances), clangers = sum(Clangers), Frees.for = sum(Frees_for), frees.against = sum(Frees_against), assists = sum(Assists), brownlow.votes = sum(Brownlow_votes), marks.inside.50 = sum(Marks_inside50), one.percenters = sum(One_percenters), bounces = sum(Bounces), centre.clearances = sum(Centre_clearances),stoppages = sum(Stoppages))

The next step (in a round about way) matches - for each game - each team and their opponent.

The first section of code creates a function called "find_first_team" creates a key, which in conjunction with "Season" and "Round", links together the the two teams (rows) that played together in the one game.

The key is comprised - in alphabetical order - the first 6 characters of each team name.

# Next section of code converts 2 row per game format to one row per game format



# This function takes as input the names of two football teams (x and y respectively)
# The first 6 characters of the team names are then assigned to a and b respectively
# a and b are compared using < operator
# The output of the function is 12 characters and comprised of the two 6 character segments, in alphabetical order 

find_first_team <- function(x,y) {
   a = tolower(str_sub(x,1,6))
   b = tolower(str_sub(y,1,6))
   ifelse(a<b, paste0(a,b), paste0(b,a))
}

# Assigns output of function to new field in game_stats

game_stats$abr <- find_first_team(game_stats$Team_name, game_stats$Opposing_team)

This is accomplished using the unique() function in the data.table package. Whilst this file has one row per game, it does not include any data about the opposing team.

The left join function is part of the data.table function, and the setkey() function (from data.table) package is used to link the two files

new_test_1 <- game_stats


#Uses datatable package to use setkey and unique functions

new_test_1 <- data.table(new_test_1)

# Creates a key on nominated columns

setkeyv(new_test_1, c("Season", "Round", "abr"))

# Unique returns a datatable with duplicated rows removed. Duplication is based on columns nominated by setkey
# In this case, the duplication is defined by Season, Round, abr
# abr is the field created above by combining, in alphabetical order, the first 6 characters of the team names
# for example : adelahawtho
# The original file has two rows per game
# This step removes one of the two rows per game
# Required as part of the process to convert file from 2 rows per game to one row per game

write.csv(new_test_1, file = "new_test_1.csv")

new_test_2 <- unique(new_test_1)

write.csv(new_test_2, file = "new_test_2.csv")

# Convert new_test_2 to data frame, so that dplyr functions can be used

new_test_2 <- data.frame(new_test_2)

final2 <- left_join(new_test_2, game_stats, by = c("Season" = "Season", "Round" = "Round", "Opposing_team" = "Team_name"))

final2$Opposing_team.y <- final2$Opposing_team

write.csv(final2, file = "game_data_wide.csv")

###################################

The output file - final2 (or "game_data_wide" as the saved version is called) contains one row per game, with the playing statistics of each team included.

The next step in the process is take the data file where there is one row per game and create a summary - for whatever time frame is covered by the data file - of each team that inludes a summary of both

A more detailed summary would show the playing team with each opposition team shown separately.

Processing AFL Data

Sorenson Data Files

Setup