Exploratory analysis of a data set involves summarising and visualising the main characteristics of the data set in order to gain a more holistic understanding of the data. This is a critical step which should be utilised prior to the commencement of any in depth analysis or applications. Through this process an individual is able to begin to evaluate data quality, an understanding of data trends allowing for the formation of potential hypotheses and begin the process of data cleaning and preprocessing - all of which enable higher quality analysis to eventuate.
In this instance we are utilising the fitzRoy package in order to access the 2023 AFL match data.
We can use view in the dplyr package to look at the 2023 AFL match data. This enables quick inspection of the type of data contained within each variable and identification of errors or missing data.
We can utilise summary to create an overview of the data. This provides key information surrounding the data range and summary statistics including mean and median.
The data can then be grouped to provide information on different levels e.g. match level. In the below code, the data has been grouped by (using group_by) each team for each round keeping only the data which varies for individuals to create a holistic overview of team actions and outputs using summary. An additional column has also been added using mutate to provide information on whether that team was a home or away team during that match.
We can also use the information we have been provided to understand the match outcome for each team, again providing further information to be utilised in trend detection.
Data visualisation is extremely useful in identifying trends and patterns within the data. The ggplot2 package can be utilised to create a variety of data visualisations and trend detection.
For example, identifying correlations between Inside.50s and margin, the following code could be utilised.
The blue trend line demonstrates a high level of correlation between the two metrics. We can repeat this with other variables and compare them simultaneously using gridExtra.
Freesfor_margin <- afl2023_matchoutcome %>%
ggplot(aes(y = Freesfor, x = margin)) +
geom_point() +
geom_smooth(method = lm) +
ggtitle ("Relationship between Frees For and Score Margin") +
theme(plot.title = element_text(size = 8))
Marks_margin <- afl2023_matchoutcome %>%
ggplot(aes(y = Marks, x = margin)) +
geom_point() +
geom_smooth(method = lm) +
ggtitle ("Relationship between Marks and Score Margin") +
theme(plot.title = element_text(size = 8))
Brownlowvotes_margin <- afl2023_matchoutcome %>%
ggplot(aes(y = Brownlowvotes, x = margin)) +
geom_point() +
geom_smooth(method = lm) +
ggtitle ("Relationship between Brownlow Votes and Score Margin") +
theme(plot.title = element_text(size = 8))
library(gridExtra)
plots_arranged <- grid.arrange(Inside50s_margin,
Freesfor_margin,
Marks_margin,
Brownlowvotes_margin,
ncol=2)