Required packages

# This is the R chunk for the required packages
library(fitzRoy)
library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(editrules)

Executive Summary

In this assignment I sought to combine data on the Australian Football League or AFL from the website AFL Tables with rainfall data from the Bureau of Meteorology.

Specifically, I narrowed the focus to AFL matches played at the MCG since rainfall data from the BOM’s Olympic Park weather station, near the MCG, became available on June 2 2013, through to the conclusion of the 2019 AFL season.

I combined these two datasets based on the date variable, which is present in both the AFL data and the BOM rainfall data.

I converted some variables in the resulting dataset to factor type as appropriate and created new variables recording the accuracy and total score of the relevant AFL matches.

I also created a second dataset on the umpiring of AFL matches. This data arrived in an untidy format and it was necessary to tidy it for this purpose.

Both datasets were checked for missing and special values, impossible values or obvious errors, and outliers. No significant issues were uncovered.

This report also contains a demonstration of how the ‘Attendance’ variable could be re-scaled for analysis.

Data

Data was obtained from two seperate sources: AFL Tables, and the Bureau of Meteorology.

AFL Tables data is best obtained through use of the fitzRoy package’s get_afltables_stats function. I specified the upper and lower date limits for data collection.

This produces the following variables:

Some of these variables are player-specific and unique on each line, and same are match-specific and repeated for each instance of a player in that match. Since I wished to use match data only, I used the dplyr select function to select only match data, and then used the distinct function to remove the redundant observations.

I then filtered the data to only include matches played at the MCG.

BOM data can be downloaded directly from the BOM website, where it comes as a csv. I read it into R using readr’s read_csv function.

This data has the following variables:

Although data collection at the site only began on June 2, 2013, this dataset contains observations for all days in 2013. As such, I used the complete cases function to remove observations with empty values from the dataset.

I renamed the variables in this dataset for ease of use, and then created a new variable using the dplyr and lubridate packages to combine the Year, Month and Day variables into a single date variable.

This was necessary in order to combine the weather data with the AFL data, which also contains a Date variable. I subsetted the weather data down to two variables - Date and Rainfall - and then used the dplyr left_join function to add the rainfall data to the afl dataset.

The use of left_join insures that only weather data observations with a matching date are brought into the AFL dataset - inner_join would also have been appropriate.

# Loading AFL data
afl <- get_afltables_stats(start_date = '2013-06-02', end_date = '2020-01-01')

head(afl)

# Cleaning AFL data
afl <- afl %>% select(Season:Home.team, HQ4G:Away.team, AQ4G:Away.score, Umpire.1:Umpire.4)
afl <- afl %>% distinct()
afl <- afl %>% filter(Venue == 'M.C.G.')

# Loading weather data
weather <- read_csv('weather.csv')
Parsed with column specification:
cols(
  `Product code` = col_character(),
  `Bureau of Meteorology station number` = col_character(),
  Year = col_double(),
  Month = col_character(),
  Day = col_character(),
  `Rainfall amount (millimetres)` = col_double(),
  `Period over which rainfall was measured (days)` = col_double(),
  Quality = col_character()
)
head(weather)

# Cleaning weather data
weather <- weather[complete.cases(weather), ]
colnames(weather) <- c('ProductCode', 'Station', 'Year', 'Month', 'Day', 'Rainfall', 'Period', 'Quality')
weather <- weather %>% mutate(Date = make_date(Year, Month, Day))
weather <- weather %>% select(Date, Rainfall)

# Joining datasets
afl <- afl %>% left_join(weather, by = "Date")

head(afl)

Understand

I used the str function to look at the structure of the combined AFL dataset.

I identified five variables in need of conversion to factor type - Season, Round, Venue, Home.team and Away.team.

I passed these variables into the table function to check for the possibility of typos in the values which might need to be corrected before converting the datatype, or NA values. There were no instances present for either of these possible issues.

I converted these variables to factor type. For the Round variable, I defined a specific vector of labels, as this variable contains a mix of numeric and alphabetic data, and would not be appropriately ordered by standard alphabetic order. However as these rounds are not numerically “greater” than each other I did not set Ordered = TRUE.

A second check of the structure of the dataset shows the successful datatype conversions. All other variables in this dataset have an appropriate datatype.

# Checking data structure
str(afl)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   328 obs. of  19 variables:
 $ Season          : num  2013 2013 2014 2014 2015 ...
 $ Round           : chr  "12" "16" "10" "18" ...
 $ Date            : Date, format: "2013-06-15" "2013-07-12" "2014-05-25" ...
 $ Local.start.time: int  1340 1950 1640 1640 1340 1920 1950 1950 1520 1310 ...
 $ Venue           : chr  "M.C.G." "M.C.G." "M.C.G." "M.C.G." ...
 $ Attendance      : int  43615 54790 32419 41486 32035 60872 70879 45781 29133 32430 ...
 $ Home.team       : chr  "Richmond" "Collingwood" "Carlton" "Collingwood" ...
 $ HQ4G            : int  16 17 12 12 14 16 21 17 15 7 ...
 $ HQ4B            : int  14 9 9 10 6 13 9 10 8 5 ...
 $ Home.score      : int  110 111 81 82 90 109 135 112 98 47 ...
 $ Away.team       : chr  "Adelaide" "Adelaide" "Adelaide" "Adelaide" ...
 $ AQ4G            : int  10 12 10 14 14 14 8 17 18 16 ...
 $ AQ4B            : int  12 12 16 14 15 18 13 7 12 11 ...
 $ Away.score      : int  72 84 76 98 99 102 61 109 120 107 ...
 $ Umpire.1        : chr  "Dean Margetts" "Stephen McBurney" "Matthew Leppard" "Luke Farmer" ...
 $ Umpire.2        : chr  "Shane Stewart" "Matt Stevic" "Sam Hay" "Matt Stevic" ...
 $ Umpire.3        : chr  "Simon Meredith" "Shane Stewart" "Simon Meredith" "Jason Armstrong" ...
 $ Umpire.4        : chr  "" "" "" "" ...
 $ Rainfall        : num  0 0 0.6 1.4 0.2 0 0 21.6 0 0 ...
# Checking for typos and NA values
afl$Season %>% table(useNA = "ifany")
.
2013 2014 2015 2016 2017 2018 2019 
  28   50   49   50   50   51   50 
afl$Round %>% table(useNA = "ifany")
.
 1 10 11 12 13 14 15 16 17 18 19  2 20 21 22 23  3  4  5  6  7  8  9 EF GF PF QF SF 
15 11 11 11  8 13 14 11 12 11 13 14 15 14 15 14 14 12 17 12 13 12 12  6  7  7  6  8 
afl$Venue %>% table(useNA = "ifany")
.
M.C.G. 
   328 
afl$Home.team %>% table(useNA = "ifany")
.
        Adelaide          Carlton      Collingwood         Essendon          Geelong 
               1               36               62               26               18 
        Hawthorn        Melbourne  North Melbourne    Port Adelaide         Richmond 
              48               57                3                1               70 
        St Kilda           Sydney       West Coast Western Bulldogs 
               2                1                1                2 
afl$Away.team %>% table(useNA = "ifany")
.
              Adelaide         Brisbane Lions                Carlton            Collingwood 
                    16                     11                     19                     36 
              Essendon              Fremantle                Geelong             Gold Coast 
                    24                     14                     26                      8 
Greater Western Sydney               Hawthorn              Melbourne        North Melbourne 
                    17                     26                     22                     10 
         Port Adelaide               Richmond               St Kilda                 Sydney 
                    12                     24                     14                     20 
            West Coast       Western Bulldogs 
                    16                     13 
# Converting variables to factor type
afl$Season <- as.factor(afl$Season)
afl$Round <- factor(afl$Round, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "QF", "EF", "SF", "PF", "GF"))
afl$Venue <- as.factor(afl$Venue)
afl$Home.team <- as.factor(afl$Home.team)
afl$Away.team <- as.factor(afl$Away.team)

# Confiriming successful conversions
str(afl)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   328 obs. of  19 variables:
 $ Season          : Factor w/ 7 levels "2013","2014",..: 1 1 2 2 3 3 3 4 4 4 ...
 $ Round           : Factor w/ 28 levels "1","2","3","4",..: 12 16 10 18 10 25 26 5 15 16 ...
 $ Date            : Date, format: "2013-06-15" "2013-07-12" "2014-05-25" ...
 $ Local.start.time: int  1340 1950 1640 1640 1340 1920 1950 1950 1520 1310 ...
 $ Venue           : Factor w/ 1 level "M.C.G.": 1 1 1 1 1 1 1 1 1 1 ...
 $ Attendance      : int  43615 54790 32419 41486 32035 60872 70879 45781 29133 32430 ...
 $ Home.team       : Factor w/ 14 levels "Adelaide","Carlton",..: 10 3 2 3 2 1 6 6 7 2 ...
 $ HQ4G            : int  16 17 12 12 14 16 21 17 15 7 ...
 $ HQ4B            : int  14 9 9 10 6 13 9 10 8 5 ...
 $ Home.score      : int  110 111 81 82 90 109 135 112 98 47 ...
 $ Away.team       : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 1 1 1 1 1 18 1 1 1 1 ...
 $ AQ4G            : int  10 12 10 14 14 14 8 17 18 16 ...
 $ AQ4B            : int  12 12 16 14 15 18 13 7 12 11 ...
 $ Away.score      : int  72 84 76 98 99 102 61 109 120 107 ...
 $ Umpire.1        : chr  "Dean Margetts" "Stephen McBurney" "Matthew Leppard" "Luke Farmer" ...
 $ Umpire.2        : chr  "Shane Stewart" "Matt Stevic" "Sam Hay" "Matt Stevic" ...
 $ Umpire.3        : chr  "Simon Meredith" "Shane Stewart" "Simon Meredith" "Jason Armstrong" ...
 $ Umpire.4        : chr  "" "" "" "" ...
 $ Rainfall        : num  0 0 0.6 1.4 0.2 0 0 21.6 0 0 ...

Tidy & Manipulate Data I

In this dataset, there are four variables dedicated to recording the umpires who officiated the relevant AFL match - Umpire.1, Umpire.2, Umpire.3 and Umpire.4. Each match has at least three umpires and some matches have a fouth umpire.

In AFL matches, there is no distinction made between a “first” or “second” umpire - the numbering here is used only to keep all umpire information contained to a single row. This is untidy as there are four columns used to record a single variable.

To tidy this data, I decided to split it off into a seperate dataset, umpires. This was done using the dplyr select function. I then used the tidyr gather function to convert the data to long format, creating a row for each indiviudal umpire.

This is combined with information on the match they umpired - in a relational database, this infomration could be condensed down to a single unique indentifier for each match which would be shared with the original dataset.

I removed the key variable created by the gather function as this provides no useful information. I used complete_cases to remove observations where there was no value in the umpire variable (due to many observations of Umpire.4 being blank).

With this data successfully tidied in a seperate dataset, I removed it from the main AFL dataset.

# Subsetting umpire data
umpires <- afl %>% select(Season:Date, Venue, Home.team, Away.team, Umpire.1:Umpire.4)

# Converting to long format
umpires <- umpires %>% gather(`Umpire.1`, `Umpire.2`, `Umpire.3`, `Umpire.4`, key = 'No', value = 'Umpire')

# Removing unnecessary key variable
umpires <- umpires %>% select(Season:Away.team, Umpire)

# Removing incomplete observations
umpires <- umpires[complete.cases(umpires), ]

head(umpires)

# Removing umpire data from original dataset
afl <- afl %>% select(Season:Away.score, Rainfall)

head(afl)

Tidy & Manipulate Data II

In an earlier step the dplyr mutate function was combined with the lubridate make_date function to create an additional variable. However, there are other possible instances where this would be a relevant task in this dataset.

I used mutate to create a new variable in the AFL dataset called Total.score, which combines the existing variables Home.score and Away.score, thus tracking the combined number of points scored in a match.

I also created an additional variable called Accuracy. This records what propotion of the combined total of scoring shots from the match (the combined total of HQ4G, AQ4G, HQ4B, and AQ4B) were goals (the combined total of HQ4G and AQ4G).

# Adding Total.score variable
afl <- afl %>% mutate(Total.score = Home.score+Away.score)

# Adding Accuracy variable
afl <- afl %>% mutate(Accuracy = (HQ4G+AQ4G)/(HQ4G+AQ4G+HQ4B+AQ4B))

head(afl)

Scan I

I put the two datasets, AFL and Umpires, through the is.na, is.infinite and is.nan functions to check for the presence of missing or special values. Neither is present in any of the variables in these two datasets.

I used the editrules package to define some rules to check the variables for obvious errors. These rules were: checking that the Local.start.time for each match was between noon and 10pm, checking that the crowd at a match was not a negative value nor an impossibly high amount, checking that the home and away scores reflected the expected total of six points per goal and one point per behind and that neither was greater than an impossibly high value, and checking that rainfall was never negative nor above an impossibly high value.

I then used the violatedEdits function to check if any observations breaked these set rules. There were no instances of broken rules.

# Checking for missing values
afl %>% is.na() %>% sum()
[1] 0
umpires %>% is.na() %>% sum()
[1] 0
# Checking for special values
afl %>% sapply(is.infinite) %>% sum()
[1] 0
umpires %>% sapply(is.infinite) %>% sum()
[1] 0
afl %>% sapply(is.nan) %>% sum()
[1] 0
umpires %>% sapply(is.nan) %>% sum()
[1] 0
# Defining variables rules
TimeRule <- editset(c("Local.start.time >= 1200", "Local.start.time <= 2200"))
CrowdRule <- editset(c("Attendance >= 0", "Attendance <= 150000"))
ScoreRule <- editset(c("Home.score == HQ4G * 6 + HQ4B", "Away.score == AQ4G * 6 + AQ4B", "Home.score <= 300", "Away.score <= 300"))
RainRule <- editset(c("Rainfall >= 0", "Rainfall <= 1000"))
Rules <- (c(TimeRule, CrowdRule, ScoreRule, RainRule))

# Checking for broken rules
violatedEdits(Rules, afl) %>% sum()
[1] 0

Scan II

To check the numerical variables for outliers, I created a series of boxplots for each numerical value.

The HQ4G, HQ4B, AQ4G, AQ4B, Home.score, Away.score and Accuracy variables all reported between 1-3 outliers, all at the high end of the boxplot.

I checked the highest values in these variables against an independent record of the data to ensure they were not incorrect values, which I can confirm they are not. As such, I decided to keep them as they are in the dataset.

The Rainfall data has a large number of outliers. This is due to this data being heavily right skewed - the majority of observations of this variable are 0.0, and as such many observations are considered outliers.

This is primarily an illustration of the skewedness of this dataset, as such, I decided to retain the values as they are.

I also checked three likely relationships of interest for bivariate outliers, creating scatter plots that each use Rainfall on the x axis and use Accuracy, Total.score and Attendance on the three y axes respectively.

These plots also show how the vast majority of rainfall data is at or near a zero value. There do not appear to be any bivariate outliers (or correlation) on the Accuracy or Total.score plots.

The attendance variable does show more evidence of a possible relationship and some potential outliers, but the data for high values of rainfall is too sparse to be confident that this is the case.

# Checking single variables for outliers
afl$Attendance %>% boxplot(main="Box Plot of MCG Crowds", ylab="Crowd")

afl$HQ4G %>% boxplot(main="Box Plot of HQ4G", ylab="HQ4G")

afl$HQ4B %>% boxplot(main="Box Plot of HQ4B", ylab="HQ4B")

afl$AQ4G %>% boxplot(main="Box Plot of AQ4G", ylab="AQ4G")

afl$AQ4B %>% boxplot(main="Box Plot of AQ4B", ylab="AQ4B")

afl$Home.score %>% boxplot(main="Box Plot of Home Score", ylab="Home Score")

afl$Away.score %>% boxplot(main="Box Plot of Away Score", ylab="Away Score")

afl$Total.score %>% boxplot(main="Box Plot of Total Score", ylab="Total Score")

afl$Rainfall %>% boxplot(main="Box Plot of Rainfall", ylab="Rainfall")

afl$Accuracy %>% boxplot(main="Box Plot of Accuracy", ylab="Accuracy")


#Checking for bivariate outliers
afl %>% plot(Accuracy ~ Rainfall, data =., ylab = "Accuracy", xlab = "Rainfall", main = "Accuracy by Rainfall")

afl %>% plot(Total.score ~ Rainfall, data =., ylab = "Total Score", xlab = "Rainfall", main = "Total Score by Rainfall")

afl %>% plot(Attendance ~ Rainfall, data =., ylab = "Attendance", xlab = "Rainfall", main = "Attendance by Rainfall")

Transform

The attendance variable has a large range, containing both observations below 10,000 and greater than 100,000 in the dataset. As such, for some analysis it might be valuable to transform this variable.

To demonstrated this I chose to use the baseR scale function to convert each observation in the variable to a Z-score, representing the number of standard deviations that the value sits away from the variable mean.

A summary of the new variable (in this case saved as a seperate object, zAttendance), shows that the mean of the variable is now 0, as we would expect following a conversion of this type.

# Transfomring attendance variable to z-scores
zAttendance <- scale(afl$Attendance, center = TRUE, scale = TRUE)

zAttendance %>% summary()
       V1          
 Min.   :-1.94272  
 1st Qu.:-0.79158  
 Median :-0.08648  
 Mean   : 0.00000  
 3rd Qu.: 0.72797  
 Max.   : 2.29240  



---
title: "MATH2349 Semester 1, 2020"
author: "Joshua Elliott s3853220"
subtitle: Assignment 2
output:
  html_notebook: default
---

## Required packages 

```{r}
# This is the R chunk for the required packages
library(fitzRoy)
library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(editrules)
```

## Executive Summary 
In this assignment I sought to combine data on the Australian Football League or AFL from the website AFL Tables with rainfall data from the Bureau of Meteorology.

Specifically, I narrowed the focus to AFL matches played at the MCG since rainfall data from the BOM's Olympic Park weather station, near the MCG, became available on June 2 2013, through to the conclusion of the 2019 AFL season.

I combined these two datasets based on the date variable, which is present in both the AFL data and the BOM rainfall data.

I converted some variables in the resulting dataset to factor type as appropriate and created new variables recording the accuracy and total score of the relevant AFL matches.

I also created a second dataset on the umpiring of AFL matches. This data arrived in an untidy format and it was necessary to tidy it for this purpose.

Both datasets were checked for missing and special values, impossible values or obvious errors, and outliers. No significant issues were uncovered.

This report also contains a demonstration of how the 'Attendance' variable could be re-scaled for analysis.

## Data 
Data was obtained from two seperate sources: AFL Tables, and the Bureau of Meteorology.

AFL Tables data is best obtained through use of the fitzRoy package's get_afltables_stats function. I specified the upper and lower date limits for data collection.

This produces the following variables: 

- Season - The AFL season the match was played in
- Round - The round of the season the match was played in
- Date - The date the match was played on
- Local.start.time - The time the match was played, as per local time zone
- Venue - The location the match was played
- Attendance - The recorded crowd at the match
- Home.team - the AFL side listed as the home side for the match
- HQ1G - The number of goals kicked by the home side at the end of the first quarter
- HQ1B - The number of behinds kicked by the home side at the end of the first quarter
- HQ2G - The number of goals kicked by the home side at the end of the second quarter
- HQ2B - The number of behinds kicked by the home side at the end of the second quarter
- HQ3G - The number of goals kicked by the home side at the end of the third quarter
- HQ3B - The number of behinds kicked by the home side at the end of the third quarter
- HQ4G - The number of goalss kicked by the home side at the end of the fourth quarter
- HQ4B - The number of behinds kicked by the home side at the end of the fourth quarter
- Home.score - The total score of the home side at the end of the match
- Away.team - The AFL side listed as the away side for the match
- HQ1G - The number of goals kicked by the away side at the end of the first quarter
- HQ1B - The number of behinds kicked by the away side at the end of the first quarter
- HQ2G - The number of goals kicked by the away side at the end of the second quarter
- HQ2B - The number of behinds kicked by the away side at the end of the second quarter
- HQ3G - The number of goals kicked by the away side at the end of the third quarter
- HQ3B - The number of behinds kicked by the away side at the end of the third quarter
- HQ4G - The number of goalss kicked by the away side at the end of the fourth quarter
- HQ4B - The number of behinds kicked by the away side at the end of the fourth quarter
- Away.score - The total score of the away side at the end of the match
- First.name - The first name of the player
- Surname - The surname of the player
- ID - A unique identifier for each player
- Jumper.No. - The number worn by the player
- Playing.for - The AFL club the player represents
- Kicks - The number of kicks recorded by the player
- Marks - The number of marks recorded by the player
- Handballs - The number of handballs recorded by the player
- Goals - The number of goals recorded by the player
- Behinds - The number of behinds recorded by the player
- Hit.outs - The number of hit outs recorded by the player
- Tackles - The number of tackles recorded by the player
- Rebounds - The number of rebounds recorded by the player
- Inside.50s - The number of inside 50s recorded by the player
- Clearances - The number of clearances recorded by the player
- Clangers - The number of clangers recorded by the player
- Frees.For - The number of free kicks for recorded by the player
- Free.Against - The number of free kicks against recorded by the player
- Brownlow.Votes - The number of Brownlow Votes recorded by the player
- Contested.Possessions - The number of contested possessions recorded by the player
- Uncontested.Possessions - The number of uncontested possessions recorded by the player
- Contested.Marks - The number of contested marks recorded by the player
- Marks.Inside.50 - The number of marks inside 50 recorded by the player
- One.Percenters - The number of one-percenters recorded by the player
- Bounces - The number of bounces recorded by the player
- Goal.Assists - The number of goal assists recorded by the player
- Time.on.Ground - The percentage of match time the player spent on ground
- Substitute - Whether or not the player was involved in a substitution
- Umpire.1 - The first umpire assigned to the match, ranked alphabetically by surname
- Umpire.2 - The second umpire assigned to the match, ranked alphabetically by surname
- Umpire.3 - The third umpire assigned to the match, ranked alphabetically by surname
- Umpire.4 - The fourth umpire assigned to the match, ranked alphabetically by surname
- group_id - A unique identifier for each possible combination of umpires

Some of these variables are player-specific and unique on each line, and same are match-specific and repeated for each instance of a player in that match. Since I wished to use match data only, I used the dplyr select function to select only match data, and then used the distinct function to remove the redundant observations.

I then filtered the data to only include matches played at the MCG.

BOM data can be downloaded directly from the BOM website, where it comes as a csv. I read it into R using readr's read_csv function.

This data has the following variables:

- Product code - A label for the dataset
- Bureau of Meteorology station number - A unique identifier for individual weather stations
- Year - The year the data was recorded in
- Month - The month the data was recorded in
- Day - The day the data was recorded on
- Rainfall amount (millimetres) - The amount of rainfall measured, in millimetres
- Period over which rainfall was measured (days) - The length, in days, of the period the data represents
- Quality - An categorisation of the quality of the measurement

Although data collection at the site only began on June 2, 2013, this dataset contains observations for all days in 2013. As such, I used the complete cases function to remove observations with empty values from the dataset.

I renamed the variables in this dataset for ease of use, and then created a new variable using the dplyr and lubridate packages to combine the Year, Month and Day variables into a single date variable.

This was necessary in order to combine the weather data with the AFL data, which also contains a Date variable. I subsetted the weather data down to two variables - Date and Rainfall - and then used the dplyr left_join function to add the rainfall data to the afl dataset. 

The use of left_join insures that only weather data observations with a matching date are brought into the AFL dataset - inner_join would also have been appropriate. 

```{r, results='hide'}
# Loading AFL data
afl <- get_afltables_stats(start_date = '2013-06-02', end_date = '2020-01-01')
```
```{r}

head(afl)

# Cleaning AFL data
afl <- afl %>% select(Season:Home.team, HQ4G:Away.team, AQ4G:Away.score, Umpire.1:Umpire.4)
afl <- afl %>% distinct()
afl <- afl %>% filter(Venue == 'M.C.G.')

# Loading weather data
weather <- read_csv('weather.csv')

head(weather)

# Cleaning weather data
weather <- weather[complete.cases(weather), ]
colnames(weather) <- c('ProductCode', 'Station', 'Year', 'Month', 'Day', 'Rainfall', 'Period', 'Quality')
weather <- weather %>% mutate(Date = make_date(Year, Month, Day))
weather <- weather %>% select(Date, Rainfall)

# Joining datasets
afl <- afl %>% left_join(weather, by = "Date")

head(afl)
```

## Understand 
I used the str function to look at the structure of the combined AFL dataset. 

I identified five variables in need of conversion to factor type - Season, Round, Venue, Home.team and Away.team. 

I passed these variables into the table function to check for the possibility of typos in the values which might need to be corrected before converting the datatype, or NA values. There were no instances present for either of these possible issues.

I converted these variables to factor type. For the Round variable, I defined a specific vector of labels, as this variable contains a mix of numeric and alphabetic data, and would not be appropriately ordered by standard alphabetic order. However as these rounds are not numerically "greater" than each other I did not set Ordered = TRUE.

A second check of the structure of the dataset shows the successful datatype conversions. All other variables in this dataset have an appropriate datatype.

```{r}
# Checking data structure
str(afl)

# Checking for typos and NA values
afl$Season %>% table(useNA = "ifany")
afl$Round %>% table(useNA = "ifany")
afl$Venue %>% table(useNA = "ifany")
afl$Home.team %>% table(useNA = "ifany")
afl$Away.team %>% table(useNA = "ifany")

# Converting variables to factor type
afl$Season <- as.factor(afl$Season)
afl$Round <- factor(afl$Round, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "QF", "EF", "SF", "PF", "GF"))
afl$Venue <- as.factor(afl$Venue)
afl$Home.team <- as.factor(afl$Home.team)
afl$Away.team <- as.factor(afl$Away.team)

# Confiriming successful conversions
str(afl)
```

##	Tidy & Manipulate Data I 
In this dataset, there are four variables dedicated to recording the umpires who officiated the relevant AFL match - Umpire.1, Umpire.2, Umpire.3 and Umpire.4. Each match has at least three umpires and some matches have a fouth umpire.

In AFL matches, there is no distinction made between a "first" or "second" umpire - the numbering here is used only to keep all umpire information contained to a single row. This is untidy as there are four columns used to record a single variable.

To tidy this data, I decided to split it off into a seperate dataset, umpires. This was done using the dplyr select function. I then used the tidyr gather function to convert the data to long format, creating a row for each indiviudal umpire. 

This is combined with information on the match they umpired - in a relational database, this infomration could be condensed down to a single unique indentifier for each match which would be shared with the original dataset.

I removed the key variable created by the gather function as this provides no useful information. I used complete_cases to remove observations where there was no value in the umpire variable (due to many observations of Umpire.4 being blank).

With this data successfully tidied in a seperate dataset, I removed it from the main AFL dataset.

```{r}
# Subsetting umpire data
umpires <- afl %>% select(Season:Date, Venue, Home.team, Away.team, Umpire.1:Umpire.4)

# Converting to long format
umpires <- umpires %>% gather(`Umpire.1`, `Umpire.2`, `Umpire.3`, `Umpire.4`, key = 'No', value = 'Umpire')

# Removing unnecessary key variable
umpires <- umpires %>% select(Season:Away.team, Umpire)

# Removing incomplete observations
umpires <- umpires[complete.cases(umpires), ]

head(umpires)

# Removing umpire data from original dataset
afl <- afl %>% select(Season:Away.score, Rainfall)

head(afl)
```

##	Tidy & Manipulate Data II 
In an earlier step the dplyr mutate function was combined with the lubridate make_date function to create an additional variable. However, there are other possible instances where this would be a relevant task in this dataset.

I used mutate to create a new variable in the AFL dataset called Total.score, which combines the existing variables Home.score and Away.score, thus tracking the combined number of points scored in a match.

I also created an additional variable called Accuracy. This records what propotion of the combined total of scoring shots from the match (the combined total of HQ4G, AQ4G, HQ4B, and AQ4B) were goals (the combined total of HQ4G and AQ4G).

```{r}
# Adding Total.score variable
afl <- afl %>% mutate(Total.score = Home.score+Away.score)

# Adding Accuracy variable
afl <- afl %>% mutate(Accuracy = (HQ4G+AQ4G)/(HQ4G+AQ4G+HQ4B+AQ4B))

head(afl)
```

##	Scan I 
I put the two datasets, AFL and Umpires, through the is.na, is.infinite and is.nan functions to check for the presence of missing or special values. Neither is present in any of the variables in these two datasets.

I used the editrules package to define some rules to check the variables for obvious errors. These rules were: checking that the Local.start.time for each match was between noon and 10pm, checking that the crowd at a match was not a negative value nor an impossibly high amount, checking that the home and away scores reflected the expected total of six points per goal and one point per behind and that neither was greater than an impossibly high value, and checking that rainfall was never negative nor above an impossibly high value.

I then used the violatedEdits function to check if any observations breaked these set rules. There were no instances of broken rules.

```{r}
# Checking for missing values
afl %>% is.na() %>% sum()
umpires %>% is.na() %>% sum()

# Checking for special values
afl %>% sapply(is.infinite) %>% sum()
umpires %>% sapply(is.infinite) %>% sum()
afl %>% sapply(is.nan) %>% sum()
umpires %>% sapply(is.nan) %>% sum()

# Defining variables rules
TimeRule <- editset(c("Local.start.time >= 1200", "Local.start.time <= 2200"))
CrowdRule <- editset(c("Attendance >= 0", "Attendance <= 150000"))
ScoreRule <- editset(c("Home.score == HQ4G * 6 + HQ4B", "Away.score == AQ4G * 6 + AQ4B", "Home.score <= 300", "Away.score <= 300"))
RainRule <- editset(c("Rainfall >= 0", "Rainfall <= 1000"))
Rules <- (c(TimeRule, CrowdRule, ScoreRule, RainRule))

# Checking for broken rules
violatedEdits(Rules, afl) %>% sum()
```

##	Scan II
To check the numerical variables for outliers, I created a series of boxplots for each numerical value.

The HQ4G, HQ4B, AQ4G, AQ4B, Home.score, Away.score and Accuracy variables all reported between 1-3 outliers, all at the high end of the boxplot. 

I checked the highest values in these variables against an independent record of the data to ensure they were not incorrect values, which I can confirm they are not. As such, I decided to keep them as they are in the dataset.

The Rainfall data has a large number of outliers. This is due to this data being heavily right skewed - the majority of observations of this variable are 0.0, and as such many observations are considered outliers. 

This is primarily an illustration of the skewedness of this dataset, as such, I decided to retain the values as they are.

I also checked three likely relationships of interest for bivariate outliers, creating scatter plots that each use Rainfall on the x axis and use Accuracy, Total.score and Attendance on the three y axes respectively.

These plots also show how the vast majority of rainfall data is at or near a zero value. There do not appear to be any bivariate outliers (or correlation) on the Accuracy or Total.score plots. 

The attendance variable does show more evidence of a possible relationship and some potential outliers, but the data for high values of rainfall is too sparse to be confident that this is the case.

```{r}
# Checking single variables for outliers
afl$Attendance %>% boxplot(main="Box Plot of MCG Crowds", ylab="Crowd")
afl$HQ4G %>% boxplot(main="Box Plot of HQ4G", ylab="HQ4G")
afl$HQ4B %>% boxplot(main="Box Plot of HQ4B", ylab="HQ4B")
afl$AQ4G %>% boxplot(main="Box Plot of AQ4G", ylab="AQ4G")
afl$AQ4B %>% boxplot(main="Box Plot of AQ4B", ylab="AQ4B")
afl$Home.score %>% boxplot(main="Box Plot of Home Score", ylab="Home Score")
afl$Away.score %>% boxplot(main="Box Plot of Away Score", ylab="Away Score")
afl$Total.score %>% boxplot(main="Box Plot of Total Score", ylab="Total Score")
afl$Rainfall %>% boxplot(main="Box Plot of Rainfall", ylab="Rainfall")
afl$Accuracy %>% boxplot(main="Box Plot of Accuracy", ylab="Accuracy")

#Checking for bivariate outliers
afl %>% plot(Accuracy ~ Rainfall, data =., ylab = "Accuracy", xlab = "Rainfall", main = "Accuracy by Rainfall")
afl %>% plot(Total.score ~ Rainfall, data =., ylab = "Total Score", xlab = "Rainfall", main = "Total Score by Rainfall")
afl %>% plot(Attendance ~ Rainfall, data =., ylab = "Attendance", xlab = "Rainfall", main = "Attendance by Rainfall")
```

##	Transform 
The attendance variable has a large range, containing both observations below 10,000 and greater than 100,000 in the dataset. As such, for some analysis it might be valuable to transform this variable.

To demonstrated this I chose to use the baseR scale function to convert each observation in the variable to a Z-score, representing the number of standard deviations that the value sits away from the variable mean.

A summary of the new variable (in this case saved as a seperate object, zAttendance), shows that the mean of the variable is now 0, as we would expect following a conversion of this type.

```{r}
# Transfomring attendance variable to z-scores
zAttendance <- scale(afl$Attendance, center = TRUE, scale = TRUE)

zAttendance %>% summary()

```

<br>
<br>
