Preparing Data

The purpose of this project is to list some common data tidying tasks such as renaming and reordering columns as well as adding and removing columns.

First load the dplyr package:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Load the csv file that we downloaded from fbref.com

df <- read.csv("./brentford_away.csv", header = TRUE, sep = ",")

Below is a function that allows easy reordering of columns. This will come in useful later:

##arrange df vars by position
##'vars' must be a named vector, e.g. c("var.name"=1)
arrange.vars <- function(data, vars){
    ##stop if not a data.frame (but should work for matrices as well)
    stopifnot(is.data.frame(data))

    ##sort out inputs
    data.nms <- names(data)
    var.nr <- length(data.nms)
    var.nms <- names(vars)
    var.pos <- vars
    ##sanity checks
    stopifnot( !any(duplicated(var.nms)), 
               !any(duplicated(var.pos)) )
    stopifnot( is.character(var.nms), 
               is.numeric(var.pos) )
    stopifnot( all(var.nms %in% data.nms) )
    stopifnot( all(var.pos > 0), 
               all(var.pos <= var.nr) )

    ##prepare output
    out.vec <- character(var.nr)
    out.vec[var.pos] <- var.nms
    out.vec[-var.pos] <- data.nms[ !(data.nms %in% var.nms) ]
    stopifnot( length(out.vec)==var.nr )

    ##re-arrange vars by position
    data <- data[ , out.vec]
    return(data)
}

View the head of the file:

head(df)

We can see it hasn’t added the column names we want as the header due to how the original file is structured. We need to remove this and make the next row the header:

colnames(df) <- df[1, ] # Set column names to be those at Row 1
df <- df[-1, ] #Removes the first row as we've made those names the header

Rename duplicate columns otherwise we may get an error using dplyr

colnames(df)[26] <- "Passes_Att"
colnames(df)[28] <- "Passes_Prog"

Delete some unwanted columns - let’s remove Nation using ‘select’ and ‘-’ command

df <- select(df, -(Nation))

Let’s remove Age by index

df <- df[-c(4)]

Now we’ll add a column called Opposition and fill it with ‘Brentford’. This will come in useful when we start to combine player data from other matches

df$Opposition <- rep(c("Brentford"), times = 14)

By default it adds it to the end of the dataset so we want to reorder these.

Now set Opposition to be at Index #4.

df <- arrange.vars(df, c("Opposition"=4))

Now that we’re happy with the data in our file we can rename our data variable so it’s unique from other matches:

brentford_away <- df
brentford_away