The purpose of this project is to list some common data tidying tasks such as renaming and reordering columns as well as adding and removing columns.
First load the dplyr package:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Load the csv file that we downloaded from fbref.com
df <- read.csv("./brentford_away.csv", header = TRUE, sep = ",")
Below is a function that allows easy reordering of columns. This will come in useful later:
##arrange df vars by position
##'vars' must be a named vector, e.g. c("var.name"=1)
arrange.vars <- function(data, vars){
##stop if not a data.frame (but should work for matrices as well)
stopifnot(is.data.frame(data))
##sort out inputs
data.nms <- names(data)
var.nr <- length(data.nms)
var.nms <- names(vars)
var.pos <- vars
##sanity checks
stopifnot( !any(duplicated(var.nms)),
!any(duplicated(var.pos)) )
stopifnot( is.character(var.nms),
is.numeric(var.pos) )
stopifnot( all(var.nms %in% data.nms) )
stopifnot( all(var.pos > 0),
all(var.pos <= var.nr) )
##prepare output
out.vec <- character(var.nr)
out.vec[var.pos] <- var.nms
out.vec[-var.pos] <- data.nms[ !(data.nms %in% var.nms) ]
stopifnot( length(out.vec)==var.nr )
##re-arrange vars by position
data <- data[ , out.vec]
return(data)
}
View the head of the file:
head(df)
We can see it hasn’t added the column names we want as the header due to how the original file is structured. We need to remove this and make the next row the header:
colnames(df) <- df[1, ] # Set column names to be those at Row 1
df <- df[-1, ] #Removes the first row as we've made those names the header
Rename duplicate columns otherwise we may get an error using dplyr
colnames(df)[26] <- "Passes_Att"
colnames(df)[28] <- "Passes_Prog"
Delete some unwanted columns - let’s remove Nation using ‘select’ and ‘-’ command
df <- select(df, -(Nation))
Let’s remove Age by index
df <- df[-c(4)]
Now we’ll add a column called Opposition and fill it with ‘Brentford’. This will come in useful when we start to combine player data from other matches
df$Opposition <- rep(c("Brentford"), times = 14)
By default it adds it to the end of the dataset so we want to reorder these.
Now set Opposition to be at Index #4.
df <- arrange.vars(df, c("Opposition"=4))
Now that we’re happy with the data in our file we can rename our data variable so it’s unique from other matches:
brentford_away <- df
brentford_away