Working with Data : Assignment 1

Introduction

This notebook combines standard data sets containing information on professional football club and football player data from the top five football leagues in Europe. It explores and compares the data in attempt to find useful information. The data can be used for educational purposes and hypothetically to give insights to improve sports betting players’ odds when creating/ placing bets, although this is not encouraged. All data was extracted from FBREF.com (2022, November 4).

Part 1 (Importing, merging, and cleaning data)

Begin by setting working directory to file containing all data sets and assigning data sets to new data.frame variables

#Set working directory
setwd("D:/R Studio/Data Sets/PlayerStats2021-22CSV")
getwd()

## [1] "D:/R Studio/Data Sets/PlayerStats2021-22CSV"

#Assign data sets
premierLeagueStandardData <- read.csv("PremierLeagueStandardData.csv", encoding = "UTF-8")
                      
laLigaStandardData <- read.csv("LaLigaStandardData.csv", encoding = "UTF-8")
bundesligaStandardData <- read.csv("BundasligaStandardData.csv", encoding = "UTF-8")
seriaAStandardData <- read.csv("SeriaAStandardData.csv", encoding = "UTF-8")
ligue1StandardData <- read.csv("Ligue1StandardData.csv", encoding = "UTF-8")

Merging and cleaning data

Add ‘League’ Features to each data set

#Create vector that holds the different possible values for the new feature
leagueName <- c("Premier League", "La Liga","Bundesliga","Serie A","Ligue 1")

#Create function which adds feature
CreateFeature <- function(dataset,values, featureName)
{
  
  for(i in dataset)
  {
    dataset[paste(featureName)] <- values
  }
  
  return(dataset)
}

#Call function on each data set
for(i in 1:length(leagueName))
{
  premierLeagueStandardData <- CreateFeature(premierLeagueStandardData,leagueName[1], featureName = "League")
  laLigaStandardData <- CreateFeature(laLigaStandardData,leagueName[2], featureName = "League")
  bundesligaStandardData <- CreateFeature(bundesligaStandardData,leagueName[3], featureName = "League")
  seriaAStandardData <- CreateFeature(seriaAStandardData,leagueName[4], featureName = "League")
  ligue1StandardData <- CreateFeature(ligue1StandardData,leagueName[5], featureName = "League")
}

Create list that contains each data frame

#Create list containing each data frame
dataSetList <- list(premierLeagueStandardData,laLigaStandardData,bundesligaStandardData,seriaAStandardData,ligue1StandardData)

Create function which merges data frames

MergeDataFrames <- function(df1,df2)
{
 merge(df1,df2,all.x = T, all.y = T)
}

#Combine elements of "dataSetList" in "MergeDataFrames" function using "Reduce()"
finalDataFrame <- Reduce(MergeDataFrames,dataSetList)

Remove empty and unneeded rows

#Remove unwanted rows
finalDataFrame <- finalDataFrame[finalDataFrame != '',]

Rename columns

#Create vector that will replace non communicative labels with concise labels
conciseLabels <- c("Rank","Player","Nation","Position","Club","Age","Born","Matches Played","Starts","Minutes Played","90s Played","Total Goals","Total Assists","Non-Penalty Goals","Penalty Kicks Made","Penalty Kicks Attempted","Yellow Cards","Red Cards","Goals per 90","Assists per 90","Goals plus Assists per 90","Goals minus Penalty Kicks Made per 90","Goals plus Assists minus Penalty Kicks made per 90","Expected Goals per Season","Non-penalty Excpected Goals","Expected Assists per Season","Non-penalty Expected Goals plus Expected Assists per Season","Expected Goals per 90","Expected Assists per 90","Epected Goals + Assists per 90","Non-penalty Expected Goals per 90","Non-penalty Expected Goals + Expected Assists per 90","Matches", "League")

#Change column names
colnames(finalDataFrame) <- conciseLabels

Remove rows with NA values, these are values of young players who only played a small number of games during the season. Removing them is no harm.

#Remove empty rows
finalDataFrame <- finalDataFrame[!apply(finalDataFrame == "", 1, all),]

#Use "dplyr" library 
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Convert blank values to NA
finalDataFrame <- finalDataFrame %>% mutate_all(na_if,"")

#Remove NA values
finalDataFrame <- na.omit(finalDataFrame)

#Remove last rows (contains feature names only)
finalDataFrame <- finalDataFrame[nrow(finalDataFrame) - 5:nrow(finalDataFrame),]

Clean columns

#Remove lower case letters
finalDataFrame[,"Nation"] <- substring(finalDataFrame[,"Nation"],4)

#Remove space at beginning for England
finalDataFrame$Nation[finalDataFrame$Nation == " ENG"] <- "ENG"

Filter features

#Remove features that will be of no use
featuresToRemove <- c("Born","Matches","Rank")

finalDataFrame <- finalDataFrame[ ,!(names(finalDataFrame) %in% featuresToRemove)]

Index row numbers by 1

rownames(finalDataFrame) <- 1:nrow(finalDataFrame)

Convert string columns to factors

#Create vector of indices to have their type changed
numericIndices <- c(5:8,10:16)
doublesIndices <- c(9,17:30)

#Start at six as we want first six columns to be of type character
for(i in 6: nrow(finalDataFrame))
{
  if(i %in% numericIndices)
  {
    #Replace commas with empty to avoid NA coercion
    finalDataFrame[[i]] <- as.numeric(gsub(",","",finalDataFrame[[i]]))
  }

  if(i %in% doublesIndices)
  {
    #Replace commas with empty to avoid NA coercion
    finalDataFrame[[i]] <- as.double(gsub(",","",finalDataFrame[[i]]))
  }
}

Part 2 (Data exploration)

In this section we will begin by making sure there are no NA values in our final data frame. We will then separate it by league, creating 5 different subsets. From these subsets we will find the average number of goals scored, assists made, and yellow and red cards earned. We will then compare these averages to find which leagues have the highest scoring average, highest assisting average, and highest aggression average.

Confirm there are no NA values in data set

#Finds each NA value in "finalDataFrame"
missing <- is.na(finalDataFrame$Age)
which(missing)

## integer(0)

#Finds the summation of each NA value
count <- sum(missing)
count

## [1] 0

Create subsets of different leagues

#Create league subsets
premierLeagueSubset <- subset(finalDataFrame, League == "Premier League")
laLigaSubset <- subset(finalDataFrame, League == "La Liga")
bundesLigaSubset <- subset(finalDataFrame, League == "Bundesliga")
serieASubset <- subset(finalDataFrame, League == "Serie A")
ligue1Subset <- subset(finalDataFrame, League == "Ligue 1")

leagueSubsets <- list(premierLeagueSubset,laLigaSubset,bundesLigaSubset,serieASubset,ligue1Subset)

Order rows alphabetically in terms of club name, we will be comparing players from each club so this makes sense. Next we will determine the features that are of the most interest to us and remove all other features that are not.

for(i in 1:length(leagueSubsets))
{
  #Create holder for storing data temporarily
  subsetHolder <- leagueSubsets[[i]]
  
  #Order clubs alphabetically
  subsetHolder <- subsetHolder[order(-subsetHolder$`Total Goals`), ]
  
  #Eliminate unwanted features
  featuresToKeep <- c(1:7,10:11,15:16,ncol(finalDataFrame))
  subsetHolder <- subsetHolder[featuresToKeep]
  
  #Assign indices accordingly
  rownames(subsetHolder) <- 1:nrow(subsetHolder)
  
  #Create an aggression feature, this will combine player yellow and red cards. Yellow cards are assigned a value of 1, red cards are   assigned a value of 2
  subsetHolder$Aggression <- subsetHolder$`Yellow Cards` + (subsetHolder$`Red Cards` * 2)
  
  #Assign holder to data sets in list
  leagueSubsets[[i]] <- subsetHolder
}
#Assign new ordered data sets to subsets
premierLeagueSubset <-  leagueSubsets[[1]]
laLigaSubset <-  leagueSubsets[[2]]
bundesLigaSubset <-  leagueSubsets[[3]]
serieASubset <-  leagueSubsets[[4]]
ligue1Subset <-  leagueSubsets[[5]]

Visualize player stats in graph to understand distribution, we will be using the total player goals, assists, and aggression features.

#Import "ggploty2" library    
library(ggplot2)

#Create 5th and 95th percentile variables for convenience
fp <- 0.05
nfp <- 0.95

TrimmedAverage <- function(dataSet, featureNumber,low, high)
{
    #Calculate lower and upper percentiles
    percentileLow <- quantile(dataSet[,featureNumber], low)
    percentileHigh <- quantile(dataSet[,featureNumber], high)

    #Select only rows between these two values
    dataSet <- dataSet[dataSet[,featureNumber] > percentileLow & dataSet[,featureNumber] < percentileHigh,] 
    return(dataSet)
}

#Create reusable function for creating histograms
CreateHistogram <- function(dataSet = premierLeagueSubset, featureNumber = 8,meanOffset = 2.9,medianOffset = 3.55,trimmed = F,low = fp,high = nfp,xlab,ylab,title)
{
  #For calculating trimmed mean
  if(trimmed == T)
  {
    dataSet <- TrimmedAverage(dataSet,featureNumber,low,high)
 
    feature <- dataSet[,featureNumber]
  }
  else
    feature <-dataSet[,featureNumber]
  
  
  #Visualize player goal distribution
  ggplot(dataSet,aes(feature, fill = Club)) + 
  geom_histogram(binwidth = 1) + 
  geom_vline(xintercept = mean(feature),col = 'black', lwd = 2) +
  geom_vline(xintercept = median(feature),col = 'blue', lwd = 2) +
  annotate("text", x = mean(feature) + meanOffset,235, label = paste("Mean =", round(mean(feature),2)), col = "black", size = 4) +
  annotate("text", x = median(feature) + medianOffset, 250, label = paste("Median =", round(median(feature),2)), col = "blue", size = 4) +
  xlab(xlab) +
  ylab(ylab) +
  ggtitle(title) + 
  theme(legend.key.size = unit(.25, 'cm'), legend.key.height = unit(.25, 'cm'), legend.key.width = unit(1, 'cm'), legend.title = element_text(size=14), legend.text =     element_text(size=10)) 
}

Goal Charts:

Print chart

 CreateHistogram(xlab = "Goals scored", ylab = "Players", title = "Total goals per season (Premier League)")

As our data is heavily skewed to one side we should use a trimmed mean and median instead, avoiding outliers. By removing the upper and lower 5th percentiles we can get a more accurate representation of these values.

Premier League trimmed:

CreateHistogram(trimmed = T,xlab = "Goals scored", ylab = "Players", title = "Total goals per season (Premier League)",meanOffset = 1.25, medianOffset = 1.525)

Above you can see the mean and median values are closer to each other, resulting in a more representative average. We can assume each league will have similar distribution. The following chunks will calculate leagues’ trimmed averages.

La Liga trimmed:

#La Liga
CreateHistogram(dataSet = laLigaSubset,trimmed = T,xlab = "Goals scored", ylab = "Players", title = "Total goals per season (La Liga)",meanOffset = 1.05, medianOffset = 1.275)

Bundesliga trimmed:

#Bundeisliga
CreateHistogram(dataSet = bundesLigaSubset,trimmed = T,xlab = "Goals scored", ylab = "Players", title = "Total goals per season (Bundesliga)",meanOffset = 1.25, medianOffset = 1.6)

Serie A trimmed:

#Serie A
CreateHistogram(dataSet = serieASubset,trimmed = T,xlab = "Goals scored", ylab = "Players", title = "Total goals per season (Serie A)",meanOffset = 1.2, medianOffset = 1.7)

Ligue 1 trimmed:

#Ligue 1
CreateHistogram(dataSet = ligue1Subset,trimmed = T,xlab = "Goals scored", ylab = "Players", title = "Total goals per season (Ligue 1)",meanOffset = .85, medianOffset = 1.3)

Assist Charts:

Premier League trimmed:

CreateHistogram(trimmed = T,xlab = "Assists", ylab = "Players", title = "Total assists per season (Premier League)", featureNumber = 9,meanOffset = 1.25, medianOffset = 1.525)

La Liga trimmed:

#La Liga
CreateHistogram(dataSet = laLigaSubset,trimmed = T,xlab = "Assists", ylab = "Players", title = "Total assists per season (La Liga)", featureNumber = 9,meanOffset = 1.05, medianOffset = 1.275)

Bundesliga trimmed:

#Bundeisliga
CreateHistogram(dataSet = bundesLigaSubset,trimmed = T,xlab = "Assists", ylab = "Players", title = "Total assists per season (Bundesliga)", featureNumber = 9,meanOffset = 1.25, medianOffset = 1.6)

Serie A trimmed:

#Serie A
CreateHistogram(dataSet = serieASubset,trimmed = T,xlab = "Assists", ylab = "Players", title = "Total assists per season (Serie A)", featureNumber = 9,meanOffset = 1.2, medianOffset = 1.7)

Ligue 1 trimmed:

#Ligue 1
CreateHistogram(dataSet = ligue1Subset,trimmed = T,xlab = "Assists", ylab = "Players", title = "Total assists per season (Ligue 1)", featureNumber = 9,meanOffset = .85, medianOffset = 1.3)

Aggression Charts:

Premier League trimmed:

CreateHistogram(trimmed = T,xlab = "Red + yellow card value", ylab = "Players", title = "Carded fouls per season (Premier League)", featureNumber = 13,meanOffset = 1.25, medianOffset = 1.525)

La Liga trimmed:

#La Liga
CreateHistogram(dataSet = laLigaSubset,trimmed = T,xlab = "Red + yellow card value", ylab = "Players", title = "Carded fouls per season  (La Liga)", featureNumber = 13,meanOffset = 1.05, medianOffset = 1.275)

Bundesliga trimmed:

#Bundeisliga
CreateHistogram(dataSet = bundesLigaSubset,trimmed = T,xlab = "Red + yellow card value", ylab = "Players", title = "Carded fouls per season  (Bundesliga)", featureNumber = 13,meanOffset = 1.25, medianOffset = 1.6)

Serie A trimmed:

#Serie A
CreateHistogram(dataSet = serieASubset,trimmed = T,xlab = "Red + yellow card value", ylab = "Players", title = "Carded fouls per season  (Serie A)", featureNumber = 13,meanOffset = 1.2, medianOffset = 1.7)

Ligue 1 trimmed:

#Ligue 1
CreateHistogram(dataSet = ligue1Subset,trimmed = T,xlab = "Red + yellow card value", ylab = "Players", title = "Carded fouls per season  (Ligue 1)", featureNumber = 13,meanOffset = .85, medianOffset = 1.3)

Create data frame for all of our data

Create a data frame for storing league goal, assist, and aggression trimmed averages

#Create a list containing each subset
leagueSubsets <- list(premierLeagueSubset,laLigaSubset,bundesLigaSubset,serieASubset,ligue1Subset)

leagueAverageGoals <- 1:length(leagueSubsets)
leagueAverageAssists <- 1:length(leagueSubsets)
leagueAverageAggression <- 1:length(leagueSubsets)

for(i in 1:length(leagueSubsets))
{
  goalsHolder <- TrimmedAverage(dataSet = leagueSubsets[[i]],8,fp,nfp)
  assistsHolder <- TrimmedAverage(dataSet = leagueSubsets[[i]],9,fp,nfp)
  aggressionHolder <- TrimmedAverage(dataSet = leagueSubsets[[i]],13,fp,nfp)
  
  leagueAverageGoals[[i]] <- round(mean(goalsHolder$`Total Goals`),2)
  leagueAverageAssists[[i]] <- round(mean(assistsHolder$`Total Assists`),2)
  leagueAverageAggression[[i]] <- round(mean(aggressionHolder$`Aggression`),2)
}

#Create row names for data frame
featureNames <- c("League","Average goals","Average Assists", "Average Aggression")
#Create data frame
leagueAverages <- data.frame(leagueName,leagueAverageGoals,leagueAverageAssists,leagueAverageAggression)

colnames(leagueAverages) <- featureNames
leagueAverages

Create function to represent each feature in bar charts for a visual comparison between leagues

CreateBarChart <- function(dataSet,title,xlab,ylab,xValue,yValue)
{
  ggplot(data=dataSet, aes(x= reorder(dataSet[,xValue],dataSet[,yValue]),y=dataSet[,yValue])) +
  geom_bar(stat="identity", aes(fill=dataSet[,xValue])) +
  geom_text(aes(label=dataSet[,yValue,]), vjust=1.6, color="Black", size=3.5) +
  ggtitle(title)+
  xlab(xlab) +
  ylab(ylab) +
  scale_fill_discrete(name = xlab)+
  theme_minimal()
}

Average goals per league:

CreateBarChart(dataSet = leagueAverages,"Average goals scored in top 5 European Leagues (2021-2022)", "League", "Average goals scored",xValue = 1,yValue = 2)

Average assists per league:

CreateBarChart(dataSet = leagueAverages,"Average assists in top 5 European Leagues (2021-2022)", "League", "Average assists",xValue = 1,yValue = 3)

Average amount of aggression per league:

CreateBarChart(dataSet = leagueAverages,"Average amount of aggression in top 5 European Leagues (2021-2022)", "League", "Average amount of aggression",xValue = 1,yValue = 4)

Summary

To explore the data we divided our final data set into five separate subsets with each subset representing one of the leagues. We looked at player goal, assist, and aggression averages and concluded that using trimmed averages would be more representative. Finally, we put the results into a new data frame and visualized each feature using bar charts. From the charts it can be seen that the averages are fairly similar in each league inferring that the balance of skill level between every club in every league is similar. From the charts it can be seen that the averages varied the most for the aggression average and the least for the goals scored average.

La Liga was found to have the highest aggression average (of 3.87), suggesting that placing a bet on a player to get booked in this league will have a higher chance of success, whereas in the Bundesliga there may be a lower chance of success (of 2.74). It also suggests that La Liga may play the “dirtiest football”, committing the highest amount of rash challenges, and that the Bundesliga may play the “cleanest football”, committing the highest amount of fair challenges.

This same logic can be applied for the other features, Serie A had the lowest assists average (of 1.8) and the Premier League had the highest (of 2.12), placing a bet on assists may have a higher chance of winning in the Premier League, and a lower chance in Serie A. It can be hypothesized that Serie A contains more selfish players who like to go for glory. Whereas, the premier league players are less selfish, helping their teammates more often.

Applying this logic again, it can be seen that Serie A had the highest average of goals scored per player (of 2.56), and La Liga had the lowest (of 2.19), proposing again that Serie A have the most selfish players. However, this time La Liga have the lowest amount of selfish players. On the contrary, it also could suggest that the La Liga have the strongest defenders, making it harder for opponents to score and that Serie A has the worst defenders, making it easier for opponents to score.

Exploring the data in this way gave us stronger insight to and understanding of each league. While this information may be useful to sports betters, it doesn’t cover specific player averages which is more valuable. In the final section we will set out to answer questions using specific player averages. The goal is to have a better understanding of the data and provide insightful information that can be used in real life.

Part 3 (Finding insightful information)

In this section we will seek for valuable information by answering a series of questions that could hypothetically be used in real life scenarios. The questions will be answered by examining specific player information. The set of questions asked will look to find niche information valuable to sports betters. They will target player information based on player position. Since our final data set is a combination of multiple leagues, any gained insights can be used during competitions where players from each league face each other for example in the World Cup, Champions League, Europa League, FA Cup, etc. The questions are as follows:

Which defenders scored the highest number of goals in a season?
Which midfielders are the most expected to get an assist in a game?
Which players missed the most amount of penalties in a season?
Which forwards were booked the most?

We will begin by creating subsets based on position. Our data set contains 8 positions overall. This is because some players have been categorized by positions such as “DF,MF” (defenders who play high up the pitch) or “MF,DF” (midfielders who play more further down the pitch).

#Creating subsets based on player position and rename positions
goalKeepers <- subset(finalDataFrame, Position == "GK")

defenders <- subset(finalDataFrame, Position == "DF")
attackingDefenders <- subset(finalDataFrame, Position =="DF,MF")

midfielders <- subset(finalDataFrame, Position == "MF")
defensiveMidfielders <- subset(finalDataFrame, Position =="MF,DF")
attackingMidfielders <- subset(finalDataFrame, Position =="MF,FW")

forwards <- subset(finalDataFrame, Position == "FW")
defensiveForwards <- subset(finalDataFrame, Position == "FW,MF")

Eliminate unwanted features

featuresToKeep <- c("Player","Club", "Position", "Total Goals","Total Assists","Yellow Cards","Red Cards","Expected Goals per 90", "Expected Assists per 90")

defenders <- defenders[ ,names(defenders) %in% featuresToKeep]

attackingDefenders <- attackingDefenders[ ,names(attackingDefenders) %in% featuresToKeep]

defensiveMidfielders  <- defensiveMidfielders[ ,names(defensiveMidfielders) %in% featuresToKeep]

midfielders <- midfielders[ ,names(midfielders) %in% featuresToKeep]

attackingMidfielders <- attackingMidfielders[ ,names(attackingMidfielders) %in% featuresToKeep]

forwards <- forwards[ ,names(forwards) %in% featuresToKeep]

defensiveForwards <- defensiveForwards[ ,names(defensiveForwards) %in% featuresToKeep]

Create bar charts to visualize data

#Create function
CreateBarChartHorizontal <- function(dataSet,title,xlab,ylab,xValue,yValue,hjust = 1,titleSize = 10)
{
  ggplot(data=dataSet, aes(y= reorder(dataSet[,xValue],dataSet[,yValue]),x=dataSet[,yValue])) +
  geom_bar(stat="identity", aes(fill=dataSet[,xValue])) +
  geom_text(aes(label=dataSet[,yValue,]), hjust = hjust, vjust =.5, color= "Black", size= 3.5) +
  ggtitle(title)+
  xlab(xlab) +
  ylab(ylab) +
  scale_fill_discrete(name = ylab)+
  theme(axis.text.x=element_text(size=7),plot.title = element_text(size=titleSize))
}

Which defenders scored the highest number of goals in a season?

#Find top 10 players
defenders <- defenders[order(-defenders$`Total Goals`), ]
rownames(defenders) <- 1:nrow(defenders)

topDefenders <- defenders[1:10,]

CreateBarChartHorizontal(dataSet = topDefenders,"Top 10 goal scoring defenders (2021-2022)","Total Goals","Players",xValue = 1,yValue = 4,hjust = 1.5)

Which attacking defenders scored the highest number of goals in a season?

attackingDefenders <- attackingDefenders[order(-attackingDefenders$`Total Goals`), ]
rownames(attackingDefenders) <- 1:nrow(attackingDefenders)

topAttackingDefenders <- attackingDefenders[1:10,]

CreateBarChartHorizontal(dataSet = topAttackingDefenders,"Top 10 goal scoring attacking defenders (2021-2022)",  "Total Goals","Players",xValue = 1,yValue = 4,hjust = 1.5)

Which midfielders are the most expected to get an assist in a game?

#Find top 10 players
midfielders <- midfielders[order(-midfielders$`Expected Assists per 90`), ]
rownames(midfielders) <- 1:nrow(midfielders)

topMidfielders <- midfielders[1:10,]

CreateBarChartHorizontal(dataSet = topMidfielders,"Top 10 Midfielders' expected assists per game (2021-2022)", "Expected Assists","Players", xValue = 1,yValue = 9,hjust = 1.25)

Which defensive midfielders are the most expected to get an assist in a game?

defensiveMidfielders <- defensiveMidfielders[order(-defensiveMidfielders$`Expected Assists per 90`), ]
rownames(defensiveMidfielders) <- 1:nrow(defensiveMidfielders)

topDefensiveMidfielders <- defensiveMidfielders[1:10,]

CreateBarChartHorizontal(dataSet = topDefensiveMidfielders,"Top 10 Defensive Midfielders' expected assists per game (2021-2022)", "Expected Assists","Players",xValue = 1,yValue = 9,hjust = 1.25)

Which attacking midfielders are the most expected to get an assist in a game?

attackingMidfielders <- attackingMidfielders[order(-attackingMidfielders$`Expected Assists per 90`), ]
rownames(attackingMidfielders) <- 1:nrow(attackingMidfielders)

topAttackingMidfielders <- attackingMidfielders[1:10,]

CreateBarChartHorizontal(dataSet = topAttackingMidfielders,"Top 10 Attacking Midfielders' expected assists per game (2021-2022)", "Expected Assists","Players",xValue = 1,yValue = 9,hjust = 1.25,titleSize = 8.7)

Which players missed the most amount of penalties?

#Create penalties missed feature for final data set
finalDataFrame$`Penalties Missed` <- finalDataFrame$`Penalty Kicks Attempted` - finalDataFrame$`Penalty Kicks Made`

penaltiesMissed <- finalDataFrame[order(-finalDataFrame$`Penalties Missed`), ]
rownames(penaltiesMissed) <- 1:nrow(penaltiesMissed)

topPenaltiesMissed <- penaltiesMissed[1:10,]

CreateBarChartHorizontal(dataSet = topPenaltiesMissed,"Top 10 players who missed penalties (2021-2022)", "Penalties Missed", "Players",xValue = 1,yValue = ncol(finalDataFrame),hjust = 1.25,titleSize = 8.7)

Which forwards were booked the most?

#Create total bookings feature
forwards$`Total Bookings` <- forwards$`Yellow Cards` + forwards$`Red Cards` 
defensiveForwards$`Total Bookings` <- defensiveForwards$`Yellow Cards` + defensiveForwards$`Red Cards`

forwards <- forwards[order(-forwards$`Total Bookings`), ]
rownames(forwards) <- 1:nrow(forwards)

topforwards <- forwards[1:10,]

CreateBarChartHorizontal(dataSet = topforwards,"Top 10 forwards booked (2021-2022)","Total Bookings", "Players",xValue = 1,yValue = 10,hjust = 1.25,titleSize = 10)

defensiveForwards <- defensiveForwards[order(-defensiveForwards$`Total Bookings`), ]
rownames(defensiveForwards) <- 1:nrow(defensiveForwards)

topdefensiveForwards <- defensiveForwards[1:10,]

CreateBarChartHorizontal(dataSet = topdefensiveForwards,"Top 10 defensive forwards booked (2021-2022)", "Total Bookings","Players",xValue = 1,yValue = 10,hjust = 1.25,titleSize = 8)

Conclusion

We set out to answer four different questions. Each question was made to search for niche answers that would be valuable to sports betters/ gamblers. Using the same data and same methodological patterns, more niche information could be drawn from the data in the future. The results in the data of this section varied much more to those in the exploration section which suggests it is more valuable as it is less similar. A better/ gambler could look at the graphs above and use the information to place bets when clubs of the different chosen leagues play each other.

References

Premier League stats | fbref.com. (n.d.). Retrieved November 4, 2022, from https://fbref.com/en/comps/9/Premier-League-Stats