Kanav Duggal

05/10/2020

Introduction

Here I’ve attempted to visualize results from the English Premiere League (the most watched football league globally) between 2006 and 2018 using data from Opta and the Premier League’s website (indirectly through Kaggle). I was curious to understand if, in each season, the numer of home wins (when a team wins a match at its local stadium against a visiting opponent) outwieghed the number of away wins and draws.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)

Data

The raw data included information on each of 4,560 games played between 2006 - 2018. I filtered the data to simply include match results and year to make it easier to plot the graph below. As expected, in every season, the number of home wins outweighed the number of away wins/draws. While I would have liked to extend this analysis a little furhter, I think that many factors which contribute to home advantage, such as familiarity with playing venue, psychological imapct as a result of local crowd support, environmental factors etc. are difficult to measure. Although, a study conducted by the Department of Psychology at Harvard Univeristy, which analyzed 5,000 EPL games between 1992 and 2006 found that, while the size of the home crowd had some effect on the result, home teams were more likely to receive preferential treatment from referees.
epl = read.csv("/Users/KD/OneDrive - Imperial College London/Modules/Quantitative Methods/Lecture 3/Visualisation Exercise/results.csv", header = TRUE)

epl_winstats = epl %>% select(result,season)

counts <- table(epl_winstats)

test<-data.frame(counts)
ggplot(test,aes(fill=result, x=season, y=Freq)) + geom_bar(position="dodge", stat="identity") + theme_minimal() + labs(title = "English Premier League Results (2006 - 2018)", x="Season", y="# of Wins/Draws", fill = "Win Type") + scale_fill_discrete(labels = c("Away Wins", "Draws", "Home Wins"))