I am a huge English Premier League (EPL) fan, and my favorite club is Manchester United (United). United plays home matches at a stadium named Old Trafford, nicknamed The Theatre of Dreams, and historically has the highest winning percentage at home. Each EPL season consists of thirty-eight games, nineteen at home and nineteen away. United has three main rivals in Liverpool (equivalent to Red Sox/ Yankees), Manchester City, and Arsenal. These rivalry games are called derbies in England.
I would expect derby games against rivals to be…. different. There is an emotional aspect to sports that is nearly impossible to capture analytically, but my gut tells me that these derby matches at Old Trafford would be “tighter” or lower scoring. To examine this, I used a data set of EPL match results from seasons 2000/01 to 2024/25. This data consisted of qualitative data like who was winning at half time or full time, but also quantitative data like goals, corners, fouls, and cards. I plotted the number of goals United scored in home derby matches each season and their total goals each season in a stacked bar plot.
The graph below shows the two goal totals by season, color coded using United’s colors. The red bar shows the total goals scored at home in all matches and the yellow shows the goals scored in derbies. Since there are typically three rivalry home games out of nineteen home games, the baseline is that total goals scored in derby matches should be 15% of the season total. Of the twenty-five seasons in the data set, eighteen of them “Meets Expectation” by having a percentage of derby goals lower than 15%. So, there may be something to this notion of rivalry games being closer, tighter, and lower scoring.
I encountered three challenges when conducting my analysis. The first two challenges dealt with analyzing the data. First, the type of data was very uniform. Nearly all the data in this set is quantitative; goals, fouls, corners, and cards were all numeric which impacted the type of plot to use. Challenge number two was a potential lack of data independence. In football -> corners lead to goals. Fouls and cards may be a result of game flow and score. The largest challenge I had in this assignment was that my initial plot included geom_text labels of the bar values. When I trelliscoped my initial plot all the values of the geom_text were displayed in each plot, not just that specific Season’s values. I could not figure out how to only display the specific season labels with the bars. To make the final product more visibly pleasant I removed those labels in the final trelliscope output, however it makes the plots less informative. I have included the initial plot below the trelliscope plots.
The next level of investigation here would be to look at how United, and its rivals, finished each year or other measures of each team’s quality each year. There are years in his data set that are outliers. For instance, United scored more derby goals but less overall goals in 2004/05 than in 2003/04. Or perhaps one could look at the relationship between the cost of each squad & money spent on players and how that translates into goals.
#bring in file
epl <- read.csv("EPL 2020 - 2025.csv")
epl_manu <- epl %>%
filter(AwayTeam=="Man United" | HomeTeam=="Man United")
home_goal_sum <- epl_manu %>%
filter(HomeTeam=="Man United") %>%
group_by(Season) %>%
summarise(goals2 = sum(FullTimeHomeGoals))
home_goal_sum_derb <- epl_manu %>%
filter(HomeTeam=="Man United", Derby == "yes") %>%
group_by(Season) %>%
summarise(goals3 = sum(FullTimeHomeGoals))
epl_manu %>%
filter(HomeTeam=="Man United") %>%
ggplot( aes(x=Season, y=FullTimeHomeGoals, fill = Derby)) +
geom_bar(stat = "identity", position = "stack") +
scale_fill_manual(values = c("#DA020E", "#FFE500"), labels = c("All matches", "Derbies")) +
scale_y_continuous(limits=c(0,60), breaks=seq(0, 60, by=10)) +
# geom_text(data = home_goal_sum, aes(x = Season, y = goals2, label = goals2, fill = NULL, size = 0.25), nudge_y=1.75, show.legend = FALSE) +
# geom_text(data = home_goal_sum_derb, aes(x = Season, y = goals3, label = goals3, fill = NULL, size = 0.25), nudge_y=2, show.legend = FALSE) +
labs(x = "Season", y = "United Home Goals") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
theme(legend.position = "bottom")+
theme(legend.title = element_blank()) +
facet_trelliscope(~ Season,
name = "Problem 4 - Manchester United Home Goals, All Matches & Derbies",
desc = "Seasons 2000/01 through 2024/25",
nrow = 1, ncol =2,
path = ".",
self_contained = TRUE)
## using data from the first layer
#bring in file
#epl <- read.csv("EPL 2020 - 2025.csv")
epl_manu <- epl %>%
filter(AwayTeam=="Man United" | HomeTeam=="Man United")
home_goal_sum <- epl_manu %>%
filter(HomeTeam=="Man United") %>%
group_by(Season) %>%
summarise(goals2 = sum(FullTimeHomeGoals))
home_goal_sum_derb <- epl_manu %>%
filter(HomeTeam=="Man United", Derby == "yes") %>%
group_by(Season) %>%
summarise(goals3 = sum(FullTimeHomeGoals))
epl_manu %>%
filter(HomeTeam=="Man United") %>%
ggplot( aes(x=Season, y=FullTimeHomeGoals, fill = Derby)) +
geom_bar(stat = "identity", position = "stack") +
scale_fill_manual(values = c("#DA020E", "#FFE500"), labels = c("All matches", "Derbies")) +
scale_y_continuous(limits=c(0,60), breaks=seq(0, 60, by=10)) +
geom_text(data = home_goal_sum, aes(x = Season, y = goals2, label = goals2, fill = NULL, size = 0.25), nudge_y=1.75, show.legend = FALSE) +
geom_text(data = home_goal_sum_derb, aes(x = Season, y = goals3, label = goals3, fill = NULL, size = 0.25), nudge_y=2, show.legend = FALSE) +
labs(x = "Season", y = "United Home Goals") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
theme(legend.position = "bottom")+
theme(legend.title = element_blank())