Introdction

For this project I will be implementing a few different kinds of visuals to help us better understand the data and what the data is telling us. A variety of techniques will be used to show different things about the data set.

Dataset

The dataset that has been chosen is epl2020.csv. The data contains specific statistics about the 2020 English Premier League Season. Most of the variables are pretty self explanatory except for expected goals. Expected goals indicate not when a goal is scored, but rather how much of a chance each shot that is taken has of going in. By theory if a team accumulates more expected goals than the other in a ny given match than they should have a better chance of securing a result. Expected Points is another variable which may be new, but we can just think of that as an extension of expected goals.

Findings

What we found is that expected goals is not always a true indicator of a team’s likelihood in succeeding. We also found interesting information about how certain teams play given a certain set of circumstances. Through a few visualizations we will also see what the ebbs and flows of a season look like.

Tab 1

This bar chart shows the differences between a teams actual points and their expected points. What is perhaps most interesting is how greatly Liverpool over performed given where their expected points lies. Based off of expected points Manchester City would have been the champions, but given Liverpool remarkable difference of about 20 between their expected points and actual points they took home the trophy. Another thing that we can see is that it appears to be about a 50/50 split between teams that under perform and teams that overperform based on their respective bars.

sum1 <- epl2020 %>%
  group_by(teamId) %>%
  summarize(Point_Total = sum(pts)) %>%
  ungroup()

sum2 <- epl2020 %>%
  group_by(teamId) %>%
  summarize(xPoints = sum(xpts)) %>%
  ungroup()

epl1_inner <- sqldf("select *
                  from sum2 inner join sum1 using(teamId)
                  ")

df2 <- data.frame(Groups = rep(c("xPoints", "Points"), each=20),
                 Team = rep(c(epl1_inner$teamId), 2),
                 Points_Total=c(epl1_inner$xPoints, epl1_inner$Point_Total))

ggplot(data=df2, aes(x=Team, y=Points_Total, fill=Groups)) +
  geom_bar(stat="identity", position=position_dodge()) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Expected Points vs. Actual Points") +
  xlab("Team") + 
  ylab("Point Totals") 

Tab 2

Next we will look at a heat map of points by Match Day to see if there are any trends that could point to a team under performing or over performing. For example when we look at a team like Man City, one that under achieved, we can see that they struggled get points on Fridays and Sundays. When that is compared with the champions from Liverpool we can see that there was not a single day where they struggled to get a positive result. A team like Aston Villa tells the story of an inconsistent team as their heat map shows three days where they played well and another three days where they played poorly.

mylevel <- c('Mon', 'Tue','Wed', 'Thu', 'Fri', 'Sat', 'Sun')
epl2020$matchDay <- factor(epl2020$matchDay, levels = mylevel)
ggplot(data = epl2020, aes(x=factor(epl2020$matchDay, levels = mylevel), y = teamId, fill = pts)) +
  geom_tile(color="black") +
  coord_equal(ratio=1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(title="Heatmap: Points by MatchDay",
       x = "MatchDay",
       y = "Team")

Tab 3

Now we will narrow our focus down to just the “Big 6”,the most popular and well known teams from the EPL. With this line graph we can see how each teams season went. From first look it is not hard to see that Liverpool had an exceptional season when compared to the rest. We also see that about a third of the way through Man City and Chelsea were neck and neck in the standings. However, something happened near the middle of the season that either propelled Man City in their trajectory, or sunk Chelsea in theirs. Man Utd and Tottenham have similar lines which may represent somewhat of a jockeying for European Positions all season long. Lastly, Arsenal appear to have had the most turbulent season with a rapid start followed by a bad second half.

df3<-epl2020[(epl2020$teamId=="Arsenal" | epl2020$teamId=="Liverpool" | epl2020$teamId=="Man Utd" | epl2020$teamId=="Man City"
              | epl2020$teamId=="Tottenham" | epl2020$teamId=="Chelsea"),]

ggplot(df3, aes(x=round, y=tot_points, group=teamId)) +
  geom_line(aes(color=teamId), size = 2) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_point(shape=21, size=2, color="black", fill = "white") +
  labs(title="Total Points by Team and Round for the Big Six",
       x = "Match Week",
       y = "Total Points")

Tab 4

Because of Arsenal’s up and down season we are going to examine their expected goal difference over every game of the season. Apart from a stretch of six or so games around match week ten, Arsenal’s season seems to have been a story of play well one game and play poorly the next. We have labeled the best and worst expected differential for their season. As we can see their inconsistency is showcased here as their best and worst expected goal differentials have, but one game in between them. It must have been a truly frustrating season to have been an Arsenal fan. Perhaps an explanation for this up and down season could have to do with home field advantage.

Arsenal <-epl2020[(epl2020$teamId=="Arsenal")]
Expected_Goal_Difference <- (Arsenal$xG - Arsenal$xGA)
Arsenal <- data.frame(Arsenal, Expected_Goal_Difference)                  


hi_lo <- Arsenal %>%
  filter(Expected_Goal_Difference==min(Expected_Goal_Difference) | Expected_Goal_Difference ==max(Expected_Goal_Difference)) %>%
  data.frame()

Arsenal <- Arsenal %>%
  rename(n = Expected_Goal_Difference)

ggplot(Arsenal, aes(x = round, y = Expected_Goal_Difference)) +
  geom_line(color='black', size=1) +
  geom_point(shape=21, size=4, color='red', fill='white') +
  labs(title="Expected Goal Difference for Arsenal",
       x = "Match Week",
       y = "Expected Goal Difference") +
  geom_point(data = hi_lo, aes(x = round, y = Expected_Goal_Difference), shape=21, size = 4, fill = 'red', color='red') +
  geom_label_repel(aes(label=   ifelse(Expected_Goal_Difference == max(Expected_Goal_Difference) | Expected_Goal_Difference == min(Expected_Goal_Difference), Expected_Goal_Difference, "")),
                   box.padding=1,
                   point.padding=1,
                   size=4,
                   color='Grey50',
                   segment.color='darkblue') +
  theme(plot.title = element_text(hjust = 0.5))

Tab 5

From this pie chart we can conclude that home field advantage could have been a possible answer to why Arsenal’s season went how it did. While although there is not a huge tilt in the scales we can see that the majority of the goals for the 2020 EPL season were scored by the home team. In a game where often times not many goals are scored even the smallest of margins, as the pie chart shows, can make the difference.

ggplot(data = epl2020, aes(x="", y = scored, fill = h_a)) +
  geom_bar(stat="identity", position="fill") +
  coord_polar(theta="y", start=0) +
  labs(fill = "Home(h) and Away(a)", x =NULL, y = NULL, title = "Goals Scored by Home and Away Teams") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) 

Conclusion

In conclusion we have seen that while though expected goals and points are good measures of a teams probability of success, there are still other factors that weigh heavily into the results of games and season standings. We have looked at the premier league season as a whole and that we must remember there are ups and downs to just about any team’s season and that there is no reason to panic after a run of bad results. From the heat map and pie chart we have gathered that perhaps there is such a thing as home field advantage and that some teams may not be as good as others when it comes to playing on the big stage in prime time. Lastly after examining Arsenal’s season where their best and worst performances were basically one after the other we have to keep in mind just how unpredictable sports can be sometimes.