Introduction

The data set that I chose to analyze dealt with soccer games in the La Liga ranging from 2014 to 2020. The La Liga is a well know soccer league that that takes place in Spain and is the highest level of soccer in the country. This league is known as one of the big five leagues in Europe based on their teams success in European competition and its high level of competition domestically. The La Liga consists of a few of the most well known soccer teams throughout the world with decorative histories such as FC Barcelona and Real Madrid CF.

Dataset

The data that is in this data set consists of 2660 observations that are described by 41 variables. The variables describe each observation by providing information on the teams that played, the score of the game, and many other descriptive stats of every single game like the number of shots on target and fouls for each team. The main variables that will be focused on in these visualizations are the match excitement, score, home team rating and home team goals scored. The purpose of using these variables is to analyze them through visualizations to observe trends on how they have changed throughout the years that the data has provided.

Findings

Analyzing this data set helped gain an understanding on some of the trends in the La Liga. These trends consist of the games having a very average excitement level with the home team more frequently have a better performance than the visitor. The games usually end up being low scoring with the home team coming out on top more often than the away team. These trends will be seen through the visualizations that follow.

setwd("U:/")

library(data.table)
library(dplyr)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(plotly)

df <-fread("LaLiga_all_years.csv")

Match Excitement

The analysis of the La Liga data starts with a visualization describing the most common match excitement ratings that are given each year. This piece of data does not have a large number of ratings that are involved in the visualization, which could mean that match excitement ratings are scattered. Match excitement is a rating that is given at the end of the match that describes how exciting the match was. This rating is given on a scale from 0 to 10. The least exciting match would be given a rating of 0, while the most exciting match would be given a rating of 10.

In this visualization the most common ratings are in between 3.7 and 6.0. Meaning that the most common ratings are around 5, which makes sense because that is the middle of the entire range of the ratings. The most common rating for a match from the seven years of data is a rating of 5.7, which has occurred 68 times. The other counts of data are quite close with most of them only being a few counts behind. The plot shows that in more recent years, these match excitement ratings have become more common. The count of these ratings have become more frequent in the past two years than many of the years before them. With that being said, it is clear that it is more common for a game in the La Liga to have an average excitement rating of somewhere around 5, versus the game being very exciting or not so exciting.

df_reasons <- dplyr::count(df, `Match Excitement`)
df_reasons <- df_reasons[order(df_reasons$n, decreasing = TRUE),]

top_reasons <- df_reasons$`Match Excitement`[1:10]

library(dplyr)
new_df <- df %>%
  filter(`Match Excitement` %in% top_reasons) %>%
  dplyr::select(year, `Match Excitement`) %>%
  group_by(year, `Match Excitement`) %>%
  dplyr::summarise(n= length(year), .groups = 'keep') %>%
  data.frame()

agg_tot <- new_df %>%
  dplyr::select(Match.Excitement, n) %>%
  group_by(Match.Excitement) %>%
  dplyr::summarise(tot = sum(n), .groups = 'keep') %>%
  data.frame()

new_df$year <- as.factor(new_df$year)

ggplot(new_df, aes(x=`Match.Excitement`, y=n, fill=year))+
  geom_bar(stat = "identity", position = position_stack(reverse = TRUE))+
  coord_flip()+
  labs(title = "Top 10 Match Excitement Ratings by Year", x = "Match Excitement Rating", y = "Count of Ratings")+
  theme_light()+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_fill_brewer(palette="Set3", guide = guide_legend(reverse = TRUE))+
  geom_text(data = agg_tot, aes(x= `Match.Excitement`, y= tot, label = tot, fill = NULL), hjust= -0.2, size=4)+
  scale_x_continuous(breaks = c(3.6,3.8,4,4.2,4.4,4.6,4.8,5,5.2,5.4,5.6,5.8,6,6.2))

Most Common Scores

Following the analysis on match excitement, the next piece of data that will be analyzed deals with the most common final scores. Unlike the match excitement ratings, there is a much larger amount of data in this visualization. Soccer is generally a sport that produces a low total score. This shows as in the visualization there is not a total that is more than 4 combined goals by both teams in a game.

The most common final score is a game that ends with a scoreline of 1-1. On the contrary, the score that is the least common in this visualization is a 3-0 score, with the home team winning the game. The results from this chart show that close and low scoring games are the most common outcome. What is also shown is that home teams winning is much more common than away teams winning. There is a clear difference in the number of times a home team wins “1-0”, vs an away team winning “0-1”. As well as “2-0” and “0-2”. This is interesting information as we will look more into data on the home teams.

df_reasons2 <- count(df, `Score`)
df_reasons2 <- df_reasons2[order(df_reasons2$n, decreasing = TRUE),]
top_reasons2 <- df_reasons2$Score[1:10]

scores_df <- df%>%
  filter(Score %in% top_reasons2) %>%
  dplyr::select(year, Score) %>%
  group_by(year, Score) %>%
  dplyr::summarise(n= length(year), .groups = 'keep') %>%
  data.frame()

agg_tot2 <- scores_df %>%
  dplyr::select(Score, n) %>%
  group_by(Score) %>%
  dplyr::summarise(tot = sum(n), .groups = 'keep') %>%
  data.frame()

scores_df $year <- as.factor(scores_df$year)  

ggplot(scores_df, aes(x= `Score`, y=n, group=year))+
  geom_line(aes(color=year))+
  labs(title =  "Top 10 Most Common Final Scores by Year", x= "Score", y= "Count of Score")+
  theme_light()+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_point(shape= 21, size=2, color="black", fill="white")+
  scale_color_brewer(palette = "Set1", name= "Year", guide= guide_legend(reverse = TRUE))

Home Team Rating

Considering the major success of the home teams that was seen in the last visualization, an interesting topic to look into would be the top 20 best ratings that home teams have received. Team ratings are a rating that is given to the team based on their performance in the game. The best rating that can be given is a 10, and the worst is a 0. The range of ratings in this visualization is from 7.7 to 10.

The results show how uncommon it is for a team to receive any of the high ratings in this range, and especially a rating over a 9.0. There are trends in this data where there are similarities in some of the years. In the first four season (2014-2017), the data is quite similar. There charts look similar as each has year a few teams that received over a 9.0 and similar counts on ratings from 7.7 to 8.9. Observing the most recent years of 2020, 2019 and 2018, it shows in each of these years, less teams have received rating in the top 20 and there are very few rating of 9.0 and higher.

A description behind why these top 20 ratings are less common in the past 3 years could be the league getting stronger from top to bottom. AS shown in the match excitement visualization, games are trending to be more around the average, which could mean that the teams playing against each other are relatively equal which calls for an average, low scoring game. Which is the opposite of an exciting high scoring game that looks to be more common that looks to be more common in couple of years in this data.

home_rating <- count(df, `Home Team Rating`)
home_rating <- home_rating[order(home_rating$`Home Team Rating`, decreasing = TRUE), ]

top_home <- home_rating$`Home Team Rating`[1:20]

home_df <- df %>%
  filter(`Home Team Rating` %in% top_home) %>%
  dplyr::select(`year`, `Home Team Rating`) %>%
  group_by(`year`, `Home Team Rating`)%>%
  dplyr::summarise(n=length(year), .groups = "keep") %>%
  data_frame()

home_df$year <- factor(home_df$year)

x= min(as.numeric(levels(home_df$year)))

y=max(as.numeric(levels(home_df$year)))

home_df$year <- factor(home_df$year, levels = seq(y,x, by=-1))

ggplot(home_df, aes(x= `Home Team Rating`, y= n, fill= year )) +
  geom_bar(stat ="identity", position ="dodge") +
  theme_light()+
  theme(plot.title = element_text(hjust = .5)) +
  scale_x_continuous(breaks = c(7.6,8,8.4,8.8,9.2,9.6,10)) +
  labs(title = "Top 20 Home Team ratings by Year",
      x= "Home Team Rating",
      y= "Rating Count",
      fill= "Year") +
  scale_fill_brewer(palette = "Spectral") +
  facet_wrap(~year, ncol = 4, nrow = 2)

Count of Home Teams scoring between 3-10 Goals each Each Year

With the decline in high home match ratings in recent years as shown in the previous visualization, this visualization will look at the number of large amounts of goals scored, to see if there is any correlation between the two. This visualization provides information on how many times a home team scored between 3 and 10 goals each year. When a team scores 3 or more goals in a soccer game, that is usually a very good result for that team which would likely provide them with a high rating.

The data that is shown is exactly what was expected. In the past 3 years,they have contributed to the 3 lowest occurrences of a home team scoring between 3 and 10 goals in a single game. This is a result that was expected because of the decrease that top 20 home team ratings have seen in the past 3 years.

home_team_goals <- count(df, `Home Team Goals Scored`)
home_team_goals <- home_team_goals[order(home_team_goals$n, decreasing = TRUE), ]
top_goals <- home_team_goals$`Home Team Goals Scored`[4:11]

goals_df <- df %>%
  filter(`Home Team Goals Scored` %in% top_goals) %>%
  dplyr::select(year, `Home Team Goals Scored`) %>%
  group_by(year) %>%
  dplyr::summarise(n=length(year), .groups = "keep") %>%
  data_frame()

plot_ly(goals_df, labels = ~year, values = ~n) %>%
  add_pie(hole=.65) %>%
  layout(title="Count of Home Teams scoring between 3-10 Goals each Each Year") %>%
  layout(annotations=list(text=paste0("Occurance Count:  \n", (sum(goals_df$n))),
   "showarrow"=F))

Most Common Home Team Rating by Year

Finally, the fifth visualization is also continuing with the home team theme. the last visualization looks at the top 5 most common home team ratings yearly. The data shows that the most common ratings for home teams are well above the average of 5. This is no surprise as it was shown earlier that home teams win games more often than away teams. With that being said, winning a game will most likely provide a team with a higher rating than the losing team.

This data is very consistent as it shows that most of the ratings are around 20% of the total each year. There are outliers in the data like any, but in many of the years the data is very evenly split. The range between the most common ratings is very small as well. The 5 ratings range from 6.6 to 6.0, with 6.5 occurring the most frequent. With a large percentage of the ratings being within this tiny range, the ratings around this range are also likely to occur frequently. Given this data, this visualization can provide a strong estimate that the home teams rating will be around in this range of data. This visualization does not show much of a trend by year, but can more give a prediction on the variable home team rating.

home_average <- count(df, `Home Team Rating`,)
home_average <- home_average[order(home_average$n, decreasing = TRUE),]
top_average <- home_average$`Home Team Rating`[1:5]

average_df <- df %>%
  filter(`Home Team Rating` %in% top_average) %>%
  dplyr::select(`year`, `Home Team Rating`) %>%
  group_by(`year`, `Home Team Rating`) %>%
  dplyr::summarise(n=length(year), .groups = "keep") %>%
  data_frame()

plot_ly(textposition= "inside", labels = ~`Home Team Rating`) %>%
  add_pie(data = average_df[average_df$year ==2014,],  values= ~n, name= "2014",
          title = '2014', domain=list(row=0, column=0)) %>%
  add_pie(data = average_df[average_df$year ==2015,],  values= ~n, name= "2015",
          title = '2015', domain=list(row=0, column=1)) %>%
  add_pie(data = average_df[average_df$year ==2016,],  values= ~n, name= "2016",
          title = '2016', domain=list(row=0, column=2)) %>%
  add_pie(data = average_df[average_df$year ==2017,],  values= ~n, name= "2017",
          title = '2017', domain=list(row=1, column=0)) %>%
  add_pie(data = average_df[average_df$year ==2018,],  values= ~n, name= "2018",
          title = '2018', domain=list(row=1, column=1)) %>%
  add_pie(data = average_df[average_df$year ==2019,],  values= ~n, name= "2019",
          title = '2019', domain=list(row=1, column=2)) %>%
  add_pie(data = average_df[average_df$year ==2020,],  values= ~n, name= "2020",
          title = '2020', domain=list(row=2, column=0)) %>%
  layout(title="Average Home Team Rating by Year", showlegend= TRUE, grid= list(rows=3, columns= 3))

Conclusion

After analyzing the data in this data set, a lot was learned about the La Liga from 2014-2020. There are many trends from this very popular soccer league. The first horizontal bar chart describes how the excitement ratings of the matches is quite average with the most common ratings being around the average rating of 5. The line plot that follows describes the most common scores as low scoring games, with the home team winning more frequently than the away team. Next, a trellis chart showed that it is rare for a team to receive a high match rating based on their performance, and especially in the more recent years. Keeping the focus on the home team, the donut chart that folows shows that match rating is correlated to the amount of goals that is scored, since there was a clear decrease in the number of 3 or more goal games in the last 3 years. Finally, the trellis chart in plotly described that the most common home performances were above the average, and ranged very closely each year.