Project 2 - Champions League

Author

Renato Chavez

Published

April 17, 2023

Champions League Dataset

This dataset explores the statistics of the teams that have played in the Champions League grouped by their country. This means that instead of looking at each team’s data, I will group the teams by their country and analyze their performance. I would like to remind the audience that the Champions League is the biggest club competition that includes european teams. The variables that I will be analyzing are: Country(categorical variable), Participated (quantitative variable), Titles (quantitative variable), Played (quantitative variable), Win (quantitative variable), Draw (quantitative variable), Loss (quantitative variable), Goals For (quantitative variable), Goals against (quantitative variable), points (quantitative variable), and goal difference (quantitative variable). This data has come from all games from all Champions League editions since 1955 until 2022. Therefore, as one cane imagine, I will be performing some cleaning to work with an interesting group of ten countries and data from their teams.

Why this dataset ?

I chose this dataset because this sport is one of my passions, and I thought that this dataset was very complete and interesting because it analyses the teams of different european countries very well. I thought that this could be an interesting topic even for people that does not follow soccer as much because they can get a good idea of what countries have the better performing teams in this competition. There were different datasets about this topic, but I thought that if I analyse the Champions League data by each team, it could be difficult for people that don’t necessarily follow the sport. This is why I decided to choose this dataset that analyses the Champions League performance of the teams grouped by their countries.

First, I will import the libraries that this project will need.

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Now, I will set the working directory to the corresponding .csv file where the data will be coming from. The file came from https://www.kaggle.com/datasets/basharalkuwaiti/champions-league-era-stats?select=AllTimeRankingByClub.csv by Bashar Naji.

setwd("/Users/renatochavez/Documents/Montgomery College/Spring 2023/DATA110/Datasets")
byCountry <- read.csv("AllTimeRankingByCountry.csv")

Using a dplyr command such as filter, to clean the data.

To clean the data, I will filter the countries to work only with the ones that have a team that has won the champions league at least once. These are the ten countries that we will be working with.

titles <- byCountry
titles1 <- filter(titles, Titles >= 1)
titles1
    X     Country Participated Titles Played Win Draw Loss Goals.For
1   1       Spain          148     19   1349 705  306  338      2427
2   2     England          136     14   1239 655  271  313      2218
3   3     Germany          166      8   1176 554  242  380      2070
4   4       Italy          138     12   1086 508  278  300      1662
5   5      France          115      1    794 331  175  288      1187
6   6    Portugal          106      4    680 280  156  244      1006
7   7 Netherlands           97      6    567 225  145  197       847
8   8    Scotland           78      1    427 184   86  157       636
9  17      Serbia           52      1    265 116   58   91       443
10 18     Romania           73      1    319 106   76  137       415
   Goals.Against  Pts Goal.Diff
1           1446 1716       981
2           1266 1581       952
3           1538 1350       532
4           1176 1294       486
5            981  837       206
6            857  716       149
7            701  595       146
8            541  454        95
9            346  290        97
10           471  288       -56

Then, I will perform a statistical analysis such as boxplot.

I will create a boxplot and another simple plot to analyze the data that we are left with. In this boxplot, we can see the countries that have won at least one Champions League title.

ggplot(titles1, aes(x= Country, y = Titles, color = Country)) + 
  geom_boxplot() + 
  xlab("Countries") + 
  ggtitle("Countries that have at least one Champions League winner") +
  guides(color = FALSE) + 
  coord_flip() +
  theme(axis.text.y = element_text(size = 10))
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.

As a result of this statistical analysis, we see countries like Serbia, Scotland, Romania, and France with only one title each, so we can assume that it will be unlikely to see these countries lead the ranking in other quantitative variables. On the other hand, countries like Germany, Italy, England, and Spain that have more titles are also the ones with the most important domestic leagues (Premier league, LaLiga, Serie A, etc.).

In order to explore more of the quantitative variables, I will create this basic plot to see the games played and won by these ten countries. I expect to see the countries with one title with much less games and wins than the ones with the most important leagues.

plot2 <- titles1 %>%
ggplot() + 
  geom_bar(aes(x=Played, y=Win, fill = Country), 
           position = "dodge", stat = "identity") + 
  ggtitle("Games played and won by countries with CL winners") + 
  scale_x_continuous(breaks = c(250, 500, 750, 1000, 1250, 1500)) +
  xlab("Total games played") +
  ylab("Wins") +
  labs(fill = "Countries")
plot2

What was expected happened, we see a huge difference between England, Spain, Germany, and Italy with the rest. However, France is the fifth country with most games played and won, which is unexpected because they have less titles than countries like Portugal or Netherlands for example. I imagine that this is happening because of the links of Qatar and Paris Saint Germain in the last decade, investing lots of money in this french team while making them appear more in the biggest stage of soccer.

Creating an interactive final visualization

For this final visualization, I will create an interactive plot with the ten countries and more data about each one of them, such as their goal difference, total points, participations, titles, and games played. We will have labeled axes, a clear title, and a helpful legend. When putting the cursor over the colored circles, we will have the information at a glance. This interactivity is created because I used plotly.

p <- ggplot(titles1, aes(x = Participated, size=1, y = Titles, color = Country, Points = Pts, Goal_Difference = Goal.Diff)) +
     geom_point(alpha = 2) + xlim(0,175) + ylim(0,20) +
  ggtitle("Performance of countries with clubs that have won the Champions League") +
  xlab("Total participations") +
  ylab ("Number of titles") +
  theme_minimal(base_size = 12) + 
  scale_color_brewer(palette = 'Set3')
p <- ggplotly(p)
p

More about the topic and the source…

The author that was able to put this information together, Bashar Naji, did a great job because this data contains information since 1955. Every detail is very accurate, despite not using the goals for and against variables that were also in the dataset as much as the others, I compared this to the official stats by UEFA(Union of European Football Associations) and they are legitimate. Being able to have this kind of information in a file is impressive because we are talking about decades of information. I am glad that all of us have access to this information because it allows us to better understand what each team and country has been able to do throughout the history of the Champions League.

What can be concluded ? What could have been better?

After the final visualization, it is clear that the best domestic leagues like the ones from England, Spain, Germany, and Italy have the best performances throughout the history of the competitions. Also, France has been following closely in the last decade thanks to the investing from Qatar to Paris Saint Germain, but still only one title for them. Countries like Portugal and Netherlands have more titles than France because this dataset includes information since the competition was created, and historically teams like Benfica from Portugal or Ajax from Netherlands have performed very well in the Champions League. Something that did surprise me is the presence of Scotland, Serbia, and Romania, but I understand that it is because decades ago they had very good teams that have been able to win this competition. The fact that is surprising is that Belgium, Sweden, Croatia, and other countries that have had amazing teams have not been able to win the Champions League once. This is a proof of how difficult it is to win this competition. Something that I would have loved to include is this data throughout the years, so that I could create a nice animation of the performance of the teams from each country. That way we could also see how over the years the winners are only from England, Spain, Italy, and Germany, therefore the gap between these countries and the rest becomes larger and larger.