Project 3 - Champions League

Author

Renato Chavez

Published

May 4, 2023

getwd()

[1] "/Users/renatochavez/Documents/Montgomery College/Spring 2023/DATA110"

UEFA Champions League dataset

Source of the dataset: https://www.kaggle.com/datasets/basharalkuwaiti/champions-league-era-stats?select=AllTimeRankingByClub.csv Bashar Naji is the author who collected the information from UEFA itself (European confederation) as public domain. This information is updated yearly with every new season. This is the source from UEFA where they have this historic information as well: https://www.uefa.com/uefachampionsleague/history/

I am grateful because I was able to work with this dataset that has information about one of my passions which is soccer. This dataset contains information about the most important competition in the beautiful game when it comes to clubs. The UEFA Champions League features the best teams in Europe, and it is, after the World Cup, the greatest trophy that a footballer can aspire to. This dataset contains two categorical variables like Club and Country, and ten quantitative variables that include results of matches (wins, draws, losses), goals scored and scored against, as well as titles. Using the information of every club, I would like to answer questions regarding the title win rates and the game win rates that some teams have. Since there are many teams in this dataset, I will perform cleaning by filtering the information. Also, in order to create visualizations about the win rates, it is imperative to create new variables using the “mutate” command. Once again, this data has been collected year by year by UEFA. Each season, it can be calculated the goals, games, and more information about the teams that play in the Champions League. I will attempt to provide historical information about the teams that have participated in the Champions League throughout the years, including win rate percentages and classify them by the country they are from.

Why did I choose this dataset?

Like I mention before, soccer is one of my passions. I follow closely European and South American soccer because I enjoy to watch and play this beautiful game. I have had the opportunity to watch a Champions League game before in person and it is an undescribable feeling for any soccer fan. I believe that this dataset is very complete when it comes to information about the teams, and conclusions can be made based on this data. I will try to do a better job for this project when explaining because I understand that not everyone may understand the concepts that are talked about in this visualizations. Therefore, I would like to explain a few concepts to begin with. UCL is the abbreviation of UEFA Champions League which is the competition that we will be discussing. Titles refer to the number of championships that a team has won. Finally, I would like to remind everyone that the 32-team format that the Champions League is played now has not always been the same. Before, less teams used to participate in this tournament, making it different to qualify and win it.

I will start by importing the required libraries.

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

library(treemap)
library(RColorBrewer)
library(viridis)

Loading required package: viridisLite

Attaching package: 'viridis'

The following object is masked from 'package:scales':

    viridis_pal

Right after, we will set the working directory to the dataset that must be used.

setwd("/Users/renatochavez/Documents/Montgomery College/Spring 2023/DATA110/Datasets")
byClub <- read_csv("AllTimeRankingByClub5.csv")

Rows: 530 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Club, Country
dbl (10): Participated, Titles, Played, Win, Draw, Loss, Goals_For, Goals_Ag...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Then, I will use the dplyr command “arrange” to order the dataset by country.

byClub1 <- arrange(byClub, Country)

In the following chunk, I use another dplyr command “mutate” to create two variables. Title win rate and games win rate, that will represent the winning percentage of both the championship and game winning percentage respectively. This is calculated by dividing the numbers of won games or titles by the total games played or participations.

byClub2 <- mutate(byClub1, Games_Win_Rate_percentage = (Win / Played) * 100)
## byClub3 <- mutate(byClub2, Title_Win_Rate = percent((Titles / Participated), accuracy = 0.01))
byClub3 <- mutate(byClub2, Title_Win_Rate_percentage = ((Titles / Participated) * 100))

Then, I will use another dplyr command to filter the data and keep the teams that have won at least one title (championship) because those that have won zero will have 0% title win rate.

byClub4 <- filter(byClub3, Titles >= 1)

First visualization

Our first visualization will be a treemap. This treemap has rectangles of the size of the title win rate that the teams have.

treemap(byClub4, index="Club", vSize="Title_Win_Rate_percentage", vColor="Country", type="index")

Surprisingly, Nottingham Forest and Aston Villa have the highest rectangles, this is due to their few participations in the Champions League because they have participated in this tournament only three and two times respectively. Therefore, they have a great title win rate because out of the two or three times in history that they have participated in the Champions League, they have won once or twice. On the other hand, teams that have won the most times this competition like Real Madrid, AC Milan, or Liverpool are the ones with greater rectangles as well as expected.

Statistical Analyses

Before my second visualization, I would like to perform this analysis to find trends of the teams in relation to the total games they have played. While it would be very obvious to see that the more games that a team plays the more they win, or the more games they play the more they score goals, I would like to see more interesting curves in graphs. Therefore, I decided to look at the relations of draws and goals against, and losses and goals for.

ggplot(byClub3, aes(Draw, Goals_Against)) + 
  geom_point(aes(size = Participated), alpha = 1/2) + 
  geom_smooth() + 
  ggtitle("Relation between draws and goals against") + 
  xlab("Draws") + 
  ylab("Goals against") +
  scale_size_area()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(byClub3, aes(Loss, Goals_For)) + 
  geom_point(aes(size = Participated), alpha = 1/2) + 
  geom_smooth() + 
  ggtitle("Relation between losses and Goals scored") + 
  xlab("Games lost") + 
  ylab("Goals scored")

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  scale_size_area()

<ScaleContinuous>
 Range:  
 Limits:    0 --    1

As a result of this analysis, we see that more draws result in more goals scored against. This is unexpected to me because I would think that most draws end up being matches without goals where the teams draw zero to zero. However, this shows that there could be more draws that end up in a tie of 1-1 or 2-2 and so on. Also, we can find that the more games a team loses, the more they score as well. The reason behind this can be that if they lose more games that also means that they play more games and have higher chances of scoring goals.

Finally we will use the dplyr filter once again to work with teams that have played at least 100 games, won at least 50 games, and that have a positive goal difference. The reason why I am filtering the information like this for the second visualization is because there are many teams from smaller countries in Europe that have participated only a few times. I would like to see the teams’ information historically in the competition while being classified by the country they are from.

byClub5 <- filter(byClub3, Played >= 100)
byClub6 <- filter(byClub5, Win >= 50)
byClub7 <- filter(byClub6, Goal_Diff >= 0)

Second visualization

p <- ggplot(byClub7, aes(x = Played, size= Participated, y = Win, Team = Club, color = Country, WinRate = Games_Win_Rate_percentage, 
                         TitleWinRate = Title_Win_Rate_percentage)) +
     geom_point(alpha = 2) + xlim(0,500) + ylim(0,300) +
  ggtitle("Champions league teams classified by their country") +
  xlab("Games played in the Champions League") +
  ylab ("Games won in the Champions League") +
  theme_minimal(base_size = 10) + 
  scale_color_brewer(palette = 'Paired')
p <- ggplotly(p)
p

There are some results that were unexpected, for example Manchester City has a very good win rate (54.7%) also because they have only played slightly more than 100 games. This is due to their recent success in their local league with lots of money coming from Abu Dhabi since they were recently bought by owners from this city. A similar conclusion could be made with the team from the french capital, Paris Saint-Germain, being recently bought by owners from Qatar who invest a lot of money in being successful in their national league, but they have yet to prove themselves in the biggest stage, the Champions League. Finally, Real Madrid is without a doubt the king of this competition because they are the team that has won the UCL the most (14 times), have the most games played, and the most games won.

What can be conclude? What could be better?

It has been very fun to create these visualizations and explore this dataset as a soccer fan. I am grateful to Professor Saidi and the class in general because before I started this class, I would have never imagined I could be able to create these kinds of visualizations. Taking this class as an elective one has been a great choice as this is a skill that can help me in the future. I will be looking to improve my skills of data visualization and interpretation as well in the near future because, as a Computer Science major, I would like to implement this skill in my field as well. Finally, this project made me very happy because I was working with one of my passions, and I also discovered new information that I was not aware of. I have been watching this sport for fifteen years, but clearly this competition has so much history that I could not witness.