Tidyverse EXTEND

The Tidyverse CREATE assignment was made by Lucas Weyrich, I decided to work with this data set and add additional tidyding and analysis to it. His work was about a dataset of soccer predictions.

In this document, I will demonstrate a few packages of the tidyverse. Specifically, I will use the packages dplyr to transform the data, tidyr to clean the data, and ggplot2 to plot some of the data. Additionally, readr is used to import the data. As demonstrative dataset, we use the soccer performance index (SPI) match data from fivethirtyeight (https://projects.fivethirtyeight.com/soccer-predictions/).

Below, we load the packages and then the data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.3.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
spi = read_csv('https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/spi_matches.csv')
## Rows: 68913 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): league, team1, team2
## dbl  (19): season, league_id, spi1, spi2, prob1, prob2, probtie, proj_score1...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(spi)
## Rows: 68,913
## Columns: 23
## $ season      <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…
## $ date        <date> 2016-07-09, 2016-07-10, 2016-07-10, 2016-07-16, 2016-07-1…
## $ league_id   <dbl> 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921…
## $ league      <chr> "FA Women's Super League", "FA Women's Super League", "FA …
## $ team1       <chr> "Liverpool Women", "Arsenal Women", "Chelsea FC Women", "L…
## $ team2       <chr> "Reading", "Notts County Ladies", "Birmingham City", "Nott…
## $ spi1        <dbl> 51.56, 46.61, 59.85, 53.00, 59.43, 50.75, 48.13, 50.62, 48…
## $ spi2        <dbl> 50.42, 54.03, 54.64, 52.35, 60.99, 55.03, 60.15, 52.63, 48…
## $ prob1       <dbl> 0.4389, 0.3572, 0.4799, 0.4289, 0.4124, 0.3821, 0.3082, 0.…
## $ prob2       <dbl> 0.2767, 0.3608, 0.2487, 0.2699, 0.3157, 0.3200, 0.3888, 0.…
## $ probtie     <dbl> 0.2844, 0.2819, 0.2714, 0.3013, 0.2719, 0.2979, 0.3030, 0.…
## $ proj_score1 <dbl> 1.39, 1.27, 1.53, 1.27, 1.45, 1.22, 1.04, 1.31, 1.64, 1.20…
## $ proj_score2 <dbl> 1.05, 1.28, 1.03, 0.94, 1.24, 1.09, 1.20, 1.09, 1.35, 1.45…
## $ importance1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32.4, 53.7, 38.1, …
## $ importance2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 67.7, 22.9, 22.2, …
## $ score1      <dbl> 2, 2, 1, 0, 1, 1, 1, 1, 1, 1, 0, 2, 2, 1, 0, 1, 1, 0, 3, 2…
## $ score2      <dbl> 0, 0, 1, 0, 2, 1, 5, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1…
## $ xg1         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.97, 2.45, 0.85, …
## $ xg2         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.63, 0.77, 2.77, …
## $ nsxg1       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.43, 1.75, 0.17, …
## $ nsxg2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.45, 0.42, 1.25, …
## $ adj_score1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.00, 2.10, 2.10, …
## $ adj_score2  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1.05, 2.10, 1.05, …

As we can see, this is a big data frame with almost 70,000 rows. It contains games outcomes and SPIs, alongside predictions of several seasons and leagues. In order to tailor the dataset a bit more for this demonstration, we use only the 2022 season of the UEFA Champions League. This is a tournament that is played besides the regular leagues for which only the 1st to 4th place (some only 1st and 2nd) of each league qualify to play in, making it a literal Champions League. Below we clean up the dataframe.

cl_2022 = spi %>%
  filter(season == 2022, league == 'UEFA Champions League')
glimpse(cl_2022)
## Rows: 125
## Columns: 23
## $ season      <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022…
## $ date        <date> 2022-09-06, 2022-09-06, 2022-09-06, 2022-09-06, 2022-09-0…
## $ league_id   <dbl> 1818, 1818, 1818, 1818, 1818, 1818, 1818, 1818, 1818, 1818…
## $ league      <chr> "UEFA Champions League", "UEFA Champions League", "UEFA Ch…
## $ team1       <chr> "Borussia Dortmund", "Dinamo Zagreb", "Celtic", "RB Leipzi…
## $ team2       <chr> "FC Copenhagen", "Chelsea", "Real Madrid", "Shakhtar Donet…
## $ spi1        <dbl> 79.92, 63.44, 76.53, 77.81, 69.22, 91.62, 81.13, 76.96, 85…
## $ spi2        <dbl> 56.12, 81.47, 86.58, 60.47, 92.30, 69.23, 79.14, 45.44, 69…
## $ prob1       <dbl> 0.7427, 0.2093, 0.3355, 0.6714, 0.1714, 0.7420, 0.4589, 0.…
## $ prob2       <dbl> 0.0922, 0.5370, 0.4276, 0.1363, 0.6129, 0.0934, 0.2759, 0.…
## $ probtie     <dbl> 0.1651, 0.2537, 0.2370, 0.1923, 0.2157, 0.1646, 0.2652, 0.…
## $ proj_score1 <dbl> 2.28, 0.85, 1.41, 2.10, 0.89, 2.30, 1.40, 2.33, 2.16, 1.35…
## $ proj_score2 <dbl> 0.66, 1.54, 1.62, 0.82, 1.89, 0.68, 1.02, 0.33, 0.87, 1.30…
## $ importance1 <dbl> 77.4, 49.7, 73.2, 71.5, 64.4, 37.3, 76.4, 71.0, 75.7, 78.3…
## $ importance2 <dbl> 52.0, 64.9, 56.9, 53.6, 34.2, 70.6, 75.9, 29.0, 62.3, 82.4…
## $ score1      <dbl> 3, 1, 0, 1, 0, 2, 1, 2, 4, 0, 4, 5, 2, 0, 1, 2, 0, 2, 0, 2…
## $ score2      <dbl> 0, 0, 3, 4, 4, 1, 1, 0, 0, 3, 1, 1, 1, 2, 0, 0, 2, 0, 1, 0…
## $ xg1         <dbl> 2.56, 0.60, 1.05, 1.26, 0.33, 2.25, 1.22, 1.38, 2.06, 0.80…
## $ xg2         <dbl> 0.40, 0.87, 2.74, 1.20, 3.97, 1.31, 0.71, 0.20, 0.11, 1.18…
## $ nsxg1       <dbl> 2.80, 0.62, 1.95, 3.12, 1.18, 1.49, 0.60, 1.22, 2.51, 0.89…
## $ nsxg2       <dbl> 0.57, 1.92, 1.71, 0.33, 2.08, 1.02, 1.82, 0.21, 0.28, 0.93…
## $ adj_score1  <dbl> 2.82, 1.05, 0.00, 1.05, 0.00, 2.10, 1.05, 2.10, 3.96, 0.00…
## $ adj_score2  <dbl> 0.00, 0.00, 2.99, 3.68, 3.68, 1.05, 1.05, 0.00, 0.00, 2.85…

Looking at the new data frame, we can see that it is cleaned up, as it retained only 125 rows, therefore, 125 games. Since we’re working with the best teams of European soccer here, and therefore likely also across the world, it would be interesting to look at a time-series of the average SPI for each game. It should be very high from the beginning, but since the tournament contains a knock-out round, the average SPI should increase over the season. To do this, we’ll create a new column, the average out of both team’s SPIs for each game, and then plot it over time.

cl_2022 = cl_2022 %>%
  mutate(avg_spi = (spi1 + spi2) / 2)

ggplot(cl_2022, aes(date, avg_spi)) +
  geom_line() +
  labs(x = "Date", y = "Average SPI", title = "Average SPI Across CL 2022/23 Season") +
  geom_hline(yintercept = mean(cl_2022$avg_spi), color = "#5755FE", linetype = "dashed") +
  theme_minimal()

While this graph is certainly not pretty, it confirms our idea: average SPI increases over the season. The big gap in the middle is the break between the group phase and the knock-out round, to which not every team advances. Therefore, it makes sense that the average game SPI increases at that point already. Interestingly, the final is not the game with the highest SPI, which appears to be either the quarter or semi final. This shows additionally the importance of seeding in a tournament, which does not take place in the CL, it is drawn completely randomly. Additionally, the horizontal line shows the average SPI of all games, which is just under 80, making it a quite strong tournament.

EXTEND.

I am going to subset some columns to reduce the amount of data on it, and display table using kable function

extend_ucl = subset(cl_2022,select = -c(league_id,importance1,importance2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2,avg_spi))
kable(head(extend_ucl))
season date league team1 team2 spi1 spi2 prob1 prob2 probtie proj_score1 proj_score2 score1 score2
2022 2022-09-06 UEFA Champions League Borussia Dortmund FC Copenhagen 79.92 56.12 0.7427 0.0922 0.1651 2.28 0.66 3 0
2022 2022-09-06 UEFA Champions League Dinamo Zagreb Chelsea 63.44 81.47 0.2093 0.5370 0.2537 0.85 1.54 1 0
2022 2022-09-06 UEFA Champions League Celtic Real Madrid 76.53 86.58 0.3355 0.4276 0.2370 1.41 1.62 0 3
2022 2022-09-06 UEFA Champions League RB Leipzig Shakhtar Donetsk 77.81 60.47 0.6714 0.1363 0.1923 2.10 0.82 1 4
2022 2022-09-06 UEFA Champions League Sevilla FC Manchester City 69.22 92.30 0.1714 0.6129 0.2157 0.89 1.89 0 4
2022 2022-09-06 UEFA Champions League Paris Saint-Germain Juventus 91.62 69.23 0.7420 0.0934 0.1646 2.30 0.68 2 1

Select columns to work with.

extend_ucl1 <- extend_ucl %>%
  select(team1, team2, proj_score1, prob1, prob2, proj_score2, score1, score2)
kable(head(extend_ucl1))
team1 team2 proj_score1 prob1 prob2 proj_score2 score1 score2
Borussia Dortmund FC Copenhagen 2.28 0.7427 0.0922 0.66 3 0
Dinamo Zagreb Chelsea 0.85 0.2093 0.5370 1.54 1 0
Celtic Real Madrid 1.41 0.3355 0.4276 1.62 0 3
RB Leipzig Shakhtar Donetsk 2.10 0.6714 0.1363 0.82 1 4
Sevilla FC Manchester City 0.89 0.1714 0.6129 1.89 0 4
Paris Saint-Germain Juventus 2.30 0.7420 0.0934 0.68 2 1

Filter team1 with “Real Madrid” team.

extend_ucl2 = extend_ucl1 %>%
  filter(team1 == "Real Madrid")
kable(head(extend_ucl2))
team1 team2 proj_score1 prob1 prob2 proj_score2 score1 score2
Real Madrid RB Leipzig 2.00 0.5627 0.2251 1.19 2 0
Real Madrid Shakhtar Donetsk 2.26 0.7256 0.1032 0.72 2 1
Real Madrid Celtic 2.20 0.6543 0.1561 0.98 5 1
Real Madrid Liverpool 1.72 0.4746 0.2929 1.31 1 0
Real Madrid Chelsea 1.57 0.5342 0.2160 0.90 2 0
Real Madrid Manchester City 1.42 0.3310 0.4347 1.66 1 1

Plot of projections of “Real Madrid” team against oponents

ggplot(data = extend_ucl2, aes(x = "Real Madrid", y = team2 ,col = proj_score1), col= red)+ geom_boxplot() +labs(title="Projections Real Madrid vs Oponents",) + theme(plot.title = element_text(hjust=0.5))

 extend_ucl2 %>% group_by("Real Madrid",team2,proj_score2) %>%
  summarise(team1 = n()) %>%
  arrange(desc(team2)) %>%
  ggplot(aes(x = "Real Madrid", y = team2, fill = proj_score2)) +
  geom_bar(position = "dodge", stat="identity")+
  scale_fill_gradient(low = "blue4", high = "springgreen1")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+labs(height=10, width=5)+
  coord_flip() + labs(title = "Projections 2 Madrid vs Rivals",x= "Madrid", y = "rivals")
## `summarise()` has grouped output by '"Real Madrid"', 'team2'. You can override
## using the `.groups` argument.

Conclussion.

After working with my own vignette and being able to upload it and to the clone a repositorie, I was able to download a .rmd file from the shared repositorie and work with it. I decided to subset some of the columns in the dataset, to work with only one team, and make a analysis of how “Real Madrid” projections were againts its rivals in the UCL 2022, I created two visualizations, a boxplot displaying the results of project scores 1, and a bar plot with project scores 2.