Tidyverse EXTEND

The Tidyverse CREATE assignment was made by Lucas Weyrich, I decided to work with this data set and add additional tidyding and analysis to it. His work was about a dataset of soccer predictions.

In this document, I will demonstrate a few packages of the tidyverse. Specifically, I will use the packages dplyr to transform the data, tidyr to clean the data, and ggplot2 to plot some of the data. Additionally, readr is used to import the data. As demonstrative dataset, we use the soccer performance index (SPI) match data from fivethirtyeight (https://projects.fivethirtyeight.com/soccer-predictions/).

Below, we load the packages and then the data.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.3.3

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

spi = read_csv('https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/spi_matches.csv')

## Rows: 68913 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): league, team1, team2
## dbl  (19): season, league_id, spi1, spi2, prob1, prob2, probtie, proj_score1...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(spi)

## Rows: 68,913
## Columns: 23
## $ season      <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…
## $ date        <date> 2016-07-09, 2016-07-10, 2016-07-10, 2016-07-16, 2016-07-1…
## $ league_id   <dbl> 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921…
## $ league      <chr> "FA Women's Super League", "FA Women's Super League", "FA …
## $ team1       <chr> "Liverpool Women", "Arsenal Women", "Chelsea FC Women", "L…
## $ team2       <chr> "Reading", "Notts County Ladies", "Birmingham City", "Nott…
## $ spi1        <dbl> 51.56, 46.61, 59.85, 53.00, 59.43, 50.75, 48.13, 50.62, 48…
## $ spi2        <dbl> 50.42, 54.03, 54.64, 52.35, 60.99, 55.03, 60.15, 52.63, 48…
## $ prob1       <dbl> 0.4389, 0.3572, 0.4799, 0.4289, 0.4124, 0.3821, 0.3082, 0.…
## $ prob2       <dbl> 0.2767, 0.3608, 0.2487, 0.2699, 0.3157, 0.3200, 0.3888, 0.…
## $ probtie     <dbl> 0.2844, 0.2819, 0.2714, 0.3013, 0.2719, 0.2979, 0.3030, 0.…
## $ proj_score1 <dbl> 1.39, 1.27, 1.53, 1.27, 1.45, 1.22, 1.04, 1.31, 1.64, 1.20…
## $ proj_score2 <dbl> 1.05, 1.28, 1.03, 0.94, 1.24, 1.09, 1.20, 1.09, 1.35, 1.45…
## $ importance1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32.4, 53.7, 38.1, …
## $ importance2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 67.7, 22.9, 22.2, …
## $ score1      <dbl> 2, 2, 1, 0, 1, 1, 1, 1, 1, 1, 0, 2, 2, 1, 0, 1, 1, 0, 3, 2…
## $ score2      <dbl> 0, 0, 1, 0, 2, 1, 5, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1…
## $ xg1         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.97, 2.45, 0.85, …
## $ xg2         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.63, 0.77, 2.77, …
## $ nsxg1       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.43, 1.75, 0.17, …
## $ nsxg2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.45, 0.42, 1.25, …
## $ adj_score1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.00, 2.10, 2.10, …
## $ adj_score2  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1.05, 2.10, 1.05, …

As we can see, this is a big data frame with almost 70,000 rows. It contains games outcomes and SPIs, alongside predictions of several seasons and leagues. In order to tailor the dataset a bit more for this demonstration, we use only the 2022 season of the UEFA Champions League. This is a tournament that is played besides the regular leagues for which only the 1st to 4th place (some only 1st and 2nd) of each league qualify to play in, making it a literal Champions League. Below we clean up the dataframe.

cl_2022 = spi %>%
  filter(season == 2022, league == 'UEFA Champions League')
glimpse(cl_2022)

## Rows: 125
## Columns: 23
## $ season      <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022…
## $ date        <date> 2022-09-06, 2022-09-06, 2022-09-06, 2022-09-06, 2022-09-0…
## $ league_id   <dbl> 1818, 1818, 1818, 1818, 1818, 1818, 1818, 1818, 1818, 1818…
## $ league      <chr> "UEFA Champions League", "UEFA Champions League", "UEFA Ch…
## $ team1       <chr> "Borussia Dortmund", "Dinamo Zagreb", "Celtic", "RB Leipzi…
## $ team2       <chr> "FC Copenhagen", "Chelsea", "Real Madrid", "Shakhtar Donet…
## $ spi1        <dbl> 79.92, 63.44, 76.53, 77.81, 69.22, 91.62, 81.13, 76.96, 85…
## $ spi2        <dbl> 56.12, 81.47, 86.58, 60.47, 92.30, 69.23, 79.14, 45.44, 69…
## $ prob1       <dbl> 0.7427, 0.2093, 0.3355, 0.6714, 0.1714, 0.7420, 0.4589, 0.…
## $ prob2       <dbl> 0.0922, 0.5370, 0.4276, 0.1363, 0.6129, 0.0934, 0.2759, 0.…
## $ probtie     <dbl> 0.1651, 0.2537, 0.2370, 0.1923, 0.2157, 0.1646, 0.2652, 0.…
## $ proj_score1 <dbl> 2.28, 0.85, 1.41, 2.10, 0.89, 2.30, 1.40, 2.33, 2.16, 1.35…
## $ proj_score2 <dbl> 0.66, 1.54, 1.62, 0.82, 1.89, 0.68, 1.02, 0.33, 0.87, 1.30…
## $ importance1 <dbl> 77.4, 49.7, 73.2, 71.5, 64.4, 37.3, 76.4, 71.0, 75.7, 78.3…
## $ importance2 <dbl> 52.0, 64.9, 56.9, 53.6, 34.2, 70.6, 75.9, 29.0, 62.3, 82.4…
## $ score1      <dbl> 3, 1, 0, 1, 0, 2, 1, 2, 4, 0, 4, 5, 2, 0, 1, 2, 0, 2, 0, 2…
## $ score2      <dbl> 0, 0, 3, 4, 4, 1, 1, 0, 0, 3, 1, 1, 1, 2, 0, 0, 2, 0, 1, 0…
## $ xg1         <dbl> 2.56, 0.60, 1.05, 1.26, 0.33, 2.25, 1.22, 1.38, 2.06, 0.80…
## $ xg2         <dbl> 0.40, 0.87, 2.74, 1.20, 3.97, 1.31, 0.71, 0.20, 0.11, 1.18…
## $ nsxg1       <dbl> 2.80, 0.62, 1.95, 3.12, 1.18, 1.49, 0.60, 1.22, 2.51, 0.89…
## $ nsxg2       <dbl> 0.57, 1.92, 1.71, 0.33, 2.08, 1.02, 1.82, 0.21, 0.28, 0.93…
## $ adj_score1  <dbl> 2.82, 1.05, 0.00, 1.05, 0.00, 2.10, 1.05, 2.10, 3.96, 0.00…
## $ adj_score2  <dbl> 0.00, 0.00, 2.99, 3.68, 3.68, 1.05, 1.05, 0.00, 0.00, 2.85…

Looking at the new data frame, we can see that it is cleaned up, as it retained only 125 rows, therefore, 125 games. Since we’re working with the best teams of European soccer here, and therefore likely also across the world, it would be interesting to look at a time-series of the average SPI for each game. It should be very high from the beginning, but since the tournament contains a knock-out round, the average SPI should increase over the season. To do this, we’ll create a new column, the average out of both team’s SPIs for each game, and then plot it over time.

cl_2022 = cl_2022 %>%
  mutate(avg_spi = (spi1 + spi2) / 2)

ggplot(cl_2022, aes(date, avg_spi)) +
  geom_line() +
  labs(x = "Date", y = "Average SPI", title = "Average SPI Across CL 2022/23 Season") +
  geom_hline(yintercept = mean(cl_2022$avg_spi), color = "#5755FE", linetype = "dashed") +
  theme_minimal()

While this graph is certainly not pretty, it confirms our idea: average SPI increases over the season. The big gap in the middle is the break between the group phase and the knock-out round, to which not every team advances. Therefore, it makes sense that the average game SPI increases at that point already. Interestingly, the final is not the game with the highest SPI, which appears to be either the quarter or semi final. This shows additionally the importance of seeding in a tournament, which does not take place in the CL, it is drawn completely randomly. Additionally, the horizontal line shows the average SPI of all games, which is just under 80, making it a quite strong tournament.

EXTEND.

I am going to subset some columns to reduce the amount of data on it, and display table using kable function

extend_ucl = subset(cl_2022,select = -c(league_id,importance1,importance2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2,avg_spi))
kable(head(extend_ucl))

season	date	league	team1	team2	spi1	spi2	prob1	prob2	probtie	proj_score1	proj_score2	score1	score2
2022	2022-09-06	UEFA Champions League	Borussia Dortmund	FC Copenhagen	79.92	56.12	0.7427	0.0922	0.1651	2.28	0.66	3	0
2022	2022-09-06	UEFA Champions League	Dinamo Zagreb	Chelsea	63.44	81.47	0.2093	0.5370	0.2537	0.85	1.54	1	0
2022	2022-09-06	UEFA Champions League	Celtic	Real Madrid	76.53	86.58	0.3355	0.4276	0.2370	1.41	1.62	0	3
2022	2022-09-06	UEFA Champions League	RB Leipzig	Shakhtar Donetsk	77.81	60.47	0.6714	0.1363	0.1923	2.10	0.82	1	4
2022	2022-09-06	UEFA Champions League	Sevilla FC	Manchester City	69.22	92.30	0.1714	0.6129	0.2157	0.89	1.89	0	4
2022	2022-09-06	UEFA Champions League	Paris Saint-Germain	Juventus	91.62	69.23	0.7420	0.0934	0.1646	2.30	0.68	2	1

Select columns to work with.

extend_ucl1 <- extend_ucl %>%
  select(team1, team2, proj_score1, prob1, prob2, proj_score2, score1, score2)
kable(head(extend_ucl1))

team1	team2	proj_score1	prob1	prob2	proj_score2	score1	score2
Borussia Dortmund	FC Copenhagen	2.28	0.7427	0.0922	0.66	3	0
Dinamo Zagreb	Chelsea	0.85	0.2093	0.5370	1.54	1	0
Celtic	Real Madrid	1.41	0.3355	0.4276	1.62	0	3
RB Leipzig	Shakhtar Donetsk	2.10	0.6714	0.1363	0.82	1	4
Sevilla FC	Manchester City	0.89	0.1714	0.6129	1.89	0	4
Paris Saint-Germain	Juventus	2.30	0.7420	0.0934	0.68	2	1

Filter team1 with “Real Madrid” team.

extend_ucl2 = extend_ucl1 %>%
  filter(team1 == "Real Madrid")
kable(head(extend_ucl2))

team1	team2	proj_score1	prob1	prob2	proj_score2	score1	score2
Real Madrid	RB Leipzig	2.00	0.5627	0.2251	1.19	2	0
Real Madrid	Shakhtar Donetsk	2.26	0.7256	0.1032	0.72	2	1
Real Madrid	Celtic	2.20	0.6543	0.1561	0.98	5	1
Real Madrid	Liverpool	1.72	0.4746	0.2929	1.31	1	0
Real Madrid	Chelsea	1.57	0.5342	0.2160	0.90	2	0
Real Madrid	Manchester City	1.42	0.3310	0.4347	1.66	1	1

Plot of projections of “Real Madrid” team against oponents

ggplot(data = extend_ucl2, aes(x = "Real Madrid", y = team2 ,col = proj_score1), col= red)+ geom_boxplot() +labs(title="Projections Real Madrid vs Oponents",) + theme(plot.title = element_text(hjust=0.5))

 extend_ucl2 %>% group_by("Real Madrid",team2,proj_score2) %>%
  summarise(team1 = n()) %>%
  arrange(desc(team2)) %>%
  ggplot(aes(x = "Real Madrid", y = team2, fill = proj_score2)) +
  geom_bar(position = "dodge", stat="identity")+
  scale_fill_gradient(low = "blue4", high = "springgreen1")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+labs(height=10, width=5)+
  coord_flip() + labs(title = "Projections 2 Madrid vs Rivals",x= "Madrid", y = "rivals")

## `summarise()` has grouped output by '"Real Madrid"', 'team2'. You can override
## using the `.groups` argument.

Conclussion.

After working with my own vignette and being able to upload it and to the clone a repositorie, I was able to download a .rmd file from the shared repositorie and work with it. I decided to subset some of the columns in the dataset, to work with only one team, and make a analysis of how “Real Madrid” projections were againts its rivals in the UCL 2022, I created two visualizations, a boxplot displaying the results of project scores 1, and a bar plot with project scores 2.

Tidyverse EXTEND

Victor Torres

2024-04-18