In this document, I will demonstrate a few packages of the tidyverse. Specifically, I will use the packages dplyr to transform the data, tidyr to clean the data, and ggplot2 to plot some of the data. Additionally, readr is used to import the data. As demonstrative dataset, we use the soccer performance index (SPI) match data from fivethirtyeight (https://projects.fivethirtyeight.com/soccer-predictions/).
Below, we load the packages and then the data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.3.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
spi = read_csv('https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/spi_matches.csv')
## Rows: 68913 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): league, team1, team2
## dbl (19): season, league_id, spi1, spi2, prob1, prob2, probtie, proj_score1...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(spi)
## Rows: 68,913
## Columns: 23
## $ season <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…
## $ date <date> 2016-07-09, 2016-07-10, 2016-07-10, 2016-07-16, 2016-07-1…
## $ league_id <dbl> 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921…
## $ league <chr> "FA Women's Super League", "FA Women's Super League", "FA …
## $ team1 <chr> "Liverpool Women", "Arsenal Women", "Chelsea FC Women", "L…
## $ team2 <chr> "Reading", "Notts County Ladies", "Birmingham City", "Nott…
## $ spi1 <dbl> 51.56, 46.61, 59.85, 53.00, 59.43, 50.75, 48.13, 50.62, 48…
## $ spi2 <dbl> 50.42, 54.03, 54.64, 52.35, 60.99, 55.03, 60.15, 52.63, 48…
## $ prob1 <dbl> 0.4389, 0.3572, 0.4799, 0.4289, 0.4124, 0.3821, 0.3082, 0.…
## $ prob2 <dbl> 0.2767, 0.3608, 0.2487, 0.2699, 0.3157, 0.3200, 0.3888, 0.…
## $ probtie <dbl> 0.2844, 0.2819, 0.2714, 0.3013, 0.2719, 0.2979, 0.3030, 0.…
## $ proj_score1 <dbl> 1.39, 1.27, 1.53, 1.27, 1.45, 1.22, 1.04, 1.31, 1.64, 1.20…
## $ proj_score2 <dbl> 1.05, 1.28, 1.03, 0.94, 1.24, 1.09, 1.20, 1.09, 1.35, 1.45…
## $ importance1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32.4, 53.7, 38.1, …
## $ importance2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 67.7, 22.9, 22.2, …
## $ score1 <dbl> 2, 2, 1, 0, 1, 1, 1, 1, 1, 1, 0, 2, 2, 1, 0, 1, 1, 0, 3, 2…
## $ score2 <dbl> 0, 0, 1, 0, 2, 1, 5, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1…
## $ xg1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.97, 2.45, 0.85, …
## $ xg2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.63, 0.77, 2.77, …
## $ nsxg1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.43, 1.75, 0.17, …
## $ nsxg2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.45, 0.42, 1.25, …
## $ adj_score1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.00, 2.10, 2.10, …
## $ adj_score2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1.05, 2.10, 1.05, …
As we can see, this is a big data frame with almost 70,000 rows. It contains games outcomes and SPIs, alongside predictions of several seasons and leagues. In order to tailor the dataset a bit more for this demonstration, we use only the 2022 season of the UEFA Champions League. This is a tournament that is played besides the regular leagues for which only the 1st to 4th place (some only 1st and 2nd) of each league qualify to play in, making it a literal Champions League. Below we clean up the dataframe.
cl_2022 = spi %>%
filter(season == 2022, league == 'UEFA Champions League')
glimpse(cl_2022)
## Rows: 125
## Columns: 23
## $ season <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022…
## $ date <date> 2022-09-06, 2022-09-06, 2022-09-06, 2022-09-06, 2022-09-0…
## $ league_id <dbl> 1818, 1818, 1818, 1818, 1818, 1818, 1818, 1818, 1818, 1818…
## $ league <chr> "UEFA Champions League", "UEFA Champions League", "UEFA Ch…
## $ team1 <chr> "Borussia Dortmund", "Dinamo Zagreb", "Celtic", "RB Leipzi…
## $ team2 <chr> "FC Copenhagen", "Chelsea", "Real Madrid", "Shakhtar Donet…
## $ spi1 <dbl> 79.92, 63.44, 76.53, 77.81, 69.22, 91.62, 81.13, 76.96, 85…
## $ spi2 <dbl> 56.12, 81.47, 86.58, 60.47, 92.30, 69.23, 79.14, 45.44, 69…
## $ prob1 <dbl> 0.7427, 0.2093, 0.3355, 0.6714, 0.1714, 0.7420, 0.4589, 0.…
## $ prob2 <dbl> 0.0922, 0.5370, 0.4276, 0.1363, 0.6129, 0.0934, 0.2759, 0.…
## $ probtie <dbl> 0.1651, 0.2537, 0.2370, 0.1923, 0.2157, 0.1646, 0.2652, 0.…
## $ proj_score1 <dbl> 2.28, 0.85, 1.41, 2.10, 0.89, 2.30, 1.40, 2.33, 2.16, 1.35…
## $ proj_score2 <dbl> 0.66, 1.54, 1.62, 0.82, 1.89, 0.68, 1.02, 0.33, 0.87, 1.30…
## $ importance1 <dbl> 77.4, 49.7, 73.2, 71.5, 64.4, 37.3, 76.4, 71.0, 75.7, 78.3…
## $ importance2 <dbl> 52.0, 64.9, 56.9, 53.6, 34.2, 70.6, 75.9, 29.0, 62.3, 82.4…
## $ score1 <dbl> 3, 1, 0, 1, 0, 2, 1, 2, 4, 0, 4, 5, 2, 0, 1, 2, 0, 2, 0, 2…
## $ score2 <dbl> 0, 0, 3, 4, 4, 1, 1, 0, 0, 3, 1, 1, 1, 2, 0, 0, 2, 0, 1, 0…
## $ xg1 <dbl> 2.56, 0.60, 1.05, 1.26, 0.33, 2.25, 1.22, 1.38, 2.06, 0.80…
## $ xg2 <dbl> 0.40, 0.87, 2.74, 1.20, 3.97, 1.31, 0.71, 0.20, 0.11, 1.18…
## $ nsxg1 <dbl> 2.80, 0.62, 1.95, 3.12, 1.18, 1.49, 0.60, 1.22, 2.51, 0.89…
## $ nsxg2 <dbl> 0.57, 1.92, 1.71, 0.33, 2.08, 1.02, 1.82, 0.21, 0.28, 0.93…
## $ adj_score1 <dbl> 2.82, 1.05, 0.00, 1.05, 0.00, 2.10, 1.05, 2.10, 3.96, 0.00…
## $ adj_score2 <dbl> 0.00, 0.00, 2.99, 3.68, 3.68, 1.05, 1.05, 0.00, 0.00, 2.85…
Looking at the new data frame, we can see that it is cleaned up, as it retained only 125 rows, therefore, 125 games. Since we’re working with the best teams of European soccer here, and therefore likely also across the world, it would be interesting to look at a time-series of the average SPI for each game. It should be very high from the beginning, but since the tournament contains a knock-out round, the average SPI should increase over the season. To do this, we’ll create a new column, the average out of both team’s SPIs for each game, and then plot it over time.
cl_2022 = cl_2022 %>%
mutate(avg_spi = (spi1 + spi2) / 2)
ggplot(cl_2022, aes(date, avg_spi)) +
geom_line() +
labs(x = "Date", y = "Average SPI", title = "Average SPI Across CL 2022/23 Season") +
geom_hline(yintercept = mean(cl_2022$avg_spi), color = "#5755FE", linetype = "dashed") +
theme_minimal()
While this graph is certainly not pretty, it confirms our idea: average SPI increases over the season. The big gap in the middle is the break between the group phase and the knock-out round, to which not every team advances. Therefore, it makes sense that the average game SPI increases at that point already. Interestingly, the final is not the game with the highest SPI, which appears to be either the quarter or semi final. This shows additionally the importance of seeding in a tournament, which does not take place in the CL, it is drawn completely randomly. Additionally, the horizontal line shows the average SPI of all games, which is just under 80, making it a quite strong tournament.
extend_ucl = subset(cl_2022,select = -c(league_id,importance1,importance2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2,avg_spi))
kable(head(extend_ucl))
| season | date | league | team1 | team2 | spi1 | spi2 | prob1 | prob2 | probtie | proj_score1 | proj_score2 | score1 | score2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2022 | 2022-09-06 | UEFA Champions League | Borussia Dortmund | FC Copenhagen | 79.92 | 56.12 | 0.7427 | 0.0922 | 0.1651 | 2.28 | 0.66 | 3 | 0 |
| 2022 | 2022-09-06 | UEFA Champions League | Dinamo Zagreb | Chelsea | 63.44 | 81.47 | 0.2093 | 0.5370 | 0.2537 | 0.85 | 1.54 | 1 | 0 |
| 2022 | 2022-09-06 | UEFA Champions League | Celtic | Real Madrid | 76.53 | 86.58 | 0.3355 | 0.4276 | 0.2370 | 1.41 | 1.62 | 0 | 3 |
| 2022 | 2022-09-06 | UEFA Champions League | RB Leipzig | Shakhtar Donetsk | 77.81 | 60.47 | 0.6714 | 0.1363 | 0.1923 | 2.10 | 0.82 | 1 | 4 |
| 2022 | 2022-09-06 | UEFA Champions League | Sevilla FC | Manchester City | 69.22 | 92.30 | 0.1714 | 0.6129 | 0.2157 | 0.89 | 1.89 | 0 | 4 |
| 2022 | 2022-09-06 | UEFA Champions League | Paris Saint-Germain | Juventus | 91.62 | 69.23 | 0.7420 | 0.0934 | 0.1646 | 2.30 | 0.68 | 2 | 1 |
extend_ucl1 <- extend_ucl %>%
select(team1, team2, proj_score1, prob1, prob2, proj_score2, score1, score2)
kable(head(extend_ucl1))
| team1 | team2 | proj_score1 | prob1 | prob2 | proj_score2 | score1 | score2 |
|---|---|---|---|---|---|---|---|
| Borussia Dortmund | FC Copenhagen | 2.28 | 0.7427 | 0.0922 | 0.66 | 3 | 0 |
| Dinamo Zagreb | Chelsea | 0.85 | 0.2093 | 0.5370 | 1.54 | 1 | 0 |
| Celtic | Real Madrid | 1.41 | 0.3355 | 0.4276 | 1.62 | 0 | 3 |
| RB Leipzig | Shakhtar Donetsk | 2.10 | 0.6714 | 0.1363 | 0.82 | 1 | 4 |
| Sevilla FC | Manchester City | 0.89 | 0.1714 | 0.6129 | 1.89 | 0 | 4 |
| Paris Saint-Germain | Juventus | 2.30 | 0.7420 | 0.0934 | 0.68 | 2 | 1 |
extend_ucl2 = extend_ucl1 %>%
filter(team1 == "Real Madrid")
kable(head(extend_ucl2))
| team1 | team2 | proj_score1 | prob1 | prob2 | proj_score2 | score1 | score2 |
|---|---|---|---|---|---|---|---|
| Real Madrid | RB Leipzig | 2.00 | 0.5627 | 0.2251 | 1.19 | 2 | 0 |
| Real Madrid | Shakhtar Donetsk | 2.26 | 0.7256 | 0.1032 | 0.72 | 2 | 1 |
| Real Madrid | Celtic | 2.20 | 0.6543 | 0.1561 | 0.98 | 5 | 1 |
| Real Madrid | Liverpool | 1.72 | 0.4746 | 0.2929 | 1.31 | 1 | 0 |
| Real Madrid | Chelsea | 1.57 | 0.5342 | 0.2160 | 0.90 | 2 | 0 |
| Real Madrid | Manchester City | 1.42 | 0.3310 | 0.4347 | 1.66 | 1 | 1 |
ggplot(data = extend_ucl2, aes(x = "Real Madrid", y = team2 ,col = proj_score1), col= red)+ geom_boxplot() +labs(title="Projections Real Madrid vs Oponents",) + theme(plot.title = element_text(hjust=0.5))
extend_ucl2 %>% group_by("Real Madrid",team2,proj_score2) %>%
summarise(team1 = n()) %>%
arrange(desc(team2)) %>%
ggplot(aes(x = "Real Madrid", y = team2, fill = proj_score2)) +
geom_bar(position = "dodge", stat="identity")+
scale_fill_gradient(low = "blue4", high = "springgreen1")+
theme_light()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+labs(height=10, width=5)+
coord_flip() + labs(title = "Projections 2 Madrid vs Rivals",x= "Madrid", y = "rivals")
## `summarise()` has grouped output by '"Real Madrid"', 'team2'. You can override
## using the `.groups` argument.