Introduction

I have chosen to work with the soccer data set, and its associated article is titled Club Soccer Predictions. The article basically makes predictions about 40 leagues around the world. A prediction entails the expected standings of teams within a league for the course of the season. However, it is not the prediction part of the data set that interests me the most; it is rather the measurement called the Soccer Power Index (SPI) that is fascinating to me.

I read up on how the index is calculated to see if it made sense to me as an avid viewer. I was immediately hooked by their approach because SPI is cumulative, and it is updated with each game played by the team: It is a self-correcting predictive model.

The Methodology behind SPI as Presented in How Our Club Soccer Predictions Work

At the beginning of a season, the SPI rating of a team is that of the previous season adjusted by a value that corresponds to the off-season investment or transfer activity of the team. In other words, if a lot of money is spent during the summer transfer window, that has an effect of bolstering the SPI of the team at the start of the season. Then, as the season continues and games are played, the performances are taken into account to adjust the SPI rating. The key concept that really convinces me about the merit of this methodology is that the researchers make a distinction between result and performance. As a long-time fan of the game, I am more than aware of the fact that it is quite often the case that the result does not reflect the performance, and the better team ends up drawing or losing the game. So, I would not be on board with a model that only accounts for the result of a game. This model makes use of two indices to adjust the all important SPI throughout the season: the offensive rating and the defensive rating. These ratings reflect the number of goals a team is expected to score and concede each game, and each new game can change the ratings based on both performance and result. That means, not only are the goals actually scored and conceded taken into account, but also the ones that could have occurred are also taken into account, weighed by the context of the game surrounding them.

I realize this has been a long-winded explanation, but the methodology gets a lot more detailed, and the bare minimum needs to be enumerated to contextualize my choices in cleaning and summarizing the data for comparing the top 6 European soccer leagues.

Transformations

Loading and Glimpsing the Data

library("tidyverse")
spi_global_rankings <- read_csv("https://projects.fivethirtyeight.com/soccer-api/club/spi_global_rankings.csv")
glimpse(spi_global_rankings)
## Rows: 643
## Columns: 7
## $ rank      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ prev_rank <dbl> 1, 2, 3, 4, 6, 5, 7, 8, 9, 10, 11, 14, 16, 13, 18, 12, 17, 2…
## $ name      <chr> "Bayern Munich", "Manchester City", "Paris Saint-Germain", "…
## $ league    <chr> "German Bundesliga", "Barclays Premier League", "French Ligu…
## $ off       <dbl> 3.35, 3.02, 3.06, 2.80, 2.66, 2.77, 2.98, 2.30, 2.19, 2.39, …
## $ def       <dbl> 0.45, 0.32, 0.50, 0.47, 0.46, 0.57, 0.73, 0.58, 0.54, 0.67, …
## $ spi       <dbl> 92.88, 92.70, 90.42, 89.02, 87.92, 87.09, 86.31, 81.87, 81.2…

Transforming and Analyzing the Data

Looking at the data, it occurs to me that the season has just started: so for most teams, there is no difference between current rank and previous rank. As such, I will remove the prev_rank column. Additionally, I see that teams from many small leagues appear in the data frame, so I filter the data frame to include only the top 6 European soccer leagues: German Bundesliga, Barclays Premier League, French Ligue 1, Spanish Primera Division, Italy Serie A, Portuguese Liga.

filt_by_league <- spi_global_rankings[spi_global_rankings$league %in% c("German Bundesliga", "Barclays Premier League", "French Ligue 1", "Spanish Primera Division", "Italy Serie A", "Portuguese Liga"), ]
filt_by_league <- filt_by_league[, -2]

Next, I create a new column called goal_diff that calculates the difference between the offensive and defensive ratings (off and def) to see if the values decrease down the column in correspondence with the decreasing SPI values down the column. Theoretically, the correspondence should be there since SPI is calculated using the offensive and defensive ratings.

Finally, I create a new aggregate data frame grouped by league column for calculating the means of the performance metrics (off, def, goal_diff, and spi). It is sorted by the descending spi_mean column.

filt_by_league <- filt_by_league %>% mutate(goal_diff = off - def, .after = def)
mean_df <- filt_by_league %>% group_by(league) %>% summarise_at(.vars = c(-1:-2), .funs = list(mean = mean))
mean_df
## # A tibble: 6 × 5
##   league                   off_mean def_mean goal_diff_mean spi_mean
##   <chr>                       <dbl>    <dbl>          <dbl>    <dbl>
## 1 Barclays Premier League      1.99    0.788          1.20      71.5
## 2 French Ligue 1               1.72    1.01           0.716     61.4
## 3 German Bundesliga            2.02    0.916          1.10      69.1
## 4 Italy Serie A                1.76    1.03           0.737     62.0
## 5 Portuguese Liga              1.45    1.13           0.316     52.5
## 6 Spanish Primera Division     1.83    0.819          1.01      68.0

We can see that there is in fact correspondence between a decrease in goal_diff_mean and a decrease in spi_mean. We can also see that the aggregate data frame mean_df reveals that the strongest European league is the Barclays Premier League, with the German Bundesliga, the Spanish Primera Division, the Italy Serie A, the French Ligue 1, and the Portuguese Liga following in that respective order.

Finally, I want to check one more thing that should be interesting. I want to see if the strongest teams are ordered in correspondence with the order of the strongest leagues: i.e., Barclays Premier League sides should be the stronger than top German Bundesliga sides. To check this, I fetch the top 6 teams in the spi_global_rankings data frame, which is already ordered by descending values of the spi column. I plan to compare the order of these teams to the order of the leagues to get my answer.

teams_vs_leagues_order <- bind_cols(spi_global_rankings[1:6, "league"], mean_df["league"])
colnames(teams_vs_leagues_order) <- c("teams_order", "leagues_order")
teams_vs_leagues_order
## # A tibble: 6 × 2
##   teams_order              leagues_order           
##   <chr>                    <chr>                   
## 1 German Bundesliga        Barclays Premier League 
## 2 Barclays Premier League  French Ligue 1          
## 3 French Ligue 1           German Bundesliga       
## 4 Barclays Premier League  Italy Serie A           
## 5 Spanish Primera Division Portuguese Liga         
## 6 Spanish Primera Division Spanish Primera Division

The result is surprising because the order of the strongest teams is not in correspondence with the order of the strongest leagues.

Conclusions

On reflection, I have realized a couple of problems with my approach to transforming/analyzing the data. Firstly, the top 6 European soccer leagues should not be determined based on reputation. I should have identified the list after my analysis by using the spi_mean values from mean_df; similarly, mean_df should have contained all the leagues, not just the top 6 chosen according to reputation. So, filtering out the smaller leagues early in the transformation process was a mistake. Secondly, mean_df should have contained the mean of the rank column of filt_by_league data frame. This mean rank column likely would have reflected the discrepancy between the order of the leagues of the strongest teams and the mean-SPI-sorted order of the strongest leagues by clarifying that a high number of strong teams can lead to a league being deemed the strongest (as is the case for the Barclays Premier League), even if the strongest team or teams come from a different league.