Introduction

As a soccer fan, I’ve decided to use FiveThirtyEight’s 2022 Club Soccer Predictions. I’ll be using their club rankings data set (also available on my GitHub) to see if the offenseive and defensive rating columns assigned by FiveThirtyEight are dependent on the league/name of a team. This dataset uses ESPN’s Soccer Power Index to assign teams a universal rating to normalize across different leagues/playing conditions (home vs away, etc.). This metric can be used to forecast the outcome of matches/seasons for a club. It also can be used to rank clubs from different leagues, where quality of play may differ.

library('dplyr')
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Read in our dataset hosted on GitHub
url_path <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/data/spi_global_rankings.csv"
df = read.csv(url_path, header=TRUE)

head(df, 10)
##    rank prev_rank                name                   league  off  def   spi
## 1     1         1       Bayern Munich        German Bundesliga 3.38 0.44 93.17
## 2     2         2     Manchester City  Barclays Premier League 2.98 0.35 92.10
## 3     3         3 Paris Saint-Germain           French Ligue 1 3.07 0.51 90.35
## 4     4         4           Liverpool  Barclays Premier League 2.84 0.42 90.03
## 5     5         5         Real Madrid Spanish Primera Division 2.73 0.60 86.17
## 6     6         6           Barcelona Spanish Primera Division 2.48 0.48 85.82
## 7     7         7                Ajax         Dutch Eredivisie 2.88 0.74 85.19
## 8     8         8             Chelsea  Barclays Premier League 2.29 0.53 82.74
## 9     9         9      Internazionale            Italy Serie A 2.50 0.69 82.00
## 10   10        10   Tottenham Hotspur  Barclays Premier League 2.36 0.61 81.93

Filtering our Dataset

I’m primarily a fan of the Italian Serie A. I’d like to filter our dataset based on the league column to only include the rows of teams that play in Serie A

serie_a <- filter(df, league=="Italy Serie A") #| league=="Italy Serie B")

# Can also use the subset function to filter our dataframe: https://www.r-bloggers.com/2011/11/r-101-the-subset-function/
serieA <- subset(df, league=="Italy Serie A")

serieA
##     rank prev_rank           name        league  off  def   spi
## 9      9         9 Internazionale Italy Serie A 2.50 0.69 82.00
## 17    17        17       AC Milan Italy Serie A 2.17 0.62 79.08
## 19    19        20         Napoli Italy Serie A 2.23 0.71 78.16
## 30    30        27        AS Roma Italy Serie A 1.92 0.68 73.91
## 32    32        29       Atalanta Italy Serie A 2.05 0.81 72.97
## 44    44        46       Juventus Italy Serie A 1.86 0.81 69.79
## 50    50        49          Lazio Italy Serie A 2.01 0.92 69.66
## 62    62        60     Fiorentina Italy Serie A 1.80 0.87 67.14
## 74    74        76         Torino Italy Serie A 1.66 0.88 64.04
## 89    89        94       Sassuolo Italy Serie A 1.92 1.24 60.84
## 93    93        91         Verona Italy Serie A 1.72 1.08 60.35
## 102  102       114        Udinese Italy Serie A 1.74 1.17 58.69
## 129  129       124        Bologna Italy Serie A 1.56 1.15 55.51
## 162  162       143      Sampdoria Italy Serie A 1.46 1.20 51.71
## 172  172       170          Lecce Italy Serie A 1.36 1.14 50.91
## 177  177       174         Empoli Italy Serie A 1.46 1.28 49.95
## 180  180       171          Monza Italy Serie A 1.42 1.27 49.39
## 183  183       178      Cremonese Italy Serie A 1.43 1.32 48.23
## 187  187       196    Salernitana Italy Serie A 1.45 1.35 48.10
## 196  196       185         Spezia Italy Serie A 1.43 1.38 46.93

Dataset Measurements

The off and def columns represent the average number of goals scored/conceded against an average team on a neutral field. I’d like to calculate an average goal differential column (goals scored - goals conceded). I’d also like to see the difference in a club’s weekly ranking, given by our calculated rank_diff column (defined as \(previous rank - new rank\))

df <- df %>% mutate(rank_diff = prev_rank - rank)
df <- df %>% mutate(avg_diff = off - def)

head(df, 10)
##    rank prev_rank                name                   league  off  def   spi
## 1     1         1       Bayern Munich        German Bundesliga 3.38 0.44 93.17
## 2     2         2     Manchester City  Barclays Premier League 2.98 0.35 92.10
## 3     3         3 Paris Saint-Germain           French Ligue 1 3.07 0.51 90.35
## 4     4         4           Liverpool  Barclays Premier League 2.84 0.42 90.03
## 5     5         5         Real Madrid Spanish Primera Division 2.73 0.60 86.17
## 6     6         6           Barcelona Spanish Primera Division 2.48 0.48 85.82
## 7     7         7                Ajax         Dutch Eredivisie 2.88 0.74 85.19
## 8     8         8             Chelsea  Barclays Premier League 2.29 0.53 82.74
## 9     9         9      Internazionale            Italy Serie A 2.50 0.69 82.00
## 10   10        10   Tottenham Hotspur  Barclays Premier League 2.36 0.61 81.93
##    rank_diff avg_diff
## 1          0     2.94
## 2          0     2.63
## 3          0     2.56
## 4          0     2.42
## 5          0     2.13
## 6          0     2.00
## 7          0     2.14
## 8          0     1.76
## 9          0     1.81
## 10         0     1.75

League Averages

Want to see the average SPI metrics of each league included in our dataset.

league_avgs <- df %>% group_by(league)

league_avg_stats <- league_avgs %>% summarize(off=mean(off), def=mean(def), spi=mean(spi))
league_avg_stats
## # A tibble: 36 × 4
##    league                         off   def   spi
##    <chr>                        <dbl> <dbl> <dbl>
##  1 Argentina Primera Division   1.02  1.33   37.9
##  2 Australian A-League          0.832 2.13   20.2
##  3 Austrian T-Mobile Bundesliga 1.39  1.49   43.3
##  4 Barclays Premier League      2.00  0.790  71.5
##  5 Belgian Jupiler League       1.33  1.47   42.8
##  6 Brasileiro Série A          1.31  1.06   51.5
##  7 Chinese Super League         0.584 2.34   13.5
##  8 Danish SAS-Ligaen            1.25  1.49   40.2
##  9 Dutch Eredivisie             1.65  1.30   53.0
## 10 English League Championship  1.23  1.31   43.8
## # … with 26 more rows

Sorting our league summary stats by SPI to get a sense of how the different leagues perform in terms of offensive, defensive, and SPI metrics. Sorting by SPI in order to see overall league performance according to this metric.

league_avg_stats_sorted <- league_avg_stats %>% arrange(desc(spi))

league_avg_stats_sorted
## # A tibble: 36 × 4
##    league                                     off   def   spi
##    <chr>                                    <dbl> <dbl> <dbl>
##  1 Barclays Premier League                   2.00 0.790  71.5
##  2 German Bundesliga                         2.02 0.918  69.0
##  3 Spanish Primera Division                  1.80 0.794  68.3
##  4 Italy Serie A                             1.76 1.03   61.9
##  5 French Ligue 1                            1.74 1.01   61.7
##  6 UEFA Champions League                     1.72 1.26   56.1
##  7 Dutch Eredivisie                          1.65 1.30   53.0
##  8 Portuguese Liga                           1.45 1.13   52.5
##  9 Brasileiro Série A                       1.31 1.06   51.5
## 10 Mexican Primera Division Torneo Apertura  1.34 1.28   47.1
## # … with 26 more rows

Plots

Plotting the offensive and defensive ratings. Using the ggplot2 library.

library('ggplot2')

ggplot(df, aes(x=def, y=off)) + geom_point()

Plotting by League

Want to make the same plot as above but colored by league. This could be a way for us to see if there are any trends between leagues in terms of club offensive and defensive performance. We only want the Big 5 Leagues (Premier League, Bundesliga, Serie A, Ligue 1, La Liga), otherwise the chart becomes a bit busy.

# Filtering dataframe to the leagues we want
big5 <- filter(df, league=="Italy Serie A" |
                   league=="Barclays Premier League" |
                   league=="German Bundesliga" | 
                   league=="Spanish Primera Division" |
                   league=="French Ligue 1"
                 )

ggplot(big5, aes(x=def, y=off, color=league)) + geom_point()

Plotting Just Serie A

I want to see just the offensive vs defensive ratings for only Italian teams in Serie A (the top Italian league). The general trend (negative correlation) follows that of the overall visualization above, but the range of each variable is different for this case. For instance, the lowest defensive ratings (i.e. the average expected goals conceded) across all Big 5 leagues are between 0.3 and 0.4 goals, whereas in Serie A the lowest defensive rating is around 0.6.

ggplot(serieA, aes(x=def, y=off)) + geom_point()

Conclusion/Further Recommendations

One exercise that could be interesting to do with this dataset would be to see if there are any bias based on league in the SPI rankings. For instance, using European Tournaments, such as the UEFA Champions League or Europa League, could give insight into how leagues perform against each other. Discrepancies in how a league’s clubs perform on average would show that the SPI ranking favors certain leagues/clubs, even if their performance against outside competition does not reflect their ranking. This could also be used to compare how well the SPI metric predicts the outcomes of matches between two teams (in European competitions or otherwise). One other idea would be to try to classify a team’s league based on its offensive, defensive and SPI ratings.