As a soccer fan, I’ve decided to use FiveThirtyEight’s 2022 Club Soccer Predictions. I’ll be using their club rankings data set (also available on my GitHub) to see if the offenseive and defensive rating columns assigned by FiveThirtyEight are dependent on the league/name of a team. This dataset uses ESPN’s Soccer Power Index to assign teams a universal rating to normalize across different leagues/playing conditions (home vs away, etc.). This metric can be used to forecast the outcome of matches/seasons for a club. It also can be used to rank clubs from different leagues, where quality of play may differ.
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Read in our dataset hosted on GitHub
url_path <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/data/spi_global_rankings.csv"
df = read.csv(url_path, header=TRUE)
head(df, 10)
## rank prev_rank name league off def spi
## 1 1 1 Bayern Munich German Bundesliga 3.38 0.44 93.17
## 2 2 2 Manchester City Barclays Premier League 2.98 0.35 92.10
## 3 3 3 Paris Saint-Germain French Ligue 1 3.07 0.51 90.35
## 4 4 4 Liverpool Barclays Premier League 2.84 0.42 90.03
## 5 5 5 Real Madrid Spanish Primera Division 2.73 0.60 86.17
## 6 6 6 Barcelona Spanish Primera Division 2.48 0.48 85.82
## 7 7 7 Ajax Dutch Eredivisie 2.88 0.74 85.19
## 8 8 8 Chelsea Barclays Premier League 2.29 0.53 82.74
## 9 9 9 Internazionale Italy Serie A 2.50 0.69 82.00
## 10 10 10 Tottenham Hotspur Barclays Premier League 2.36 0.61 81.93
I’m primarily a fan of the Italian Serie A. I’d like to filter our
dataset based on the league column to only include the rows
of teams that play in Serie A
serie_a <- filter(df, league=="Italy Serie A") #| league=="Italy Serie B")
# Can also use the subset function to filter our dataframe: https://www.r-bloggers.com/2011/11/r-101-the-subset-function/
serieA <- subset(df, league=="Italy Serie A")
serieA
## rank prev_rank name league off def spi
## 9 9 9 Internazionale Italy Serie A 2.50 0.69 82.00
## 17 17 17 AC Milan Italy Serie A 2.17 0.62 79.08
## 19 19 20 Napoli Italy Serie A 2.23 0.71 78.16
## 30 30 27 AS Roma Italy Serie A 1.92 0.68 73.91
## 32 32 29 Atalanta Italy Serie A 2.05 0.81 72.97
## 44 44 46 Juventus Italy Serie A 1.86 0.81 69.79
## 50 50 49 Lazio Italy Serie A 2.01 0.92 69.66
## 62 62 60 Fiorentina Italy Serie A 1.80 0.87 67.14
## 74 74 76 Torino Italy Serie A 1.66 0.88 64.04
## 89 89 94 Sassuolo Italy Serie A 1.92 1.24 60.84
## 93 93 91 Verona Italy Serie A 1.72 1.08 60.35
## 102 102 114 Udinese Italy Serie A 1.74 1.17 58.69
## 129 129 124 Bologna Italy Serie A 1.56 1.15 55.51
## 162 162 143 Sampdoria Italy Serie A 1.46 1.20 51.71
## 172 172 170 Lecce Italy Serie A 1.36 1.14 50.91
## 177 177 174 Empoli Italy Serie A 1.46 1.28 49.95
## 180 180 171 Monza Italy Serie A 1.42 1.27 49.39
## 183 183 178 Cremonese Italy Serie A 1.43 1.32 48.23
## 187 187 196 Salernitana Italy Serie A 1.45 1.35 48.10
## 196 196 185 Spezia Italy Serie A 1.43 1.38 46.93
The off and def columns represent the
average number of goals scored/conceded against an average team on a
neutral field. I’d like to calculate an average goal differential column
(goals scored - goals conceded). I’d also like to see the difference in
a club’s weekly ranking, given by our calculated rank_diff
column (defined as \(previous rank - new
rank\))
df <- df %>% mutate(rank_diff = prev_rank - rank)
df <- df %>% mutate(avg_diff = off - def)
head(df, 10)
## rank prev_rank name league off def spi
## 1 1 1 Bayern Munich German Bundesliga 3.38 0.44 93.17
## 2 2 2 Manchester City Barclays Premier League 2.98 0.35 92.10
## 3 3 3 Paris Saint-Germain French Ligue 1 3.07 0.51 90.35
## 4 4 4 Liverpool Barclays Premier League 2.84 0.42 90.03
## 5 5 5 Real Madrid Spanish Primera Division 2.73 0.60 86.17
## 6 6 6 Barcelona Spanish Primera Division 2.48 0.48 85.82
## 7 7 7 Ajax Dutch Eredivisie 2.88 0.74 85.19
## 8 8 8 Chelsea Barclays Premier League 2.29 0.53 82.74
## 9 9 9 Internazionale Italy Serie A 2.50 0.69 82.00
## 10 10 10 Tottenham Hotspur Barclays Premier League 2.36 0.61 81.93
## rank_diff avg_diff
## 1 0 2.94
## 2 0 2.63
## 3 0 2.56
## 4 0 2.42
## 5 0 2.13
## 6 0 2.00
## 7 0 2.14
## 8 0 1.76
## 9 0 1.81
## 10 0 1.75
Want to see the average SPI metrics of each league included in our dataset.
league_avgs <- df %>% group_by(league)
league_avg_stats <- league_avgs %>% summarize(off=mean(off), def=mean(def), spi=mean(spi))
league_avg_stats
## # A tibble: 36 × 4
## league off def spi
## <chr> <dbl> <dbl> <dbl>
## 1 Argentina Primera Division 1.02 1.33 37.9
## 2 Australian A-League 0.832 2.13 20.2
## 3 Austrian T-Mobile Bundesliga 1.39 1.49 43.3
## 4 Barclays Premier League 2.00 0.790 71.5
## 5 Belgian Jupiler League 1.33 1.47 42.8
## 6 Brasileiro Série A 1.31 1.06 51.5
## 7 Chinese Super League 0.584 2.34 13.5
## 8 Danish SAS-Ligaen 1.25 1.49 40.2
## 9 Dutch Eredivisie 1.65 1.30 53.0
## 10 English League Championship 1.23 1.31 43.8
## # … with 26 more rows
Sorting our league summary stats by SPI to get a sense of how the different leagues perform in terms of offensive, defensive, and SPI metrics. Sorting by SPI in order to see overall league performance according to this metric.
league_avg_stats_sorted <- league_avg_stats %>% arrange(desc(spi))
league_avg_stats_sorted
## # A tibble: 36 × 4
## league off def spi
## <chr> <dbl> <dbl> <dbl>
## 1 Barclays Premier League 2.00 0.790 71.5
## 2 German Bundesliga 2.02 0.918 69.0
## 3 Spanish Primera Division 1.80 0.794 68.3
## 4 Italy Serie A 1.76 1.03 61.9
## 5 French Ligue 1 1.74 1.01 61.7
## 6 UEFA Champions League 1.72 1.26 56.1
## 7 Dutch Eredivisie 1.65 1.30 53.0
## 8 Portuguese Liga 1.45 1.13 52.5
## 9 Brasileiro Série A 1.31 1.06 51.5
## 10 Mexican Primera Division Torneo Apertura 1.34 1.28 47.1
## # … with 26 more rows
Plotting the offensive and defensive ratings. Using the
ggplot2 library.
library('ggplot2')
ggplot(df, aes(x=def, y=off)) + geom_point()
Want to make the same plot as above but colored by league. This could be a way for us to see if there are any trends between leagues in terms of club offensive and defensive performance. We only want the Big 5 Leagues (Premier League, Bundesliga, Serie A, Ligue 1, La Liga), otherwise the chart becomes a bit busy.
# Filtering dataframe to the leagues we want
big5 <- filter(df, league=="Italy Serie A" |
league=="Barclays Premier League" |
league=="German Bundesliga" |
league=="Spanish Primera Division" |
league=="French Ligue 1"
)
ggplot(big5, aes(x=def, y=off, color=league)) + geom_point()
I want to see just the offensive vs defensive ratings for only Italian teams in Serie A (the top Italian league). The general trend (negative correlation) follows that of the overall visualization above, but the range of each variable is different for this case. For instance, the lowest defensive ratings (i.e. the average expected goals conceded) across all Big 5 leagues are between 0.3 and 0.4 goals, whereas in Serie A the lowest defensive rating is around 0.6.
ggplot(serieA, aes(x=def, y=off)) + geom_point()
One exercise that could be interesting to do with this dataset would be to see if there are any bias based on league in the SPI rankings. For instance, using European Tournaments, such as the UEFA Champions League or Europa League, could give insight into how leagues perform against each other. Discrepancies in how a league’s clubs perform on average would show that the SPI ranking favors certain leagues/clubs, even if their performance against outside competition does not reflect their ranking. This could also be used to compare how well the SPI metric predicts the outcomes of matches between two teams (in European competitions or otherwise). One other idea would be to try to classify a team’s league based on its offensive, defensive and SPI ratings.