library(reticulate)
use_python("/usr/local/bin/python3.10")
The data I chose contains all the stats of English Premier League season 2021-22. The Premier League is the top tier of England’s football pyramid, with 20 teams battling it out for the honor of being crowned English champions. The data set includes individual player stats including their team, jersey number, name, position, # of appearances, # of substitutions, # of goal, and number of penalties.
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as patches
import warnings
import textwrap
The code provided below contains the source of the all_players_stats csv file and how I read it in using pandas.
#file source: https://www.kaggle.com/datasets/azminetoushikwasi/epl-21-22-matches-players/code
path = "Documents/Loyola_DS/DS736_DataVisualization/"
filename = "all_players_stats.csv"
df = pd.read_csv(filename)
The code below shows a brief summary (the top 10 records) of the data that can be found in the all_players dataset.
df.head(10)
## Team JerseyNo Player ... Penalties YellowCards RedCards
## 0 Arsenal 7 Bukayo Saka ... 2 6.0 0.0
## 1 Arsenal 6 Gabriel ... 0 7.0 1.0
## 2 Arsenal 32 Aaron Ramsdale ... 0 1.0 0.0
## 3 Arsenal 4 Ben White ... 0 3.0 0.0
## 4 Arsenal 8 Martin Odegaard ... 0 4.0 0.0
## 5 Arsenal 34 Granit Xhaka ... 0 10.0 2.0
## 6 Arsenal 35 Gabriel Martinelli ... 1 2.0 1.0
## 7 Arsenal 5 Thomas Partey ... 0 6.0 1.0
## 8 Arsenal 10 Emile Smith Rowe ... 0 1.0 0.0
## 9 Arsenal 3 Kieran Tierney ... 0 0.0 0.0
##
## [10 rows x 10 columns]
The code below shows the columns that the all_players dataset is composed of.
df.columns
## Index(['Team', 'JerseyNo', 'Player', 'Position', 'Apearances', 'Substitutions',
## 'Goals', 'Penalties', 'YellowCards', 'RedCards'],
## dtype='object')
The code below shows the data types of each column in the data frame.
df.dtypes
## Team object
## JerseyNo int64
## Player object
## Position object
## Apearances int64
## Substitutions int64
## Goals int64
## Penalties int64
## YellowCards float64
## RedCards float64
## dtype: object
I decided to add a new column, Average Goals per Apperance to normalize the number of goals score per game.
df['Avg_GPA'] = df['Goals']/df['Apearances']
Show the top 5 columns of the the dataframe with our new Avg_GPA or “Average Goals per Apperance” column.
df.head(5)
## Team JerseyNo Player ... YellowCards RedCards Avg_GPA
## 0 Arsenal 7 Bukayo Saka ... 6.0 0.0 0.300000
## 1 Arsenal 6 Gabriel ... 7.0 1.0 0.135135
## 2 Arsenal 32 Aaron Ramsdale ... 1.0 0.0 0.000000
## 3 Arsenal 4 Ben White ... 3.0 0.0 0.000000
## 4 Arsenal 8 Martin Odegaard ... 4.0 0.0 0.194444
##
## [5 rows x 11 columns]
Review the number of players present in the dataframe:
cnt = df.Player.count()
print("There are ", cnt, "players in the English Premier league")
## There are 623 players in the English Premier league
Review if any columns have null or NA records. We need to verify this before doing our analysis and creating visualizations.
df.isna().sum()
## Team 0
## JerseyNo 0
## Player 0
## Position 0
## Apearances 0
## Substitutions 0
## Goals 0
## Penalties 0
## YellowCards 0
## RedCards 0
## Avg_GPA 54
## dtype: int64
Retrieve summary statistics for the dataframe.
df.describe()
## JerseyNo Apearances ... RedCards Avg_GPA
## count 623.000000 623.000000 ... 623.000000 569.000000
## mean 22.597111 16.861958 ... 0.086677 0.125530
## std 18.719450 13.950425 ... 0.303568 0.209319
## min 1.000000 0.000000 ... 0.000000 0.000000
## 25% 9.000000 3.000000 ... 0.000000 0.000000
## 50% 18.000000 16.000000 ... 0.000000 0.047619
## 75% 30.000000 27.500000 ... 0.000000 0.166667
## max 97.000000 54.000000 ... 2.000000 2.000000
##
## [8 rows x 8 columns]
Below we will begin to examine the relationship of the columns to one another in our dataframe.
Green boxes, or boxes containing values close to 1 represent a strong positive relationship between columns. Orange boxes or boxes containing values close to -1 represent a strong negative relationship between columns. As seen in the correlogram below, Appearances and Yellowcards have a strong positive relationship (.68). This is to be expected, as (in most cases) you can’t get a yellowcard if you are not playing in the game. Another relationship of note is the negative relationship between Appearances and Jersey number (-.48). We can conclude that as jersey numbers go up, appearances go down. This supports the traditional notion that starters sport the jersey numbers 1-11 while substitutes wear higher jersey numbers.
# Plot
plt.figure(figsize=(12,10), dpi= 80);
sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True);
# Decorations
plt.title('Correlogram of EPL Player Stats', fontsize=22);
plt.xticks(rotation=45, fontsize=12);
plt.yticks(rotation=45, fontsize=12);
plt.show();
Here we can continue to see the correlation and examine the role that team plays in these relationships.
sns.pairplot(df[['Team','Apearances','JerseyNo','Penalties','Avg_GPA']],hue = 'Team')
The ordered bar chart below displays the average goals per game by team. The top 3 teams in the EPL standings for the 2021-2022 season are as follows: 1. Man City 2. Liverpool and 3. Chelsea. As we might expect, these three teams (in order of their standings) have the highest goals per game. After review I observed the following about the top 3 teams in the league: * Man city has a majority of their players as Midfielder / Forwards. * Man city has the most defensive players out of all of the teams in the EPL. * Chelsea has the most midfielders out of all of the teams in the EPL * Liverpool has the most defensive midfielders in the league. & the following regarding the bottom 3 teams in the league: * Norwich and Burnley seem to have two of the least amount of players on their teams int the EPL. * Norwich has the lowest amount of defensive midfielders * Burnley is the only team that has a player that plays both defense and forward
# ignore warning produced
warnings.filterwarnings("ignore", category=DeprecationWarning);
# Prepare Data
df2 = df[['Goals', 'Team']].groupby('Team').apply(lambda x: x.mean());
## <string>:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
df2.sort_values('Goals', inplace=True);
df2.reset_index(inplace=True);
fig, ax = plt.subplots(figsize=(16,10), facecolor='white', dpi= 80);
ax.vlines(x=df2.index, ymin=0, ymax=df2.Goals, color='firebrick', alpha=0.7, linewidth=20);
# Annotate Text
for i, Goals in enumerate(df2.Goals):
ax.text(i, Goals+0.15, round(Goals, 1), horizontalalignment='center');
# Title, Label, Ticks and Ylim
ax.set_title('Average Goals per Game by Team', fontdict={'size':22});
ax.set(ylabel='Goals Per Game', ylim=(0, 5));
plt.xticks(df2.index, df2.Team.str.title(), rotation=60, horizontalalignment='right', fontsize=12);
# Add patches to color the X axis labels
## ([<matplotlib.axis.XTick object at 0x1260f7b20>, <matplotlib.axis.XTick object at 0x1260f7880>, <matplotlib.axis.XTick object at 0x1260f7ee0>, <matplotlib.axis.XTick object at 0x1261a6800>, <matplotlib.axis.XTick object at 0x1261a72b0>, <matplotlib.axis.XTick object at 0x1261a7d60>, <matplotlib.axis.XTick object at 0x1261a65f0>, <matplotlib.axis.XTick object at 0x1261cca90>, <matplotlib.axis.XTick object at 0x1261cd540>, <matplotlib.axis.XTick object at 0x1261cdff0>, <matplotlib.axis.XTick object at 0x1261ceaa0>, <matplotlib.axis.XTick object at 0x1261ccfd0>, <matplotlib.axis.XTick object at 0x1261cf3a0>, <matplotlib.axis.XTick object at 0x1261cfe50>, <matplotlib.axis.XTick object at 0x1261e8940>, <matplotlib.axis.XTick object at 0x1261e93f0>, <matplotlib.axis.XTick object at 0x1261ce440>, <matplotlib.axis.XTick object at 0x1261e9bd0>, <matplotlib.axis.XTick object at 0x1261ea680>, <matplotlib.axis.XTick object at 0x1261eb130>], [Text(0, 0, 'Norwich City'), Text(1, 0, 'Watford'), Text(2, 0, 'Brighton And Hove Albion'), Text(3, 0, 'Newcastle United'), Text(4, 0, 'Burnley'), Text(5, 0, 'Everton'), Text(6, 0, 'Leeds United'), Text(7, 0, 'Wolverhampton Wanderers'), Text(8, 0, 'Aston Villa'), Text(9, 0, 'Brentford'), Text(10, 0, 'Manchester United'), Text(11, 0, 'Southampton'), Text(12, 0, 'Crystal Palace'), Text(13, 0, 'Arsenal'), Text(14, 0, 'Tottenham Hotspur'), Text(15, 0, 'West Ham United'), Text(16, 0, 'Leicester City'), Text(17, 0, 'Chelsea'), Text(18, 0, 'Liverpool'), Text(19, 0, 'Manchester City')])
p1 = patches.Rectangle((.57, -0.005), width=.33, height=.13, alpha=.1, facecolor='green', transform=fig.transFigure);
p2 = patches.Rectangle((.124, -0.005), width=.446, height=.13, alpha=.1, facecolor='red', transform=fig.transFigure);
fig.add_artist(p1);
fig.add_artist(p2);
plt.show();
Below is a stacked histogram of each team by position. I wanted to examine the distribution of players across position to see if it varied by team, and if the three most successful teams had similar player compositions.
# Prepare data
x_var = 'Team'
groupby_var = 'Position'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]
# Draw'
plt.figure(figsize=(16,9), dpi= 80)
colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)], orientation='horizontal')
# Decoration
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.ylabel(x_var)
plt.xlabel("Frequency")
plt.xlim(0, 40);
plt.show()
# Draw Plot
plt.figure(figsize=(13,10), dpi= 80);
sns.boxplot(x='Position', y='Goals', data=df, notch=False)
# Add N Obs inside boxplot (optional)
def add_n_obs(df,group_col,y):
medians_dict = {grp[0]:grp[1][y].median() for grp in df.groupby(group_col)}
xticklabels = [x.get_text() for x in plt.gca().get_xticklabels()]
n_obs = df.groupby(group_col)[y].size().values
for (x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs):
plt.text(x, medians_dict[xticklabel]*1.01, "#obs : "+str(n_ob), horizontalalignment='center',va='bottom', fontdict={'size':11}, color='black')
add_n_obs(df,group_col='Position',y='Goals')
# Decoration
plt.title('Box Plot of Goals by Position', fontsize=22)
plt.ylim(-1, 35);
plt.xticks(rotation=45);
plt.show();
It was interesting to see the role that position and team played in goals scored and how that related to the actual EPL standings for that specific year. It was fun to combine my knowledge and assumptions about soccer with the actual data provided for each player. For my analysis, I chose to focus on the team and position level, but in the future it would be fun to dive deeper into using the additional aggregation or pairing with other data sets for a fuller picture.Future analyses on this data set could include looking at a specific team to see how what positions and players made the biggest impact on goals scored in each game. Or even comparing players across postion to see which players are the most effective in their offensive roles.