library(reticulate)
  use_python("/usr/local/bin/python3.10")

Introduction

The data I chose contains all the stats of English Premier League season 2021-22. The Premier League is the top tier of England’s football pyramid, with 20 teams battling it out for the honor of being crowned English champions. The data set includes individual player stats including their team, jersey number, name, position, # of appearances, # of substitutions, # of goal, and number of penalties.

Data manipulation

Import Block

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as patches
import warnings
import textwrap

Read in file

The code provided below contains the source of the all_players_stats csv file and how I read it in using pandas.

#file source: https://www.kaggle.com/datasets/azminetoushikwasi/epl-21-22-matches-players/code
path = "Documents/Loyola_DS/DS736_DataVisualization/"
filename = "all_players_stats.csv"

df = pd.read_csv(filename)

Review data frame

The code below shows a brief summary (the top 10 records) of the data that can be found in the all_players dataset.

df.head(10)
##       Team  JerseyNo              Player  ... Penalties  YellowCards  RedCards
## 0  Arsenal         7         Bukayo Saka  ...         2          6.0       0.0
## 1  Arsenal         6             Gabriel  ...         0          7.0       1.0
## 2  Arsenal        32      Aaron Ramsdale  ...         0          1.0       0.0
## 3  Arsenal         4           Ben White  ...         0          3.0       0.0
## 4  Arsenal         8     Martin Odegaard  ...         0          4.0       0.0
## 5  Arsenal        34        Granit Xhaka  ...         0         10.0       2.0
## 6  Arsenal        35  Gabriel Martinelli  ...         1          2.0       1.0
## 7  Arsenal         5       Thomas Partey  ...         0          6.0       1.0
## 8  Arsenal        10    Emile Smith Rowe  ...         0          1.0       0.0
## 9  Arsenal         3      Kieran Tierney  ...         0          0.0       0.0
## 
## [10 rows x 10 columns]

The code below shows the columns that the all_players dataset is composed of.

df.columns
## Index(['Team', 'JerseyNo', 'Player', 'Position', 'Apearances', 'Substitutions',
##        'Goals', 'Penalties', 'YellowCards', 'RedCards'],
##       dtype='object')

The code below shows the data types of each column in the data frame.

df.dtypes
## Team              object
## JerseyNo           int64
## Player            object
## Position          object
## Apearances         int64
## Substitutions      int64
## Goals              int64
## Penalties          int64
## YellowCards      float64
## RedCards         float64
## dtype: object

Manipulate columns

I decided to add a new column, Average Goals per Apperance to normalize the number of goals score per game.

df['Avg_GPA'] = df['Goals']/df['Apearances']

Show the top 5 columns of the the dataframe with our new Avg_GPA or “Average Goals per Apperance” column.

df.head(5)
##       Team  JerseyNo           Player  ... YellowCards  RedCards   Avg_GPA
## 0  Arsenal         7      Bukayo Saka  ...         6.0       0.0  0.300000
## 1  Arsenal         6          Gabriel  ...         7.0       1.0  0.135135
## 2  Arsenal        32   Aaron Ramsdale  ...         1.0       0.0  0.000000
## 3  Arsenal         4        Ben White  ...         3.0       0.0  0.000000
## 4  Arsenal         8  Martin Odegaard  ...         4.0       0.0  0.194444
## 
## [5 rows x 11 columns]

Aggregate Values

Review the number of players present in the dataframe:

cnt = df.Player.count()
print("There are ", cnt, "players in the English Premier league")
## There are  623 players in the English Premier league

Review if any columns have null or NA records. We need to verify this before doing our analysis and creating visualizations.

df.isna().sum()
## Team              0
## JerseyNo          0
## Player            0
## Position          0
## Apearances        0
## Substitutions     0
## Goals             0
## Penalties         0
## YellowCards       0
## RedCards          0
## Avg_GPA          54
## dtype: int64

Retrieve summary statistics for the dataframe.

df.describe()
##          JerseyNo  Apearances  ...    RedCards     Avg_GPA
## count  623.000000  623.000000  ...  623.000000  569.000000
## mean    22.597111   16.861958  ...    0.086677    0.125530
## std     18.719450   13.950425  ...    0.303568    0.209319
## min      1.000000    0.000000  ...    0.000000    0.000000
## 25%      9.000000    3.000000  ...    0.000000    0.000000
## 50%     18.000000   16.000000  ...    0.000000    0.047619
## 75%     30.000000   27.500000  ...    0.000000    0.166667
## max     97.000000   54.000000  ...    2.000000    2.000000
## 
## [8 rows x 8 columns]

Reviewing Visualizations

Below we will begin to examine the relationship of the columns to one another in our dataframe.

Correlogram

Green boxes, or boxes containing values close to 1 represent a strong positive relationship between columns. Orange boxes or boxes containing values close to -1 represent a strong negative relationship between columns. As seen in the correlogram below, Appearances and Yellowcards have a strong positive relationship (.68). This is to be expected, as (in most cases) you can’t get a yellowcard if you are not playing in the game. Another relationship of note is the negative relationship between Appearances and Jersey number (-.48). We can conclude that as jersey numbers go up, appearances go down. This supports the traditional notion that starters sport the jersey numbers 1-11 while substitutes wear higher jersey numbers.

# Plot
plt.figure(figsize=(12,10), dpi= 80);
sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True);

# Decorations
plt.title('Correlogram of EPL Player Stats', fontsize=22);
plt.xticks(rotation=45, fontsize=12);
plt.yticks(rotation=45, fontsize=12);
plt.show();

Correlation Continued

Here we can continue to see the correlation and examine the role that team plays in these relationships.

sns.pairplot(df[['Team','Apearances','JerseyNo','Penalties','Avg_GPA']],hue = 'Team')

Ordered Bar Chart

The ordered bar chart below displays the average goals per game by team. The top 3 teams in the EPL standings for the 2021-2022 season are as follows: 1. Man City 2. Liverpool and 3. Chelsea. As we might expect, these three teams (in order of their standings) have the highest goals per game. After review I observed the following about the top 3 teams in the league: * Man city has a majority of their players as Midfielder / Forwards. * Man city has the most defensive players out of all of the teams in the EPL. * Chelsea has the most midfielders out of all of the teams in the EPL * Liverpool has the most defensive midfielders in the league. & the following regarding the bottom 3 teams in the league: * Norwich and Burnley seem to have two of the least amount of players on their teams int the EPL. * Norwich has the lowest amount of defensive midfielders * Burnley is the only team that has a player that plays both defense and forward

# ignore warning produced
warnings.filterwarnings("ignore", category=DeprecationWarning);

# Prepare Data
df2 = df[['Goals', 'Team']].groupby('Team').apply(lambda x: x.mean());
## <string>:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
df2.sort_values('Goals', inplace=True);
df2.reset_index(inplace=True);

fig, ax = plt.subplots(figsize=(16,10), facecolor='white', dpi= 80);
ax.vlines(x=df2.index, ymin=0, ymax=df2.Goals, color='firebrick', alpha=0.7, linewidth=20);

# Annotate Text
for i, Goals in enumerate(df2.Goals):
    ax.text(i, Goals+0.15, round(Goals, 1), horizontalalignment='center');


# Title, Label, Ticks and Ylim
ax.set_title('Average Goals per Game by Team', fontdict={'size':22});
ax.set(ylabel='Goals Per Game', ylim=(0, 5));
plt.xticks(df2.index, df2.Team.str.title(), rotation=60, horizontalalignment='right', fontsize=12);

# Add patches to color the X axis labels
## ([<matplotlib.axis.XTick object at 0x1260f7b20>, <matplotlib.axis.XTick object at 0x1260f7880>, <matplotlib.axis.XTick object at 0x1260f7ee0>, <matplotlib.axis.XTick object at 0x1261a6800>, <matplotlib.axis.XTick object at 0x1261a72b0>, <matplotlib.axis.XTick object at 0x1261a7d60>, <matplotlib.axis.XTick object at 0x1261a65f0>, <matplotlib.axis.XTick object at 0x1261cca90>, <matplotlib.axis.XTick object at 0x1261cd540>, <matplotlib.axis.XTick object at 0x1261cdff0>, <matplotlib.axis.XTick object at 0x1261ceaa0>, <matplotlib.axis.XTick object at 0x1261ccfd0>, <matplotlib.axis.XTick object at 0x1261cf3a0>, <matplotlib.axis.XTick object at 0x1261cfe50>, <matplotlib.axis.XTick object at 0x1261e8940>, <matplotlib.axis.XTick object at 0x1261e93f0>, <matplotlib.axis.XTick object at 0x1261ce440>, <matplotlib.axis.XTick object at 0x1261e9bd0>, <matplotlib.axis.XTick object at 0x1261ea680>, <matplotlib.axis.XTick object at 0x1261eb130>], [Text(0, 0, 'Norwich City'), Text(1, 0, 'Watford'), Text(2, 0, 'Brighton And Hove Albion'), Text(3, 0, 'Newcastle United'), Text(4, 0, 'Burnley'), Text(5, 0, 'Everton'), Text(6, 0, 'Leeds United'), Text(7, 0, 'Wolverhampton Wanderers'), Text(8, 0, 'Aston Villa'), Text(9, 0, 'Brentford'), Text(10, 0, 'Manchester United'), Text(11, 0, 'Southampton'), Text(12, 0, 'Crystal Palace'), Text(13, 0, 'Arsenal'), Text(14, 0, 'Tottenham Hotspur'), Text(15, 0, 'West Ham United'), Text(16, 0, 'Leicester City'), Text(17, 0, 'Chelsea'), Text(18, 0, 'Liverpool'), Text(19, 0, 'Manchester City')])
p1 = patches.Rectangle((.57, -0.005), width=.33, height=.13, alpha=.1, facecolor='green', transform=fig.transFigure);
p2 = patches.Rectangle((.124, -0.005), width=.446, height=.13, alpha=.1, facecolor='red', transform=fig.transFigure);
fig.add_artist(p1);
fig.add_artist(p2);
plt.show();

Histogram

Below is a stacked histogram of each team by position. I wanted to examine the distribution of players across position to see if it varied by team, and if the three most successful teams had similar player compositions.

# Prepare data
x_var = 'Team'
groupby_var = 'Position'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]

# Draw'
plt.figure(figsize=(16,9), dpi= 80)
colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)], orientation='horizontal')

# Decoration
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.ylabel(x_var)
plt.xlabel("Frequency")
plt.xlim(0, 40);
plt.show()

Box plot

# Draw Plot
plt.figure(figsize=(13,10), dpi= 80);
sns.boxplot(x='Position', y='Goals', data=df, notch=False)

# Add N Obs inside boxplot (optional)
def add_n_obs(df,group_col,y):
    medians_dict = {grp[0]:grp[1][y].median() for grp in df.groupby(group_col)}
    xticklabels = [x.get_text() for x in plt.gca().get_xticklabels()]
    n_obs = df.groupby(group_col)[y].size().values
    for (x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs):
        plt.text(x, medians_dict[xticklabel]*1.01, "#obs : "+str(n_ob), horizontalalignment='center',va='bottom', fontdict={'size':11}, color='black')

add_n_obs(df,group_col='Position',y='Goals')    

# Decoration
plt.title('Box Plot of Goals by Position', fontsize=22)
plt.ylim(-1, 35);
plt.xticks(rotation=45);
plt.show();

Final Thoughts

It was interesting to see the role that position and team played in goals scored and how that related to the actual EPL standings for that specific year. It was fun to combine my knowledge and assumptions about soccer with the actual data provided for each player. For my analysis, I chose to focus on the team and position level, but in the future it would be fun to dive deeper into using the additional aggregation or pairing with other data sets for a fuller picture.Future analyses on this data set could include looking at a specific team to see how what positions and players made the biggest impact on goals scored in each game. Or even comparing players across postion to see which players are the most effective in their offensive roles.