NBA Data - Python Visualizations

Analysis of NBA Players Dataset

This dataset includes biometric, biographic and basic box score features from the 1996 to 2021 NBA seasons. It includes demographic variables such as height, weight, place of birth, as well as biographical information such as team played for, draft year and draft round. Basic box score statistics include average number of points, rebounds, assists, and games played.

The goal of these visualizations is to discover patterns and correlations between biology and game statistics throughout the NBA’s recent history, as the game’s philosophy constantly changes.

First up is a Scatter Plot for Height / Average Points Per Game

Here is a scatter plot that shows the correlation between a player’s height and average points per game. There was found to be no distinct correlation between these two variables at the NBA level.

import wget
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

path = "//apporto.com/dfs/LOYOLA/Users/cmhatton_loyola/Desktop/"

filename = "basketball data.csv"

df = pd.read_csv(path + filename, skiprows = 0)

df2 = df[['player_name', 'player_height','pts',]]

def cm_to_ft_in(height_cm):
    height_in = height_cm / 2.54
    height_ft = int(height_in // 12)
    height_in = round(height_in % 12, 1)
    return f"{height_ft}'{height_in}\""

def cm_to_ft_in(height_cm):
    if pd.isna(height_cm):
        return "N/A"
    try:
        height_in = height_cm / 2.54
    except TypeError:
        return "N/A"
    height_ft = int(height_in // 12)
    height_in = round(height_in % 12, 1)
    return f"{height_ft}'{height_in}\""
  
df2_sorted = df2.sort_values(by='player_height')

plt.figure(figsize=(18, 10))

plt.scatter(df2_sorted['player_height'], df2_sorted['pts'], 
            cmap='viridis', edgecolors='black',
           c='orange', s=50)

plt.title('Average Points Per Game by Player Height', fontsize=18)
plt.xlabel('Player Height', fontsize=14)
plt.ylabel('Avg PPG', fontsize=14)

plt.show()

Second Visualization - Non-USA Draft Picks Bar Chart

Here is a bar chart that shows the number of players drafted outside of the US, by team. The colors represent the mean, where green is above the average, pink is the average, and blue is below.

It was found that the San Antonio Spurs have drafted the most players outside of the US since 1996.


country_df = df[['player_name', 'team_abbreviation', 'draft_year', 'country']]
non_usa_df = country_df[country_df['country'] != 'USA']
count_by_team = non_usa_df.groupby('team_abbreviation').size()
count_by_team = count_by_team.sort_values(ascending=False)

def pick_colors_acc_to_mean_count(count_by_team):
    colors = []
    avg = count_by_team.mean()
    for each in count_by_team:
        if each > avg*1.01:
            colors.append('green')
        elif each < avg*0.99:
            colors.append('lightblue')
        else:
            colors.append('lightcoral')
    return colors

my_colors1= pick_colors_acc_to_mean_count(count_by_team)

plt.figure(figsize=(18,10))

plt.bar(count_by_team.index, count_by_team.values, color=my_colors1)

plt.title('Number of Non-USA Players Drafted by Team', size=20)
plt.xlabel('Team Abbreviation', size = 18)
plt.ylabel('Number of Non-USA Players', size = 18)

plt.show()

Third Visualization - Height / Rebounds Line Chart

This line chart shows the average rebounds per game by player height. As opposed to points scored, we can see that there is a measurable upward trend in rebounds as the player height increases.


rebound_df = df[['player_name', 'player_height','reb', 'season']]

rebound_df = df[['player_name', 'player_height','reb', 'season']].copy()

# Used ChatGPT to troubleshoot lambda function to convert height from cm to ft and inches and store the result in a new column called 'player_height_ft'

rebound_df['player_height_ft'] = rebound_df['player_height'].apply(lambda x: f"{int(x*0.0328084)}'{round((x*0.0328084 - int(x*0.0328084))*12)}\"")

rebound_df.drop('player_height', axis=1, inplace=True)

grouped_df = rebound_df.groupby('player_height_ft')['reb'].mean().reset_index()

fig = plt.figure(figsize = (18,10))

plt.plot(grouped_df['player_height_ft'], grouped_df['reb'], '-o', linewidth=2)

plt.xlabel('Player Height', fontsize=18 )
plt.ylabel('Avg. Rebounds Per Game', fontsize=18)
plt.title('Average Rebounds per Player Height', fontsize=22)

plt.scatter(grouped_df['player_height_ft'], grouped_df['reb'], color='red')

plt.show()

Fourth Visualization - College Donut Chart

This donut chart shows the top 10 colleges that NBA-drafted players come from.

The University of Kentucky, and Duke University are the two most frequent NBA player producers.


college_counts = df['college'].value_counts().nlargest(11)
college_counts.drop(labels=['None'], inplace=True)

college_df = pd.DataFrame({'college': college_counts.index,
                           'count': college_counts.values})

fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1)

colors_dict = {
    'Duke': 'blue',
    'Kentucky': 'darkblue',
    'Kansas': 'red',
    'North Carolina': 'lightblue',
    'UCLA': 'yellow',
    'Arizona': 'crimson',
    'Michigan': 'gold',
    'Georgia Tech': 'gray',
    'Connecticut': 'navy',
    'Florida': 'orange'
}

def format_pct_count(pct, count):
    return f'{pct:1.1f}%\n({count})'

counts = college_df['count']
labels = college_df['college']
colors = [colors_dict[name] for name in labels]

ax.pie(counts, labels=labels, colors=colors,
       pctdistance=0.85, labeldistance=1.1,
       wedgeprops=dict(edgecolor='white'), textprops={'fontsize': 18},
       autopct=lambda pct: format_pct_count(pct, int(round(pct / 100.0 * sum(counts), 0))),
       startangle=90)

hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax.set_title('Distribution of Drafted Players by College', fontsize = 20)

ax.text(0,0, 'Total from Top Colleges\n' + str(sum(college_df['count'])), size = 12, ha='center', va='center')

ax.axis('equal')

plt.tight_layout()

plt.show()

Fifth Visualization - Heatmap

This heatmap shows the average PPG per player, for each team, by season, starting in 2001.

This heatmap uses 10 teams as a microcosm for the rest of the league.

This shows how an overall team’s average PPG per player can change based on yearly roster changes.


abv_filtered_df = df[(df['team_abbreviation'].isin(['UTA', 'GSW', 'PHX', 'CHI', 'MIL', 'MIA', 'BOS', 'CLE', 'MEM', 'PHI'])) &
                 (df['season'] > '2000-2001')]

ast_df = abv_filtered_df.groupby(['team_abbreviation', 'season'])['pts'].agg(['mean'])

ast_df = ast_df.reset_index()

hm_df = pd.pivot_table(ast_df, index='season', columns='team_abbreviation', values='mean')

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1, 1, 1)

comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))

ax = sns.heatmap(hm_df, linewidth = 0.2, annot = True, cmap = 'coolwarm',
                 fmt= '.3f', annot_kws={'size' : 11}, 
                 cbar_kws = {'format' : comma_fmt, 'orientation':'vertical'})

plt.title('Heatmap of Avg. Points per Player by Team and NBA Season', fontsize=20, pad=15)
plt.xlabel('Team Abbreviation', fontsize=18, labelpad=10)
plt.ylabel('Season', fontsize=18, labelpad=10)

ax.invert_yaxis()

plt.show()

Conclusion

In summation, the relationship of demographics and biology to the statistics of NBA players has shown to fluctuate over time, as trends are constantly changing with the game. This data only covers a few decades of the NBA and it is predicted that the fluctuations in this data will only increase as the game’s landscape exponentially changes.