Introduction

For my data visualizations, I chose to look at Major League Baseball data from 2006-2017 and focused on home runs.

Dataset

There was a lot of information to go through, but I chose to focus my visualizations either on the overall numbers or break it down to the top home run hitters.

Findings

Looking at this data, one may not realize how many home runs there are to

Scatterplot

Home Runs by Ballpark

When comparing the home runs by ballpark, I chose to manipulate the data so that in instances the building remained the same, but had a different name, I renamed those ballparks to the 2017 name. If it was a new structure, I kept that in the data, but if it was a field that had a singular game (i.e. Fort Bragg), I removed it.

One of the interesting things to see looking at the data is that certain ballparks deemed “hitters parks” such as Yankees Stadium or Orioles Park at Camden Yards are colored as you would expect. Then, in the instance of Citizens Bank Park, usually considered a hitters park, you can almost see the team’s decline after 2011 where the number of home runs dropped considerably (that’s when I started working there, so I can confirm there were significantly fewer home runs). Then other parks like O.co Coliseum and AT&T Park clearly show that they are more challenging places to hit home runs.


import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/phill/Anaconda3/Library/plugins/platforms'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import warnings

path = "C:/Users/phill/OneDrive/Documents/Loyola/Data_Visualization/Python/"
filename = "MLBHR_2006-2017.csv"

df = pd.read_csv(path + filename, usecols = ['GAME_DATE', 'BATTER', 'BATTER_TEAM', 'PITCHER', 'PITCHER_TEAM', 'INNING', 'BALLPARK'])

df['GAME_DATE'] = pd.to_datetime(df['GAME_DATE'], format="%Y-%m-%d")

df['Year'] = df['GAME_DATE'].dt.year
df['Month'] = df['GAME_DATE'].dt.month
df['MonthName'] = df['GAME_DATE'].dt.strftime('%B')

#Combined Ballparks where it was the same building, but changed names; replaced with the 2017 name
df = df.replace({'Oakland-Alameda County Coliseum': 'O.co Coliseum', 
                 'McAfee Coliseum': 'O.co Coliseum', 
                 'Dolphin Stadium': 'Sun Life Stadium',
                 'Land Shark Stadium':'Sun Life Stadium', 
                 'Jacobs Field': 'Progressive Field', 
                 'U.S. Cellular Field': 'Guaranteed Rate Field', 
                 'Ballpark at Arlington': 'Globe Life Park', 
                 'Ameriquest Field': 'Globe Life Park', 
                 'Rangers Ballpark': 'Globe Life Park'})
                 
x = df.groupby(['Year', 'BALLPARK'])['Year'].count().reset_index(name='count')
x = pd.DataFrame(x)

omit = ['BB&T Ballpark', 'Fort Bragg', 'Hiram Bithorn Stadium', 'Sydney Cricket Ground', 'Tokyo Dome']
x2 = x.loc[~x['BALLPARK'].isin(omit)]
x2 = x2.sort_values(by = ['Year', 'BALLPARK'], ascending = [True, True])

plt.figure(figsize=(20,16))

plt.scatter(x2['Year'], 
            x2['BALLPARK'],
            marker='8', 
            cmap='seismic', 
            c=x2['count'], 
            s=x2['count'], 
            edgecolors='black')
plt.title('Home Runs by Ballpark by Season', fontsize=20)
plt.ylabel('Ballpark', fontsize=18)
plt.xlabel('Season', fontsize=18)

cbar = plt.colorbar()
cbar.set_label('Number of Home Runs', rotation=270, fontsize=16, color='black', labelpad=30)

my_colorbar_ticks = [*range(x2['count'].min(), int(x2['count'].max()), 20)]
cbar.set_ticks(my_colorbar_ticks)

my_colorbar_tick_labels = [*range(x2['count'].min()-1, int(x2['count'].max()), 20)]
my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_tick_labels]
cbar.set_ticklabels(my_colorbar_tick_labels)

plt.xticks(x2['Year'], x2['Year'], fontsize=14, color='black')
plt.yticks(x2['BALLPARK'], x2['BALLPARK'], fontsize=14, color='black')
plt.show()

Bar Charts

Home Runs by Player, Top 250 and 15 (2006-2017)

Comparing the two charts, about one-third of the top 250 hitters are above the mean for total home runs from 2006-2017. Looking at the top 15, almost half of those hitters are above the mean. However, comparing the mean, the average for the top 15 hitters is more than double the mean of the top 250.

player_totals = df['BATTER']
player_totals = pd.DataFrame(player_totals)
player_totals = player_totals.sort_values(by = ['BATTER'])
player_totals['BATTER'] = player_totals['BATTER'].astype(str)
player_totals = player_totals.applymap(lambda x: x.replace('"', ''))
player_totals2 = player_totals.groupby(['BATTER'])['BATTER'].count().reset_index(name='count')
player_totals2 = player_totals2.sort_values(by = ['count'], ascending=False)
player_totals2 = player_totals2.reset_index(drop=True)
player_totals2 = player_totals2.rename(columns={"count": "COUNT"})

def pick_colors_according_to_mean_count(this_data):
    colors=[]
    avg = this_data.COUNT.mean()
    for each in this_data.COUNT:
        if each > avg*1.01:
            colors.append('firebrick')
        elif each < avg*0.99:
            colors.append('lightskyblue')
        else:
            colors.append('lightcoral')
    return colors
  
bottom1 = 0
top1 = 249
d1 = player_totals2.loc[bottom1:top1]
my_colors1 = pick_colors_according_to_mean_count(d1)

bottom2 = 0
top2 = 14
d2 = player_totals2.loc[bottom2:top2]
my_colors2 = pick_colors_according_to_mean_count(d2)

Above = mpatches.Patch(color='firebrick', label='Above Average')
At = mpatches.Patch(color='lightcoral', label='Within 1% of the Average')
Below = mpatches.Patch(color='lightskyblue', label='Below Average')

fig = plt.figure(figsize=(18, 16))
fig.suptitle('Home Runs by Player:\n Top ' + str(top1+1) + ' and Top ' + str(top2+1), 
             fontsize=18, fontweight='bold')

ax1 = fig.add_subplot(2, 1, 1)
ax1.bar(d1['BATTER'], d1.COUNT, label='Count', color=my_colors1)
ax1.legend(handles=[Above, At, Below], fontsize=14)
plt.axhline(d1.COUNT.mean(), color='black', linestyle='dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.xaxis.set_visible(False)
ax1.set_title('Top ' + str(top1+1) + ' Home Run Hitters', size=20)
ax1.text(top1-10, d1.COUNT.mean()+5, 'Mean = ' + str(d1.COUNT.mean()))
ax1.set_ylabel('Number of Home Runs', fontsize=14)

ax2 = fig.add_subplot(2, 1, 2)
ax2.bar(d2['BATTER'], d2.COUNT, label='Count', color=my_colors2)
ax2.legend(handles=[Above, At, Below], fontsize=14)
plt.axhline(d2.COUNT.mean(), color='black', linestyle='solid')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_title('Top ' + str(top2+1) + ' Home Run Hitters', size=20)
ax2.text(top2-1, d2.COUNT.mean()+5, 'Mean = ' + str(d2.COUNT.mean()))
ax2.set_ylabel('Number of Home Runs', fontsize=14)
ax2.set_ylim(0, d2.COUNT.max()*1.1)
for tick in ax2.get_xticklabels():
    tick.set_rotation(45)

fig.tight_layout(pad=2)

plt.show()

Bump Charts

Top 5 Home Run Hitters, Ranking by Year 2006-2016

For the bump chart, I chose to look at the top 5 hitters from 2006-2016. I removed 2017 because 2 of the hitters retired following 2016. What I thought was interesting in this data was how much overlap there was in a given year with players hitting the same number of home runs.

It is also interesting to see the way that Ryan Howard and Edwin Encarnacion switched places from the start of the decade to the end of the decade, making what Albert Pujols has done, staying consistent for most of his career, so impressive and explains how one must play to hit some of those major historical milestones.

I wanted to include the actual number of home runs each year in the ranking, but I kept getting an error that my pixels were too large.

top5 = player_totals2['BATTER'].iloc[:5,].tolist()

bat_year_df = df[['BATTER', 'Year']]
bat_year_df = bat_year_df.sort_values(by = ['BATTER'], ascending=True)
bat_year_df['BATTER'] = bat_year_df['BATTER'].astype(str)
bat_year_df['BATTER'] = bat_year_df['BATTER'].apply(lambda x: x.replace('"', ''))

hr_season = bat_year_df.groupby(['Year', 'BATTER'])['Year'].count().reset_index(name='HR Count')
hr_season2 = hr_season.loc[hr_season['BATTER'].isin(top5)]
hr_season2 = hr_season2.sort_values(by = ['BATTER', 'Year'], ascending=True)
hr_season2 = hr_season2[hr_season2['Year'] != 2017]
hr_season2.reset_index(inplace=True, drop =True)
hr_season3 = hr_season2.pivot(index='BATTER', columns='Year', values='HR Count')

hr_season3_ranked = hr_season3.rank(0, ascending=False, method='min')
hr_season3_ranked = hr_season3_ranked.T

fig = plt.figure(figsize=(18,10), dpi=100)
ax = fig.add_subplot(1, 1, 1)

hr_season3_ranked.plot(kind='line', ax=ax, marker='o', markeredgewidth=1, linewidth=6, 
                   markersize=25,
                   markerfacecolor='white')
ax.invert_yaxis()

x_ticks = np.unique(hr_season2['Year'])
y_ticks = hr_season3_ranked.shape[1]

plt.ylabel('Player Ranking', fontsize=18, labelpad=10)
plt.title('Player Ranking of Home Runs by Year 2006-2016\n Bump Chart', fontsize=18, pad=15)
plt.xticks(x_ticks, fontsize=14)
plt.yticks(range(1, y_ticks+1, 1), fontsize=14)
ax.set_xlabel('Year', fontsize=18)

handles, labels = ax.get_legend_handles_labels()
handles = [handles[4], handles[3], handles[2], handles[1], handles[0]]
labels = [labels[4], labels[3], labels[2], labels[1], labels[0]]
ax.legend(handles, labels, bbox_to_anchor=(1.01, 1.01), fontsize=14,
         labelspacing = 1, 
         markerscale = .5,
         borderpad = 1,
         handletextpad = 0.8)

plt.tight_layout()

plt.show()

Stacked Bar Chart

Total Home Run by Top 10 Home Run Hitters by Inning (2006-2017)

Looking at the top 10 home run hitters broken down by inning, it is noticeable that more home runs are hit in the first inning and then again in the middle innings. I would think this would be more about when a batter is more likely to come up to bat and most of these batters probably hit in the first, second, third, or fourth spot in the lineup.

player_totes = df[['BATTER', 'Year']]
player_totes = pd.DataFrame(player_totes)
player_totes = player_totes.sort_values(by = ['BATTER'])
player_totes['BATTER'] = player_totes['BATTER'].astype(str)
player_totes['BATTER'] = player_totes['BATTER'].apply(lambda x: x.replace('"', ''))
player_totes2 = player_totes.groupby(['Year','BATTER'])['BATTER'].count().reset_index(name='count')
player_totes2 = player_totes2.sort_values(by = ['Year', 'count'], ascending=False)
player_totes2 = player_totes2.reset_index(drop=True)
player_totes2 = player_totes2.rename(columns={"count": "HR Count"})

top10 = player_totals2['BATTER'].iloc[:10,].tolist()

player_inning = df[['BATTER', 'INNING', 'Year']]
player_inning = player_inning[player_inning['INNING'] < 10]

stacked_inning = player_inning[player_inning['BATTER'].isin(top10)]
stacked_inning = stacked_inning.groupby(['INNING', 'BATTER'])['BATTER'].count().reset_index(name='HRCount')
stacked_inning_pivot = stacked_inning.pivot(index='BATTER', columns='INNING', values='HRCount')

fig = plt.figure(figsize=(20,12))
ax = fig.add_subplot(1, 1, 1)

stacked_inning_pivot.plot(kind='bar', stacked=True, ax=ax)

plt.ylabel('Total Home Runs', fontsize=18, labelpad=10)
plt.title('Total Home Runs by Top 10 Hitters and Inning (2016-2017)\n Stacked Bar Plot', fontsize=20)
plt.xticks(rotation=45, horizontalalignment='center', fontsize=14)
plt.yticks(fontsize=14)
ax.set_xlabel('Top 10 Home Run Hitters', fontsize=18, labelpad=40)

handles, labels = ax.get_legend_handles_labels()
handles = [handles[8], handles[7], handles[6], handles[5], handles[4], handles[3], handles[2], handles[1], handles[0]]
labels = [labels[8],  labels[7],  labels[6],  labels[5],  labels[4],  labels[3],  labels[2],  labels[1],  labels[0]]
ax.legend(handles, labels, bbox_to_anchor=(1.01, 1.01), fontsize=14,
         labelspacing = 1, 
         markerscale = .5,
         borderpad = 1,
         handletextpad = 0.8)

fig.tight_layout(pad=2)

plt.show()

Pie Chart

Home Runs by Inning (2017)

After looking at the data by batter by inning, I thought it would be easier to view overall percentages by inning in one season. This data mirrors what was broken down by the top hitters, showing that the largest percentage of home runs are hit in the first inning, third inning, fourth inning, fifth inning, and sixth inning. Similar to the previous chart, it makes sense that the first inning is high and the middle innings are higher as the batters have seen the same pitcher a couple of times at that point. The later innings are lower most likely because they are facing relief pitchers who typically throw harder for the first time. The ninth inning is logically the lowest because closers should not be giving up a lot of runs, but also if the home team is winning, they will not bat in the ninth inning.

top3_2017 = player_totes2['BATTER'].iloc[:3,].tolist()

player_inning2 = player_inning[player_inning['Year'] == 2017]
player_inning2 = player_inning2.groupby(['INNING'])['INNING'].count().reset_index(name='HRCount')

number_outside_colors = len(player_inning2.INNING.unique())
outside_color_ref_number = np.arange(number_outside_colors)*2

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(1, 1, 1)

colormap = plt.get_cmap("tab20b_r")
outer_colors = colormap(outside_color_ref_number)

HR_inning = player_inning2.HRCount.sum()

player_inning2.groupby(['INNING'])['HRCount'].sum().plot(
    kind='pie', radius=1, colors = outer_colors, pctdistance = 0.79, labeldistance = 1.1,
    wedgeprops = dict(edgecolor='w'), textprops={'fontsize':9},
    autopct = lambda p: '{:.2f}%\n({:.0f} Home Runs)'.format(p, (p/100)*HR_inning),
    startangle=90
)

hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('2017 Home Runs by Inning \n (Excluding Extra Innings)', fontsize=16, pad=30)

ax.text(0, 0, 'Home Runs\n' + str(HR_inning), size=16, ha='center', va='center')

ax.axis('equal')
plt.tight_layout()

plt.show()

Conclusion

In conclusion, there is a lot of information you can manipulate any visualize uses only a couple of fields in a dataframe.