The data that I will be using for my annalists is Tennis data form the kaggle data set ATP Tennis. The data contains male tennis player data form the years 2000 to 2019. There are three csv files I will be using and they are match.csv, stat.csv, and player.csv. I have joined all three of theses files and it contains 20240 rows and 36 column of data.
Data Set ATP Tennis
My Goal in my research is to see who is the best tennis player and what makes them the best. I will be taking the whole career of each player into account when considering who is the best.I will being finding out who is the best based on points of a player. Points are given to the player after they have won a match, which is then used in the players ranking. I would then analyze the top players in more detail and find more insightful information on the players.
#Cleaning for first graph
g1 = merged_df1[['name', 'pts']]
g1_clean = g1.groupby(['name']).agg({'name':['count'], 'pts':['sum','mean']}).reset_index()
g1_clean.columns = ['name', 'name_count', 'sum_pts', 'ave_pts']
g1_clean = g1_clean.sort_values('sum_pts', ascending = False)
g1_clean.reset_index(inplace = True, drop = True)
x = g1_clean.head(10)
#Color picker function
def color_picker(data):
colors=[]
avg = data.sum_pts.mean()
for each in data.sum_pts:
if each > avg*1.01:
colors.append('lightblue')
elif each < avg*.99:
colors.append('darkblue')
else:
colors.append('black')
return colors
#First Graph
import matplotlib.patches as mpatches
bottom1 = 0
top1 = 150
d1 = g1_clean.loc[bottom1:top1]
my_colors1 = color_picker(d1)
bottom2 = 0
top2 = 10
d2 = g1_clean.loc[bottom2:top2]
my_colors2 = color_picker(d2)
above = mpatches.Patch(color='lightblue', label='Above Average')
at = mpatches.Patch(color='black', label='Within 1% of the Average')
below = mpatches.Patch(color='darkblue', label='Below Average')
fig = plt.figure(figsize = (18, 16))
fig.suptitle('Number of Points Per Player In Millions by Name of Player: \n Top ' + str(top1) + ' and Top ' +str(top2), fontsize=18, fontweight='bold')
ax1 = fig.add_subplot(2,1,1)
ax1.bar(d1.name, d1.sum_pts, label='Sum of Pts', color=my_colors1)
ax1.legend(handles=[above, at, below], fontsize = 14)
plt.axhline(d1.sum_pts.mean(),color = 'black', linestyle = 'dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.xaxis.set_visible(False)
ax1.set_title('Top ' + str(top1) + ' Players Points' , size =20)
ax1.text(top1-10, d1.sum_pts.mean()+5, 'Mean = ' + str(d1.sum_pts.mean()), rotation=0, fontsize=14, va='bottom')
ax2 = fig.add_subplot(2,1,2)
ax2.bar(d2.name, d2.sum_pts, label='Sum of Pts', color=my_colors2)
ax2.legend(handles=[above, at, below], fontsize = 14)
plt.axhline(d2.sum_pts.mean(),color = 'black', linestyle = 'solid')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_title('Top ' + str(top2) + ' Players Points' , size =20)
ax2.text(top2-1, d2.sum_pts.mean()+ 5, 'Mean = ' + str(d2.sum_pts.mean()), rotation=0, fontsize=14,va='bottom')
plt.show()
In this graph we see the total amount of points a player has revived in 2000 to 2019. In the second graph we can see the top 10 players with points from the data set. The reason for this graph is to see how far ahead a player is based on points. This also givens top player legitimacy of being called the best.
Notes from the visualization
From this graph we can see how the first four players has a lot more points than anyother player. We can also see that the top four players stand out more than any other player based on the second graph.
#Cleanig for second graph
g2 = merged_df[['name', 'first_serve_in']]
g2_clean = g2.groupby(['name']).agg({'name':['count'], 'first_serve_in':['sum','mean']}).reset_index()
g2_clean.columns = ['name', 'name_count', 'sum_first_serve_in', 'ave_first_serve_in']
g2_clean = g2_clean.sort_values('sum_first_serve_in', ascending = False)
g2_clean.reset_index(inplace = True, drop = True)
#Color picker function
def color_picker2(data):
colors=[]
avg = data.sum_first_serve_in.mean()
for each in data.sum_first_serve_in:
if each > avg*1.01:
colors.append('lightblue')
elif each < avg*.99:
colors.append('darkblue')
else:
colors.append('black')
return colors
#Second Graph
bottom3 = 0
top3 = 20
d3 = g2_clean.loc[bottom3: top3]
d3 = d3. sort_values ('sum_first_serve_in', ascending=True)
d3. reset_index(inplace=True, drop=True)
my_colors3 = color_picker2(d3)
Above = mpatches.Patch(color='lightblue', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of the Average')
Below = mpatches.Patch(color='darkblue', label= 'Below Average')
fig = plt. figure(figsize= (18, 12))
ax1 = fig. add_subplot (1, 1, 1)
ax1.barh(d3.name, d3.sum_first_serve_in, color=my_colors3)
for row_counter, value_at_row_counter in enumerate(d3.sum_first_serve_in):
ax1.text(value_at_row_counter+2, row_counter, str(value_at_row_counter), color='black', size =10, fontweight='bold',ha = 'left', va = 'center', backgroundcolor = 'white')
plt.xlim(0, d3.sum_first_serve_in.max()*1.1)
ax1.legend(loc='lower right', handles=[Above, At, Below], fontsize=14)
plt.axvline(d3.sum_first_serve_in.mean(), color='black', linestyle='dashed' )
ax1.text(d3.sum_first_serve_in.mean()+ 2, 0, 'Mean =' + str(d3.sum_first_serve_in.mean()), rotation=0, fontsize=10)
ax1.set_title('Top ' + str(top3) + ' First Server In', size=20)
ax1.set_xlabel('Servers In', fontsize=16)
ax1.set_ylabel ('Players', fontsize=16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()
In this graph we will see which player makes the most first servers in on average in their career. We are looking at the top 20 players that make thier server in.
Notes from the visualization
We can see that first server do matter due to the top players having a above average first serve in not being in the top ten points graph. Showing tha this atribute is important but not relient for a player in a match.
#Getting the Variables
stack_df = merged_df1[['name','pts','1','2','3','4','5']]
stack_df = stack_df.sort_values(by='pts', ascending=False)
stack_df = stack_df.fillna(0)
stack_df = stack_df.groupby(['name']).agg({'name':['count'], 'pts':['sum','mean'], '1':['mean'],'2':['mean'],'3':['mean'],'4':['mean'],'5':['mean']}).reset_index()
#Getting thhe top 10 players
stack_df.columns = ['name', 'name_count', 'sum_pts', 'ave_pts', '1st Game','2nd Game','3rd Game','4th Game','5th Game',]
stack_df = stack_df.sort_values('sum_pts', ascending = False)
stack_df.reset_index(inplace = True, drop = True)
stack_df= stack_df.set_index('name')
stack_df=stack_df.drop(columns=['name_count', 'sum_pts','ave_pts'])
stack_df = stack_df.head(10)
stack_df=stack_df.round()
#Makeing the Graph
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1,1,1)
stack_df.plot(kind='bar', stacked=True, ax=ax)
plt.ylabel('Average Games Won in Set', fontsize = 18, labelpad=10)
plt.title('Average Games Won in a Set for the top 10 Players', fontsize = 18)
plt.yticks(fontsize=14)
ax.set_xlabel('Top 20 Players',fontsize=18)
for c in ax.containers:
labels = [v.get_height() if v.get_height() > 0 else '' for v in c]
ax.bar_label(c, labels=labels, label_type='center')
plt.show()
## Information: Stacked Bar Chart In this graph we see the average
amount of games a player wins in a set. The point of this graph is to
see where the top players struggle in a match.
Notes from the visualization
This graph helps us see where players struggle threw their matches. We can see the constant players and the players that struggle as the match plays out. This also shows use if a player does not win the match in the first three games.
#Cleaning
line_df = merged_df1[['name','pts','year']]
line_df = line_df.sort_values(by= 'pts', ascending=False)
line_df=line_df.dropna()
line_df = line_df.groupby(['name']).agg({'name':['count'], 'pts':['sum','mean'], 'year':['count']}).reset_index()
line_df.columns = ['name', 'name_count', 'pts_sum', 'pts_mean','year_count']
line_df = line_df[(line_df["name"] == "Novak Djokovic") | (line_df["name"] == "Roger Federer") | (line_df["name"] == "Rafael Nadal") |(line_df["name"] == "Andy Murray")|(line_df["name"] == "David Ferrer")]
line_df = line_df.sort_values(by = 'pts_sum', ascending=False)
#Getting the colors
number_outside = len(line_df.name.unique())
outside_color_ref = np.arange(number_outside)*4
#Graphing
fig = plt. figure(figsize=(10,10))
ax = fig. add_subplot (1, 1, 1)
colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref)
all_pts = line_df.pts_sum.sum()
line_df.groupby(['name'])['pts_sum'].sum().plot(
kind= 'pie',radius=1,colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1,
wedgeprops = dict(edgecolor='white'), textprops= {'fontsize':18},
autopct = lambda p: '{:.2f}%\n({:.1f}pts)'.format(p,(p/100)*all_pts/1e+6),startangle=90)
ax.yaxis.set_visible(False)
plt.title('Top Five Player Points in Millions', fontsize= 18)
ax.axis('equal')
plt.tight_layout()
plt.show()
## Information: Pie Chart In the pie chart I took the 5 highest scoring
player and sum up their scores with each other to see how ahead they are
from each other. With this graph we can focus on the top five player
points it will be easier to see the different with in themselves than
taking more players.
Notes from the visualization
We can see how Djokovic Nadal and Federer are the highest scorring players by a significant percent. Out of the top 5 players points we can see that the 4th best player is Andy Murray with a percent of the total points only being 14.90% showing how much better Djokovic Nadal and Federer are.
#Cleaning
name_df = merged_df1[(merged_df1["name"] == "Novak Djokovic") | (merged_df1["name"] == "Roger Federer") | (merged_df1["name"] == "Rafael Nadal") |(merged_df1["name"] == "Andy Murray")|(merged_df1["name"] == "David Ferrer")]
name_df = name_df[(name_df["winner"] == True)]
name_df = name_df[['winner','name','year']]
win_df = name_df.groupby(['name','year'])['winner'].sum().reset_index(name = 'tot_wins')
#Graphing
fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot (1, 1, 1)
colormap={'Andy Murray':'blue',
'Roger Federer':'green',
'Novak Djokovic':'black',
'Rafael Nadal':'brown',
'David Ferrer':'red'}
for key, grp in win_df.groupby (['name']):
key_str = key if isinstance(key, str) else key[0]
grp.plot(ax=ax, kind='line', x='year', y='tot_wins', color=colormap[key_str], label=key, marker='8')
plt.title('Total Players Wins per year', fontsize=18)
ax.set_xlabel('Year', fontsize = 18)
ax.set_ylabel('Total Wins', fontsize = 18,labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y',labelsize=14,rotation=0)
ax.set_xticks(np.arange(2000,2020,1))
plt.show()
In This Line Graph we see the top 5 Players total wins threw the years 2000 to 2019.
Notes from the visualization
This graph tells the shows us the eras of player dominance We can see how players tend to win a lot for a period of time. We also see big spikes in players if they do not win a lot in the previous year.
Through my analysis, I can confidently say that there are three players that could be considered the best. These players are Novak Djokovic, Roger Federer, and Rafael Nadal. Being a tennis fan, I already had a feeling that these players would show up. But what I found insightful was that the points and rank of a player don’t show the whole story of the player. Like I said in the line graph, we can see eras of players, and the late 2010s was the era of Djokovic and Nadal battling for wins. One thing that was not represented in the data was injuries, and Nadal has had many injuries. This is why his points are lower than Djokovic, who has had close to no injuries. Also, in 2019, we see Djokovic in second for wins, and if more years were added to the data, we would see Djokovic being number one.