Introudction

The data that I will be using for my annalists is Tennis data form the kaggle data set ATP Tennis. The data contains male tennis player data form the years 2000 to 2019. There are three csv files I will be using and they are match.csv, stat.csv, and player.csv. I have joined all three of theses files and it contains 20240 rows and 36 column of data.

Data summary

  • match.csv: Contains data of the match such as time of the match, data and where the match was player.
  • stat.csv: Contains most of the data and gives very detailed information on what happened during the matches games and sets.
  • player.csv: General information of each player.

Data Set ATP Tennis

Key varriables

  • match_id
  • round
  • date
  • avg_minutes_game
  • avg_seconds_point
  • avg_minutes_set
  • tournament
  • year
  • match_minutes
  • player_id
  • name
  • hand
  • country
  • birthday
  • pts
  • rank
  • winner
  • sets
  • 1
  • 2
  • 3
  • 4
  • 5
  • avg_odds
  • max_odds
  • total_pts
  • return_pts
  • aces
  • bp_saved
  • bp_faced
  • second_serve_rtn_won
  • first_serve_in
  • dbl_faults
  • first_serve_per

Reasearch Goal

My Goal in my research is to see who is the best tennis player and what makes them the best. I will be taking the whole career of each player into account when considering who is the best.I will being finding out who is the best based on points of a player. Points are given to the player after they have won a match, which is then used in the players ranking. I would then analyze the top players in more detail and find more insightful information on the players.

Horizontal Bar Chart

#Cleaning for first graph
g1 = merged_df1[['name', 'pts']]

g1_clean = g1.groupby(['name']).agg({'name':['count'], 'pts':['sum','mean']}).reset_index()

g1_clean.columns = ['name', 'name_count', 'sum_pts', 'ave_pts']

g1_clean = g1_clean.sort_values('sum_pts', ascending = False)

g1_clean.reset_index(inplace = True, drop = True)

x = g1_clean.head(10)

#Color picker function
def color_picker(data):
    colors=[]
    avg = data.sum_pts.mean()
    for each in data.sum_pts:
        if each > avg*1.01:
            colors.append('lightblue')
        elif each < avg*.99:
            colors.append('darkblue')
        else:
            colors.append('black')
    return colors

#First Graph
import matplotlib.patches as mpatches

bottom1 = 0 
top1 = 150
d1 = g1_clean.loc[bottom1:top1]
my_colors1 = color_picker(d1)


bottom2 = 0 
top2 = 10
d2 = g1_clean.loc[bottom2:top2]
my_colors2 = color_picker(d2)


above = mpatches.Patch(color='lightblue', label='Above Average')
at = mpatches.Patch(color='black', label='Within 1% of the Average')
below = mpatches.Patch(color='darkblue', label='Below Average')

fig = plt.figure(figsize = (18, 16))
fig.suptitle('Number of Points Per Player In Millions by Name of Player: \n Top ' + str(top1) + ' and Top ' +str(top2), fontsize=18, fontweight='bold')

ax1 = fig.add_subplot(2,1,1)
ax1.bar(d1.name, d1.sum_pts, label='Sum of Pts', color=my_colors1)
ax1.legend(handles=[above, at, below], fontsize = 14)
plt.axhline(d1.sum_pts.mean(),color = 'black', linestyle = 'dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.xaxis.set_visible(False)
ax1.set_title('Top ' + str(top1) + ' Players Points' , size =20)
ax1.text(top1-10, d1.sum_pts.mean()+5, 'Mean = ' + str(d1.sum_pts.mean()), rotation=0, fontsize=14, va='bottom')

ax2 = fig.add_subplot(2,1,2)
ax2.bar(d2.name, d2.sum_pts, label='Sum of Pts', color=my_colors2)
ax2.legend(handles=[above, at, below], fontsize = 14)
plt.axhline(d2.sum_pts.mean(),color = 'black', linestyle = 'solid')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_title('Top ' + str(top2) + ' Players Points' , size =20)
ax2.text(top2-1, d2.sum_pts.mean()+ 5, 'Mean = ' + str(d2.sum_pts.mean()), rotation=0, fontsize=14,va='bottom')

plt.show()

Information: Horizontal Bar Chart

In this graph we see the total amount of points a player has revived in 2000 to 2019. In the second graph we can see the top 10 players with points from the data set. The reason for this graph is to see how far ahead a player is based on points. This also givens top player legitimacy of being called the best.

Notes from the visualization

  • We can see the top 4 players are well above the average of points for all the player.
  • From the first graph we can see that Novak Djokovic has well above the amount of point in his career than any other player.

Conclusion:Horizontal Bar Chart

From this graph we can see how the first four players has a lot more points than anyother player. We can also see that the top four players stand out more than any other player based on the second graph.

Vertical Bar Chart

#Cleanig for second graph
g2 = merged_df[['name', 'first_serve_in']]

g2_clean = g2.groupby(['name']).agg({'name':['count'], 'first_serve_in':['sum','mean']}).reset_index()

g2_clean.columns = ['name', 'name_count', 'sum_first_serve_in', 'ave_first_serve_in']

g2_clean = g2_clean.sort_values('sum_first_serve_in', ascending = False)

g2_clean.reset_index(inplace = True, drop = True)


#Color picker function
def color_picker2(data):
    colors=[]
    avg = data.sum_first_serve_in.mean()
    for each in data.sum_first_serve_in:
        if each > avg*1.01:
            colors.append('lightblue')
        elif each < avg*.99:
            colors.append('darkblue')
        else:
            colors.append('black')
    return colors
  

#Second Graph
bottom3 = 0
top3 = 20
d3 = g2_clean.loc[bottom3: top3]
d3 = d3. sort_values ('sum_first_serve_in', ascending=True)
d3. reset_index(inplace=True, drop=True)
my_colors3 = color_picker2(d3)

Above = mpatches.Patch(color='lightblue', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of the Average')
Below = mpatches.Patch(color='darkblue', label=  'Below Average')

fig = plt. figure(figsize= (18, 12))
ax1 = fig. add_subplot (1, 1, 1)
ax1.barh(d3.name, d3.sum_first_serve_in, color=my_colors3)
for row_counter, value_at_row_counter in enumerate(d3.sum_first_serve_in):
    ax1.text(value_at_row_counter+2, row_counter, str(value_at_row_counter), color='black', size =10, fontweight='bold',ha = 'left', va = 'center', backgroundcolor = 'white')
plt.xlim(0, d3.sum_first_serve_in.max()*1.1)
ax1.legend(loc='lower right', handles=[Above, At, Below], fontsize=14) 
plt.axvline(d3.sum_first_serve_in.mean(), color='black', linestyle='dashed' )
ax1.text(d3.sum_first_serve_in.mean()+ 2, 0, 'Mean =' + str(d3.sum_first_serve_in.mean()), rotation=0, fontsize=10)
ax1.set_title('Top ' + str(top3) + ' First Server In', size=20)
ax1.set_xlabel('Servers In', fontsize=16)
ax1.set_ylabel ('Players', fontsize=16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

Information: Vertical Bar Chart

In this graph we will see which player makes the most first servers in on average in their career. We are looking at the top 20 players that make thier server in.

Notes from the visualization

  • Rodger Federer is the most consistent in making the most servers in.
  • In the above average players we can see Fernando Verdasco and Stan Wawrinka are above average but are not in above average in points.

Conclusion:Horizontal Bar Chart

We can see that first server do matter due to the top players having a above average first serve in not being in the top ten points graph. Showing tha this atribute is important but not relient for a player in a match.

Stacked Bar Chart

#Getting the Variables

stack_df = merged_df1[['name','pts','1','2','3','4','5']]
stack_df = stack_df.sort_values(by='pts', ascending=False)
stack_df = stack_df.fillna(0)

stack_df = stack_df.groupby(['name']).agg({'name':['count'], 'pts':['sum','mean'], '1':['mean'],'2':['mean'],'3':['mean'],'4':['mean'],'5':['mean']}).reset_index()

#Getting thhe top 10 players

stack_df.columns = ['name', 'name_count', 'sum_pts', 'ave_pts', '1st Game','2nd Game','3rd Game','4th Game','5th Game',]

stack_df = stack_df.sort_values('sum_pts', ascending = False)

stack_df.reset_index(inplace = True, drop = True)

stack_df= stack_df.set_index('name')
stack_df=stack_df.drop(columns=['name_count', 'sum_pts','ave_pts'])
stack_df = stack_df.head(10)

stack_df=stack_df.round() 

#Makeing the Graph

fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1,1,1)
stack_df.plot(kind='bar', stacked=True, ax=ax)

plt.ylabel('Average Games Won in Set', fontsize = 18, labelpad=10)
plt.title('Average Games Won in a Set for the top 10 Players', fontsize = 18)
plt.yticks(fontsize=14)
ax.set_xlabel('Top 20 Players',fontsize=18)

for c in ax.containers:
    labels = [v.get_height() if v.get_height() > 0 else '' for v in c]
    ax.bar_label(c, labels=labels, label_type='center')
    
plt.show()

## Information: Stacked Bar Chart In this graph we see the average amount of games a player wins in a set. The point of this graph is to see where the top players struggle in a match.

Notes from the visualization

Conclusion:Horizontal Bar Chart

This graph helps us see where players struggle threw their matches. We can see the constant players and the players that struggle as the match plays out. This also shows use if a player does not win the match in the first three games.

Pie Chart

#Cleaning

line_df = merged_df1[['name','pts','year']]

line_df = line_df.sort_values(by= 'pts', ascending=False)
line_df=line_df.dropna()

line_df = line_df.groupby(['name']).agg({'name':['count'], 'pts':['sum','mean'], 'year':['count']}).reset_index()

line_df.columns = ['name', 'name_count', 'pts_sum', 'pts_mean','year_count']

line_df = line_df[(line_df["name"] == "Novak Djokovic") | (line_df["name"] == "Roger Federer") | (line_df["name"] == "Rafael Nadal") |(line_df["name"] == "Andy Murray")|(line_df["name"] == "David Ferrer")]

line_df = line_df.sort_values(by = 'pts_sum', ascending=False)

#Getting the colors
number_outside = len(line_df.name.unique())
outside_color_ref = np.arange(number_outside)*4

#Graphing
fig = plt. figure(figsize=(10,10))
ax = fig. add_subplot (1, 1, 1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref)

all_pts = line_df.pts_sum.sum()

line_df.groupby(['name'])['pts_sum'].sum().plot(
    kind= 'pie',radius=1,colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1,
    wedgeprops = dict(edgecolor='white'), textprops= {'fontsize':18},
    autopct = lambda p: '{:.2f}%\n({:.1f}pts)'.format(p,(p/100)*all_pts/1e+6),startangle=90)

ax.yaxis.set_visible(False)
plt.title('Top Five Player Points in Millions', fontsize= 18)
ax.axis('equal')
plt.tight_layout()

plt.show()

## Information: Pie Chart In the pie chart I took the 5 highest scoring player and sum up their scores with each other to see how ahead they are from each other. With this graph we can focus on the top five player points it will be easier to see the different with in themselves than taking more players.

Notes from the visualization

Conclution: Pie Chart

We can see how Djokovic Nadal and Federer are the highest scorring players by a significant percent. Out of the top 5 players points we can see that the 4th best player is Andy Murray with a percent of the total points only being 14.90% showing how much better Djokovic Nadal and Federer are.

Line Graph

#Cleaning

name_df = merged_df1[(merged_df1["name"] == "Novak Djokovic") | (merged_df1["name"] == "Roger Federer") | (merged_df1["name"] == "Rafael Nadal") |(merged_df1["name"] == "Andy Murray")|(merged_df1["name"] == "David Ferrer")]
name_df = name_df[(name_df["winner"] == True)]
name_df = name_df[['winner','name','year']]

win_df = name_df.groupby(['name','year'])['winner'].sum().reset_index(name = 'tot_wins')

#Graphing
fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot (1, 1, 1)

colormap={'Andy Murray':'blue',
         'Roger Federer':'green',
         'Novak Djokovic':'black',
         'Rafael Nadal':'brown',
         'David Ferrer':'red'}

for key, grp in win_df.groupby (['name']):
    key_str = key if isinstance(key, str) else key[0]
    grp.plot(ax=ax, kind='line', x='year', y='tot_wins', color=colormap[key_str], label=key, marker='8')

plt.title('Total Players Wins per year', fontsize=18)
ax.set_xlabel('Year', fontsize = 18)
ax.set_ylabel('Total Wins', fontsize = 18,labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y',labelsize=14,rotation=0)

ax.set_xticks(np.arange(2000,2020,1))


plt.show()

Information: Line Graph

In This Line Graph we see the top 5 Players total wins threw the years 2000 to 2019.

Notes from the visualization

  • We can see Roger Federer was dominance in wins from 2000 to 2005.
  • We can see Novak Djokovic dominance from 2011 to 2016.
  • Then in 2016 to 2019 Rafael Nadal took the led in points.

Conclution: Line Graph

This graph tells the shows us the eras of player dominance We can see how players tend to win a lot for a period of time. We also see big spikes in players if they do not win a lot in the previous year.

Final Thoughts

Through my analysis, I can confidently say that there are three players that could be considered the best. These players are Novak Djokovic, Roger Federer, and Rafael Nadal. Being a tennis fan, I already had a feeling that these players would show up. But what I found insightful was that the points and rank of a player don’t show the whole story of the player. Like I said in the line graph, we can see eras of players, and the late 2010s was the era of Djokovic and Nadal battling for wins. One thing that was not represented in the data was injuries, and Nadal has had many injuries. This is why his points are lower than Djokovic, who has had close to no injuries. Also, in 2019, we see Djokovic in second for wins, and if more years were added to the data, we would see Djokovic being number one.