Rough Draft of Final Project
The dataset (below) I will be using for my project is title “NBA Players stats since 1950” that I have obtained from Kaggle. This dataset only goes up to year 2017 so we will be missing some data from our current and recent basketball seasons.
Below is the description of the variables kept for this project. Year - The year/season the player had played in. Player - The name of the player Pos - The Position the Player had played Age - The age the player was during that year/season Tm - The Team that player had played for G - Total Games played GS - Total Games Started MP - Number of Minutes Played by the player PER - Player Efficiency Rating TS. - True Shooting Percentage of which is based on field goals, 3 point field goals and free throws TRB. - Total Rebound Percentage of which is based on the number of rebounds available for the player to grab while playing AST. - Assist Percentage which of the teammate field goals a player assisted while he was on the floor. WS - Win share, an estimate of the number of wins contributed by a player. FG - Total number of field goals FGA - Total of field goal attempts X3P. - The 3 Point goal percentage X2P. - 2 Point goal percentage eFG. - Effective Field Goal Percentage based on the other goal percentages (X3P., X2P., FT.) FT. - Free Throw Percentage TRB - Number of Rebounds made AST - Number of Assists made STL - Number of Steals made BLK - Number of Blocks made TOV - Number of Turnovers made PF - Number of Personal Fouls called on the player PTS - Number of Points scored
Other adjustments made are combining certain position. The primary categories that they will be merged into are Center (C), Power Forward (PF), Small Forward (SF), Point Guard (PG) and Shooting (SG). Some players might have played more than one role which would then isolate themselves into their own category. For example, Ed Manning st one point during his career was classified as a Small Forward-Power Forward (SF-PF). In this case, we would fit him with the first corresponding position given, which would be a Small Forward (SF from SF-PF).
#Making appropriate changes to Player's position if they played more than one.
#Ex) Center-Forward is moved to Center in the next line
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "C-F", "C")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "C-PF", "C")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "C-SF", "C")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "F", "PF")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "F-C", "PF")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "F-G", "PF")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "G", "PG")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "G-F", "PG")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "PF-C", "PF")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "PF-SF", "PF")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "PG-SF", "PG")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "PG-SG", "PG")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "SF-PF", "SF")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "SF-PG", "SF")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "SF-SG", "SF")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "SG-PF", "SG")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "SG-PG", "SG")
Seasons_Stats$Pos = replace(Seasons_Stats$Pos, Seasons_Stats$Pos == "SG-SF", "SG")
After evaluating the dataset heavily and some research, I have come to a conclusion on more adjustments that needed to be made. From looking at the original, there are a lot of missing variables in the early stages of basketball. There’s two simple answers as to why that is the case. One is simply it wasn’t recorded during those times while the other is rule changes that have effect the overall play of the game. If you go way back into the years of basketball, the three point line of basketball didn’t exist. The three point line was implemented in the year 1986 in an effort to reward players from taking shots far away and to disrupt the domination of players that ruled over the paint (the large rectangle under the basket). And since this rule wasn’t made before 1987, then the players before than wouldn’t have the chance to shoot three point shots and thus are NA values. I don’t believe it will effect my process. If it will, I will remove them and let the audience know here what happened.
My primary goal for this dataset is to answer basketball’s most asked question: Who belongs on the Mount Rushmore of basketball? To answer this question, we will solely be using statistics of the players performance without any other form of outside factors (such as attitude, win streaks, miracle plays, etc). To do this, we will have a selection process by which a player is picked for consideration of being on Mount Rushmore. Starting with looking at the overall player pool and creating a standard based on the averages of all the players so we can distinguish between above and below average statistic in a given variable. We will then shift our focus to two primary variables: Points (PTS) and the Player Efficiency Rating (PER). We will select the best of players based on being above the average points made and average PER to complete first round selection and then see who had appeared in that section more than once. We will then take the top performers and move on to the last round where we will select the faces for our Mount Rushmore of Basketball. We will compare the final group to each other and attempt to look a few key aspects that can be seen in their player performance shown across the years of their career such as growth, consistency and plain being the best.
The reason as to why I selected Player Efficiency Rating (PER) because it is an essential part that evaluates a players overall performance in a given amount of time. PER had started when John Hollinger, a ESPN Journalist had come up with the idea for a per minute rating that looks at all of the players good aspects of their plays and subtraction of the negative aspects over the minutes they played. There is a formula in which has a variety of how much a particular accomplishment weighs with other considerations involved with the league stats and team stats. So I personally believe that this is a good representation of how a player is evaluated and thus is why I focus on it on portions of my project.
#Standard of all players across all seasons
#Mean
Stats_Mean = Seasons_Stats %>%
summarize_all(mean, na.rm = TRUE)
Stats_Mean
Year Player Pos Age Tm G GS MP PER
1 1992.595 NA NA 26.66441 NA 50.83711 23.59337 1209.72 12.47907
TS. TRB. AST. WS FG FGA X3P.
1 0.4930013 9.94921 13.00996 2.485796 195.3258 430.6458 0.2487963
X2P. eFG. FT. TRB AST STL BLK
1 0.4453431 0.4506584 0.7192786 224.6374 114.8526 39.89705 24.47026
TOV PF PTS
1 73.93983 116.3392 510.1163
#SD
Seasons_Stats %>%
summarize_all(sd, na.rm = TRUE)
Year Player Pos Age Tm G GS MP PER
1 17.42959 NA NA 3.841892 NA 26.49616 28.63239 941.1466 6.039014
TS. TRB. AST. WS FG FGA X3P.
1 0.09446901 5.040283 9.191843 3.058638 188.1144 397.6247 0.1766832
X2P. eFG. FT. TRB AST STL BLK
1 0.09980326 0.09920031 0.141824 228.1902 135.8639 38.71305 36.93508
TOV PF PTS
1 67.7138 84.79187 492.923
# Distribution of Points in all Seasons
ggplot(Seasons_Stats, aes(PTS)) +
geom_histogram(aes(y = ..density..), alpha = 0.5) +
geom_density(alpha = 0.2, fill="blue") +
theme_minimal() +
labs(x = "Points", y= "Density", title = "Distribution of Points Scored in decade of 1970 by Position")
The graph above illustrates the distribution of points made throughout the seasons. The question would be what’s the distribution of points given across the seasons. This suggests that majority of basketball players are around 100 points at the end of seasons. The more points that a player has slides more to the right of the graph and there’s fewer and fewer players each step of the way. This is because it is hard for someone to score more than 1000 points in a given season but it does happen (displaying it is possible). What is insane is that there are people that have crossed 2000 points in a season. These players I suggest would be considered a candidate for being on Mount Rushmore of Basketball, but we will see.
#Round 1 elimination
#Finding Number of players starting with
All_Players = Seasons_Stats %>%
distinct(Player)
#Displays all NBA Players
head(All_Players[order(All_Players$Player),])
[1] "" "A.C. Green" "A.J. Bramlett" "A.J. English"
[5] "A.J. Guyton" "A.J. Hammons"
#Subtract one for a "Ghost" Player
Number_Of_Players = nrow(All_Players)-1
#Eliminating Players below all factors
#Factors not used: Year, Player, Age, Tm, and TOV. (PF and GS added)
Round1_Players = Seasons_Stats %>%
filter(Stats_Mean$G < G,
Stats_Mean$MP < MP,
Stats_Mean$PER < PER,
Stats_Mean$TS. < TS.,
Stats_Mean$TRB. < TRB.,
Stats_Mean$AST. < AST.,
Stats_Mean$WS < WS,
Stats_Mean$FG < FG,
Stats_Mean$FGA < FGA,
Stats_Mean$X3P. < X3P.,
Stats_Mean$X2P. < X2P.,
Stats_Mean$eFG. < eFG.,
Stats_Mean$FT. < FT.,
Stats_Mean$TRB < TRB,
Stats_Mean$AST < AST,
Stats_Mean$STL < STL,
Stats_Mean$BLK < BLK,
Stats_Mean$PF < PF,
Stats_Mean$PTS < PTS)
#Displays Records of all players and years
#Round1_Players
#Displays all Players passing round 1
Round1_Players %>%
distinct(Player)
Player
1 Larry Bird*
2 Alex English*
3 Bob Lanier*
4 Julius Erving*
5 Alvan Adams
6 Albert King
7 Clark Kellogg
8 Alvin Robertson
9 Charles Barkley*
10 Brad Daugherty
11 Clyde Drexler*
12 Michael Jordan*
13 Jack Sikma
14 John Williams
15 Magic Johnson*
16 Karl Malone*
17 Rodney McCray
18 Larry Nance
19 Dominique Wilkins*
20 Derrick Coleman
21 Detlef Schrempf
22 Frank Brickowski
23 Larry Johnson
24 Robert Horry
25 David Robinson*
26 Patrick Ewing*
27 Christian Laettner
28 Tom Gugliotta
29 Juwan Howard
30 Arvydas Sabonis*
31 Kevin Garnett
32 Anthony Mason
33 Hakeem Olajuwon*
34 Shareef Abdur-Rahim
35 Grant Hill
36 Chris Webber
37 Steve Francis
38 Tracy McGrady
39 Glenn Robinson
40 Bonzi Wells
41 Brad Miller
42 Dirk Nowitzki
43 Paul Pierce
44 Andrei Kirilenko
45 Lamar Odom
46 LeBron James
47 Elton Brand
48 Boris Diaw
49 Mike Miller
50 Metta World
51 Carmelo Anthony
52 Luol Deng
53 Matt Barnes
54 Pau Gasol
55 Andray Blatche
56 Kevin Durant
57 David West
58 Al Horford
59 David Lee
60 Josh Smith
61 Amar'e Stoudemire
62 Dwyane Wade
63 Paul George
64 Gerald Wallace
65 Tim Duncan
66 Paul Millsap
67 Nicolas Batum
68 Spencer Hawes
69 Kevin Love
70 DeMarcus Cousins
71 Blake Griffin
72 Jared Sullinger
73 Giannis Antetokounmpo
74 Will Barton
75 Marc Gasol
76 Nikola Jokic
77 Patrick Beverley
78 James Harden
79 Kelly Olynyk
80 Julius Randle
81 Karl-Anthony Towns
82 Russell Westbrook
We started with a total of 3921 Players at the beginning and after round 1 of the elimination process, we are left with 76 Players. This is a big jump from a thousand players to less than a hundred. Factors that were not included in this process was the player’s name, the year, their age, the team they played for and the Turnovers (TOV) they had. I decided to eliminate it since the top performing players would either have more played time and thus would have more opportunities to turn the ball over more than the average player would and not necessary for this section of the process. I had also decided to remove the Games Started because then majority of the players would be starters (the players that a team starts with). This would make sense to have a team’s starters be the top performing players, but for my project to eliminate favoritism of soley starters, I wanted to remove it. This leads into why the decision to keep both Games (G) and Minutes Played because they would be a sufficient source of filtering for critical players of the game of basketball versus just looking at those that started.
#Graph displaying PER versus Points of the players remaining after round 1
Round1_Players %>%
filter(! is.na(PER)) %>%
ggplot(aes(PER, PTS, color = Pos)) +
geom_point() +
theme_bw() +
labs(x="Player Efficiency Rating", y="Number of Points scored by Player", title="Player Efficiency Rating versus Points")
We shall begin our steps to starting our round 2 process. Above is a graph that shows a players Player Efficiency Rating by the Points they have scored in a season and the color indicates the players corresponding position. We can see a nice little trend that shows the higher the number of points a player scored, the higher likely the player efficiency is for that individual. It does seem to show that the Centers are towards the bottom of this trend. Power Forward and Small Forward are co-existing at the same areas. There are very few Shooting Guards and I believe four Point Guards on this graph which illustrates that the elite players are mainly of the other three positions. What I would like to do to eliminate more players is to focus on the portion of players that have a high Player Efficiency Rating and Points scored.
(Had a thought here where I can see who had showed up multiple times in the graph. So now I’m currently deciding if I take those people that showed up multiple times or continue with this idea of a portion or do a combination of both)
It was decided that we will evaluate the players based on the top 4x4 sections on the top right corner of the graph given. The faces of Mount Rushmore would have to consist of the the top basketball players which would heavily weighs in on the points scored by the player and their Player Efficiency Rating. The “Can’t win if you don’t score” mentality fits this filter section perfectly. Next we will then select those that have been seen more than once after this filteration.
#Round 2 Elimination
#Seeing who had outstanding multiple seasons performances given the range above PTS and PER
table(Round2_Players['Player'])
Amar'e Stoudemire Carmelo Anthony Charles Barkley*
1 2 3
Chris Webber Clyde Drexler* David Robinson*
1 1 3
DeMarcus Cousins Dirk Nowitzki Dominique Wilkins*
1 5 1
Dwyane Wade Elton Brand Giannis Antetokounmpo
1 1 1
Grant Hill Hakeem Olajuwon* James Harden
1 1 1
Julius Erving* Karl-Anthony Towns Karl Malone*
1 1 7
Kevin Durant Kevin Garnett Kevin Love
6 4 1
Larry Bird* LeBron James Magic Johnson*
5 6 1
Michael Jordan* Paul Pierce Russell Westbrook
3 2 1
Tracy McGrady
2
#Removing anyone that was displayed only once
Round2_Players = Round2_Players %>%
filter(Round2_Players['Player'] != "Amar'e Stoudemire",
Round2_Players['Player'] != "Chris Webber",
Round2_Players['Player'] != "Clyde Drexler*",
Round2_Players['Player'] != "DeMarcus Cousins",
Round2_Players['Player'] != "Dominique Wilkins*",
Round2_Players['Player'] != "Dwyane Wade",
Round2_Players['Player'] != "Elton Brand",
Round2_Players['Player'] != "Giannis Antetokounmpo",
Round2_Players['Player'] != "Grant Hill",
Round2_Players['Player'] != "Hakeem Olajuwon*",
Round2_Players['Player'] != "James Harden",
Round2_Players['Player'] != "Julius Erving*",
Round2_Players['Player'] != "Karl-Anthony Towns",
Round2_Players['Player'] != "Kevin Love",
Round2_Players['Player'] != "Magic Johnson*",
Round2_Players['Player'] != "Russell Westbrook")
#Number of Players who Remains
#nrow(Round2_Players)
#Those that have survived Round 2
Round2_Players %>% distinct(Player)
Player
1 Larry Bird*
2 Charles Barkley*
3 Michael Jordan*
4 Karl Malone*
5 David Robinson*
6 Kevin Garnett
7 Tracy McGrady
8 Dirk Nowitzki
9 Paul Pierce
10 LeBron James
11 Kevin Durant
12 Carmelo Anthony
We have narrowed the size down to just 12 players was significant considering the names that we have remaining. The names that are still here aren’t what specifically makes this surprising; It’s the names that didn’t make it through the rounds. Legendary names like Bill Russell, Kareem Abdul-Jabbar, Magic Johnson, and Kobe Bryant did not make it up to this point. Other names such as Michael Jordan, LeBron James and Larry Bird have survived though. Lets continue this elimination process. Originally had planned to eliminate players based on their performance against others in the same position as those players. However, there is 12 players remaining and that doesn’t seem necessary anymore, so I will skip to comparing the players against themselves.
#Final Elimination before selection
#Finding top 4 players in each column
#Then combining them into one dataframe with rbind
top4s_list = Round2_Players %>% arrange(desc(PER)) %>% head(5)
a = Round2_Players %>% arrange(desc(TS.)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(TRB.)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(AST.)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(WS)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(FG)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(FGA)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(X3P.)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(X2P.)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(eFG.)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(FT.)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(TRB)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(AST)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(STL)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(BLK)) %>% head(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(TOV)) %>% tail(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(PF)) %>% tail(5)
top4s_list = rbind(top4s_list,a)
a = Round2_Players %>% arrange(desc(PTS)) %>% head(5)
top4s_list = rbind(top4s_list,a)
#Took the tails of both Turnovers (TOV) and Personal Fouls (PF) because a player would like to have the least amount of those as possible, hence them being tails instead of heads.
#Counting how many times a player name appeared in those top 5's
table(top4s_list['Player'])
Carmelo Anthony Charles Barkley* David Robinson* Dirk Nowitzki
1 8 7 7
Karl Malone* Kevin Durant Kevin Garnett Larry Bird*
4 10 7 8
LeBron James Michael Jordan* Tracy McGrady
18 19 1
After taking the top 5 players of each column and re-counting the top names of each role, it can be said that the 3 people that should definitely be on Mount Rushmore of Basketball is Michael Jordan, LeBron James, and Kevin Durant. The last member of that would have to be between Charles Barkley and Larry Bird. The reason for looking at the top 5 players instead of top 4 for the number of heads is to allow for a comeback in this counting system.
Considering that Larry Bird had 2 more seasons than Charles Barkley where he was in the 4x4 section, that would mean that he would be our 4th head. The Mount Rushmore of Basketball statistically should be Michael Jordan, LeBron James, Kevin Durant and Larry Bird.
Here we will evaluate how the position of players can effect the points gathered
#Made a distinction between the decades
Seasons_Stats_1970 = Seasons_Stats %>%
filter(Year >= 1970, Year < 1980)
Seasons_Stats_1980 = Seasons_Stats %>%
filter(Year >= 1980, Year < 1990)
Seasons_Stats_1990 = Seasons_Stats %>%
filter(Year >= 1990, Year < 2000)
Seasons_Stats_2000 = Seasons_Stats %>%
filter(Year >= 2000, Year < 2010)
Seasons_Stats_2010 = Seasons_Stats %>%
filter(Year >= 2010, Year < 2020)
#Plotted decades into a histogram to see if they had changed over time
ggplot(Seasons_Stats_1970) +
geom_histogram(aes(x = PTS, binwidth = 200, fill = Pos)) +
labs(x = "Points", y= "Number of Players", title = "Distribution of Points Scored in decade of 1970 by Position") +
facet_wrap(vars(Pos))+
xlim(0, 2500)+
ylim(0, 250)
ggplot(Seasons_Stats_1980) +
geom_histogram(aes(x = PTS, binwidth = 200, fill = Pos)) +
labs(x = "Points", y= "Number of Players", title = "Distribution of Points Scored in decade of 1980 by Position") +
facet_wrap(vars(Pos))+
xlim(0, 2500)+
ylim(0, 250)
ggplot(Seasons_Stats_1990) +
geom_histogram(aes(x = PTS, binwidth = 200, fill = Pos)) +
labs(x = "Points", y= "Number of Players", title = "Distribution of Points Scored in decade of 1990 by Position") +
facet_wrap(vars(Pos))+
xlim(0, 2500)+
ylim(0, 250)
ggplot(Seasons_Stats_2000) +
geom_histogram(aes(x = PTS, binwidth = 200, fill = Pos)) +
labs(x = "Points", y= "Number of Players", title = "Distribution of Points Scored in decade of 2000 by Position") +
facet_wrap(vars(Pos))+
xlim(0, 2500)+
ylim(0, 250)
ggplot(Seasons_Stats_2010) +
geom_histogram(aes(x = PTS, binwidth = 200, fill = Pos)) +
labs(x = "Points", y= "Number of Players", title = "Distribution of Points Scored in decade of 2010 by Position") +
facet_wrap(vars(Pos))+
xlim(0, 2500)+
ylim(0, 250)
I decided to look at the decades starting from 1970 and then on because earlier wouldn’t provide much information since the 3 point line being invented in 1980s. The decade of 1970s will represent the other previous years. Looking at all the graphs, all positions are right skewed. However, over the decades the number of Players had to have increased since the bar on the far left increased each decade from 1980s to the 2000s. One minor small difference is that after the 1990s, the Center positions didn’t have a bar that score beyond the 2000 points in a season when there were few people that did prior. Other than that, it becomes difficult to identify any more changes. Something to point out is that the shape of the distribution of the points mimic other positions given in which decade is displayed. So I would have to say that the position doesn’t have an effect on how much points the player scored since there’s little change in distribution other than maybe the Center position based on the tail, but I personally feel it is not enough evidence to claim that.
My words currently are kinda sloppy but I will adjust them to a more formal and practical sentences. Graphs are my best, but I still need fix the legend to state Position instead of just Pos. With this, it might be enough to hit the 8 page requirement, but still not completely certain. I’m this is not what a final rough draft looks like, but I my final project submission will be much cleaner.
As much of what I had originally planned to do. I still want to add the section about the effective field goal percentage and see if that is effected by age.