The dataset that we work with today is a Kaggle dataset about NBA players. Specifically, the entries each give an NBA player within a specific season, from 1996 to 2021. We will focus on seasons from 2005 to 2021 (inclusive), analyzing the data to answer specific questions about player performance, scoring trends, etc.
Here is a sample of some rows and columns of the dataset:
player_name team_abbreviation age player_height player_weight college
3973 Tyronn Lue ATL 29 182.88 80.73938 Nebraska
3974 T.J. Ford MIL 23 182.88 74.84268 Texas
3975 Tayshaun Prince DET 26 205.74 97.52228 Kentucky
3976 Terence Morris ORL 27 205.74 100.24383 Maryland
3977 Theo Ratliff POR 33 208.28 106.59412 Wyoming
It is important to understand that the same athlete may appear multiple times if they were present in multiple seasons! This is incredibly useful for tracking statistics both over time and by athlete.
1. Age and Performance: How does the performance change with players’ age?
A large indicator of professional sports fit-ness is the age of the players. Given that “old age” for am NBA athlete is around 35-40 years old, it would be interesting to understand how a player’s performance tracks as they age.
Above is a plot of the average number of points per game for each age represented in the dataset. Note (for future age discussions, as well) that there are only 3 entries in which the athlete is 43 years old or older. As such, extrapolations cannot effectively be made about this oldest age bracket, as there is very limited data.
However, we can still see a notable (unsteady) incline in average points/game in ages less than 30, and an decline in points for ages greater than 30. Maybe the incline can be attributed to experience gain while the decline can be attributed to a lessening athleticism.
Code
ggplot(data = nba, aes(x = age, y = pts)) +geom_density_2d_filled(bins =7) +scale_fill_manual(values =colorRampPalette(c("grey92", "blue4"))(7),name ="Density" ) +labs(title ="Age vs. Average Points/Game",subtitle ="2d kernel density estimate", x ="Age (years)",y ="Average points" ) +geom_point(data = nba[nba$player_name =="LeBron James", ], aes(x = age, y = pts, color ="LeBron") ) +geom_point(data = nba[nba$player_name =="Reggie Bullock", ], aes(x = age, y = pts, color ="Reggie") ) +scale_color_manual(name ="Players",values =c(LeBron ="red3", Reggie ="orange2") ) +theme_classic()
For convenience, here is a kernel density estimate of the joint distribution of age and average points/game. As a reference, the statistics of LeBron James are added (he is often going to be an outlier in these plots), as well as those of Reggie Bullock, a more “typical” player (according to a Reddit post that I found).
Same deal here, although the decline in rebounds seems to begin earlier and manifest as a more gradual decrease of average rebounds/game, as opposed to the sharper decline in overall points that we saw before.
Code
ggplot(data = nba, aes(x = age, y = reb)) +geom_density_2d_filled(bins =7) +scale_fill_manual(values =colorRampPalette(c("grey92", "purple4"))(7),name ="Density" ) +labs(title ="Age vs. Average Rebounds/Game",subtitle ="2d kernel density estimate", x ="Age (years)",y ="Average rebounds" ) +geom_point(data = nba[nba$player_name =="LeBron James", ], aes(x = age, y = reb, color ="LeBron") ) +geom_point(data = nba[nba$player_name =="Reggie Bullock", ], aes(x = age, y = reb, color ="Reggie") ) +scale_color_manual(name ="Players",values =c(LeBron ="red3", Reggie ="orange2") ) +theme_classic()
As LeBron ages, the number of rebounds he got on average per game stayed roughly the same. However, Reggie improved over time! This is unexpected given the earlier plot that shows gradual decrease in rebounds/game with age, but good for him.
Lastly, we have assists/game by age. Unlike the first two indicators of performance, this one remains largely constant (on average) for most players, at roughly 1.5-2 assists/game across most ages. The initial uptick is likely explainable by an experience increase, and there is not enough data to explain the drop at ages of 40+.
Code
ggplot(data = nba, aes(x = age, y = ast)) +geom_density_2d_filled(bins =7) +scale_fill_manual(values =colorRampPalette(c("grey92", "green4"))(7),name ="Density" ) +labs(title ="Age vs. Average Assists/Game",subtitle ="2d kernel density estimate", x ="Age (years)",y ="Average assists" ) +geom_point(data = nba[nba$player_name =="LeBron James", ], aes(x = age, y = ast, color ="LeBron") ) +geom_point(data = nba[nba$player_name =="Reggie Bullock", ], aes(x = age, y = ast, color ="Reggie") ) +scale_color_manual(name ="Players",values =c(LeBron ="red3", Reggie ="orange2") ) +theme_classic()
And once more, we see LeBron James up high with at least 6-7 assists/game on average (regardless of age), and our benchmark Reggie with a somewhat modest increase.
In conclusion, performance of a player (especially points scored and rebounds per game) typically seems to improve with age for most players until their late 20s, possibly due to an growing level of skill and experience gain. From then, performance largely declines with continuing age, until most careers end by the time the players turn 40.
Of course, some outliers (such as LeBron James) do exist and continue to remain competitive, even as they grow older.
2. Scoring Trends: Who were the top scorers each season, and how did their points per game vary?
Looking at the individual top scorers for each season will take a long time. So instead, some examples for a few seasons will be given, then some aggregate statistics. Lets look at the top 3 scorers for the 2005, 2010, and 2015 seasons:
player_name pts season
1 Kobe Bryant 35.4 2005-06
2 Allen Iverson 33.0 2005-06
3 LeBron James 31.4 2005-06
Code
top2010[, c("player_name", "pts", "season")]
player_name pts season
1 Kevin Durant 27.7 2010-11
2 LeBron James 26.7 2010-11
3 Carmelo Anthony 25.6 2010-11
Code
top2015[, c("player_name", "pts", "season")]
player_name pts season
1 Stephen Curry 30.1 2015-16
2 James Harden 29.0 2015-16
3 Kevin Durant 28.2 2015-16
So, generally, we are looking at people that are scoring between 25 and 35 points/game on average. But we don’t just care about those 3 seasons, we care about all! What would the series of top 3 scorers look like across all 15 seasons?
Code
allseason <-data.frame()for (i in2005:2021) { currseason <- nba |>filter(as.numeric(substr(season, 1, 4)) == i) |>arrange(desc(pts)) |>filter(row_number() <4) |>select(player_name, pts, season) currseason$year <- i currseason$rank <-as.factor(1:3) allseason <-rbind(allseason, currseason)}ggplot(data = allseason, aes(x = year, y = pts, color = rank, group = rank)) +geom_line() +labs(title ="Average points/game for top 3 season scorers", x ="Year", y ="Average points" ) +theme_classic()
It’s interesting to see a dip in average points/game among the best season scorers from 2005 onward, until 2018 again. The number 1 scorer in 2018 was James Harden (with 36.1 points/game), maybe he just had a good year. Same for Kobe Bryant in 2005 (with 35.4 points/game).
There seems to be a somewhat strong correlation between the top three scorers for each season:
And indeed, there measurably is, according to the above correlation matrix. I would expect similar averages year-by-year, but not necessarily correlation between those averages. It’s probably because, in differing seasons, the same athlete can appear in different positions in the top 3 players, allowing cross-contamination of the series.
3. Rookie Impact: What is the average performance of rookie players compared to players in their prime years?
We can define “rookie” to be a player who was first drafted within the season of focus. Then, a player in their “prime” to be a player that is at least 1 year since their drafting year, but no more than 4 (this is a bit unfair, but on average, each player is only around for about 4.5 seasons. This works as an estimate).
Analysis of rookie vs. experienced player performance
Of course we are going to see better performance among the experienced population of players, as defined above. There is a selection bias here in which only the players that survived their rookie season actually made it to the “experienced” pool.
That said, we can partially counter this effect by including only players who appear in at least 4 seasons. This restricts the total player pool to relatively “good” players, but the trajectories by player should still be similar:
The numbers do not change much, and the effect is still the same. The “experienced” players are typically outperforming the rookies by a decent margin. One could argue that this idea reasonably extends to the player pool as a whole
These metrics are actually closer when split by experience level, but still notably different. it’s also worth noting how high the variance of the data is; in all metrics looked at so far, there have been many outliers!
We can safely say that it is likely that player experience (especially rookie vs. non-rookie) does have a meaningful effect on the performance of a player. A more rigorous approach would be to use some statistical test (2-sample t test especially) to demonstrate this phenomenon, but a visual examination is sufficient for now.
As for specific rookie players who scored a disproportionately high number of points/game, we can find that too:
player_name pts team_abbreviation age draft_year
1 Zion Williamson 22.5 NOP 19 2019
2 Luka Doncic 21.2 DAL 20 2018
3 Donovan Mitchell 20.5 UTA 21 2017
4 Kevin Durant 20.3 SEA 19 2007
5 Tyreke Evans 20.1 SAC 20 2009
6 Anthony Edwards 19.3 MIN 19 2020
7 Trae Young 19.1 ATL 20 2018
8 Damian Lillard 19.0 POR 22 2012
9 O.J. Mayo 18.5 MEM 21 2008
10 Kyrie Irving 18.5 CLE 20 2011
4. Career Longevity: What is the average career length of NBA players during this period?
On the topic of rookies vs. experienced players, what is a typical career length for players who’s careers began and ended during this window? How many games will they play?
We need to consider that some players’ careers may not yet have ended by the final year of examination, 2021.
Analysis of NBA career length
Code
overs <- nba |>group_by(player_name) |>summarise(first_season =min(season.num),last_season =max(season.num),score =mean(pts)) |>filter(first_season >=2005, last_season <=2020)overs$career_length <- overs$last_season - overs$first_season +1hist(x = overs$career_length,main ="Histogram of NBA career lengths (n = 1247)",xlab ="Length (seasons)",ylab ="Count")abline(v =mean(overs$career_length), lwd =2, col ="red3")abline(v =median(overs$career_length), lwd =2, col ="purple3")legend(x ="topright",fill =c("red3", "purple3"),legend =c(paste0("Mean: ", trunc(mean(overs$career_length*10))/10),paste0("Median: ", median(overs$career_length)) ),bty ="n")
We can further analyze career length for “high scorers”:
Code
overs <- overs |>arrange(desc(score))hist(x = overs[1:200, ]$career_length,main ="Histogram of NBA career lengths (n = 200)",breaks =max(overs[1:200, ]$career_length) -min(overs[1:200, ]$career_length) +1,xlim =c(min(overs[1:200, ]$career_length), max(overs[1:200, ]$career_length)),xlab ="Length (seasons)",ylab ="Count")abline(v =mean(overs[1:200, ]$career_length), lwd =2, col ="red3")abline(v =median(overs[1:200, ]$career_length), lwd =2, col ="purple3")legend(x ="topright",fill =c("red3", "purple3"),legend =c(paste0("Mean: ", trunc(mean(overs[1:200, ]$career_length*10))/10),paste0("Median: ", median(overs[1:200, ]$career_length)) ),bty ="n")text(x =8,y =26.5,labels ="2005 - 2020 careers of top 200 scorers",xpd =TRUE)
It seem that those who score within the top ~15% of all players tend to have careers that are almost twice as long on average, and three times the length on the median. This makes sense, given that overall performance is the most important aspect of an NBA athlete, and average points scored/game is a strong indicator of high performance.
Note that there are other performance metrics as well; a player who plays defensively may not score as highly on average, but could still have a long ands valuable NBA career.
There seems to be a clear pattern: if you want to stay in the NBA, perform highly (or otherwise score a lot!).
5. All-Star Selections: Which players were selected as All-Stars in the most seasons during this period?
The data for NBA All-star teams was gotten from somewhere on Google, then incorporated into the original dataset. During this period, there were 113 distinct players selected to be All-stars. Some players were selected as All-stars a lot, while most others were only a few times:
Top 10 selected All-star players from 2005 to 2020
# A tibble: 10 × 2
player_name count
<chr> <int>
1 LeBron James 17
2 Dwyane Wade 12
3 Chris Paul 11
4 Kobe Bryant 11
5 Carmelo Anthony 10
6 Chris Bosh 10
7 Dirk Nowitzki 10
8 Kevin Durant 10
9 James Harden 9
10 Russell Westbrook 9
There are, of course, many ways to approach analyzing the All-star players vs. typical players. The easiest way (again) would just be to compare performance metrics of these All-star players compared to others:
Performance metric comparison of All-star and non All-star players
Generally, the metrics of the players selected as All-stars are going to be significantly higher than those not.
Code
boxplot(data = nba, pts ~ allstar,names =c("All-star", "Not All-star"),main ="Boxplots of points/game vs. class",xlab ="Class",ylab ="Average points/game")
Even still, the boxplot above reveals that there are many outliers that do score average points/game within that All-star range. One would figure that performance alone may not dictate All-star lineups as there are many facets of an NBA player other than how they play.
Source Code
---title: "NBA Homework"author: "Griffin Lessinger"format: html: code-fold: true code-tools: true toc: trueeditor: visual---### IntroductionThe dataset that we work with today is a Kaggle dataset about NBA players. Specifically, the entries each give an NBA player within a specific season, from 1996 to 2021. We will focus on seasons from 2005 to 2021 (inclusive), analyzing the data to answer specific questions about player performance, scoring trends, etc.Here is a sample of some rows and columns of the dataset:```{r, message = FALSE, warning = FALSE}library(lubridate)library(ggplot2)library(dplyr)library(readxl)nba <-read.csv("/home/user/School/STAT360/Project 3 (NBA)/all_seasons.csv")nba <- nba[as.numeric(substr(nba$season, 1, 4)) >=2005&as.numeric(substr(nba$season, 1, 4)) <=2021, ]nba[1:5, 2:7]```It is important to understand that the same athlete may appear multiple times if they were present in multiple seasons! This is incredibly useful for tracking statistics both over time *and* by athlete.### 1. Age and Performance: How does the performance change with players’ age?A large indicator of professional sports fit-ness is the age of the players. Given that "old age" for am NBA athlete is around 35-40 years old, it would be interesting to understand how a player's performance tracks as they age.::: {.callout collapse=true title="Average points/game by age"}```{r, cache = TRUE}avebyage.pts <-tapply(nba$pts, nba$age, mean)plot(x =18:44,y = avebyage.pts,frame.plot =FALSE,type ="l",main ="Average points/game by age",xlab ="Age (years)",ylab ="Points/game")```Above is a plot of the average number of points per game for each age represented in the dataset. Note (for future age discussions, as well) that there are only 3 entries in which the athlete is 43 years old or older. As such, extrapolations cannot effectively be made about this oldest age bracket, as there is *very* limited data.However, we can still see a notable (unsteady) incline in average points/game in ages less than 30, and an decline in points for ages greater than 30. Maybe the incline can be attributed to experience gain while the decline can be attributed to a lessening athleticism.```{r, cache = TRUE}ggplot(data = nba, aes(x = age, y = pts)) +geom_density_2d_filled(bins =7) +scale_fill_manual(values =colorRampPalette(c("grey92", "blue4"))(7),name ="Density" ) +labs(title ="Age vs. Average Points/Game",subtitle ="2d kernel density estimate", x ="Age (years)",y ="Average points" ) +geom_point(data = nba[nba$player_name =="LeBron James", ], aes(x = age, y = pts, color ="LeBron") ) +geom_point(data = nba[nba$player_name =="Reggie Bullock", ], aes(x = age, y = pts, color ="Reggie") ) +scale_color_manual(name ="Players",values =c(LeBron ="red3", Reggie ="orange2") ) +theme_classic()```For convenience, here is a kernel density estimate of the joint distribution of age and average points/game. As a reference, the statistics of LeBron James are added (he is often going to be an outlier in these plots), as well as those of Reggie Bullock, a more "typical" player (according to a Reddit post that I found).:::::: {.callout collapse=true title="Average rebounds/game by age"}```{r, cache = TRUE}avebyage.reb <-tapply(nba$reb, nba$age, mean)plot(x =18:44,y = avebyage.reb,frame.plot =FALSE,type ="l",main ="Average rebounds/game by age",xlab ="Age (years)",ylab ="Rebounds/game")```Same deal here, although the decline in rebounds seems to begin earlier and manifest as a more gradual decrease of average rebounds/game, as opposed to the sharper decline in overall points that we saw before.```{r, cache = TRUE}ggplot(data = nba, aes(x = age, y = reb)) +geom_density_2d_filled(bins =7) +scale_fill_manual(values =colorRampPalette(c("grey92", "purple4"))(7),name ="Density" ) +labs(title ="Age vs. Average Rebounds/Game",subtitle ="2d kernel density estimate", x ="Age (years)",y ="Average rebounds" ) +geom_point(data = nba[nba$player_name =="LeBron James", ], aes(x = age, y = reb, color ="LeBron") ) +geom_point(data = nba[nba$player_name =="Reggie Bullock", ], aes(x = age, y = reb, color ="Reggie") ) +scale_color_manual(name ="Players",values =c(LeBron ="red3", Reggie ="orange2") ) +theme_classic()```As LeBron ages, the number of rebounds he got on average per game stayed roughly the same. However, Reggie improved over time! This is unexpected given the earlier plot that shows gradual decrease in rebounds/game with age, but good for him.:::::: {.callout collapse=true title="Average assists/game by age"}```{r, cache = TRUE}avebyage.ast <-tapply(nba$ast, nba$age, mean)plot(x =18:44,y = avebyage.ast,frame.plot =FALSE,type ="l",main ="Average assists/game by age",xlab ="Age (years)",ylab ="Assists/game")```Lastly, we have assists/game by age. Unlike the first two indicators of performance, this one remains largely constant (on average) for most players, at roughly 1.5-2 assists/game across most ages. The initial uptick is likely explainable by an experience increase, and there is not enough data to explain the drop at ages of 40+.```{r, cache = TRUE}ggplot(data = nba, aes(x = age, y = ast)) +geom_density_2d_filled(bins =7) +scale_fill_manual(values =colorRampPalette(c("grey92", "green4"))(7),name ="Density" ) +labs(title ="Age vs. Average Assists/Game",subtitle ="2d kernel density estimate", x ="Age (years)",y ="Average assists" ) +geom_point(data = nba[nba$player_name =="LeBron James", ], aes(x = age, y = ast, color ="LeBron") ) +geom_point(data = nba[nba$player_name =="Reggie Bullock", ], aes(x = age, y = ast, color ="Reggie") ) +scale_color_manual(name ="Players",values =c(LeBron ="red3", Reggie ="orange2") ) +theme_classic()```And once more, we see LeBron James up high with at least 6-7 assists/game on average (regardless of age), and our benchmark Reggie with a somewhat modest increase.:::In conclusion, performance of a player (especially points scored and rebounds per game) typically seems to improve with age for most players until their late 20s, possibly due to an growing level of skill and experience gain. From then, performance largely declines with continuing age, until most careers end by the time the players turn 40.Of course, some outliers (such as LeBron James) do exist and continue to remain competitive, even as they grow older.### 2. Scoring Trends: Who were the top scorers each season, and how did their points per game vary?Looking at the individual top scorers for each season will take a long time. So instead, some examples for a few seasons will be given, then some aggregate statistics. Lets look at the top 3 scorers for the 2005, 2010, and 2015 seasons:::: {.callout collapse=true title="Top Scorers (2005, 2010, 2015)"}```{r, cache = TRUE}top2005 <- nba |>filter(as.numeric(substr(season, 1, 4)) ==2005) |>arrange(desc(pts)) |>filter(row_number() <4)top2010 <- nba |>filter(as.numeric(substr(season, 1, 4)) ==2010) |>arrange(desc(pts)) |>filter(row_number() <4)top2015 <- nba |>filter(as.numeric(substr(season, 1, 4)) ==2015) |>arrange(desc(pts)) |>filter(row_number() <4)top2005[, c("player_name", "pts", "season")]top2010[, c("player_name", "pts", "season")]top2015[, c("player_name", "pts", "season")]```:::So, generally, we are looking at people that are scoring between 25 and 35 points/game on average. But we don't just care about those 3 seasons, we care about all! What would the series of top 3 scorers look like across all 15 seasons?```{r, cache = TRUE}allseason <-data.frame()for (i in2005:2021) { currseason <- nba |>filter(as.numeric(substr(season, 1, 4)) == i) |>arrange(desc(pts)) |>filter(row_number() <4) |>select(player_name, pts, season) currseason$year <- i currseason$rank <-as.factor(1:3) allseason <-rbind(allseason, currseason)}ggplot(data = allseason, aes(x = year, y = pts, color = rank, group = rank)) +geom_line() +labs(title ="Average points/game for top 3 season scorers", x ="Year", y ="Average points" ) +theme_classic()```It's interesting to see a dip in average points/game among the best season scorers from 2005 onward, until 2018 again. The number 1 scorer in 2018 was James Harden (with 36.1 points/game), maybe he just had a good year. Same for Kobe Bryant in 2005 (with 35.4 points/game).There seems to be a somewhat strong correlation between the top three scorers for each season:```{r, cache = TRUE}cormat <-matrix(data =c(allseason[allseason$rank ==1, ]$pts, allseason[allseason$rank ==2, ]$pts, allseason[allseason$rank ==3, ]$pts),nrow =17)cor(cormat)```And indeed, there measurably is, according to the above correlation matrix. I would expect similar averages year-by-year, but not necessarily correlation between those averages. It's probably because, in differing seasons, the same athlete can appear in different positions in the top 3 players, allowing cross-contamination of the series.### 3. Rookie Impact: What is the average performance of rookie players compared to players in their prime years?We can define "rookie" to be a player who was first drafted within the season of focus. Then, a player in their "prime" to be a player that is at least 1 year since their drafting year, but no more than 4 (this is a bit unfair, but on average, each player is only around for about 4.5 seasons. This works as an estimate).:::{.callout collapse=true title="Analysis of rookie vs. experienced player performance"}```{r, warning = FALSE}nba$draft_year.num <-ifelse(nba$draft_year =="Undrafted", NA, as.numeric(nba$draft_year))nba$season.num <-as.numeric(substr(nba$season, 1, 4))nba <- nba |>mutate(status =case_when( season.num == draft_year.num ~"Rookie", season.num >= draft_year.num +1& season.num <= draft_year.num +3~"Experienced", ) )boxplot(data =na.omit(nba), pts ~ status,main ="Average points/game by experience",xlab ="Player status",ylab ="Average points")```Of course we are going to see better performance among the experienced population of players, as defined above. There is a selection bias here in which only the players that *survived* their rookie season actually made it to the "experienced" pool.That said, we can partially counter this effect by including only players who appear in at least 4 seasons. This restricts the total player pool to relatively "good" players, but the trajectories by player should still be similar:```{r}boxplot(data =na.omit(nba)|>group_by(player_name) |>filter(n() >=4) |>ungroup(), pts ~ status,main ="Average points/game by experience",xlab ="Player status",ylab ="Average points")text(x =1.5,y =33,labels ="Only players present in >=4 seasons",xpd =TRUE)```The numbers do not change much, and the effect is still the same. The "experienced" players are typically outperforming the rookies by a decent margin. One could argue that this idea reasonably extends to the player pool as a whole:::{.callout collapse=true title="Other metric comparisons by status"}```{r}boxplot(data =na.omit(nba)|>group_by(player_name) |>filter(n() >=4) |>ungroup(), reb ~ status,main ="Average rebounds/game by experience",xlab ="Player status",ylab ="Average rebounds")text(x =1.5,y =33,labels ="Only players present in >=4 seasons",xpd =TRUE)boxplot(data =na.omit(nba)|>group_by(player_name) |>filter(n() >=4) |>ungroup(), ast ~ status,main ="Average assists/game by experience",xlab ="Player status",ylab ="Average assists")text(x =1.5,y =33,labels ="Only players present in >=4 seasons",xpd =TRUE)```These metrics are actually closer when split by experience level, but still notably different. it's also worth noting how high the variance of the data is; in all metrics looked at so far, there have been *many* outliers!::::::We can safely say that it is likely that player experience (especially rookie vs. non-rookie) does have a meaningful effect on the performance of a player. A more rigorous approach would be to use some statistical test (2-sample t test especially) to demonstrate this phenomenon, but a visual examination is sufficient for now.As for specific rookie players who scored a disproportionately high number of points/game, we can find that too::::{.callout collapse=true title="High-scoring rookies"}```{r}rookies <- nba |>filter(status =="Rookie") |>arrange(desc(pts))rookies[1:10, c("player_name", "pts", "team_abbreviation", "age", "draft_year")]```:::### 4. Career Longevity: What is the average career length of NBA players during this period?On the topic of rookies vs. experienced players, what is a typical career length for players who's careers began and ended during this window? How many games will they play?We need to consider that some players' careers may not yet have ended by the final year of examination, 2021.:::{.callout collapse=true title="Analysis of NBA career length"}```{r}overs <- nba |>group_by(player_name) |>summarise(first_season =min(season.num),last_season =max(season.num),score =mean(pts)) |>filter(first_season >=2005, last_season <=2020)overs$career_length <- overs$last_season - overs$first_season +1hist(x = overs$career_length,main ="Histogram of NBA career lengths (n = 1247)",xlab ="Length (seasons)",ylab ="Count")abline(v =mean(overs$career_length), lwd =2, col ="red3")abline(v =median(overs$career_length), lwd =2, col ="purple3")legend(x ="topright",fill =c("red3", "purple3"),legend =c(paste0("Mean: ", trunc(mean(overs$career_length*10))/10),paste0("Median: ", median(overs$career_length)) ),bty ="n")``` We can further analyze career length for "high scorers":```{r}overs <- overs |>arrange(desc(score))hist(x = overs[1:200, ]$career_length,main ="Histogram of NBA career lengths (n = 200)",breaks =max(overs[1:200, ]$career_length) -min(overs[1:200, ]$career_length) +1,xlim =c(min(overs[1:200, ]$career_length), max(overs[1:200, ]$career_length)),xlab ="Length (seasons)",ylab ="Count")abline(v =mean(overs[1:200, ]$career_length), lwd =2, col ="red3")abline(v =median(overs[1:200, ]$career_length), lwd =2, col ="purple3")legend(x ="topright",fill =c("red3", "purple3"),legend =c(paste0("Mean: ", trunc(mean(overs[1:200, ]$career_length*10))/10),paste0("Median: ", median(overs[1:200, ]$career_length)) ),bty ="n")text(x =8,y =26.5,labels ="2005 - 2020 careers of top 200 scorers",xpd =TRUE)```It seem that those who score within the top ~15% of all players tend to have careers that are almost twice as long on average, and three times the length on the median. This makes sense, given that overall performance is the most important aspect of an NBA athlete, and average points scored/game is a strong indicator of high performance.Note that there are other performance metrics as well; a player who plays defensively may not score as highly on average, but could still have a long ands valuable NBA career.:::There seems to be a clear pattern: if you want to stay in the NBA, perform highly (or otherwise score a lot!).### 5. All-Star Selections: Which players were selected as All-Stars in the most seasons during this period?```{r}allstar <-read.csv("/home/user/School/STAT360/Project 3 (NBA)/all_star.csv")nba$allstar <-ifelse(paste(nba$player_name, nba$season.num) %in%paste(allstar$Player, allstar$Year),"all-star", "not")```The data for NBA All-star teams was gotten from somewhere on Google, then incorporated into the original dataset. During this period, there were 113 distinct players selected to be All-stars. Some players were selected as All-stars *a lot*, while most others were only a few times::::{.callout collapse=true title="Top 10 selected All-star players from 2005 to 2020"}```{r}starplayers <- nba |>filter(allstar =="all-star") |>group_by(player_name) |>summarise(count =n() ) |>arrange(desc(count))starplayers[1:10, ]```:::There are, of course, many ways to approach analyzing the All-star players vs. typical players. The easiest way (again) would just be to compare performance metrics of these All-star players compared to others::::{.callout collapse=true title="Performance metric comparison of All-star and non All-star players"}```{r}ptsbyyear <- nba |>group_by(season.num) |>filter(allstar =="not") |>summarise(pts.ave =mean(pts)) |>pull(pts.ave)ptsbyyear.as <- nba |>group_by(season.num) |>filter(allstar =="all-star") |>summarise(pts.ave =mean(pts)) |>pull(pts.ave)mydata <-data.frame( year <-2005:2021, pts <- ptsbyyear, pts.as <- ptsbyyear.as)ggplot(data = mydata, aes(x = year)) +ylim(0, 25) +geom_line(aes(y = pts, color ="Non All-star")) +geom_line(aes(y = pts.as, color ="All-star")) +scale_color_manual(name ="Class",values =c("Non All-star"="red4", "All-star"="royalblue4") ) +labs(title ="Average points/game",x ="Season",y ="Average points" ) +theme_classic()```Generally, the metrics of the players selected as All-stars are going to be significantly higher than those not.```{r}boxplot(data = nba, pts ~ allstar,names =c("All-star", "Not All-star"),main ="Boxplots of points/game vs. class",xlab ="Class",ylab ="Average points/game")```Even still, the boxplot above reveals that there are many outliers that *do* score average points/game within that All-star range. One would figure that performance alone may not dictate All-star lineups as there are many facets of an NBA player *other* than how they play.:::