Final Project

Introduction

In homework 1 we will be performing the basic data related operations that we have already come across in the tutorials and challenges. The goals will be to introduce the dataset and provide both textual and statistical description of it.

The dataset link - https://www.kaggle.com/datasets/justinas/nba-players-data

Dataset

We need to load the necessary libraries for our operations.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(corrplot)

## corrplot 0.92 loaded

We can now load the dataset.

basketball_data <- read.csv("all_seasons.csv")

Reading the data and Narrative

We can get the first few rows of the data.

head(basketball_data)

##   no.      player_name team_abbreviation age player_height player_weight
## 1   0 Randy Livingston               HOU  22        193.04      94.80073
## 2   1 Gaylon Nickerson               WAS  28        190.50      86.18248
## 3   2     George Lynch               VAN  26        203.20     103.41898
## 4   3   George McCloud               LAL  30        203.20     102.05820
## 5   4     George Zidek               DEN  23        213.36     119.74829
## 6   5   Gerald Wilkins               ORL  33        198.12     102.05820
##                 college country draft_year draft_round draft_number gp  pts reb
## 1       Louisiana State     USA       1996           2           42 64  3.9 1.5
## 2 Northwestern Oklahoma     USA       1994           2           34  4  3.8 1.3
## 3        North Carolina     USA       1993           1           12 41  8.3 6.4
## 4         Florida State     USA       1989           1            7 64 10.2 2.8
## 5                  UCLA     USA       1995           1           22 52  2.8 1.7
## 6 Tennessee-Chattanooga     USA       1985           2           47 80 10.6 2.2
##   ast net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct  season
## 1 2.4        0.3    0.042    0.071   0.169  0.487   0.248 1996-97
## 2 0.3        8.9    0.030    0.111   0.174  0.497   0.043 1996-97
## 3 1.9       -8.2    0.106    0.185   0.175  0.512   0.125 1996-97
## 4 1.7       -2.7    0.027    0.111   0.206  0.527   0.125 1996-97
## 5 0.3      -14.1    0.102    0.169   0.195  0.500   0.064 1996-97
## 6 2.2       -5.8    0.031    0.064   0.203  0.503   0.143 1996-97

The dataset was created by a fan of Basketball who wanted to combine his enthusiasm of the sport with analytics. He used the NBA Stats API to pull together this dataset.

The dataset is about the information of every player of each team from the season 1996-97 to 2022-23. Information includes components such as physical attributes, season statistics, college and draft selection. It encompasses a multitude of variables shedding light on players’ attributes, game statistics, and career details. Below is the description of every attribute in the data-

no. - This is just the serial number denoting each row.

player_name - The name of the player. There are several repeat players in the rows because several players tend to play multiple seasons in the NBA.

team_abbreviation - This is the abbreviation of each of the 30 teams in the league. For instance, Boston Celtics is listed as BOS, Los Angeles Lakers are listed as LAL, etc.

Age - This represents the age of the player during that season. The age has a large range. For instance, LeBron James made his debut at the age of 19 and he was still active in 2022-23 season at the age of 38. He and several other players have multiple rows with their names as it shows their stats for each particular season.

player_height - This represents the height of the player in centimeter (cm). This remains unchanged irrespective of the season because by the time players make their debut they have reached their maximum height and biologically cannot grow anymore.

player_weight - It shows the weight of the player in kilogram (kg). Although in real world weight can fluctuate depending on the lifestyle and position, in this dataset it remains consistent.

college - Shows the college which the players represented before becoming professionals. Not every player has a college name in this column because some were either drafted in high school like LeBron James or they played internationally like Nikola Jokic.

country - Represents the country of origin. While most of the players belong to the USA, there is noticeable international representation. It is an important categorical attribute that can be used for future analysis.

draft_year - Represents the year of the NBA draft where the player was selected. Because not every NBA player is drafted, certain players do not have a numerical value in this column.

draft_round - Represents the round in which they were drafted. There are two rounds in the NBA draft since 1989 so if a player is drafted after 1989 then their row has a value of either 1 or 2. Certain players who were drafted before 1989 may have their draft_round value as 3 because there used to be a 3rd round. If undrafted, then a numerical value is absent.

draft_number - Shows the pick of the round at which they were selected. In each round there are a total of 30 picks so the lowest drafted player will be the 60th pick overall. As before, undrafted players do not have a numerical value in this.

gp - This shows the games played by the player in that season. Every team in the regular season has 82 games which is the highest possible value in this column.

pts - This represents the points per game scored by the player. A player can score points in a game in a number of ways such as Free throws, field goals, Three-pointers, Dunks, etc.

reb - This means rebounds per game. There are two types. Offensive rebound is when the player from attacking team grabs the ball after the teammate misses a shot. It gives another chance to add points. Defensive rebound is when the defending team gets the ball from the opposing player’s missed shot.

ast - It is assists per game. A player gets an assist if they pass the ball to their teammate who ends up scoring quickly without any dribbling or passing the ball elsewhere.

net_rating - This is the Net Rating of the player. It is a measure of impact of the player on team performance. It is the difference between the team’s offensive rating (the points scored per 100 possesions with that player active) and the defensive rating (points allowed per 100 possessions with that active player).

oreb_pct - Offensive Rebound Percentage. It is a statistic attributed to a player to show the percentage of available offensive rebounds. The formula is offensive rebounds / (offensive rebounds + opponent’s defensive rebounds)

dreb_pct - Defensive Rebound Percentage. Similar to above metric, even this is attributed to the player to give the percentage of available defensive rebounds. The formula is defensive rebounds / (defensive rebounds + opponent’s offensive rebounds).

usg_pct - This is the usage percentage. It shows the percentage of team’s plays used by that player. It is a measure of proportion of team’s possessions a player uses through shooting, assists, and turnovers. It has a complex formula -

(field goals attempted + 0.44 * free throws attempted + turnovers) . (Team minutes played / 5) / Minutes played . (team field goals attempted + 0.44 * team free throws attempted + team turnovers).

ts_pct - True shooting percentage. It is player’s overall shooting efficiency that considers field goals, three-pointers, and free throws. The formula is-

Total Points / (2 * [field goal attempts + (0.44 * free throws made)])

ast_pct - Assist percentage. Shows the percentage of field goals a player assists. The formula is-

(Assists / minutes played) * (team field goals made) - (player field goals made)

season - Represents each particular season of the NBA.

Tidying the data

From the dataset, we see that there are not really any missing values and all values appear to be in correct positions. However, we can still enhance the appearance of the data by changing the column names.

We can use the rename function from dplyr package to change the column names.

basketball_data <- basketball_data %>%
  rename(
    PlayerName = player_name,
    Team = team_abbreviation,
    Height = player_height,
    Weight = player_weight,
    DraftYear = draft_year,
    Round = draft_round,
    Number = draft_number,
    NetRating = net_rating,
    `oreb %` = oreb_pct,
    `dreb %` = dreb_pct,
    `usg %` = usg_pct,
    `ts %` = ts_pct,
    `ast %` = ast_pct,
  )

head(basketball_data)

##   no.       PlayerName Team age Height    Weight               college country
## 1   0 Randy Livingston  HOU  22 193.04  94.80073       Louisiana State     USA
## 2   1 Gaylon Nickerson  WAS  28 190.50  86.18248 Northwestern Oklahoma     USA
## 3   2     George Lynch  VAN  26 203.20 103.41898        North Carolina     USA
## 4   3   George McCloud  LAL  30 203.20 102.05820         Florida State     USA
## 5   4     George Zidek  DEN  23 213.36 119.74829                  UCLA     USA
## 6   5   Gerald Wilkins  ORL  33 198.12 102.05820 Tennessee-Chattanooga     USA
##   DraftYear Round Number gp  pts reb ast NetRating oreb % dreb % usg %  ts %
## 1      1996     2     42 64  3.9 1.5 2.4       0.3  0.042  0.071 0.169 0.487
## 2      1994     2     34  4  3.8 1.3 0.3       8.9  0.030  0.111 0.174 0.497
## 3      1993     1     12 41  8.3 6.4 1.9      -8.2  0.106  0.185 0.175 0.512
## 4      1989     1      7 64 10.2 2.8 1.7      -2.7  0.027  0.111 0.206 0.527
## 5      1995     1     22 52  2.8 1.7 0.3     -14.1  0.102  0.169 0.195 0.500
## 6      1985     2     47 80 10.6 2.2 2.2      -5.8  0.031  0.064 0.203 0.503
##   ast %  season
## 1 0.248 1996-97
## 2 0.043 1996-97
## 3 0.125 1996-97
## 4 0.125 1996-97
## 5 0.064 1996-97
## 6 0.143 1996-97

From the above operation we see that the column names appear more visualling appealing compared to before.

Descriptive Statistics

Because there are players with multiple seasons, it is not reasonable to perform statistical analysis on the overall data because there will be plenty of redundant information that can provide a false insight. A better strategy is to filter on parameters such as season or team to narrow the search range and get more accurate results.

For instance if we want to get the mean height of all players in 2000-01 season then this is how the operation works-

mean_ht_2000 <- basketball_data %>%
  filter(season == "2000-01") %>%
  summarize(mean_height = mean(Height, na.rm = TRUE))

mean_ht_2000

##   mean_height
## 1    200.7522

From above we see that the mean height of the player in that season was 200.7522 cm.

We can compare the same information from the most recent season to see if there is a noticeable difference,

mean_ht_2022 <- basketball_data %>%
  filter(season == "2022-23") %>%
  summarize(mean_height = mean(Height, na.rm = TRUE))

mean_ht_2022

##   mean_height
## 1    199.2651

We see that the difference is not sizeable because it is only about 1.5 cm lower which is less than even 1 inch.

We can get the median points scored by any team’s players in a particular season. This can also be used to compare individual performances over several seasons.

We compare the median points scored by Boston Celtics’ players in two seasons by grouping the data.

median_points_celtics <- basketball_data %>%
  filter(Team == "BOS" & season %in% c("2014-15", "2019-20")) %>%
  group_by(season) %>%
  summarise(median_points = median(pts, na.rm = TRUE))

median_points_celtics

## # A tibble: 2 × 2
##   season  median_points
##   <chr>           <dbl>
## 1 2014-15          8.65
## 2 2019-20          5.2

The median points scored actually dropped by 3.45. This could indicate that the overall offense of the Boston Celtics either regressed or if it remained similar or better then one or two players were responsible for majority of the offensive success.

We can get the average physical attributes of a team from any season. For instance, we can get the average height and weight of Golden State Warriors’ players from the 2010-11 season.

phy_gsw <- basketball_data %>%
  filter(season == "2010-11" & Team == "GSW") %>%
  summarize(avg_height = mean(Height, na.rm = TRUE),
            avg_weight = mean(Weight, na.rm = TRUE))

phy_gsw

##   avg_height avg_weight
## 1   200.3457   98.21274

We see that the average height was about 200 cm and weight about 98 kg. We can get this data for any team from any season in the data.

We can find out how many first round draft picks played in the most recent season.

first_round_22 <- basketball_data %>%
  filter(Round == "1" & season == "2022-23") %>%
  summarize(number_of_first_rounders = n())

first_round_22

##   number_of_first_rounders
## 1                      280

We see that about 280 players from across 30 teams were former first round draft picks. However, to get more insight we can get the proportion of both round picks and undrafted players.

players_22 <- basketball_data %>%
  filter(season == "2022-23")

draft_proportions <- players_22 %>%
  group_by(Round) %>%
  summarize(proportion = n() / nrow(players_22))

draft_proportions

## # A tibble: 3 × 2
##   Round     proportion
##   <chr>          <dbl>
## 1 1              0.519
## 2 2              0.226
## 3 Undrafted      0.254

We see that a little more than half the active players in last season were first rounders. This makes sense because out of 60 drafted players each year 30 of them are first rounders. In most cases, the undrafted and late second rounders are cut. First rounders are potential starters and this is why the proportion is in favor of first round picks.

We can get the average career stats of LeBron James across all the seasons. We will consider points per game, rebounds, and assists.

lebron_stats <- basketball_data %>%
  filter(PlayerName == "LeBron James")

avg_stats <- lebron_stats %>%
  summarize(
    avg_pts = mean(pts, na.rm = TRUE),
    avg_reb = mean(reb, na.rm = TRUE),
    avg_ast = mean(ast, na.rm = TRUE)
  )

avg_stats

##   avg_pts avg_reb avg_ast
## 1    27.2    7.54    7.34

From the above data, we see that LeBron has impressive mean stats across his long career. This shows his consistency from his debut till now and it makes a strong case for why he is considered as one of the best players ever in the sport history.

We can get the standard deviation of games played by players throughout their careers.

player_games_sd <- basketball_data %>%
  group_by(PlayerName) %>%
  summarize(gp_sd = sd(gp))

head(player_games_sd)

## # A tibble: 6 × 2
##   PlayerName    gp_sd
##   <chr>         <dbl>
## 1 A.C. Green     14.4
## 2 A.J. Bramlett  NA  
## 3 A.J. Guyton    22.2
## 4 A.J. Lawson    NA  
## 5 AJ Green       NA  
## 6 AJ Griffin     NA

From above we see that certain players have NA next to their name. While initially it might give an impression that it is missing data, however a closer analysis shows that these players only lasted a single season as of the end of 2022-23 season. For instance, AJ Griffin was a 2022 Draft Pick and has played in 72 games in his lone season. This makes his career total well below what veteran players have. In the past seasons, the other players only played one season in their entire career.

We can compare the career stats of Kobe Bryant and Stephen Curry. Two players considered greatest of all time and responsible for leading their teams to multiple championships.

player_stats <- basketball_data %>%
  filter(PlayerName %in% c("Kobe Bryant", "Stephen Curry")) %>%
  group_by(PlayerName) %>%
  summarise(
    med_oreb = median(`oreb %`, na.rm = TRUE),
    med_dreb = median(`dreb %`, na.rm = TRUE),
    med_usg = median(`usg %`, na.rm = TRUE),
    med_ts = median(`ts %`, na.rm = TRUE),
    med_ast = median(`ast %`, na.rm = TRUE)
  )

player_stats

## # A tibble: 2 × 6
##   PlayerName    med_oreb med_dreb med_usg med_ts med_ast
##   <chr>            <dbl>    <dbl>   <dbl>  <dbl>   <dbl>
## 1 Kobe Bryant     0.035     0.127   0.316  0.548   0.236
## 2 Stephen Curry   0.0225    0.12    0.288  0.617   0.286

We see that their career numbers are approximate. This aligns with the stats of rest of the great players in the league.

Research Questions

The following questions can be answered by the analysis of the data-

How does a player’s performance change with age ?

Because we have entire/partial career stats over multiple seasons, we can get a good idea on the performance trends.

Which players have improved/regressed ?

Just like above, we can use similar data to see which players have improved and which have regressed with time. This can also help in detecting down years and career years.

Do players’ physical attributes have an impact ?

We are given the heights and weights of the players, that can be used for comparing the players in similar and different positions. We can also compare specific stats such as rebounds where size plays a factor.

Does draft pick matter ?

Because each player falls into a certain category of Draft, we can use this data to assess if there is a noticeable success among first rounders compared to second round picks and undrafted players.

Impact of star player on the team ?

Often due to serious injury, a player can miss most/all of a certain season. We can use this data to see how the team performs with and without the player. This can be used to analyze that how vital is a certain player to his team.

Place of Origin and its significance ?

We can infer statistical tendencies among players from same country. This can be helpful to see if an international player has a different playing style when compared to an American player.

Descriptive Statistics in Groups

We can find out the career stats of all NBA players who previously played in UCLA. UCLA is considered as one of the powerhouses of College Basketball. They have a rich history and consistently produce NBA level of talent.

We can do this by filtering the data first and then taking the mean of the aggregate statistics of the players.

ucla_data <- basketball_data %>%
  filter(college == "UCLA")

head(ucla_data)

##   no.     PlayerName Team age Height    Weight college country DraftYear
## 1   4   George Zidek  DEN  23 213.36 119.74829    UCLA     USA      1995
## 2  81     Jack Haley  NJN  33 208.28 109.76926    UCLA     USA      1987
## 3 153    Don MacLean  PHI  27 208.28 106.59412    UCLA     USA      1992
## 4 168    Ed O'Bannon  DAL  24 203.20 100.69742    UCLA     USA      1995
## 5 176 Darrick Martin  LAC  26 180.34  77.11064    UCLA     USA Undrafted
## 6 260     Tyus Edney  SAC  24 177.80  68.94598    UCLA     USA      1995
##       Round    Number gp  pts reb ast NetRating oreb % dreb % usg %  ts % ast %
## 1         1        22 52  2.8 1.7 0.3     -14.1  0.102  0.169 0.195 0.500 0.064
## 2         4        79 20  2.0 1.6 0.3       5.0  0.167  0.284 0.224 0.441 0.088
## 3         1        19 37 10.9 3.8 1.0      -5.9  0.060  0.152 0.264 0.493 0.091
## 4         1         9 64  3.7 2.3 0.6      -8.7  0.060  0.149 0.167 0.399 0.077
## 5 Undrafted Undrafted 82 10.9 1.4 4.1      -4.5  0.016  0.060 0.232 0.539 0.292
## 6         2        47 70  6.9 1.6 3.2      -2.0  0.028  0.067 0.192 0.499 0.264
##    season
## 1 1996-97
## 2 1996-97
## 3 1996-97
## 4 1996-97
## 5 1996-97
## 6 1996-97

From above we see that some players have multiple rows and that is because it shows data from each season. We can take the average career stats and update our data so that every player appears only once.

new_ucla <- ucla_data %>%
  group_by(PlayerName) %>%
  summarize_all(mean, na.rm = TRUE)

## Warning: There were 371 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `Team = (function (x, ...) ...`.
## ℹ In group 1: `PlayerName = "Aaron Holiday"`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 370 remaining warnings.

head(new_ucla)

## # A tibble: 6 × 22
##   PlayerName       no.  Team   age Height Weight college country DraftYear Round
##   <chr>          <dbl> <dbl> <dbl>  <dbl>  <dbl>   <dbl>   <dbl>     <dbl> <dbl>
## 1 Aaron Holiday 11400.    NA  24     183.   83.9      NA      NA        NA    NA
## 2 Arron Afflalo  7423     NA  27     196.   96.9      NA      NA        NA    NA
## 3 Baron Davis    4243     NA  27     190.   98.3      NA      NA        NA    NA
## 4 Cedric Bozem…  4444     NA  24     198.   93.9      NA      NA        NA    NA
## 5 Charles O'Ba…   890     NA  23.5   196.   94.8      NA      NA        NA    NA
## 6 Dan Gadzuric   4845     NA  29.5   211.  110.       NA      NA        NA    NA
## # ℹ 12 more variables: Number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
## #   NetRating <dbl>, `oreb %` <dbl>, `dreb %` <dbl>, `usg %` <dbl>,
## #   `ts %` <dbl>, `ast %` <dbl>, season <dbl>

Now we see that the number of rows has reduced to 53. It shows that from 1996-97 to 2022-23 UCLA has produced 53 professional players. However, this dataset is not ideal because even irrelevant columns such as serial number, age, etc have been averaged. We can filter this data further by only having relevant columns.

new_ucla <- new_ucla %>%
  group_by(PlayerName) %>%
  summarize(
    pts = mean(pts, na.rm = TRUE),
    reb = mean(reb, na.rm = TRUE),
    ast = mean(ast, na.rm = TRUE),
    NetRating = mean(NetRating, na.rm = TRUE),
    oreb = mean(`oreb %`, na.rm = TRUE),
    dreb = mean(`dreb %`, na.rm = TRUE),
    usg = mean(`usg %`, na.rm = TRUE),    
    ts = mean(`ts %`, na.rm = TRUE),      
    ast = mean(`ast %`, na.rm = TRUE)
  )

head(new_ucla)

## # A tibble: 6 × 9
##   PlayerName         pts   reb    ast NetRating   oreb   dreb   usg    ts
##   <chr>            <dbl> <dbl>  <dbl>     <dbl>  <dbl>  <dbl> <dbl> <dbl>
## 1 Aaron Holiday     6.56  1.62 0.173       1.7  0.0168 0.0752 0.178 0.523
## 2 Arron Afflalo    10.7   2.81 0.0993     -1.54 0.0193 0.099  0.171 0.551
## 3 Baron Davis      15.9   3.72 0.355       1.46 0.029  0.0961 0.243 0.498
## 4 Cedric Bozeman    1.1   1    0.075      -5.2  0.016  0.112  0.112 0.312
## 5 Charles O'Bannon  2.6   1.5  0.119       7.7  0.091  0.106  0.172 0.446
## 6 Dan Gadzuric      3.92  3.98 0.0386     -3.28 0.124  0.229  0.158 0.441

This updated data now shows the average career stats of each UCLA player. This can be helpful for both the teams who for future NBA drafts and the high school students who are making decisions on which college to commit to.

Because NBA attracts international interest, we can filter the data to see the performance of international players.

international <- basketball_data %>%
  filter(country != "USA")

head(international)

##   no.       PlayerName Team age Height    Weight        college
## 1  18  Hakeem Olajuwon  HOU  34 213.36 115.66596        Houston
## 2 150  Dikembe Mutombo  ATL  31 218.44 113.39800     Georgetown
## 3 224         Rick Fox  BOS  27 200.66 112.94441 North Carolina
## 4 246       Steve Nash  PHX  23 190.50  88.45044    Santa Clara
## 5 254      Vlade Divac  CHH  29 215.90 117.93392           None
## 6 255 Vitaly Potapenko  CLE  22 208.28 127.00576   Wright State
##                 country DraftYear Round Number gp  pts  reb ast NetRating
## 1               Nigeria      1984     1      1 78 23.2  9.2 3.0       6.5
## 2                 Congo      1991     1      4 80 13.3 11.6 1.4       8.1
## 3                Canada      1991     1     24 76 15.4  5.2 3.8      -6.9
## 4                Canada      1996     1     15 65  3.3  1.0 2.1      -8.5
## 5 Serbia and Montenegro      1989     1     26 81 12.6  9.0 3.7       5.4
## 6               Ukraine      1996     1     12 80  5.8  2.7 0.5      -0.2
##   oreb % dreb % usg %  ts % ast %  season
## 1  0.075  0.206 0.308 0.558 0.158 1996-97
## 2  0.112  0.256 0.179 0.584 0.066 1996-97
## 3  0.047  0.134 0.202 0.551 0.175 1996-97
## 4  0.026  0.083 0.170 0.539 0.301 1996-97
## 5  0.107  0.197 0.194 0.533 0.172 1996-97
## 6  0.106  0.119 0.234 0.486 0.064 1996-97

There are multiple occurences of several players. We can use the distinct function to only get the first occurence because we are not trying to compute any statistics.

international <- international %>%
  distinct(PlayerName, .keep_all = TRUE)

head(international)

##   no.       PlayerName Team age Height    Weight        college
## 1  18  Hakeem Olajuwon  HOU  34 213.36 115.66596        Houston
## 2 150  Dikembe Mutombo  ATL  31 218.44 113.39800     Georgetown
## 3 224         Rick Fox  BOS  27 200.66 112.94441 North Carolina
## 4 246       Steve Nash  PHX  23 190.50  88.45044    Santa Clara
## 5 254      Vlade Divac  CHH  29 215.90 117.93392           None
## 6 255 Vitaly Potapenko  CLE  22 208.28 127.00576   Wright State
##                 country DraftYear Round Number gp  pts  reb ast NetRating
## 1               Nigeria      1984     1      1 78 23.2  9.2 3.0       6.5
## 2                 Congo      1991     1      4 80 13.3 11.6 1.4       8.1
## 3                Canada      1991     1     24 76 15.4  5.2 3.8      -6.9
## 4                Canada      1996     1     15 65  3.3  1.0 2.1      -8.5
## 5 Serbia and Montenegro      1989     1     26 81 12.6  9.0 3.7       5.4
## 6               Ukraine      1996     1     12 80  5.8  2.7 0.5      -0.2
##   oreb % dreb % usg %  ts % ast %  season
## 1  0.075  0.206 0.308 0.558 0.158 1996-97
## 2  0.112  0.256 0.179 0.584 0.066 1996-97
## 3  0.047  0.134 0.202 0.551 0.175 1996-97
## 4  0.026  0.083 0.170 0.539 0.301 1996-97
## 5  0.107  0.197 0.194 0.533 0.172 1996-97
## 6  0.106  0.119 0.234 0.486 0.064 1996-97

From above we see that there have been 413 unique international players over the span of seasons in the dataset.

We can get the count of players from each unique country.

country_count <- international %>%
  count(country)

country_count

##                             country  n
## 1                            Angola  1
## 2                         Argentina 14
## 3                         Australia 22
## 4                           Austria  1
## 5                           Bahamas  3
## 6                            Belize  1
## 7                            Bosnia  2
## 8              Bosnia & Herzegovina  1
## 9            Bosnia and Herzegovina  1
## 10                           Brazil 13
## 11                       Cabo Verde  1
## 12                         Cameroon  4
## 13                           Canada 45
## 14                            China  4
## 15                         Colombia  1
## 16                            Congo  2
## 17                          Croatia 15
## 18                   Czech Republic  4
## 19                              DRC  1
## 20 Democratic Republic of the Congo  4
## 21                          Denmark  1
## 22               Dominican Republic  6
## 23                            Egypt  1
## 24                          England  2
## 25                          Finland  2
## 26                           France 37
## 27                            Gabon  2
## 28                          Georgia  6
## 29                          Germany 13
## 30                            Ghana  1
## 31                    Great Britain  1
## 32                           Greece  9
## 33                           Guinea  1
## 34                            Haiti  2
## 35                             Iran  1
## 36                          Ireland  1
## 37                           Israel  3
## 38                            Italy  7
## 39                          Jamaica  5
## 40                            Japan  3
## 41                           Latvia  6
## 42                        Lithuania 12
## 43                        Macedonia  1
## 44                             Mali  2
## 45                           Mexico  3
## 46                       Montenegro  5
## 47                      Netherlands  2
## 48                      New Zealand  2
## 49                          Nigeria 12
## 50                           Panama  1
## 51                           Poland  4
## 52                         Portugal  1
## 53                      Puerto Rico  6
## 54                           Russia  9
## 55                         Scotland  1
## 56                          Senegal 10
## 57                           Serbia 15
## 58            Serbia and Montenegro  5
## 59                         Slovenia 10
## 60                      South Korea  1
## 61                      South Sudan  3
## 62                            Spain 14
## 63         St. Vincent & Grenadines  1
## 64                            Sudan  1
## 65                       Sudan (UK)  1
## 66                           Sweden  2
## 67                      Switzerland  2
## 68                         Tanzania  1
## 69              Trinidad and Tobago  1
## 70                          Tunisia  1
## 71                           Turkey 12
## 72              U.S. Virgin Islands  1
## 73                US Virgin Islands  1
## 74                             USSR  1
## 75                          Ukraine  8
## 76                   United Kingdom  4
## 77                          Uruguay  1
## 78                        Venezuela  2
## 79                       Yugoslavia  4

We see that the 413 international players are from a total of 79 countries apart from the USA. Canada boasts the highest count with 45. This can be attributed to its close proximity to USA and also having a professional team in Toronto Raptors and short distance to cities like New York City, Boston, Detroit, Minneapolis, etc.

We can see the highest points per game average across the seasons. This can show the other stats of those players from potential All-Star / MVP caliber seasons.

highest_pts <- basketball_data %>%
  arrange(desc(pts))

head(highest_pts)

##     no.    PlayerName Team age Height    Weight       college  country
## 1 10227  James Harden  HOU  29 195.58  99.79024 Arizona State      USA
## 2  4163   Kobe Bryant  LAL  27 198.12  99.79024          None      USA
## 3 10634  James Harden  HOU  30 195.58  99.79024 Arizona State      USA
## 4 12839   Joel Embiid  PHI  29 213.36 127.00576        Kansas Cameroon
## 5  4302 Allen Iverson  PHI  31 182.88  74.84268    Georgetown      USA
## 6 12740   Luka Doncic  DAL  24 200.66 104.32616          None Slovenia
##   DraftYear Round Number gp  pts  reb ast NetRating oreb % dreb % usg %  ts %
## 1      2009     1      3 78 36.1  6.6 7.5       6.3  0.023  0.157 0.396 0.616
## 2      1996     1     13 80 35.4  5.3 4.5       4.7  0.026  0.127 0.384 0.559
## 3      2009     1      3 68 34.3  6.6 7.5       5.8  0.026  0.139 0.356 0.626
## 4      2014     1      3 66 33.1 10.2 4.2       8.8  0.057  0.243 0.370 0.655
## 5      1996     1      1 72 33.0  3.2 7.4       0.8  0.016  0.071 0.354 0.543
## 6      2018     1      3 66 32.4  8.6 8.0       2.1  0.024  0.224 0.368 0.609
##   ast %  season
## 1 0.394 2018-19
## 2 0.228 2005-06
## 3 0.366 2019-20
## 4 0.233 2022-23
## 5 0.331 2005-06
## 6 0.408 2022-23

We see that the top part of this list is filled with the popular star players. And this is not surprising, because being the key players for their teams contributed to their popularity among the fan bases.

While statistics such as points, rebounds, assists look great on paper. However, they do not always tell the whole story. This is why net rating is important. A player may not have glossy stats however a positive net rating will show that they have an impact on the team’s success. Similarly, a poor net rating and great stats will show that the player racks up numbers when his team is losing.

We can group and filter the data accordingly.

all_net <- basketball_data %>%
  group_by(PlayerName) %>%
  summarize(
    avg_net = mean(NetRating, na.rm = TRUE)
  )

head(all_net)

## # A tibble: 6 × 2
##   PlayerName    avg_net
##   <chr>           <dbl>
## 1 A.C. Green      -1.88
## 2 A.J. Bramlett  -32.6 
## 3 A.J. Guyton     -6.7 
## 4 A.J. Lawson    -20.1 
## 5 AJ Green        -4.9 
## 6 AJ Griffin       1.5

This is a list of all players with their career Net Rating. There are a total of 2551 players in the list and we can filter those who had a positive net rating in their career.

positive_net <- all_net %>%
  filter(avg_net > 0)

head(positive_net)

## # A tibble: 6 × 2
##   PlayerName    avg_net
##   <chr>           <dbl>
## 1 AJ Griffin      1.5  
## 2 Aaron Gordon    1    
## 3 Aaron Holiday   1.7  
## 4 Aaron McKie     0.673
## 5 Adam Keefe      1.62 
## 6 Adam Mokoka     5

We get a list of 733 players which is a little under 30% of the total players. This shows that impact players are considerably rare even at the highest level. The list is a mix of star players (known for their stats) and other players who may not be as popular but they are good enough to not hurt the team.

Visualizations of Above Statistics

While numerical data provides a good amount of information, not everyone is capable of grasping numbers. This is why visualization is important as it is a more convenient way of understanding the insights from the statistical operations conducted above.

We can plot the distribution of physical attributes of all players such as height and weight just to observe the physical diversity across the recent league history.

all_heights <- basketball_data %>%
  group_by(PlayerName) %>%
  summarise(avg_ht = mean(Height, na.rm = TRUE))

ggplot(all_heights, aes(x = avg_ht)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "white", aes(y = ..count..)) +
  labs(title = "Distribution of Player Heights", x = "Player Height") +
  scale_x_continuous(breaks = seq(150, 250, by = 10)) +
  theme_minimal()

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From above we see the distribution of heights of all players in the past 25 seasons. We see that most players are around 200cm tall which is about 6 feet 7 inches. We also notice that a tiny subset of players is above 210 cm or 7 feet. Similarly, Not every player is above 6 feet tall as we see certain players in the 180 cm interval which is around 5 feet 11 inches. The huge range in heights shows that even if a certain prospect is not very tall, they still can play in the league without much trouble. Notable short players such as Chris Paul have established Hall of Fame careers despite being physically inferior.

We can do the same for player weights.

all_weights <- basketball_data %>%
  group_by(PlayerName) %>%
  summarise(avg_wt = mean(Weight, na.rm = TRUE))

ggplot(all_weights, aes(x = avg_wt)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "white", aes(y = ..count..)) +
  labs(title = "Distribution of Player Weights", x = "Player Weight") +
  scale_x_continuous(breaks = seq(50, 150, by = 10)) +
  theme_minimal()

From above, we see that most players weigh around 100 kg or 225 pounds. This makes sense because from the previous plot we saw that most players are considerably taller than the average American man. Just like heights, even weights are diverse from being as light as 75 kg to being over 125 kg.

We can categorize all the players in the dataset based on their draft rounds and undrafted.

round <- basketball_data %>%
  distinct(PlayerName, Round) %>%
  group_by(Round) %>%
  summarize(player_count = n_distinct(PlayerName))

ggplot(round, aes(x = factor(Round), y = player_count)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "white") +
  labs(title = "Players by Draft Rounds", x = "Draft Round", y = "Player Count") +
  theme_minimal()

We see that about 1000 players in the dataset are first rounders which is about 40% of all players in the past 25 seasons. One interesting observation is the higher number of undrafted players over second rounds. This was initially unexpected as undrafted players are often signed as backups or for roster depth. The fact that there are more undrafted starters compared to second rounders shows that players can still be in the league even if they were not one of the 60 picks. There’s also a tiny number of third and fourth rounders but that was in the past. There are only two rounds as of today.

We can plot the score per game scored by the Boston Celtics every season to see if there are any trends that can be picked up.

bos <- basketball_data %>%
  filter(Team == "BOS") %>%
  mutate(season = as.character(season),
         season = substr(season, 1, 4)) %>%
  group_by(season) %>%
  summarize(total_pts = sum(pts, na.rm = TRUE))

ggplot(bos, aes(x = factor(season), y = total_pts)) +
  geom_point(color = "black") +
  geom_line(color = "black") +
  labs(title = "Average Points per game scored by Boston Celtics each season",
       x = "Season", y = "Total Points") +
  theme_minimal() +
   theme(axis.text.x = element_text(size = 8))

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

From this plot we see that over each decade, the Celtics have been consistent with the trend in positive direction. We see that in general with time, the team has been scoring more total points per game than before. Deep analysis shows that in back to back drafts in 2016 and 2017, the Celtics drafted their star players - Jaylen Brown and Jayson Tatum. This explains their high totals in the recent years.

We can categorize the players by number of appearances. Here, we classify the cuts at 82 (single season), 410 (5 seasons), 820 (10 seasons), 1230 (15 seasons) and beyond.

all_games <- basketball_data %>%
  group_by(PlayerName) %>%
  summarize(games = sum(gp, na.rm = TRUE))

all_games <- all_games %>%
  mutate(category = cut(games, breaks = c(0, 82, 410, 820, 1230, Inf),
                        labels = c("0-82", "83-410", "411-820", "821-1230", "1231+")))

ggplot(all_games, aes(x = category, fill = category)) +
  geom_bar() +
  labs(title = "Distribution of Players by Appearances",
       x = "Games Played", y = "Number of Players") +
  theme_minimal()

From above we see that about half the players play only a season’s worth of games in their career. This shows the level of competition to be part of the starting 5 in the games. Because of extreme competition within the team itself, very few players are able to play long. It is clear from the plot that at higher appearances, the list of players only gets smaller. Very tiny subset of players lasted over a 1000 games, showing that with age they have been consistently good enough to have so many appeareances to their name.

Questions Based Analysis

In this section we will tackle the research questions that we stated above by help for descriptive statistics and visualizations.

Q 1) Analyzing performance trends with age for certain players.

For instance, if we want to have a look at the career trends of LeBron James.

lebron <- basketball_data %>%
  filter(PlayerName == "LeBron James") %>%
  arrange(season)

lebron$season <- as.numeric(substring(lebron$season, 1, 4))

ggplot(lebron, aes(x = season)) +
  geom_line(aes(y = pts, color = "Points"), linewidth = 1) +
  geom_line(aes(y = reb, color = "Rebounds"), linewidth = 1) +
  geom_line(aes(y = ast, color = "Assists"), linewidth = 1) +
  labs(title = "Lebron James Season Stats Over Time",
       x = "Season", y = "Stats") +
  scale_color_manual(values = c("Points" = "blue", "Rebounds" = "green", "Assists" = "red")) +
  theme_minimal() +
  ylim(0, NA)

From above, we see that LeBron has been mostly consistent in his career.There are some noticeable highs and lows but they are in a respectable range. And that is expected because no player can be perfect in every single moment of their career. We see that he averages at least 20 points per game every season. and between 5-10 rebounds and assists per game. Those are impressive numbers to be consistent in a 20 year old career. This is a validation as to why many consider him as one of the greatest of all time.

We can also have a look at the percentage stats to get a better idea.

ggplot(lebron, aes(x = season)) +
  geom_line(aes(y = `oreb %`, color = "Oreb %"), linewidth = 1, na.rm = TRUE) +
  geom_line(aes(y = `dreb %`, color = "Dreb %"), linewidth = 1, na.rm = TRUE) +
  geom_line(aes(y = `usg %`, color = "Usg %"), linewidth = 1, na.rm = TRUE) +
  geom_line(aes(y = `ts %`, color = "Ts %"), linewidth = 1, na.rm = TRUE) +
  geom_line(aes(y = `ast %`, color = "Ast %"), linewidth = 1, na.rm = TRUE) +
  labs(title = "Lebron James Season Stats Over Time",
       x = "Season", y = "Stats") +
  scale_color_manual(values = c("Oreb %" = "purple", "Dreb %" = "orange", "Usg %" = "brown","Ts %" = "pink", "Ast %" = "cyan")) +
  theme_minimal()

LeBron’s offensive rebound percentage is low. Almost close to 0 but it makes sense because he mostly plays as a power forward where if the ball is lost, in most cases the defense manages to get it. His rebounds and usage % have remained consistent for his entire career. He had a sharp decline in assist % in 2020 and 2021 but it is due to multiple injuries and Covid-19 that the Los Angeles Lakers suffered from in the 2020 and 2021 seasons.

His overall scoring percentage is high which shows that he plays a key role in his teams’ performances.

Q 2) Which players have improved/regressed ?

We can understand the career trends of two popular players whose performances were in opposite trends with age. While basketball has plenty of crucial stats, however points scored is still the most crucial because that is how games are decided. And it also plays a part in other stats such as assists.

A question arises as to whether it is a good idea to only analyze specific players for career trajectories. After all, there are several thousand players so it is not possible to just have a few specific trends. However, plotting the career trends of every single player who has played in the league is a tall task that will require repetition of the code chunk for every single player. One way is to compare a certain season’s performance with the previous season but that will also not give the whole story because a lot of players also plateau in their careers at different points.

First we will analyze the scoring performance of Dwight Howard.

dwight <- basketball_data %>%
  filter(PlayerName == "Dwight Howard") %>%
  arrange(season)

dwight$season <- as.numeric(substring(dwight$season, 1, 4))

ggplot(dwight, aes(x = season, y = pts)) +
  geom_line(color = "blue", size = 1) +
  labs(title = "Dwight Howard's Points Per Game Over Time",
       x = "Season", y = "Points Per Game") +
  theme_minimal() +
  ylim(0, NA)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We see that initially Dwight Howard was on the rise since his rookie season as he improved from around 10 ppg to close to 20 ppg. However, since 2016 he went on a decline as his offensive performance became worse drastically. Towards the end of the plot, we see a massive decline which tells us that age/injuries derailed his career. And by 2021, he was out of the NBA.

Now we will analyze the performance of Jimmy Butler who is one of the best players in the league today.

jimmy <- basketball_data %>%
  filter(PlayerName == "Jimmy Butler") %>%
  arrange(season)

jimmy$season <- as.numeric(substring(jimmy$season, 1, 4))

ggplot(jimmy, aes(x = season, y = pts)) +
  geom_line(color = "red", size = 1) +
  labs(title = "Jimmy Butler's Points Per Game Over Time",
       x = "Season", y = "Points Per Game") +
  theme_minimal() +
  ylim(0, NA)

In this plot, we see the opposite trend of Dwight Howard. Jimmy Butler has improved massively with time. As a new player, his performance was pretty average with about 10 points per game however he picked up pace and in recent seasons he is averaging at least 20 points per game. This places him among the top players in today’s game.

These were just two of the several players in the list. While, it is inconvenient to make a plot for every single player, we can understand that there are a variety of trends. Age does not affect players equally as seen from the above plots. Some improve with time contrary to the popular belief.

Q 3) Impact of Physical Attributes ?

We can compare physical attributes such as height and weight with characteristics like points, rebounds, etc.

First we get the data of all players.

players <- basketball_data %>%
  group_by(PlayerName) %>%
  summarise(Height = mean(Height, na.rm = TRUE),
            Weight = mean(Weight, na.rm = TRUE),
            AvgPointsPerGame = mean(pts, na.rm = TRUE),
            AvgReboundsPerGame = mean(reb, na.rm = TRUE),
            AvgAssistsPerGame = mean(ast, na.rm = TRUE),
            AvgNetRating = mean(NetRating, na.rm = TRUE))

head(players)

## # A tibble: 6 × 7
##   PlayerName Height Weight AvgPointsPerGame AvgReboundsPerGame AvgAssistsPerGame
##   <chr>       <dbl>  <dbl>            <dbl>              <dbl>             <dbl>
## 1 A.C. Green   206.  102.              5.78               6.06              0.86
## 2 A.J. Bram…   208.  103.              1                  2.8               0   
## 3 A.J. Guyt…   185.   81.6             3.8                0.7               1.57
## 4 A.J. Laws…   198.   81.2             3.7                1.4               0.1 
## 5 AJ Green     196.   86.2             4.4                1.3               0.6 
## 6 AJ Griffin   198.   99.8             8.9                2.1               1   
## # ℹ 1 more variable: AvgNetRating <dbl>

First we will analyze the relationship between Height and Average Points per game

ggplot(players, aes(x = Height, y = AvgPointsPerGame)) +
  geom_point(alpha = 0.5) +
  labs(title = "Height vs. Average Points Per Game",
       x = "Height", y = "Average Points Per Game") +
  theme_minimal()

From the plot we see that most high scorers are between 185-215. Which in other units is between 6 feet and 7 feet. Because of such a large range, we can say that height is not strongly correlated. We say strongly because players at outlier heights are not performing as good, so height does have some level of influence in scoring.

We now compare weight with points scored.

ggplot(players, aes(x = Weight, y = AvgPointsPerGame)) +
  geom_point(alpha = 0.5) +
  labs(title = "Weight vs. Average Points Per Game",
       x = "Weight", y = "Average Points Per Game") +
  theme_minimal()

Just like height, most top scorers weigh between 75-125 kg. Because of a wide range and worse performance by outliers, we can say that even player weight as a certain impact on their scoring ability.

We do the same for assists and rebounds.

ggplot(players, aes(x = Height, y = AvgReboundsPerGame)) +
  geom_point(size = 3, color = "skyblue") +
  labs(title = "Height vs. Average Rebounds Per Game",
       x = "Height", y = "Average Rebounds Per Game") +
  theme_minimal()

Unlike points, rebounds is significantly affected by height of the players. Aside from a few outliers, most rebounds are from players who are close to 7 feet tall. And this makes sense, because their longer wingspan helps both in defense and possessing the ball lost by their own teammate.

ggplot(players, aes(x = Weight, y = AvgReboundsPerGame)) +
  geom_point(size = 3, color = "red") +
  labs(title = "Weight vs. Average Rebounds Per Game",
       x = "Weight", y = "Average Rebounds Per Game") +
  theme_minimal()

However, weight does not have as much of an impact as height does on the rebounds. This is because weight works differently on people. 100 kg on 6 feet tall player is not the same as on a 7 feet one.

ggplot(players, aes(x = Height, y = AvgAssistsPerGame)) +
  geom_point(size = 3, color = "green") +
  labs(title = "Height vs. Average Assists Per Game",
       x = "Height", y = "Average Assists Per Game") +
  theme_minimal()

From the graph, it appears that the players below the average player height (200 cm) perform better in the assists category. And this makes sense because smaller players are more on the offensive side of the game which is where they get to assist the scorer.

ggplot(players, aes(x = Weight, y = AvgAssistsPerGame)) +
  geom_point(size = 3, color = "purple") +
  labs(title = "Weight vs. Average Assists Per Game",
       x = "Weight", y = "Average Assists Per Game") +
  theme_minimal()

Shorter, lighter players are often in the role of Small Forward or Point Guard who are the main offensive players. These guys are responsible for assists and scores. And this plot confirms this by stating that lighter players tend to have more assists as it comes with the position.

Q 4) Impact of Star Player on the team.

We can analyze stats of both the star player and the team overall to see if there is a strong correlation. For instance, we can use data of Stephen Curry and the Golden State Warriors. We will analyze the data from 2009 which is when he made his debut.

curry <- basketball_data %>%
  filter(PlayerName == "Stephen Curry")

warriors <- basketball_data %>%
  filter(Team == "GSW" & as.numeric(substring(season, 1, 4)) >= 2009)


total_gsw <- warriors %>%
  group_by(season) %>%
  summarise(total_pts = sum(pts),
            total_reb = sum(reb),
            total_ast = sum(ast))

combined_data <- merge(curry, total_gsw, by = "season", suffixes = c("_curry", "_total_gsw"))

combined_data$season <- as.numeric(substring(combined_data$season, 1, 4))

ggplot(combined_data, aes(x = season)) +
  geom_line(aes(y = pts, color = "Curry"), linewidth = 1) +
  geom_line(aes(y = total_pts, color = "Team"), linewidth = 1) +
  labs(title = "Curry's Points vs Total Team Points",
       x = "Season", y = "Points") +
  scale_color_manual(values = c("Curry" = "blue", "Team" = "orange")) +
  theme_minimal() +
  ylim(0, NA)

We see that the Team was at the peak in Curry’s Rookie season however they went on a decline in overall performance before rising back. Their trends are mostly similar to Curry’s performance trends. We see that in the major decline of the team, even his performance slightly dipped. However, over time both the player and the team improved. This shows that Curry had an impact on the team’s scoring ability.

ggplot(combined_data, aes(x = season)) +
  geom_line(aes(y = reb, color = "Curry"), linewidth = 1) +
  geom_line(aes(y = total_reb, color = "Team"), linewidth = 1) +
  labs(title = "Curry's Rebounds vs Total Team Rebounds",
       x = "Season", y = "Rebounds") +
  scale_color_manual(values = c("Curry" = "blue", "Team" = "orange")) +
  theme_minimal() +
  ylim(0, NA)

Curry’s rebounds have little to no effect on the total team stat as while the team fluctuates, he remains consistent. And previously we have seen that it is the bigger players who are more successful in accumulating rebounds. Because Curry is on the smaller side - 188 cm tall and 84kg he is not expected to have a lot of rebounds per game.

ggplot(combined_data, aes(x = season)) +
  geom_line(aes(y = ast, color = "Curry"), linewidth = 1) +
  geom_line(aes(y = total_ast, color = "Team"), linewidth = 1) +
  labs(title = "Curry's Assists vs Total Team Assists",
       x = "Season", y = "Assists") +
  scale_color_manual(values = c("Curry" = "blue", "Team" = "orange")) +
  theme_minimal() +
  ylim(0, NA)

This statline is almost similar to Rebounds. Curry’s assists are not strongly related to the total team stat. And this makes sense because he is the main shooter of the team. With teammates like Klay Thompson and Draymond Green feeding him with the ball, all he has to do is be accurate in scoring which works in the team’s favors as they have won 4 NBA titles with him on the team.

Q 5) Significance of Country of Origin ?

We can compare performances of players from the top 10 countries to see if there is an advantage of being from a certain country.

average_performance <- basketball_data %>%
  group_by(PlayerName, country) %>%
  summarise(avg_pts = mean(pts, na.rm = TRUE)) %>%
  group_by(country) %>%
  summarise(avg_pts = mean(avg_pts, na.rm = TRUE),
            median_pts = median(avg_pts, na.rm = TRUE))

## `summarise()` has grouped output by 'PlayerName'. You can override using the
## `.groups` argument.

# Fix to separate USA from US Virgin Islands
average_performance <- average_performance %>%
  mutate(country = ifelse(country %in% c("US Virgin Islands", "USA"), "USA", country))

top_countries <- average_performance %>%
  top_n(10, wt = avg_pts)

ggplot(top_countries, aes(x = reorder(country, -avg_pts), y = avg_pts)) +
  geom_bar(stat = "identity", fill = "skyblue", position = "dodge", alpha = 0.8) +
  labs(title = "Average Points per Game by Country (Top 10)",
       x = "Country", y = "Average Points per Game") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We compared the top 10 countries and it is clear that USA has the biggest edge by a sizeable margin. The average American NBA player averages 18 points per game. The second highest is Cameroon with about 13 points per game. Thus, in terms of offensive performance it is clear that being an American player provides an advantage. This can be credited to youth basketball and elite coaching across all levels.

Now we compare the average rebounds.

average_performance <- basketball_data %>%
  group_by(PlayerName, country) %>%
  summarise(avg_reb = mean(reb, na.rm = TRUE)) %>%
  group_by(country) %>%
  summarise(avg_reb = mean(avg_reb, na.rm = TRUE),
            median_reb = median(avg_reb, na.rm = TRUE))

## `summarise()` has grouped output by 'PlayerName'. You can override using the
## `.groups` argument.

average_performance <- average_performance %>%
  mutate(country = ifelse(country %in% c("US Virgin Islands", "USA"), "USA", country))

top_reb <- average_performance %>%
  top_n(10, wt = avg_reb)

ggplot(top_reb, aes(x = reorder(country, -avg_reb), y = avg_reb)) +
  geom_bar(stat = "identity", fill = "skyblue", position = "dodge", alpha = 0.8) +
  labs(title = "Average Rebounds per Game by Country (Top 10)",
       x = "Country", y = "Average Rebounds per Game") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Just like Points, American Players are much ahead in rebounds compared to the next best country which is Switzerland.

average_performance <- basketball_data %>%
  group_by(PlayerName, country) %>%
  summarise(avg_ast = mean(ast, na.rm = TRUE)) %>%
  group_by(country) %>%
  summarise(avg_ast = mean(avg_ast, na.rm = TRUE),
            median_ast = median(avg_ast, na.rm = TRUE))

## `summarise()` has grouped output by 'PlayerName'. You can override using the
## `.groups` argument.

average_performance <- average_performance %>%
  mutate(country = ifelse(country %in% c("US Virgin Islands", "USA"), "USA", country))

top_ast <- average_performance %>%
  top_n(10, wt = avg_ast)

ggplot(top_ast, aes(x = reorder(country, -avg_ast), y = avg_ast)) +
  geom_bar(stat = "identity", fill = "skyblue", position = "dodge", alpha = 0.8) +
  labs(title = "Average Assists per Game by Country (Top 10)",
       x = "Country", y = "Average Assists per Game") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

However, in assists even though USA remains at the top the competition is more intense as Russian and Danish Players are just as good in assists per game. However, as we have seen that American players are better scorers we can assume that international players serve more as assist providers to the main scorers of their teams.

Critical Reflection

We had a data with season stats of all players from 1996-97 to 2022-23. First we were given a background of the players such as their physical attributes, college, nationality and draft picks.

Which was then followed by each player’s season stats such as per game points, rebounds, assists and % stats such as oreb_pct, dreb_pct, usage_pct, ts_pct, and ast_pct. There is also a unique attribute net_rating which puts more emphasis on impact than just stats.

We observed that there has not been much changes in the physical attributes of the players over two decades which was expected. One surprising observation was a higher amount of undrafted players more than second rounders.

We analyzed and found that American players tend to be statistically better than their international counterparts. This can be attributed to the hype around the sport in the country as well as ease of access to equipment, and coaching camps across all age groups. The consistent preparation as they grow up makes them physically ready to be part of the professional league.

We analyzed the performances of certain individual players who are popular in the press such as LeBron James and Stephen Curry. Players in this category are often the topic of sports shows and online articles. They also sign the biggest contracts and get featured in several endorsements on TV.

A person new to Basketball will assume that they must be the top players and the stats correctly back that assumption. These players have been consistent for several years in a league where most players are out after a season or never get to start in a game. These players are enjoying the rewards of the efforts they put on the court.

We also saw that age does not treat all players equally. In Dwight Howard and Jimmy Butler, we saw two contrasting examples as their performance trends were the complete opposite of the other. Dwight regressed as expected but Jimmy only got better with time. Overall, the trends across players vary a lot but we can assume that only a few players either stay consistent or get better.

While stats and visualizations provide a good level of information to a viewer, certain questions are left unanswered.

How successful a certain team was ?

While stats provide the technical information about the average game performance of a team, it cannot reveal the overall success of the team. Just by looking at the stats, one cannot conclude that a certain team lifted the championship that year. Upsets have always been a crucial part of Basketball and we have seen several final buzzer scores that have changed the outcomes of the games.

Reason for poor performance in certain seasons ?

Down seasons are a part of an athlete’s life. It is often a symptom of one or more causes. These could be injuries, change in game scheme, personal reasons, etc. The dataset does not provide this information which is inadequate to answer this question.

Impact of team chemistry and locker room culture ?

In the end, players are also people just like the rest. Each one can be very different from others such as social skills, temper, discipline, etc. While, this can be a learning experience for the team and staff as a whole it has its own problems. Quarrels among the players and coaches can often hurt the overall morale of the team which can lead to poor on-court performance. Often, we hear of certain players as “locker room cancers” because of their bad reputation in the players’ circle.

Impact of coaching strategies on outcomes ?

So far we have only dealt with player performances. Right coaching is very important to be more successful than other teams. Who plays where and how is decided by the coaches. Scheming correctly and leveraging the strengths of the players yields the ideal results. Coaches of top teams are always under pressure for a successful playoff run/championship which emphasizes their crucial role. The dataset unfortunately shows no information about the coaches.

Fan Base and Home Court Advantage ?

We have often heard of these terms. Having a strong fan base and playing at home. While to someone new to sports, it will not make a difference to the mind as their focus will be solely on end results. However, people who follow any kind of sport know the importance of these two factors. Having the fan base support can lead to a mental advantage against the opponents. The increased confidence can help the players be more aggressive that can lead to more offensive success. Similarly, players feel more comfortable at home because thats where they play for half the season. Being in a different environment especially at a rival’s home can be intimidating which can make it difficult to perform just as good.

Impact of off-field activities ?

What a player does in off-season and during off days plays a large part in his on-court performance. If time is spent on activities like team building, workout, mental health, recreation etc then it can prove to be beneficial to their overall development. However, we often hear about NBA players getting in trouble with law, excessive expenditures, problem with women etc which often makes headlines and puts them in a bad spotlight. This affects the entire team if they are a key player such as Ja Morant, Draymond Green, etc who are often then suspended for a certain amount of games that more often than not results in overall team decline.

Bibliography

Link for the dataset obtained from Kaggle - https://www.kaggle.com/datasets/justinas/nba-players-data
R Programming Documentation - https://www.r-project.org/other-docs.html
The R for Data Science Book - https://r4ds.had.co.nz/index.html