Homework 1

Introduction

In homework 1 we will be performing the basic data related operations that we have already come across in the tutorials and challenges. The goals will be to introduce the dataset and provide both textual and statistical description of it.

Dataset

We need to load the necessary libraries for our operations.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

We can now load the dataset.

basketball_data <- read.csv("all_seasons.csv")

Reading the data and Narrative

We can get the first few rows of the data.

head(basketball_data)

##   no.      player_name team_abbreviation age player_height player_weight
## 1   0 Randy Livingston               HOU  22        193.04      94.80073
## 2   1 Gaylon Nickerson               WAS  28        190.50      86.18248
## 3   2     George Lynch               VAN  26        203.20     103.41898
## 4   3   George McCloud               LAL  30        203.20     102.05820
## 5   4     George Zidek               DEN  23        213.36     119.74829
## 6   5   Gerald Wilkins               ORL  33        198.12     102.05820
##                 college country draft_year draft_round draft_number gp  pts reb
## 1       Louisiana State     USA       1996           2           42 64  3.9 1.5
## 2 Northwestern Oklahoma     USA       1994           2           34  4  3.8 1.3
## 3        North Carolina     USA       1993           1           12 41  8.3 6.4
## 4         Florida State     USA       1989           1            7 64 10.2 2.8
## 5                  UCLA     USA       1995           1           22 52  2.8 1.7
## 6 Tennessee-Chattanooga     USA       1985           2           47 80 10.6 2.2
##   ast net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct  season
## 1 2.4        0.3    0.042    0.071   0.169  0.487   0.248 1996-97
## 2 0.3        8.9    0.030    0.111   0.174  0.497   0.043 1996-97
## 3 1.9       -8.2    0.106    0.185   0.175  0.512   0.125 1996-97
## 4 1.7       -2.7    0.027    0.111   0.206  0.527   0.125 1996-97
## 5 0.3      -14.1    0.102    0.169   0.195  0.500   0.064 1996-97
## 6 2.2       -5.8    0.031    0.064   0.203  0.503   0.143 1996-97

The dataset is about the information of every player of each team from the season 1996-97 to 2022-23. Information includes components such as physical attributes, season statistics, college and draft selection. It encompasses a multitude of variables shedding light on players’ attributes, game statistics, and career details. Below is the description of every attribute in the data-

no. - This is just the serial number denoting each row.

player_name - The name of the player. There are several repeat players in the rows because several players tend to play multiple seasons in the NBA.

team_abbreviation - This is the abbreviation of each of the 30 teams in the league. For instance, Boston Celtics is listed as BOS, Los Angeles Lakers are listed as LAL, etc.

Age - This represents the age of the player during that season. The age has a large range. For instance, LeBron James made his debut at the age of 19 and he was still active in 2022-23 season at the age of 38. He and several other players have multiple rows with their names as it shows their stats for each particular season.

player_height - This represents the height of the player in centimeter (cm). This remains unchanged irrespective of the season because by the time players make their debut they have reached their maximum height and biologically cannot grow anymore.

player_weight - It shows the weight of the player in kilogram (kg). Although in real world weight can fluctuate depending on the lifestyle and position, in this dataset it remains consistent.

college - Shows the college which the players represented before becoming professionals. Not every player has a college name in this column because some were either drafted in high school like LeBron James or they played internationally like Nikola Jokic.

country - Represents the country of origin. While most of the players belong to the USA, there is noticeable international representation. It is an important categorical attribute that can be used for future analysis.

draft_year - Represents the year of the NBA draft where the player was selected. Because not every NBA player is drafted, certain players do not have a numerical value in this column.

draft_round - Represents the round in which they were drafted. There are two rounds in the NBA draft since 1989 so if a player is drafted after 1989 then their row has a value of either 1 or 2. Certain players who were drafted before 1989 may have their draft_round value as 3 because there used to be a 3rd round. If undrafted, then a numerical value is absent.

draft_number - Shows the pick of the round at which they were selected. In each round there are a total of 30 picks so the lowest drafted player will be the 60th pick overall. As before, undrafted players do not have a numerical value in this.

gp - This shows the games played by the player in that season. Every team in the regular season has 82 games which is the highest possible value in this column.

pts - This represents the points per game scored by the player. A player can score points in a game in a number of ways such as Free throws, field goals, Three-pointers, Dunks, etc.

reb - This means rebounds per game. There are two types. Offensive rebound is when the player from attacking team grabs the ball after the teammate misses a shot. It gives another chance to add points. Defensive rebound is when the defending team gets the ball from the opposing player’s missed shot.

ast - It is assists per game. A player gets an assist if they pass the ball to their teammate who ends up scoring quickly without any dribbling or passing the ball elsewhere.

net_rating - This is the Net Rating of the player. It is a measure of impact of the player on team performance. It is the difference between the team’s offensive rating (the points scored per 100 possesions with that player active) and the defensive rating (points allowed per 100 possessions with that active player).

oreb_pct - Offensive Rebound Percentage. It is a statistic attributed to a player to show the percentage of available offensive rebounds. The formula is offensive rebounds / (offensive rebounds + opponent’s defensive rebounds)

dreb_pct - Defensive Rebound Percentage. Similar to above metric, even this is attributed to the player to give the percentage of available defensive rebounds. The formula is defensive rebounds / (defensive rebounds + opponent’s offensive rebounds).

usg_pct - This is the usage percentage. It shows the percentage of team’s plays used by that player. It is a measure of proportion of team’s possessions a player uses through shooting, assists, and turnovers. It has a complex formula -

(field goals attempted + 0.44 * free throws attempted + turnovers) . (Team minutes played / 5) / Minutes played . (team field goals attempted + 0.44 * team free throws attempted + team turnovers).

ts_pct - True shooting percentage. It is player’s overall shooting efficiency that considers field goals, three-pointers, and free throws. The formula is-

Total Points / (2 * [field goal attempts + (0.44 * free throws made)])

ast_pct - Assist percentage. Shows the percentage of field goals a player assists. The formula is-

(Assists / minutes played) * (team field goals made) - (player field goals made)

season - Represents each particular season of the NBA.

Tidying the data

From the dataset, we see that there are not really any missing values and all values appear to be in correct positions. However, we can still enhance the appearance of the data by changing the column names.

We can use the rename function from dplyr package to change the column names.

basketball_data <- basketball_data %>%
  rename(
    PlayerName = player_name,
    Team = team_abbreviation,
    Height = player_height,
    Weight = player_weight,
    DraftYear = draft_year,
    Round = draft_round,
    Number = draft_number,
    NetRating = net_rating,
    `oreb %` = oreb_pct,
    `dreb %` = dreb_pct,
    `usg %` = usg_pct,
    `ts %` = ts_pct,
    `ast %` = ast_pct,
  )

head(basketball_data)

##   no.       PlayerName Team age Height    Weight               college country
## 1   0 Randy Livingston  HOU  22 193.04  94.80073       Louisiana State     USA
## 2   1 Gaylon Nickerson  WAS  28 190.50  86.18248 Northwestern Oklahoma     USA
## 3   2     George Lynch  VAN  26 203.20 103.41898        North Carolina     USA
## 4   3   George McCloud  LAL  30 203.20 102.05820         Florida State     USA
## 5   4     George Zidek  DEN  23 213.36 119.74829                  UCLA     USA
## 6   5   Gerald Wilkins  ORL  33 198.12 102.05820 Tennessee-Chattanooga     USA
##   DraftYear Round Number gp  pts reb ast NetRating oreb % dreb % usg %  ts %
## 1      1996     2     42 64  3.9 1.5 2.4       0.3  0.042  0.071 0.169 0.487
## 2      1994     2     34  4  3.8 1.3 0.3       8.9  0.030  0.111 0.174 0.497
## 3      1993     1     12 41  8.3 6.4 1.9      -8.2  0.106  0.185 0.175 0.512
## 4      1989     1      7 64 10.2 2.8 1.7      -2.7  0.027  0.111 0.206 0.527
## 5      1995     1     22 52  2.8 1.7 0.3     -14.1  0.102  0.169 0.195 0.500
## 6      1985     2     47 80 10.6 2.2 2.2      -5.8  0.031  0.064 0.203 0.503
##   ast %  season
## 1 0.248 1996-97
## 2 0.043 1996-97
## 3 0.125 1996-97
## 4 0.125 1996-97
## 5 0.064 1996-97
## 6 0.143 1996-97

From the above operation we see that the column names appear more visualling appealing compared to before.

Descriptive Statistics

Because there are players with multiple seasons, it is not reasonable to perform statistical analysis on the overall data because there will be plenty of redundant information that can provide a false insight. A better strategy is to filter on parameters such as season or team to narrow the search range and get more accurate results.

For instance if we want to get the mean height of all players in 2000-01 season then this is how the operation works-

mean_ht_2000 <- basketball_data %>%
  filter(season == "2000-01") %>%
  summarize(mean_height = mean(Height, na.rm = TRUE))

mean_ht_2000

##   mean_height
## 1    200.7522

From above we see that the mean height of the player in that season was 200.7522 cm.

We can compare the same information from the most recent season to see if there is a noticeable difference,

mean_ht_2022 <- basketball_data %>%
  filter(season == "2022-23") %>%
  summarize(mean_height = mean(Height, na.rm = TRUE))

mean_ht_2022

##   mean_height
## 1    199.2651

We see that the difference is not sizeable because it is only about 1.5 cm lower which is less than even 1 inch.

We can get the median points scored by any team’s players in a particular season. This can also be used to compare individual performances over several seasons.

pts_2014 <- basketball_data %>%
  filter(season == "2014-15" & Team == "BOS") %>%
  summarize(median_points = median(pts, na.rm = TRUE))

pts_2014

##   median_points
## 1          8.65

From above we see that the median points scored by Boston Celtics’ players in 2014-15 was 8.65 per game.

We can compare it to their performance 5 years later in 2019,

median_points_2019 <- basketball_data %>%
  filter(season == "2019-20" & Team == "BOS") %>%
  summarize(median_points = median(pts, na.rm = TRUE))

median_points_2019

##   median_points
## 1           5.2

The median points scored actually dropped by 3.45. This could indicate that the overall offense of the Boston Celtics either regressed or if it remained similar or better then one or two players were responsible for majority of the offensive success.

We can get the average physical attributes of a team from any season. For instance, we can get the average height and weight of Golden State Warriors’ players from the 2010-11 season.

phy_gsw <- basketball_data %>%
  filter(season == "2010-11" & Team == "GSW") %>%
  summarize(avg_height = mean(Height, na.rm = TRUE),
            avg_weight = mean(Weight, na.rm = TRUE))

phy_gsw

##   avg_height avg_weight
## 1   200.3457   98.21274

We see that the average height was about 200 cm and weight about 98 kg. We can get this data for any team from any season in the data.

We can find out how many first round draft picks played in the most recent season.

first_round_22 <- basketball_data %>%
  filter(Round == "1" & season == "2022-23") %>%
  summarize(number_of_first_rounders = n())

first_round_22

##   number_of_first_rounders
## 1                      280

We see that about 280 players from across 30 teams were former first round draft picks. However, to get more insight we can get the proportion of both round picks and undrafted players.

players_22 <- basketball_data %>%
  filter(season == "2022-23")

draft_proportions <- players_22 %>%
  group_by(Round) %>%
  summarize(proportion = n() / nrow(players_22))

draft_proportions

## # A tibble: 3 × 2
##   Round     proportion
##   <chr>          <dbl>
## 1 1              0.519
## 2 2              0.226
## 3 Undrafted      0.254

We see that a little more than half the active players in last season were first rounders. This makes sense because out of 60 drafted players each year 30 of them are first rounders. In most cases, the undrafted and late second rounders are cut. First rounders are potential starters and this is why the proportion is in favor of first round picks.

We can get the average career stats of LeBron James across all the seasons. We will consider points per game, rebounds, and assists.

lebron_stats <- basketball_data %>%
  filter(PlayerName == "LeBron James")

avg_stats <- lebron_stats %>%
  summarize(
    avg_pts = mean(pts, na.rm = TRUE),
    avg_reb = mean(reb, na.rm = TRUE),
    avg_ast = mean(ast, na.rm = TRUE)
  )

avg_stats

##   avg_pts avg_reb avg_ast
## 1    27.2    7.54    7.34

From the above data, we see that LeBron has impressive mean stats across his long career. This shows his consistency from his debut till now and it makes a strong case for why he is considered as one of the best players ever in the sport history.

We can get the standard deviation of games played by players throughout their careers.

player_games_sd <- basketball_data %>%
  group_by(PlayerName) %>%
  summarize(gp_sd = sd(gp))

player_games_sd

## # A tibble: 2,551 × 2
##    PlayerName    gp_sd
##    <chr>         <dbl>
##  1 A.C. Green     14.4
##  2 A.J. Bramlett  NA  
##  3 A.J. Guyton    22.2
##  4 A.J. Lawson    NA  
##  5 AJ Green       NA  
##  6 AJ Griffin     NA  
##  7 AJ Hammons     NA  
##  8 AJ Price       13.6
##  9 Aaron Brooks   16.1
## 10 Aaron Gordon   12.6
## # ℹ 2,541 more rows

From above we see that certain players have NA next to their name. While initially it might give an impression that it is missing data, however a closer analysis shows that these players only lasted a single season as of the end of 2022-23 season. For instance, AJ Griffin was a 2022 Draft Pick and has played in 72 games in his lone season. This makes his career total well below what veteran players have. In the past seasons, the other players only played one season in their entire career.

We can get the average Net Rating of Kobe Bryant across his career to see the impact he made on the game.

kobe_data <- basketball_data %>%
  filter(PlayerName == "Kobe Bryant")

kobe_net <- kobe_data %>%
  summarize(avg_net_rating = mean(NetRating, na.rm = TRUE))

kobe_net

##   avg_net_rating
## 1            2.6

This shows that in his presence, his team outscored the opponents by 2.6 points per 100 possessions.

We can get the median percentage stats of Stephen Curry over his career.

curry_data <- basketball_data %>%
  filter(PlayerName == "Stephen Curry")

curry_stats <- curry_data %>%
  summarize(med_oreb = median(`oreb %`, na.rm = TRUE),
            med_dreb = median(`dreb %`, na.rm = TRUE),
            med_usg = median(`usg %`, na.rm = TRUE),
            med_ts = median(`ts %`, na.rm = TRUE),
            med_ast = median(`ast %`, na.rm = TRUE))

curry_stats

##   med_oreb med_dreb med_usg med_ts med_ast
## 1   0.0225     0.12   0.288  0.617  0.2865

We see that Stephen Curry’s stats are at par or better when compared to other star players at his position. This shows why he is considered great by many people associated with the game. He is a vital piece for his team and it is evident from 4 Championships.

Research Questions

The following questions can be answered by the analysis of the data-

How does a player’s performance change with age ?

Because we have entire/partial career stats over multiple seasons, we can get a good idea on the performance trends.

Which players have improved/regressed ?

Just like above, we can use similar data to see which players have improved and which have regressed with time. This can also help in detecting down years and career years.

Do players’ physical attributes have an impact ?

We are given the heights and weights of the players, that can be used for comparing the players in similar and different positions. We can also compare specific stats such as rebounds where size plays a factor.

Does draft pick matter ?

Because each player falls into a certain category of Draft, we can use this data to assess if there is a noticeable success among first rounders compared to second round picks and undrafted players.

Impact of star player on the team ?

Often due to serious injury, a player can miss most/all of a certain season. We can use this data to see how the team performs with and without the player. This can be used to analyze that how vital is a certain player to his team.

Place of Origin and its significance ?

We can infer statistical tendencies among players from same country. This can be helpful to see if an international player has a different playing style when compared to an American player.