In homework 1 we will be performing the basic data related operations that we have already come across in the tutorials and challenges. The goals will be to introduce the dataset and provide both textual and statistical description of it.
We need to load the necessary libraries for our operations.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
We can now load the dataset.
basketball_data <- read.csv("all_seasons.csv")
We can get the first few rows of the data.
head(basketball_data)
## no. player_name team_abbreviation age player_height player_weight
## 1 0 Randy Livingston HOU 22 193.04 94.80073
## 2 1 Gaylon Nickerson WAS 28 190.50 86.18248
## 3 2 George Lynch VAN 26 203.20 103.41898
## 4 3 George McCloud LAL 30 203.20 102.05820
## 5 4 George Zidek DEN 23 213.36 119.74829
## 6 5 Gerald Wilkins ORL 33 198.12 102.05820
## college country draft_year draft_round draft_number gp pts reb
## 1 Louisiana State USA 1996 2 42 64 3.9 1.5
## 2 Northwestern Oklahoma USA 1994 2 34 4 3.8 1.3
## 3 North Carolina USA 1993 1 12 41 8.3 6.4
## 4 Florida State USA 1989 1 7 64 10.2 2.8
## 5 UCLA USA 1995 1 22 52 2.8 1.7
## 6 Tennessee-Chattanooga USA 1985 2 47 80 10.6 2.2
## ast net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct season
## 1 2.4 0.3 0.042 0.071 0.169 0.487 0.248 1996-97
## 2 0.3 8.9 0.030 0.111 0.174 0.497 0.043 1996-97
## 3 1.9 -8.2 0.106 0.185 0.175 0.512 0.125 1996-97
## 4 1.7 -2.7 0.027 0.111 0.206 0.527 0.125 1996-97
## 5 0.3 -14.1 0.102 0.169 0.195 0.500 0.064 1996-97
## 6 2.2 -5.8 0.031 0.064 0.203 0.503 0.143 1996-97
The dataset is about the information of every player of each team from the season 1996-97 to 2022-23. Information includes components such as physical attributes, season statistics, college and draft selection. It encompasses a multitude of variables shedding light on players’ attributes, game statistics, and career details. Below is the description of every attribute in the data-
no. - This is just the serial number denoting each row.
player_name - The name of the player. There are several repeat players in the rows because several players tend to play multiple seasons in the NBA.
team_abbreviation - This is the abbreviation of each of the 30 teams in the league. For instance, Boston Celtics is listed as BOS, Los Angeles Lakers are listed as LAL, etc.
Age - This represents the age of the player during that season. The age has a large range. For instance, LeBron James made his debut at the age of 19 and he was still active in 2022-23 season at the age of 38. He and several other players have multiple rows with their names as it shows their stats for each particular season.
player_height - This represents the height of the player in centimeter (cm). This remains unchanged irrespective of the season because by the time players make their debut they have reached their maximum height and biologically cannot grow anymore.
player_weight - It shows the weight of the player in kilogram (kg). Although in real world weight can fluctuate depending on the lifestyle and position, in this dataset it remains consistent.
college - Shows the college which the players represented before becoming professionals. Not every player has a college name in this column because some were either drafted in high school like LeBron James or they played internationally like Nikola Jokic.
country - Represents the country of origin. While most of the players belong to the USA, there is noticeable international representation. It is an important categorical attribute that can be used for future analysis.
draft_year - Represents the year of the NBA draft where the player was selected. Because not every NBA player is drafted, certain players do not have a numerical value in this column.
draft_round - Represents the round in which they were drafted. There are two rounds in the NBA draft since 1989 so if a player is drafted after 1989 then their row has a value of either 1 or 2. Certain players who were drafted before 1989 may have their draft_round value as 3 because there used to be a 3rd round. If undrafted, then a numerical value is absent.
draft_number - Shows the pick of the round at which they were selected. In each round there are a total of 30 picks so the lowest drafted player will be the 60th pick overall. As before, undrafted players do not have a numerical value in this.
gp - This shows the games played by the player in that season. Every team in the regular season has 82 games which is the highest possible value in this column.
pts - This represents the points per game scored by the player. A player can score points in a game in a number of ways such as Free throws, field goals, Three-pointers, Dunks, etc.
reb - This means rebounds per game. There are two types. Offensive rebound is when the player from attacking team grabs the ball after the teammate misses a shot. It gives another chance to add points. Defensive rebound is when the defending team gets the ball from the opposing player’s missed shot.
ast - It is assists per game. A player gets an assist if they pass the ball to their teammate who ends up scoring quickly without any dribbling or passing the ball elsewhere.
net_rating - This is the Net Rating of the player. It is a measure of impact of the player on team performance. It is the difference between the team’s offensive rating (the points scored per 100 possesions with that player active) and the defensive rating (points allowed per 100 possessions with that active player).
oreb_pct - Offensive Rebound Percentage. It is a statistic attributed to a player to show the percentage of available offensive rebounds. The formula is offensive rebounds / (offensive rebounds + opponent’s defensive rebounds)
dreb_pct - Defensive Rebound Percentage. Similar to above metric, even this is attributed to the player to give the percentage of available defensive rebounds. The formula is defensive rebounds / (defensive rebounds + opponent’s offensive rebounds).
usg_pct - This is the usage percentage. It shows the percentage of team’s plays used by that player. It is a measure of proportion of team’s possessions a player uses through shooting, assists, and turnovers. It has a complex formula -
(field goals attempted + 0.44 * free throws attempted + turnovers) . (Team minutes played / 5) / Minutes played . (team field goals attempted + 0.44 * team free throws attempted + team turnovers).
ts_pct - True shooting percentage. It is player’s overall shooting efficiency that considers field goals, three-pointers, and free throws. The formula is-
Total Points / (2 * [field goal attempts + (0.44 * free throws made)])
ast_pct - Assist percentage. Shows the percentage of field goals a player assists. The formula is-
(Assists / minutes played) * (team field goals made) - (player field goals made)
season - Represents each particular season of the NBA.
From the dataset, we see that there are not really any missing values and all values appear to be in correct positions. However, we can still enhance the appearance of the data by changing the column names.
We can use the rename function from dplyr package to change the column names.
basketball_data <- basketball_data %>%
rename(
PlayerName = player_name,
Team = team_abbreviation,
Height = player_height,
Weight = player_weight,
DraftYear = draft_year,
Round = draft_round,
Number = draft_number,
NetRating = net_rating,
`oreb %` = oreb_pct,
`dreb %` = dreb_pct,
`usg %` = usg_pct,
`ts %` = ts_pct,
`ast %` = ast_pct,
)
head(basketball_data)
## no. PlayerName Team age Height Weight college country
## 1 0 Randy Livingston HOU 22 193.04 94.80073 Louisiana State USA
## 2 1 Gaylon Nickerson WAS 28 190.50 86.18248 Northwestern Oklahoma USA
## 3 2 George Lynch VAN 26 203.20 103.41898 North Carolina USA
## 4 3 George McCloud LAL 30 203.20 102.05820 Florida State USA
## 5 4 George Zidek DEN 23 213.36 119.74829 UCLA USA
## 6 5 Gerald Wilkins ORL 33 198.12 102.05820 Tennessee-Chattanooga USA
## DraftYear Round Number gp pts reb ast NetRating oreb % dreb % usg % ts %
## 1 1996 2 42 64 3.9 1.5 2.4 0.3 0.042 0.071 0.169 0.487
## 2 1994 2 34 4 3.8 1.3 0.3 8.9 0.030 0.111 0.174 0.497
## 3 1993 1 12 41 8.3 6.4 1.9 -8.2 0.106 0.185 0.175 0.512
## 4 1989 1 7 64 10.2 2.8 1.7 -2.7 0.027 0.111 0.206 0.527
## 5 1995 1 22 52 2.8 1.7 0.3 -14.1 0.102 0.169 0.195 0.500
## 6 1985 2 47 80 10.6 2.2 2.2 -5.8 0.031 0.064 0.203 0.503
## ast % season
## 1 0.248 1996-97
## 2 0.043 1996-97
## 3 0.125 1996-97
## 4 0.125 1996-97
## 5 0.064 1996-97
## 6 0.143 1996-97
From the above operation we see that the column names appear more visualling appealing compared to before.
Because there are players with multiple seasons, it is not reasonable to perform statistical analysis on the overall data because there will be plenty of redundant information that can provide a false insight. A better strategy is to filter on parameters such as season or team to narrow the search range and get more accurate results.
For instance if we want to get the mean height of all players in 2000-01 season then this is how the operation works-
mean_ht_2000 <- basketball_data %>%
filter(season == "2000-01") %>%
summarize(mean_height = mean(Height, na.rm = TRUE))
mean_ht_2000
## mean_height
## 1 200.7522
From above we see that the mean height of the player in that season was 200.7522 cm.
We can compare the same information from the most recent season to see if there is a noticeable difference,
mean_ht_2022 <- basketball_data %>%
filter(season == "2022-23") %>%
summarize(mean_height = mean(Height, na.rm = TRUE))
mean_ht_2022
## mean_height
## 1 199.2651
We see that the difference is not sizeable because it is only about 1.5 cm lower which is less than even 1 inch.
We can get the median points scored by any team’s players in a particular season. This can also be used to compare individual performances over several seasons.
pts_2014 <- basketball_data %>%
filter(season == "2014-15" & Team == "BOS") %>%
summarize(median_points = median(pts, na.rm = TRUE))
pts_2014
## median_points
## 1 8.65
From above we see that the median points scored by Boston Celtics’ players in 2014-15 was 8.65 per game.
We can compare it to their performance 5 years later in 2019,
median_points_2019 <- basketball_data %>%
filter(season == "2019-20" & Team == "BOS") %>%
summarize(median_points = median(pts, na.rm = TRUE))
median_points_2019
## median_points
## 1 5.2
The median points scored actually dropped by 3.45. This could indicate that the overall offense of the Boston Celtics either regressed or if it remained similar or better then one or two players were responsible for majority of the offensive success.
We can get the average physical attributes of a team from any season. For instance, we can get the average height and weight of Golden State Warriors’ players from the 2010-11 season.
phy_gsw <- basketball_data %>%
filter(season == "2010-11" & Team == "GSW") %>%
summarize(avg_height = mean(Height, na.rm = TRUE),
avg_weight = mean(Weight, na.rm = TRUE))
phy_gsw
## avg_height avg_weight
## 1 200.3457 98.21274
We see that the average height was about 200 cm and weight about 98 kg. We can get this data for any team from any season in the data.
We can find out how many first round draft picks played in the most recent season.
first_round_22 <- basketball_data %>%
filter(Round == "1" & season == "2022-23") %>%
summarize(number_of_first_rounders = n())
first_round_22
## number_of_first_rounders
## 1 280
We see that about 280 players from across 30 teams were former first round draft picks. However, to get more insight we can get the proportion of both round picks and undrafted players.
players_22 <- basketball_data %>%
filter(season == "2022-23")
draft_proportions <- players_22 %>%
group_by(Round) %>%
summarize(proportion = n() / nrow(players_22))
draft_proportions
## # A tibble: 3 × 2
## Round proportion
## <chr> <dbl>
## 1 1 0.519
## 2 2 0.226
## 3 Undrafted 0.254
We see that a little more than half the active players in last season were first rounders. This makes sense because out of 60 drafted players each year 30 of them are first rounders. In most cases, the undrafted and late second rounders are cut. First rounders are potential starters and this is why the proportion is in favor of first round picks.
We can get the average career stats of LeBron James across all the seasons. We will consider points per game, rebounds, and assists.
lebron_stats <- basketball_data %>%
filter(PlayerName == "LeBron James")
avg_stats <- lebron_stats %>%
summarize(
avg_pts = mean(pts, na.rm = TRUE),
avg_reb = mean(reb, na.rm = TRUE),
avg_ast = mean(ast, na.rm = TRUE)
)
avg_stats
## avg_pts avg_reb avg_ast
## 1 27.2 7.54 7.34
From the above data, we see that LeBron has impressive mean stats across his long career. This shows his consistency from his debut till now and it makes a strong case for why he is considered as one of the best players ever in the sport history.
We can get the standard deviation of games played by players throughout their careers.
player_games_sd <- basketball_data %>%
group_by(PlayerName) %>%
summarize(gp_sd = sd(gp))
player_games_sd
## # A tibble: 2,551 × 2
## PlayerName gp_sd
## <chr> <dbl>
## 1 A.C. Green 14.4
## 2 A.J. Bramlett NA
## 3 A.J. Guyton 22.2
## 4 A.J. Lawson NA
## 5 AJ Green NA
## 6 AJ Griffin NA
## 7 AJ Hammons NA
## 8 AJ Price 13.6
## 9 Aaron Brooks 16.1
## 10 Aaron Gordon 12.6
## # ℹ 2,541 more rows
From above we see that certain players have NA next to their name. While initially it might give an impression that it is missing data, however a closer analysis shows that these players only lasted a single season as of the end of 2022-23 season. For instance, AJ Griffin was a 2022 Draft Pick and has played in 72 games in his lone season. This makes his career total well below what veteran players have. In the past seasons, the other players only played one season in their entire career.
We can get the average Net Rating of Kobe Bryant across his career to see the impact he made on the game.
kobe_data <- basketball_data %>%
filter(PlayerName == "Kobe Bryant")
kobe_net <- kobe_data %>%
summarize(avg_net_rating = mean(NetRating, na.rm = TRUE))
kobe_net
## avg_net_rating
## 1 2.6
This shows that in his presence, his team outscored the opponents by 2.6 points per 100 possessions.
We can get the median percentage stats of Stephen Curry over his career.
curry_data <- basketball_data %>%
filter(PlayerName == "Stephen Curry")
curry_stats <- curry_data %>%
summarize(med_oreb = median(`oreb %`, na.rm = TRUE),
med_dreb = median(`dreb %`, na.rm = TRUE),
med_usg = median(`usg %`, na.rm = TRUE),
med_ts = median(`ts %`, na.rm = TRUE),
med_ast = median(`ast %`, na.rm = TRUE))
curry_stats
## med_oreb med_dreb med_usg med_ts med_ast
## 1 0.0225 0.12 0.288 0.617 0.2865
We see that Stephen Curry’s stats are at par or better when compared to other star players at his position. This shows why he is considered great by many people associated with the game. He is a vital piece for his team and it is evident from 4 Championships.
The following questions can be answered by the analysis of the data-
How does a player’s performance change with age ?
Which players have improved/regressed ?
Do players’ physical attributes have an impact ?
Does draft pick matter ?
Impact of star player on the team ?
Place of Origin and its significance ?