NBA Player’s Data 1996 - 2016

Data downloaded from : www.kaggle.com/justinas/nba-players-data/downloads/nba-players-data.zip/2

Reading the Data

This is the process of how we read the NBA players data (.csv format file) from our directory and we are using the function str() to find out the structure of the data frame.

nba <- read.csv("datasets/all_seasons.csv")
str(nba)
## 'data.frame':    9561 obs. of  22 variables:
##  $ X                : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ player_name      : Factor w/ 1892 levels "A.C. Green","A.J. Bramlett",..: 310 1250 1248 1241 1240 1239 1226 1225 1224 1219 ...
##  $ team_abbreviation: Factor w/ 36 levels "ATL","BKN","BOS",..: 35 17 12 3 8 33 7 17 11 13 ...
##  $ age              : num  23 27 30 29 22 22 36 26 33 32 ...
##  $ player_height    : num  196 211 208 211 206 ...
##  $ player_weight    : num  90.7 106.6 106.6 111.1 106.6 ...
##  $ college          : Factor w/ 288 levels "Alabama","Alabama-Birmingham",..: 276 163 97 195 159 212 182 232 82 227 ...
##  $ country          : Factor w/ 69 levels "Argentina","Australia",..: 66 66 66 66 66 66 66 66 66 66 ...
##  $ draft_year       : Factor w/ 42 levels "1963","1976",..: 21 17 42 42 21 20 8 42 11 12 ...
##  $ draft_round      : Factor w/ 8 levels "1","2","3","4",..: 2 2 8 8 1 2 2 8 2 1 ...
##  $ draft_number     : Factor w/ 75 levels "1","10","11",..: 53 52 75 75 24 52 30 75 24 16 ...
##  $ gp               : int  41 6 71 74 42 9 70 31 70 82 ...
##  $ pts              : num  4.6 0.3 4.5 7.8 3.7 1.6 3.2 2 11.3 9.9 ...
##  $ reb              : num  1.7 0.8 1.6 4.4 1.6 0.7 2.7 1.2 2.6 4.8 ...
##  $ ast              : num  1.6 0 0.9 1.4 0.5 0.4 0.3 0 4.9 11.4 ...
##  $ net_rating       : num  -11.4 -15.1 0.9 -9 -14.5 -3.5 3.5 -17.1 -3.1 -2 ...
##  $ oreb_pct         : num  0.039 0.143 0.016 0.083 0.109 0.087 0.092 0.109 0.023 0.035 ...
##  $ dreb_pct         : num  0.088 0.267 0.115 0.152 0.118 0.045 0.146 0.152 0.088 0.116 ...
##  $ usg_pct          : num  0.155 0.265 0.151 0.167 0.233 0.135 0.137 0.232 0.192 0.155 ...
##  $ ts_pct           : num  0.486 0.333 0.535 0.542 0.482 0.47 0.555 0.448 0.597 0.525 ...
##  $ ast_pct          : num  0.156 0 0.099 0.101 0.114 0.125 0.034 0.013 0.289 0.464 ...
##  $ season           : Factor w/ 21 levels "1996-97","1997-98",..: 1 1 1 1 1 1 1 1 1 1 ...

After finding out the structure of the data frame, we can use the function head() to call the first 6 data from the data frame.

head(nba)
##   X     player_name team_abbreviation age player_height player_weight
## 1 0  Chris Robinson               VAN  23        195.58       90.7184
## 2 1       Matt Fish               MIA  27        210.82      106.5941
## 3 2    Matt Bullard               HOU  30        208.28      106.5941
## 4 3    Marty Conlon               BOS  29        210.82      111.1300
## 5 4 Martin Muursepp               DAL  22        205.74      106.5941
## 6 5    Martin Lewis               TOR  22        198.12      102.0582
##                           college country draft_year draft_round
## 1                Western Kentucky     USA       1996           2
## 2       North Carolina-Wilmington     USA       1992           2
## 3                            Iowa     USA  Undrafted   Undrafted
## 4                      Providence     USA  Undrafted   Undrafted
## 5                            None     USA       1996           1
## 6 Seward County Community College     USA       1995           2
##   draft_number gp pts reb ast net_rating oreb_pct dreb_pct usg_pct ts_pct
## 1           51 41 4.6 1.7 1.6      -11.4    0.039    0.088   0.155  0.486
## 2           50  6 0.3 0.8 0.0      -15.1    0.143    0.267   0.265  0.333
## 3    Undrafted 71 4.5 1.6 0.9        0.9    0.016    0.115   0.151  0.535
## 4    Undrafted 74 7.8 4.4 1.4       -9.0    0.083    0.152   0.167  0.542
## 5           25 42 3.7 1.6 0.5      -14.5    0.109    0.118   0.233  0.482
## 6           50  9 1.6 0.7 0.4       -3.5    0.087    0.045   0.135  0.470
##   ast_pct  season
## 1   0.156 1996-97
## 2   0.000 1996-97
## 3   0.099 1996-97
## 4   0.101 1996-97
## 5   0.114 1996-97
## 6   0.125 1996-97

First Question

Is player’s height affecting the player’s defensive/offensive rebounds points in NBA season 2016-17?

To answer the first question, first, we subset the data and take all the data where the season is “2016-17” and save it inside a new data frame :

nba.1617 <- nba[nba$season == "2016-17",]
head(nba.1617)
##         X    player_name team_abbreviation age player_height player_weight
## 9076 9075   Yogi Ferrell               DAL  24        182.88      81.64656
## 9077 9076  Zaza Pachulia               GSW  33        210.82     124.73780
## 9078 9077  Zach Randolph               MEM  35        205.74     117.93392
## 9079 9078    Zach LaVine               MIN  22        195.58      83.91452
## 9080 9079 Tyson Chandler               PHX  34        215.90     108.86208
## 9081 9080   Tyreke Evans               SAC  27        198.12      99.79024
##             college country draft_year draft_round draft_number gp  pts
## 9076        Indiana     USA  Undrafted   Undrafted    Undrafted 46 10.0
## 9077           None Georgia       2003           2           42 70  6.1
## 9078 Michigan State     USA       2001           1           19 73 14.1
## 9079           UCLA     USA       2014           1           13 47 18.9
## 9080           None     USA       2001           1            2 47  8.4
## 9081        Memphis     USA       2009           1            4 40 10.3
##       reb ast net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct  season
## 9076  2.4 3.7       -4.1    0.018    0.091   0.196  0.533   0.232 2016-17
## 9077  5.9 1.9       15.8    0.130    0.225   0.151  0.588   0.131 2016-17
## 9078  8.2 1.7       -1.1    0.113    0.280   0.285  0.490   0.131 2016-17
## 9079  3.4 3.0       -3.6    0.012    0.098   0.218  0.576   0.128 2016-17
## 9080 11.5 0.6       -2.9    0.127    0.339   0.113  0.703   0.033 2016-17
## 9081  3.4 3.1       -8.6    0.017    0.178   0.270  0.501   0.281 2016-17

Then, we make a new data frame which contains the aggregated mean value of defensive and offensive rebounds’ points while being compared to the data of player’s height :

temp <- aggregate.data.frame(list("Offensive Rebounds" = nba.1617$oreb_pct, "Defensive Rebounds" = nba.1617$dreb_pct), list(nba.1617$player_height), mean)
head(temp)
##   Group.1 Offensive.Rebounds Defensive.Rebounds
## 1  175.26         0.01400000         0.08650000
## 2  177.80         0.01450000         0.09200000
## 3  180.34         0.01450000         0.04350000
## 4  182.88         0.01591667         0.09908333
## 5  185.42         0.02633333         0.11080000
## 6  187.96         0.01845000         0.10130000

After that, we can melt the data frame temp, to simplify the data frame so that we can visualize the comparison between two variables, defensive rebounds and offensive rebounds :

nbamelt <- melt(temp, c("Offensive.Rebounds", "Defensive.Rebounds"),id.vars = 1)
head(nbamelt)
##   Group.1           variable      value
## 1  175.26 Offensive.Rebounds 0.01400000
## 2  177.80 Offensive.Rebounds 0.01450000
## 3  180.34 Offensive.Rebounds 0.01450000
## 4  182.88 Offensive.Rebounds 0.01591667
## 5  185.42 Offensive.Rebounds 0.02633333
## 6  187.96 Offensive.Rebounds 0.01845000

Final step, we make the graphics using ggplot() and in this case, we are using facet_grid() because we have two different variables that we want to use :

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot1 <- ggplot(nbamelt, aes(Group.1, value))+
  geom_col(aes(fill=variable), position = "dodge")+
  labs(title = "Player's Height vs Offensive and Defensive Rebounds", subtitle = "Season : 2016 - 2017",x = "Player's Height", y = "Rebound Percentage", caption = "NBA Players Data 1996 - 2016")+
  theme(legend.position = "bottom",
        axis.text.x = element_text(hjust = 0.5))+
  facet_grid(~variable)+
  geom_line()

plot1

From the plots, we can safely assume that a player’s height is actually affecting their defensive and offensive rebound’s percentage. Even though in some of the data set there are some outlier data that said otherwise, but in general, the taller the player’s height, the more defensive rebounds and offensive rebounds the player can make.

Second Question

In season 2016-17, which of the colleges have the player with the highest Net Rating points?

To solve the second question, we can use the data frame nba.1617 which we have made before, and order the data using function order with additional - to do descending data ordering to the net_rating column. After that, we take samples of 20 first data from the data frame we’ve just created.

nba.1617.net <- nba.1617[order(-nba.1617$net_rating),]
nba.1617.net <- nba.1617.net[1:20,]
nba.1617.net
##         X      player_name team_abbreviation age player_height
## 9168 9167   John Lucas III               MIN  34        180.34
## 9340 9339    Brice Johnson               LAC  23        208.28
## 9263 9262   Pierre Jackson               DAL  25        177.80
## 9335 9334 Chris McCullough               WAS  22        205.74
## 9200 9199     John Jenkins               PHX  26        193.04
## 9454 9453     JaVale McGee               GSW  29        213.36
## 9213 9212    Stephen Curry               GSW  29        190.50
## 9173 9172  Lamar Patterson               ATL  25        195.58
## 9191 9190     Kevin Durant               GSW  28        205.74
## 9504 9503   Draymond Green               GSW  27        200.66
## 9077 9076    Zaza Pachulia               GSW  33        210.82
## 9559 9558      Edy Tavares               CLE  25        220.98
## 9096 9095    Malik Beasley               DEN  20        195.58
## 9351 9350       Chris Paul               LAC  32        182.88
## 9199 9198    Klay Thompson               GSW  27        200.66
## 9157 9156      Jordan Hill               MIN  29        208.28
## 9314 9313        Raul Neto               UTA  25        185.42
## 9437 9436   Andre Iguodala               GSW  33        198.12
## 9548 9547    Fred VanVleet               TOR  23        182.88
## 9269 9268      Patty Mills               SAS  28        182.88
##      player_weight           college    country draft_year draft_round
## 9168      75.29627    Oklahoma State        USA  Undrafted   Undrafted
## 9340     104.32616    North Carolina        USA       2016           1
## 9263      81.64656            Baylor        USA       2013           2
## 9335      97.52228          Syracuse        USA       2015           1
## 9200      97.52228        Vanderbilt        USA       2012           1
## 9454     122.46984            Nevada        USA       2008           1
## 9213      86.18248          Davidson        USA       2009           1
## 9173     102.05820        Pittsburgh        USA       2014           2
## 9191     108.86208             Texas        USA       2007           1
## 9504     104.32616    Michigan State        USA       2012           2
## 9077     124.73780              None    Georgia       2003           2
## 9559     120.20188              None Cabo Verde       2014           2
## 9096      88.90403     Florida State        USA       2016           1
## 9351      79.37860       Wake Forest        USA       2005           1
## 9199      97.52228  Washington State        USA       2011           1
## 9157     108.86208           Arizona        USA       2009           1
## 9314      81.19297              None     Brazil       2013           2
## 9437      97.52228           Arizona        USA       2004           1
## 9548      88.45044     Wichita State        USA  Undrafted   Undrafted
## 9269      83.91452 Saint Mary's (CA)  Australia       2009           2
##      draft_number gp  pts reb ast net_rating oreb_pct dreb_pct usg_pct
## 9168    Undrafted  5  0.4 0.0 0.2       46.9    0.000    0.000   0.166
## 9340           25  3  1.3 1.0 0.3       44.9    0.143    0.222   0.385
## 9263           42  8  4.4 1.1 2.4       31.7    0.011    0.111   0.232
## 9335           29 16  2.3 1.2 0.1       26.8    0.129    0.123   0.198
## 9200           23  4  1.8 0.3 0.3       21.6    0.000    0.111   0.182
## 9454           18 77  6.1 3.2 0.2       18.7    0.156    0.187   0.226
## 9213            7 79 25.3 4.5 6.6       17.2    0.027    0.113   0.292
## 9173           48  5  1.8 1.4 1.2       16.8    0.027    0.133   0.234
## 9191            2 62 25.1 8.3 4.8       16.0    0.023    0.232   0.276
## 9504           35 76 10.2 7.9 7.0       15.9    0.046    0.205   0.160
## 9077           42 70  6.1 5.9 1.9       15.8    0.130    0.225   0.151
## 9559           43  2  4.0 5.5 0.5       15.2    0.160    0.241   0.137
## 9096           19 22  3.8 0.8 0.5       15.0    0.036    0.072   0.222
## 9351            4 61 18.1 5.0 9.2       14.9    0.025    0.149   0.243
## 9199           11 78 22.3 3.7 2.1       14.7    0.022    0.091   0.262
## 9157            8  7  1.7 2.0 0.0       13.9    0.205    0.133   0.180
## 9314           47 40  2.5 0.8 0.9       13.9    0.014    0.084   0.147
## 9437            9 76  7.6 4.0 3.4       13.4    0.030    0.131   0.113
## 9548    Undrafted 37  2.9 1.1 0.9       12.1    0.015    0.130   0.205
## 9269           55 80  9.5 1.8 3.5       12.0    0.015    0.074   0.189
##      ts_pct ast_pct  season
## 9168  0.250   0.200 2016-17
## 9340  0.286   0.167 2016-17
## 9263  0.416   0.322 2016-17
## 9335  0.526   0.036 2016-17
## 9200  0.595   0.083 2016-17
## 9454  0.642   0.032 2016-17
## 9213  0.624   0.287 2016-17
## 9173  0.276   0.250 2016-17
## 9191  0.651   0.218 2016-17
## 9504  0.522   0.272 2016-17
## 9077  0.588   0.131 2016-17
## 9559  0.633   0.071 2016-17
## 9096  0.536   0.098 2016-17
## 9351  0.614   0.444 2016-17
## 9199  0.592   0.090 2016-17
## 9157  0.432   0.000 2016-17
## 9314  0.527   0.140 2016-17
## 9437  0.624   0.169 2016-17
## 9548  0.443   0.180 2016-17
## 9269  0.578   0.236 2016-17

Then, we make the data visualization using ggplot in the form of geom_point

ggplot(nba.1617.net, aes(reorder(college,net_rating), net_rating))+
  geom_point(aes(color = net_rating), size = 6)+
  coord_flip()+
  labs(x = "College", y = "Net Rating", title = "Top 20 College With Highest Net Rating", subtitle = "Season : 2016 - 2017", caption = "NBA Players Data 1996 - 2016") +
  theme(axis.title = element_text(hjust = 0.5, vjust = 1, size = 10, face = "italic"),
        axis.text.y = element_text(size = 10, face = "bold"))

From the plot we made, we can draw the conclusion that in season 2016-17, the college that have the player with the highest net rating points is Oklahoma State with the net rating point of around 40 being followed by North Carolina in second place which the net rating point has only slight difference.

Third Question

In season 2016-17, which of the colleges have the highest overall Points, Rebounds, and Assists score in average?

In solving the third question, we can also use the nba.1617 data frame we have created before and use the aggregate.data.frame function using mean to find out the average score of each college in each of Points, Rebounds, and Assists. Then, we make a new column which consisted of those 3 variable’s average and we then use function order(-) to order the data descending using the average column to find out which college have the highest average. After we get the descended data from the highest average score to the lowest, we then take 10 data from the data frame and make it into a new data frame to be visualized.

temp2 <- aggregate.data.frame(list("Points" = nba.1617$pts, "Rebounds" = nba.1617$reb, "Assists" = nba.1617$ast), list(nba.1617$college), mean)
temp2$average <- (temp2$Points + temp2$Rebounds + temp2$Assists)/3
temp2 <- temp2[order(-temp2$average),]
temp2 <- temp2[1:10,]
head(temp2)
##            Group.1 Points Rebounds Assists   average
## 4    Arizona State   29.1      8.1    11.2 16.133333
## 23        Davidson   25.3      4.5     6.6 12.133333
## 90 San Diego State   25.5      5.8     3.5 11.600000
## 55        Marshall   17.0     14.1     0.7 10.600000
## 48          Lehigh   23.0      3.6     3.6 10.066667
## 52  Louisiana Tech   18.1      7.7     3.7  9.833333

We melt the three variables into one column using function melt.

nbamelt2 <- melt(temp2, c("Points", "Rebounds", "Assists"),id.vars = 1)
head(nbamelt2)
##           Group.1 variable value
## 1   Arizona State   Points  29.1
## 2        Davidson   Points  25.3
## 3 San Diego State   Points  25.5
## 4        Marshall   Points  17.0
## 5          Lehigh   Points  23.0
## 6  Louisiana Tech   Points  18.1

We visualize the data using ggplot and geom_col which consisted the 3 variables : Points, Rebounds, and Assists.

ggplot(nbamelt2, aes(reorder(Group.1, value), value))+
  geom_col(aes(fill=variable), position = "dodge")+
  coord_flip()+
  labs(title = "Top 10 College With Highest Player's Status", subtitle = "Season : 2016 - 2017",x = "College", y = "Points, Rebounds, Assists", caption = "NBA Players Data 1996 - 2016")+
  theme(legend.position = "bottom",
        axis.text.x = element_text(hjust = 0.5))

From the plot, we find out that Arizona State is the number one college that has the highest average of Points, Rebounds, and Assists even though Marshall has the highest Rebounds score, but the other scores are not really that high, so the college only get as far as 4th place.