Data downloaded from : www.kaggle.com/justinas/nba-players-data/downloads/nba-players-data.zip/2
This is the process of how we read the NBA players data (.csv format file) from our directory and we are using the function str() to find out the structure of the data frame.
nba <- read.csv("datasets/all_seasons.csv")
str(nba)
## 'data.frame': 9561 obs. of 22 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ player_name : Factor w/ 1892 levels "A.C. Green","A.J. Bramlett",..: 310 1250 1248 1241 1240 1239 1226 1225 1224 1219 ...
## $ team_abbreviation: Factor w/ 36 levels "ATL","BKN","BOS",..: 35 17 12 3 8 33 7 17 11 13 ...
## $ age : num 23 27 30 29 22 22 36 26 33 32 ...
## $ player_height : num 196 211 208 211 206 ...
## $ player_weight : num 90.7 106.6 106.6 111.1 106.6 ...
## $ college : Factor w/ 288 levels "Alabama","Alabama-Birmingham",..: 276 163 97 195 159 212 182 232 82 227 ...
## $ country : Factor w/ 69 levels "Argentina","Australia",..: 66 66 66 66 66 66 66 66 66 66 ...
## $ draft_year : Factor w/ 42 levels "1963","1976",..: 21 17 42 42 21 20 8 42 11 12 ...
## $ draft_round : Factor w/ 8 levels "1","2","3","4",..: 2 2 8 8 1 2 2 8 2 1 ...
## $ draft_number : Factor w/ 75 levels "1","10","11",..: 53 52 75 75 24 52 30 75 24 16 ...
## $ gp : int 41 6 71 74 42 9 70 31 70 82 ...
## $ pts : num 4.6 0.3 4.5 7.8 3.7 1.6 3.2 2 11.3 9.9 ...
## $ reb : num 1.7 0.8 1.6 4.4 1.6 0.7 2.7 1.2 2.6 4.8 ...
## $ ast : num 1.6 0 0.9 1.4 0.5 0.4 0.3 0 4.9 11.4 ...
## $ net_rating : num -11.4 -15.1 0.9 -9 -14.5 -3.5 3.5 -17.1 -3.1 -2 ...
## $ oreb_pct : num 0.039 0.143 0.016 0.083 0.109 0.087 0.092 0.109 0.023 0.035 ...
## $ dreb_pct : num 0.088 0.267 0.115 0.152 0.118 0.045 0.146 0.152 0.088 0.116 ...
## $ usg_pct : num 0.155 0.265 0.151 0.167 0.233 0.135 0.137 0.232 0.192 0.155 ...
## $ ts_pct : num 0.486 0.333 0.535 0.542 0.482 0.47 0.555 0.448 0.597 0.525 ...
## $ ast_pct : num 0.156 0 0.099 0.101 0.114 0.125 0.034 0.013 0.289 0.464 ...
## $ season : Factor w/ 21 levels "1996-97","1997-98",..: 1 1 1 1 1 1 1 1 1 1 ...
After finding out the structure of the data frame, we can use the function head() to call the first 6 data from the data frame.
head(nba)
## X player_name team_abbreviation age player_height player_weight
## 1 0 Chris Robinson VAN 23 195.58 90.7184
## 2 1 Matt Fish MIA 27 210.82 106.5941
## 3 2 Matt Bullard HOU 30 208.28 106.5941
## 4 3 Marty Conlon BOS 29 210.82 111.1300
## 5 4 Martin Muursepp DAL 22 205.74 106.5941
## 6 5 Martin Lewis TOR 22 198.12 102.0582
## college country draft_year draft_round
## 1 Western Kentucky USA 1996 2
## 2 North Carolina-Wilmington USA 1992 2
## 3 Iowa USA Undrafted Undrafted
## 4 Providence USA Undrafted Undrafted
## 5 None USA 1996 1
## 6 Seward County Community College USA 1995 2
## draft_number gp pts reb ast net_rating oreb_pct dreb_pct usg_pct ts_pct
## 1 51 41 4.6 1.7 1.6 -11.4 0.039 0.088 0.155 0.486
## 2 50 6 0.3 0.8 0.0 -15.1 0.143 0.267 0.265 0.333
## 3 Undrafted 71 4.5 1.6 0.9 0.9 0.016 0.115 0.151 0.535
## 4 Undrafted 74 7.8 4.4 1.4 -9.0 0.083 0.152 0.167 0.542
## 5 25 42 3.7 1.6 0.5 -14.5 0.109 0.118 0.233 0.482
## 6 50 9 1.6 0.7 0.4 -3.5 0.087 0.045 0.135 0.470
## ast_pct season
## 1 0.156 1996-97
## 2 0.000 1996-97
## 3 0.099 1996-97
## 4 0.101 1996-97
## 5 0.114 1996-97
## 6 0.125 1996-97
To answer the first question, first, we subset the data and take all the data where the season is “2016-17” and save it inside a new data frame :
nba.1617 <- nba[nba$season == "2016-17",]
head(nba.1617)
## X player_name team_abbreviation age player_height player_weight
## 9076 9075 Yogi Ferrell DAL 24 182.88 81.64656
## 9077 9076 Zaza Pachulia GSW 33 210.82 124.73780
## 9078 9077 Zach Randolph MEM 35 205.74 117.93392
## 9079 9078 Zach LaVine MIN 22 195.58 83.91452
## 9080 9079 Tyson Chandler PHX 34 215.90 108.86208
## 9081 9080 Tyreke Evans SAC 27 198.12 99.79024
## college country draft_year draft_round draft_number gp pts
## 9076 Indiana USA Undrafted Undrafted Undrafted 46 10.0
## 9077 None Georgia 2003 2 42 70 6.1
## 9078 Michigan State USA 2001 1 19 73 14.1
## 9079 UCLA USA 2014 1 13 47 18.9
## 9080 None USA 2001 1 2 47 8.4
## 9081 Memphis USA 2009 1 4 40 10.3
## reb ast net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct season
## 9076 2.4 3.7 -4.1 0.018 0.091 0.196 0.533 0.232 2016-17
## 9077 5.9 1.9 15.8 0.130 0.225 0.151 0.588 0.131 2016-17
## 9078 8.2 1.7 -1.1 0.113 0.280 0.285 0.490 0.131 2016-17
## 9079 3.4 3.0 -3.6 0.012 0.098 0.218 0.576 0.128 2016-17
## 9080 11.5 0.6 -2.9 0.127 0.339 0.113 0.703 0.033 2016-17
## 9081 3.4 3.1 -8.6 0.017 0.178 0.270 0.501 0.281 2016-17
Then, we make a new data frame which contains the aggregated mean value of defensive and offensive rebounds’ points while being compared to the data of player’s height :
temp <- aggregate.data.frame(list("Offensive Rebounds" = nba.1617$oreb_pct, "Defensive Rebounds" = nba.1617$dreb_pct), list(nba.1617$player_height), mean)
head(temp)
## Group.1 Offensive.Rebounds Defensive.Rebounds
## 1 175.26 0.01400000 0.08650000
## 2 177.80 0.01450000 0.09200000
## 3 180.34 0.01450000 0.04350000
## 4 182.88 0.01591667 0.09908333
## 5 185.42 0.02633333 0.11080000
## 6 187.96 0.01845000 0.10130000
After that, we can melt the data frame temp, to simplify the data frame so that we can visualize the comparison between two variables, defensive rebounds and offensive rebounds :
nbamelt <- melt(temp, c("Offensive.Rebounds", "Defensive.Rebounds"),id.vars = 1)
head(nbamelt)
## Group.1 variable value
## 1 175.26 Offensive.Rebounds 0.01400000
## 2 177.80 Offensive.Rebounds 0.01450000
## 3 180.34 Offensive.Rebounds 0.01450000
## 4 182.88 Offensive.Rebounds 0.01591667
## 5 185.42 Offensive.Rebounds 0.02633333
## 6 187.96 Offensive.Rebounds 0.01845000
Final step, we make the graphics using ggplot() and in this case, we are using facet_grid() because we have two different variables that we want to use :
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot1 <- ggplot(nbamelt, aes(Group.1, value))+
geom_col(aes(fill=variable), position = "dodge")+
labs(title = "Player's Height vs Offensive and Defensive Rebounds", subtitle = "Season : 2016 - 2017",x = "Player's Height", y = "Rebound Percentage", caption = "NBA Players Data 1996 - 2016")+
theme(legend.position = "bottom",
axis.text.x = element_text(hjust = 0.5))+
facet_grid(~variable)+
geom_line()
plot1
From the plots, we can safely assume that a player’s height is actually affecting their defensive and offensive rebound’s percentage. Even though in some of the data set there are some outlier data that said otherwise, but in general, the taller the player’s height, the more defensive rebounds and offensive rebounds the player can make.
To solve the second question, we can use the data frame nba.1617 which we have made before, and order the data using function order with additional - to do descending data ordering to the net_rating column. After that, we take samples of 20 first data from the data frame we’ve just created.
nba.1617.net <- nba.1617[order(-nba.1617$net_rating),]
nba.1617.net <- nba.1617.net[1:20,]
nba.1617.net
## X player_name team_abbreviation age player_height
## 9168 9167 John Lucas III MIN 34 180.34
## 9340 9339 Brice Johnson LAC 23 208.28
## 9263 9262 Pierre Jackson DAL 25 177.80
## 9335 9334 Chris McCullough WAS 22 205.74
## 9200 9199 John Jenkins PHX 26 193.04
## 9454 9453 JaVale McGee GSW 29 213.36
## 9213 9212 Stephen Curry GSW 29 190.50
## 9173 9172 Lamar Patterson ATL 25 195.58
## 9191 9190 Kevin Durant GSW 28 205.74
## 9504 9503 Draymond Green GSW 27 200.66
## 9077 9076 Zaza Pachulia GSW 33 210.82
## 9559 9558 Edy Tavares CLE 25 220.98
## 9096 9095 Malik Beasley DEN 20 195.58
## 9351 9350 Chris Paul LAC 32 182.88
## 9199 9198 Klay Thompson GSW 27 200.66
## 9157 9156 Jordan Hill MIN 29 208.28
## 9314 9313 Raul Neto UTA 25 185.42
## 9437 9436 Andre Iguodala GSW 33 198.12
## 9548 9547 Fred VanVleet TOR 23 182.88
## 9269 9268 Patty Mills SAS 28 182.88
## player_weight college country draft_year draft_round
## 9168 75.29627 Oklahoma State USA Undrafted Undrafted
## 9340 104.32616 North Carolina USA 2016 1
## 9263 81.64656 Baylor USA 2013 2
## 9335 97.52228 Syracuse USA 2015 1
## 9200 97.52228 Vanderbilt USA 2012 1
## 9454 122.46984 Nevada USA 2008 1
## 9213 86.18248 Davidson USA 2009 1
## 9173 102.05820 Pittsburgh USA 2014 2
## 9191 108.86208 Texas USA 2007 1
## 9504 104.32616 Michigan State USA 2012 2
## 9077 124.73780 None Georgia 2003 2
## 9559 120.20188 None Cabo Verde 2014 2
## 9096 88.90403 Florida State USA 2016 1
## 9351 79.37860 Wake Forest USA 2005 1
## 9199 97.52228 Washington State USA 2011 1
## 9157 108.86208 Arizona USA 2009 1
## 9314 81.19297 None Brazil 2013 2
## 9437 97.52228 Arizona USA 2004 1
## 9548 88.45044 Wichita State USA Undrafted Undrafted
## 9269 83.91452 Saint Mary's (CA) Australia 2009 2
## draft_number gp pts reb ast net_rating oreb_pct dreb_pct usg_pct
## 9168 Undrafted 5 0.4 0.0 0.2 46.9 0.000 0.000 0.166
## 9340 25 3 1.3 1.0 0.3 44.9 0.143 0.222 0.385
## 9263 42 8 4.4 1.1 2.4 31.7 0.011 0.111 0.232
## 9335 29 16 2.3 1.2 0.1 26.8 0.129 0.123 0.198
## 9200 23 4 1.8 0.3 0.3 21.6 0.000 0.111 0.182
## 9454 18 77 6.1 3.2 0.2 18.7 0.156 0.187 0.226
## 9213 7 79 25.3 4.5 6.6 17.2 0.027 0.113 0.292
## 9173 48 5 1.8 1.4 1.2 16.8 0.027 0.133 0.234
## 9191 2 62 25.1 8.3 4.8 16.0 0.023 0.232 0.276
## 9504 35 76 10.2 7.9 7.0 15.9 0.046 0.205 0.160
## 9077 42 70 6.1 5.9 1.9 15.8 0.130 0.225 0.151
## 9559 43 2 4.0 5.5 0.5 15.2 0.160 0.241 0.137
## 9096 19 22 3.8 0.8 0.5 15.0 0.036 0.072 0.222
## 9351 4 61 18.1 5.0 9.2 14.9 0.025 0.149 0.243
## 9199 11 78 22.3 3.7 2.1 14.7 0.022 0.091 0.262
## 9157 8 7 1.7 2.0 0.0 13.9 0.205 0.133 0.180
## 9314 47 40 2.5 0.8 0.9 13.9 0.014 0.084 0.147
## 9437 9 76 7.6 4.0 3.4 13.4 0.030 0.131 0.113
## 9548 Undrafted 37 2.9 1.1 0.9 12.1 0.015 0.130 0.205
## 9269 55 80 9.5 1.8 3.5 12.0 0.015 0.074 0.189
## ts_pct ast_pct season
## 9168 0.250 0.200 2016-17
## 9340 0.286 0.167 2016-17
## 9263 0.416 0.322 2016-17
## 9335 0.526 0.036 2016-17
## 9200 0.595 0.083 2016-17
## 9454 0.642 0.032 2016-17
## 9213 0.624 0.287 2016-17
## 9173 0.276 0.250 2016-17
## 9191 0.651 0.218 2016-17
## 9504 0.522 0.272 2016-17
## 9077 0.588 0.131 2016-17
## 9559 0.633 0.071 2016-17
## 9096 0.536 0.098 2016-17
## 9351 0.614 0.444 2016-17
## 9199 0.592 0.090 2016-17
## 9157 0.432 0.000 2016-17
## 9314 0.527 0.140 2016-17
## 9437 0.624 0.169 2016-17
## 9548 0.443 0.180 2016-17
## 9269 0.578 0.236 2016-17
Then, we make the data visualization using ggplot in the form of geom_point
ggplot(nba.1617.net, aes(reorder(college,net_rating), net_rating))+
geom_point(aes(color = net_rating), size = 6)+
coord_flip()+
labs(x = "College", y = "Net Rating", title = "Top 20 College With Highest Net Rating", subtitle = "Season : 2016 - 2017", caption = "NBA Players Data 1996 - 2016") +
theme(axis.title = element_text(hjust = 0.5, vjust = 1, size = 10, face = "italic"),
axis.text.y = element_text(size = 10, face = "bold"))
From the plot we made, we can draw the conclusion that in season 2016-17, the college that have the player with the highest net rating points is Oklahoma State with the net rating point of around 40 being followed by North Carolina in second place which the net rating point has only slight difference.
Points, Rebounds, and Assists score in average?In solving the third question, we can also use the nba.1617 data frame we have created before and use the aggregate.data.frame function using mean to find out the average score of each college in each of Points, Rebounds, and Assists. Then, we make a new column which consisted of those 3 variable’s average and we then use function order(-) to order the data descending using the average column to find out which college have the highest average. After we get the descended data from the highest average score to the lowest, we then take 10 data from the data frame and make it into a new data frame to be visualized.
temp2 <- aggregate.data.frame(list("Points" = nba.1617$pts, "Rebounds" = nba.1617$reb, "Assists" = nba.1617$ast), list(nba.1617$college), mean)
temp2$average <- (temp2$Points + temp2$Rebounds + temp2$Assists)/3
temp2 <- temp2[order(-temp2$average),]
temp2 <- temp2[1:10,]
head(temp2)
## Group.1 Points Rebounds Assists average
## 4 Arizona State 29.1 8.1 11.2 16.133333
## 23 Davidson 25.3 4.5 6.6 12.133333
## 90 San Diego State 25.5 5.8 3.5 11.600000
## 55 Marshall 17.0 14.1 0.7 10.600000
## 48 Lehigh 23.0 3.6 3.6 10.066667
## 52 Louisiana Tech 18.1 7.7 3.7 9.833333
We melt the three variables into one column using function melt.
nbamelt2 <- melt(temp2, c("Points", "Rebounds", "Assists"),id.vars = 1)
head(nbamelt2)
## Group.1 variable value
## 1 Arizona State Points 29.1
## 2 Davidson Points 25.3
## 3 San Diego State Points 25.5
## 4 Marshall Points 17.0
## 5 Lehigh Points 23.0
## 6 Louisiana Tech Points 18.1
We visualize the data using ggplot and geom_col which consisted the 3 variables : Points, Rebounds, and Assists.
ggplot(nbamelt2, aes(reorder(Group.1, value), value))+
geom_col(aes(fill=variable), position = "dodge")+
coord_flip()+
labs(title = "Top 10 College With Highest Player's Status", subtitle = "Season : 2016 - 2017",x = "College", y = "Points, Rebounds, Assists", caption = "NBA Players Data 1996 - 2016")+
theme(legend.position = "bottom",
axis.text.x = element_text(hjust = 0.5))
From the plot, we find out that Arizona State is the number one college that has the highest average of Points, Rebounds, and Assists even though Marshall has the highest Rebounds score, but the other scores are not really that high, so the college only get as far as 4th place.