This project will use the computer program R to generate a decision tree to determine NBA player’s positions based upon various statistical categories. Data will be used from basketballreference.com to construct a classification tree. These types of trees can be used to sort categorical data appropriately, to classify player’s position. As one of the following
• Point Guard
• Shooting Guard
• Small Forward
• Power Forward
• Center
Using these statistics that were tracked over the course of the season
• Games played
• Games started
• Minutes per game
• 2pt & 3pt field goals/attempts
• Free throws/attempts
• Assists, turnovers
• Rebounds
• Fouls
• Blocks
• Steals
• Offensive and Defensive rating
To resolve the issue of discrepancies in playing time among players, we will use per 100 possessions statistics. The outcome of the decision tree will show the value of each category based on position the players. These values can be used to determine threshold certain players have to reach at their respective position.
Firstly in order to generate a decision tree which must have some data in order to input in the mechanism. We will import data using the ballr package as well as other necessary dependencies.
We will now view the player statistics from the 2018-2019 NBA season. It is important to use 100 possessions statistics in order to make sure all players are measure equal to get a representative reading in the decision tree.
Below are some of the entries of the per 100 possessions per game statistics for players during the season.
| rk | player | pos | age | tm | g | gs | mp | fg | fga | fgpercent | x3p | x3pa | x3ppercent | x2p | x2pa | x2ppercent | ft | fta | ftpercent | orb | drb | trb | ast | stl | blk | tov | pf | pts | x | ortg | drtg | link |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Álex Abrines | SG | 25 | OKC | 31 | 2 | 588 | 4.4 | 12.5 | 0.357 | 3.3 | 10.1 | 0.323 | 1.2 | 2.4 | 0.500 | 1.0 | 1.0 | 0.923 | 0.4 | 3.4 | 3.8 | 1.6 | 1.3 | 0.5 | 1.1 | 4.2 | 13.1 | NA | 103 | 111 | /players/a/abrinal01.html |
| 2 | Quincy Acy | PF | 28 | PHO | 10 | 0 | 123 | 1.6 | 7.0 | 0.222 | 0.8 | 5.8 | 0.133 | 0.8 | 1.2 | 0.667 | 2.7 | 3.9 | 0.700 | 1.2 | 8.5 | 9.7 | 3.1 | 0.4 | 1.6 | 1.6 | 9.3 | 6.6 | NA | 87 | 116 | /players/a/acyqu01.html |
| 3 | Jaylen Adams | PG | 22 | ATL | 34 | 1 | 428 | 4.1 | 11.9 | 0.345 | 2.7 | 8.0 | 0.338 | 1.4 | 3.9 | 0.361 | 0.8 | 1.0 | 0.778 | 1.2 | 5.3 | 6.5 | 7.0 | 1.5 | 0.5 | 3.0 | 4.9 | 11.7 | NA | 99 | 115 | /players/a/adamsja01.html |
| 4 | Steven Adams | C | 25 | OKC | 80 | 80 | 2669 | 8.4 | 14.1 | 0.595 | 0.0 | 0.0 | 0.000 | 8.4 | 14.1 | 0.596 | 2.6 | 5.1 | 0.500 | 6.8 | 6.5 | 13.3 | 2.2 | 2.0 | 1.3 | 2.4 | 3.6 | 19.4 | NA | 120 | 106 | /players/a/adamsst01.html |
| 5 | Bam Adebayo | C | 21 | MIA | 82 | 28 | 1913 | 7.2 | 12.4 | 0.576 | 0.1 | 0.4 | 0.200 | 7.1 | 12.0 | 0.588 | 4.2 | 5.8 | 0.735 | 4.2 | 11.0 | 15.2 | 4.7 | 1.8 | 1.7 | 3.1 | 5.2 | 18.6 | NA | 120 | 104 | /players/a/adebaba01.html |
| 6 | Deng Adel | SF | 21 | CLE | 19 | 3 | 194 | 2.8 | 9.2 | 0.306 | 1.5 | 5.9 | 0.261 | 1.3 | 3.3 | 0.385 | 1.0 | 1.0 | 1.000 | 0.8 | 4.1 | 4.9 | 1.3 | 0.3 | 1.0 | 1.5 | 3.3 | 8.2 | NA | 85 | 121 | /players/a/adelde01.html |
Let’s examine the column names of this data frame using the following function
## [1] "rk" "player" "pos" "age" "tm"
## [6] "g" "gs" "mp" "fg" "fga"
## [11] "fgpercent" "x3p" "x3pa" "x3ppercent" "x2p"
## [16] "x2pa" "x2ppercent" "ft" "fta" "ftpercent"
## [21] "orb" "drb" "trb" "ast" "stl"
## [26] "blk" "tov" "pf" "pts" "x"
## [31] "ortg" "drtg" "link"
All 33 categories are not neccessarily useful in the construction of the decision tree. Such as the following variables
• name
• x
• link
• rnk
• team
They are of the categorical nature and too unique to each player. Also these variables
• games
• games started
• percentages
• fouls
Will not be used be for we are only interested in players by positions and they do not add any extra value to our tree.
The following variables will be used in the decision tree
• fg, fga
• x3p, x3pa
• x2p, x2pa
• ft, fta
• trb
• ast, tov
• blk, stl
• pts
We will construct a new data frame with the statistics of interest and to use going forward Below are some of the entries of our new data frame.
| pos | fg | fga | x3p | x3pa | x2p | x2pa | ft | fta | trb | ast | blk | stl | tov | pts |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SG | 4.4 | 12.5 | 3.3 | 10.1 | 1.2 | 2.4 | 1.0 | 1.0 | 3.8 | 1.6 | 0.5 | 1.3 | 1.1 | 13.1 |
| PF | 1.6 | 7.0 | 0.8 | 5.8 | 0.8 | 1.2 | 2.7 | 3.9 | 9.7 | 3.1 | 1.6 | 0.4 | 1.6 | 6.6 |
| PG | 4.1 | 11.9 | 2.7 | 8.0 | 1.4 | 3.9 | 0.8 | 1.0 | 6.5 | 7.0 | 0.5 | 1.5 | 3.0 | 11.7 |
| C | 8.4 | 14.1 | 0.0 | 0.0 | 8.4 | 14.1 | 2.6 | 5.1 | 13.3 | 2.2 | 1.3 | 2.0 | 2.4 | 19.4 |
| C | 7.2 | 12.4 | 0.1 | 0.4 | 7.1 | 12.0 | 4.2 | 5.8 | 15.2 | 4.7 | 1.7 | 1.8 | 3.1 | 18.6 |
| SF | 2.8 | 9.2 | 1.5 | 5.9 | 1.3 | 3.3 | 1.0 | 1.0 | 4.9 | 1.3 | 1.0 | 0.3 | 1.5 | 8.2 |
It is vital for the decision tree to work, we must create a character string for each position.
• “C”
• “PF”
• “SF”
• “SG”
• “PG”
Now we can begin to construct a visual for the decision tree, by using the package ‘rpart’. With our new data frame we can construct a classification tree.
In order to verify which variables were used in making of the decision tree will call on the following function to see.
This will display the complexity parameter table for the fitted Rpart object.
##
## Classification tree:
## rpart(formula = P100, data = per_100, method = "class")
##
## Variables actually used in tree construction:
## [1] ast blk stl trb x3pa
##
## Root node error: 532/708 = 0.751
##
## n= 708
##
## CP nsplit rel error xerror xstd
## 1 0.2030 0 1.000 1.000 0.0216
## 2 0.1635 1 0.797 0.816 0.0244
## 3 0.0714 2 0.633 0.656 0.0250
## 4 0.0564 3 0.562 0.605 0.0249
## 5 0.0357 4 0.506 0.564 0.0247
## 6 0.0320 5 0.470 0.539 0.0246
## 7 0.0132 6 0.438 0.500 0.0242
## 8 0.0100 7 0.425 0.476 0.0240
This shows that the included variables are
• ast
• blk
• stl
• trb
• x3pa
will be used for the construction of the decision tree. The excluded variables are
• fg, fga
• x3p, x2p, x2pa
• ft, fta
• tov
• pts
To plot the decision tree, we will use the rpart.plot library to plot the appropriate tree and label it accordingly.
Using the prp() function we can generate a better type of the same decision tree
Using the fancyRpartplot() function we can generate a more fancy decision tree plot.
Now for a more in depth look at the decision here are the calculation for each node and respective split in the decision tree.
## Call:
## rpart(formula = P100, data = per_100, method = "class")
## n= 708
##
## CP nsplit rel error xerror xstd
## 1 0.20300752 0 1.0000000 1.0000000 0.02161643
## 2 0.16353383 1 0.7969925 0.8157895 0.02436082
## 3 0.07142857 2 0.6334586 0.6560150 0.02500528
## 4 0.05639098 3 0.5620301 0.6052632 0.02490539
## 5 0.03571429 4 0.5056391 0.5639098 0.02471510
## 6 0.03195489 5 0.4699248 0.5394737 0.02455578
## 7 0.01315789 6 0.4379699 0.5000000 0.02422276
## 8 0.01000000 7 0.4248120 0.4755639 0.02396833
##
## Variable importance
## trb ast x3pa blk x3p x2p x2pa tov stl fga fg ft
## 26 14 13 11 10 6 5 5 3 2 2 2
##
## Node number 1: 708 observations, complexity param=0.2030075
## predicted class=SG expected loss=0.7514124 P(node) =1
## class counts: 120 1 147 1 2 138 119 2 176 1 1
## probabilities: 0.169 0.001 0.208 0.001 0.003 0.195 0.168 0.003 0.249 0.001 0.001
## left son=2 (242 obs) right son=3 (466 obs)
## Primary splits:
## trb < 9.75 to the right, improve=82.01022, (0 missing)
## ast < 5.85 to the left, improve=63.46394, (0 missing)
## blk < 1.45 to the right, improve=45.10988, (0 missing)
## x3pa < 2.15 to the left, improve=42.51604, (0 missing)
## x3p < 0.95 to the left, improve=31.28822, (0 missing)
## Surrogate splits:
## x3pa < 2.95 to the left, agree=0.790, adj=0.384, (0 split)
## blk < 1.45 to the right, agree=0.780, adj=0.355, (0 split)
## x3p < 0.95 to the left, agree=0.766, adj=0.314, (0 split)
## x2p < 7.85 to the right, agree=0.739, adj=0.236, (0 split)
## x2pa < 12.65 to the right, agree=0.706, adj=0.140, (0 split)
##
## Node number 2: 242 observations, complexity param=0.07142857
## predicted class=C expected loss=0.5371901 P(node) =0.3418079
## class counts: 112 1 101 1 0 5 18 0 4 0 0
## probabilities: 0.463 0.004 0.417 0.004 0.000 0.021 0.074 0.000 0.017 0.000 0.000
## left son=4 (76 obs) right son=5 (166 obs)
## Primary splits:
## x3pa < 0.95 to the left, improve=21.17913, (0 missing)
## blk < 1.45 to the right, improve=19.00434, (0 missing)
## x3p < 1.25 to the left, improve=18.71015, (0 missing)
## trb < 11.75 to the right, improve=16.96863, (0 missing)
## x2p < 4.75 to the right, improve=10.83801, (0 missing)
## Surrogate splits:
## x3p < 0.25 to the left, agree=0.913, adj=0.724, (0 split)
## fga < 11.85 to the left, agree=0.723, adj=0.118, (0 split)
## trb < 19.2 to the right, agree=0.715, adj=0.092, (0 split)
## blk < 3.45 to the right, agree=0.715, adj=0.092, (0 split)
## x2pa < 17.05 to the right, agree=0.698, adj=0.039, (0 split)
##
## Node number 3: 466 observations, complexity param=0.1635338
## predicted class=SG expected loss=0.6309013 P(node) =0.6581921
## class counts: 8 0 46 0 2 133 101 2 172 1 1
## probabilities: 0.017 0.000 0.099 0.000 0.004 0.285 0.217 0.004 0.369 0.002 0.002
## left son=6 (141 obs) right son=7 (325 obs)
## Primary splits:
## ast < 5.85 to the right, improve=64.372910, (0 missing)
## trb < 6.95 to the right, improve=21.331720, (0 missing)
## tov < 2.55 to the right, improve=18.635180, (0 missing)
## x3pa < 8.35 to the left, improve= 9.102533, (0 missing)
## x2pa < 8.05 to the right, improve= 7.470095, (0 missing)
## Surrogate splits:
## tov < 3.05 to the right, agree=0.792, adj=0.312, (0 split)
## x2pa < 15 to the right, agree=0.736, adj=0.128, (0 split)
## fg < 10.05 to the right, agree=0.732, adj=0.113, (0 split)
## fga < 23.45 to the right, agree=0.732, adj=0.113, (0 split)
## ft < 3.85 to the right, agree=0.732, adj=0.113, (0 split)
##
## Node number 4: 76 observations
## predicted class=C expected loss=0.1842105 P(node) =0.1073446
## class counts: 62 0 13 0 0 1 0 0 0 0 0
## probabilities: 0.816 0.000 0.171 0.000 0.000 0.013 0.000 0.000 0.000 0.000 0.000
##
## Node number 5: 166 observations, complexity param=0.03195489
## predicted class=PF expected loss=0.4698795 P(node) =0.2344633
## class counts: 50 1 88 1 0 4 18 0 4 0 0
## probabilities: 0.301 0.006 0.530 0.006 0.000 0.024 0.108 0.000 0.024 0.000 0.000
## left son=10 (68 obs) right son=11 (98 obs)
## Primary splits:
## blk < 1.45 to the right, improve=13.836010, (0 missing)
## trb < 11.75 to the right, improve= 8.805977, (0 missing)
## x3pa < 6.25 to the left, improve= 8.205598, (0 missing)
## ast < 4.85 to the left, improve= 6.877132, (0 missing)
## x3p < 2.15 to the left, improve= 5.566296, (0 missing)
## Surrogate splits:
## x3pa < 3.95 to the left, agree=0.669, adj=0.191, (0 split)
## x2p < 9.3 to the right, agree=0.657, adj=0.162, (0 split)
## x3p < 1.55 to the left, agree=0.633, adj=0.103, (0 split)
## trb < 15.1 to the right, agree=0.633, adj=0.103, (0 split)
## fg < 11.95 to the right, agree=0.627, adj=0.088, (0 split)
##
## Node number 6: 141 observations
## predicted class=PG expected loss=0.2269504 P(node) =0.1991525
## class counts: 1 0 2 0 0 109 7 0 22 0 0
## probabilities: 0.007 0.000 0.014 0.000 0.000 0.773 0.050 0.000 0.156 0.000 0.000
##
## Node number 7: 325 observations, complexity param=0.05639098
## predicted class=SG expected loss=0.5384615 P(node) =0.4590395
## class counts: 7 0 44 0 2 24 94 2 150 1 1
## probabilities: 0.022 0.000 0.135 0.000 0.006 0.074 0.289 0.006 0.462 0.003 0.003
## left son=14 (136 obs) right son=15 (189 obs)
## Primary splits:
## trb < 6.95 to the right, improve=23.463410, (0 missing)
## ast < 2.45 to the left, improve= 9.133476, (0 missing)
## blk < 1.35 to the right, improve= 6.243228, (0 missing)
## x3p < 3.85 to the left, improve= 5.850970, (0 missing)
## x3pa < 8.35 to the left, improve= 5.315060, (0 missing)
## Surrogate splits:
## blk < 0.85 to the right, agree=0.658, adj=0.184, (0 split)
## stl < 2.05 to the right, agree=0.618, adj=0.088, (0 split)
## x3pa < 5.45 to the left, agree=0.606, adj=0.059, (0 split)
## fta < 3.85 to the right, agree=0.603, adj=0.051, (0 split)
## x2p < 4.25 to the right, agree=0.600, adj=0.044, (0 split)
##
## Node number 10: 68 observations, complexity param=0.01315789
## predicted class=C expected loss=0.4264706 P(node) =0.0960452
## class counts: 39 1 22 1 0 1 4 0 0 0 0
## probabilities: 0.574 0.015 0.324 0.015 0.000 0.015 0.059 0.000 0.000 0.000 0.000
## left son=20 (51 obs) right son=21 (17 obs)
## Primary splits:
## trb < 11.8 to the right, improve=5.529412, (0 missing)
## ast < 2.35 to the right, improve=5.019608, (0 missing)
## ft < 2.9 to the right, improve=4.812038, (0 missing)
## pts < 18.15 to the right, improve=4.748393, (0 missing)
## x2p < 4.8 to the right, improve=4.001961, (0 missing)
## Surrogate splits:
## tov < 1.85 to the right, agree=0.824, adj=0.294, (0 split)
## x2pa < 8.85 to the right, agree=0.794, adj=0.176, (0 split)
## fga < 9.8 to the right, agree=0.765, adj=0.059, (0 split)
## fta < 2.45 to the right, agree=0.765, adj=0.059, (0 split)
##
## Node number 11: 98 observations
## predicted class=PF expected loss=0.3265306 P(node) =0.1384181
## class counts: 11 0 66 0 0 3 14 0 4 0 0
## probabilities: 0.112 0.000 0.673 0.000 0.000 0.031 0.143 0.000 0.041 0.000 0.000
##
## Node number 14: 136 observations, complexity param=0.03571429
## predicted class=SF expected loss=0.5661765 P(node) =0.1920904
## class counts: 7 0 35 0 1 4 59 1 29 0 0
## probabilities: 0.051 0.000 0.257 0.000 0.007 0.029 0.434 0.007 0.213 0.000 0.000
## left son=28 (40 obs) right son=29 (96 obs)
## Primary splits:
## stl < 1.05 to the left, improve=12.824750, (0 missing)
## ast < 3.45 to the left, improve= 8.090552, (0 missing)
## x2p < 2.65 to the left, improve= 7.358991, (0 missing)
## x2pa < 4.55 to the left, improve= 5.207983, (0 missing)
## x3pa < 9.6 to the right, improve= 4.065126, (0 missing)
## Surrogate splits:
## x3p < 3.45 to the right, agree=0.772, adj=0.225, (0 split)
## x2p < 2.65 to the left, agree=0.772, adj=0.225, (0 split)
## x3pa < 8.95 to the right, agree=0.765, adj=0.200, (0 split)
## x2pa < 4.65 to the left, agree=0.757, adj=0.175, (0 split)
## blk < 0.15 to the left, agree=0.735, adj=0.100, (0 split)
##
## Node number 15: 189 observations
## predicted class=SG expected loss=0.3597884 P(node) =0.2669492
## class counts: 0 0 9 0 1 20 35 1 121 1 1
## probabilities: 0.000 0.000 0.048 0.000 0.005 0.106 0.185 0.005 0.640 0.005 0.005
##
## Node number 20: 51 observations
## predicted class=C expected loss=0.2941176 P(node) =0.0720339
## class counts: 36 0 12 1 0 1 1 0 0 0 0
## probabilities: 0.706 0.000 0.235 0.020 0.000 0.020 0.020 0.000 0.000 0.000 0.000
##
## Node number 21: 17 observations
## predicted class=PF expected loss=0.4117647 P(node) =0.0240113
## class counts: 3 1 10 0 0 0 3 0 0 0 0
## probabilities: 0.176 0.059 0.588 0.000 0.000 0.000 0.176 0.000 0.000 0.000 0.000
##
## Node number 28: 40 observations
## predicted class=PF expected loss=0.425 P(node) =0.05649718
## class counts: 5 0 23 0 1 2 4 0 5 0 0
## probabilities: 0.125 0.000 0.575 0.000 0.025 0.050 0.100 0.000 0.125 0.000 0.000
##
## Node number 29: 96 observations
## predicted class=SF expected loss=0.4270833 P(node) =0.1355932
## class counts: 2 0 12 0 0 2 55 1 24 0 0
## probabilities: 0.021 0.000 0.125 0.000 0.000 0.021 0.573 0.010 0.250 0.000 0.000
• The root node is the shooting guard position, the first split is total rebounds with the threshold being 9.8 rebounds. This is determined to be the best split of the data. This divides the data into two groups, if above threshold can be classified as centers and below threshold classified as shooting guards.
• The next nodes classifies centers from shooting guards by 3 point attempts at 0.95 which leads to a terminal node for the below value for centers.
• The decision node of 5.9 assists classifies point guards and shooting guards which leads to another terminal node for the above value for point guards.
• To classify shooting guards from power forward the determining factor is 1.5 blocks leading to a terminal node for the above value for power forwards.
• The decision node of 7 rebounds separates small forwards and shooting guard leading to a terminal node for the below value for shooting guards.
• The last decision nodes are for total rebounds of 12 for above that point for centers and below for power forwards. The other decision node for steals of 1.1 with below for power forwards and above for small forwards.
Each statistical category can be used to determine the position in which the player will fall into based on total rebounds, 3 point attempts, assists, blocks and steals. Each decision nodes represent the signification of that category and bench mark for that position. This classification tree can be used to identify players and help team determine what player’s goal and focus should be in certain categories. The classification tree can be extended to a team based model of respective statistics or even narrowed down to a specific team or sets of players.