Predicting NBA Player Positions Using Decision Tree

Introduction

This project will use the computer program R to generate a decision tree to determine NBA player’s positions based upon various statistical categories. Data will be used from basketballreference.com to construct a classification tree. These types of trees can be used to sort categorical data appropriately, to classify player’s position. As one of the following

• Point Guard

• Shooting Guard

• Small Forward

• Power Forward

• Center

Using these statistics that were tracked over the course of the season

• Games played

• Games started

• Minutes per game

• 2pt & 3pt field goals/attempts

• Free throws/attempts

• Assists, turnovers

• Rebounds

• Fouls

• Blocks

• Steals

• Offensive and Defensive rating

To resolve the issue of discrepancies in playing time among players, we will use per 100 possessions statistics. The outcome of the decision tree will show the value of each category based on position the players. These values can be used to determine threshold certain players have to reach at their respective position.

Firstly in order to generate a decision tree which must have some data in order to input in the mechanism. We will import data using the ballr package as well as other necessary dependencies.

NBA Statistics in R

We will now view the player statistics from the 2018-2019 NBA season. It is important to use 100 possessions statistics in order to make sure all players are measure equal to get a representative reading in the decision tree.

Below are some of the entries of the per 100 possessions per game statistics for players during the season.

Per 100 Stats
rk	player	pos	age	tm	g	gs	mp	fg	fga	fgpercent	x3p	x3pa	x3ppercent	x2p	x2pa	x2ppercent	ft	fta	ftpercent	orb	drb	trb	ast	stl	blk	tov	pf	pts	x	ortg	drtg	link
1	Álex Abrines	SG	25	OKC	31	2	588	4.4	12.5	0.357	3.3	10.1	0.323	1.2	2.4	0.500	1.0	1.0	0.923	0.4	3.4	3.8	1.6	1.3	0.5	1.1	4.2	13.1	NA	103	111	/players/a/abrinal01.html
2	Quincy Acy	PF	28	PHO	10	0	123	1.6	7.0	0.222	0.8	5.8	0.133	0.8	1.2	0.667	2.7	3.9	0.700	1.2	8.5	9.7	3.1	0.4	1.6	1.6	9.3	6.6	NA	87	116	/players/a/acyqu01.html
3	Jaylen Adams	PG	22	ATL	34	1	428	4.1	11.9	0.345	2.7	8.0	0.338	1.4	3.9	0.361	0.8	1.0	0.778	1.2	5.3	6.5	7.0	1.5	0.5	3.0	4.9	11.7	NA	99	115	/players/a/adamsja01.html
4	Steven Adams	C	25	OKC	80	80	2669	8.4	14.1	0.595	0.0	0.0	0.000	8.4	14.1	0.596	2.6	5.1	0.500	6.8	6.5	13.3	2.2	2.0	1.3	2.4	3.6	19.4	NA	120	106	/players/a/adamsst01.html
5	Bam Adebayo	C	21	MIA	82	28	1913	7.2	12.4	0.576	0.1	0.4	0.200	7.1	12.0	0.588	4.2	5.8	0.735	4.2	11.0	15.2	4.7	1.8	1.7	3.1	5.2	18.6	NA	120	104	/players/a/adebaba01.html
6	Deng Adel	SF	21	CLE	19	3	194	2.8	9.2	0.306	1.5	5.9	0.261	1.3	3.3	0.385	1.0	1.0	1.000	0.8	4.1	4.9	1.3	0.3	1.0	1.5	3.3	8.2	NA	85	121	/players/a/adelde01.html

Let’s examine the column names of this data frame using the following function

##  [1] "rk"         "player"     "pos"        "age"        "tm"        
##  [6] "g"          "gs"         "mp"         "fg"         "fga"       
## [11] "fgpercent"  "x3p"        "x3pa"       "x3ppercent" "x2p"       
## [16] "x2pa"       "x2ppercent" "ft"         "fta"        "ftpercent" 
## [21] "orb"        "drb"        "trb"        "ast"        "stl"       
## [26] "blk"        "tov"        "pf"         "pts"        "x"         
## [31] "ortg"       "drtg"       "link"

All 33 categories are not neccessarily useful in the construction of the decision tree. Such as the following variables

• name

• x

• link

• rnk

• team

They are of the categorical nature and too unique to each player. Also these variables

• games

• games started

• percentages

• fouls

Will not be used be for we are only interested in players by positions and they do not add any extra value to our tree.

The following variables will be used in the decision tree

• fg, fga

• x3p, x3pa

• x2p, x2pa

• ft, fta

• trb

• ast, tov

• blk, stl

• pts

We will construct a new data frame with the statistics of interest and to use going forward Below are some of the entries of our new data frame.

Per 100 Stats
pos	fg	fga	x3p	x3pa	x2p	x2pa	ft	fta	trb	ast	blk	stl	tov	pts
SG	4.4	12.5	3.3	10.1	1.2	2.4	1.0	1.0	3.8	1.6	0.5	1.3	1.1	13.1
PF	1.6	7.0	0.8	5.8	0.8	1.2	2.7	3.9	9.7	3.1	1.6	0.4	1.6	6.6
PG	4.1	11.9	2.7	8.0	1.4	3.9	0.8	1.0	6.5	7.0	0.5	1.5	3.0	11.7
C	8.4	14.1	0.0	0.0	8.4	14.1	2.6	5.1	13.3	2.2	1.3	2.0	2.4	19.4
C	7.2	12.4	0.1	0.4	7.1	12.0	4.2	5.8	15.2	4.7	1.7	1.8	3.1	18.6
SF	2.8	9.2	1.5	5.9	1.3	3.3	1.0	1.0	4.9	1.3	1.0	0.3	1.5	8.2

It is vital for the decision tree to work, we must create a character string for each position.

• “C”

• “PF”

• “SF”

• “SG”

• “PG”

Now we can begin to construct a visual for the decision tree, by using the package ‘rpart’. With our new data frame we can construct a classification tree.

In order to verify which variables were used in making of the decision tree will call on the following function to see.

This will display the complexity parameter table for the fitted Rpart object.

## 
## Classification tree:
## rpart(formula = P100, data = per_100, method = "class")
## 
## Variables actually used in tree construction:
## [1] ast  blk  stl  trb  x3pa
## 
## Root node error: 532/708 = 0.751
## 
## n= 708 
## 
##       CP nsplit rel error xerror   xstd
## 1 0.2030      0     1.000  1.000 0.0216
## 2 0.1635      1     0.797  0.816 0.0244
## 3 0.0714      2     0.633  0.656 0.0250
## 4 0.0564      3     0.562  0.605 0.0249
## 5 0.0357      4     0.506  0.564 0.0247
## 6 0.0320      5     0.470  0.539 0.0246
## 7 0.0132      6     0.438  0.500 0.0242
## 8 0.0100      7     0.425  0.476 0.0240

This shows that the included variables are

• ast

• blk

• stl

• trb

• x3pa

will be used for the construction of the decision tree. The excluded variables are

• fg, fga

• x3p, x2p, x2pa

• ft, fta

• tov

• pts

Decision Tree

To plot the decision tree, we will use the rpart.plot library to plot the appropriate tree and label it accordingly.

Using the prp() function we can generate a better type of the same decision tree

Using the fancyRpartplot() function we can generate a more fancy decision tree plot.

Now for a more in depth look at the decision here are the calculation for each node and respective split in the decision tree.

## Call:
## rpart(formula = P100, data = per_100, method = "class")
##   n= 708 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.20300752      0 1.0000000 1.0000000 0.02161643
## 2 0.16353383      1 0.7969925 0.8157895 0.02436082
## 3 0.07142857      2 0.6334586 0.6560150 0.02500528
## 4 0.05639098      3 0.5620301 0.6052632 0.02490539
## 5 0.03571429      4 0.5056391 0.5639098 0.02471510
## 6 0.03195489      5 0.4699248 0.5394737 0.02455578
## 7 0.01315789      6 0.4379699 0.5000000 0.02422276
## 8 0.01000000      7 0.4248120 0.4755639 0.02396833
## 
## Variable importance
##  trb  ast x3pa  blk  x3p  x2p x2pa  tov  stl  fga   fg   ft 
##   26   14   13   11   10    6    5    5    3    2    2    2 
## 
## Node number 1: 708 observations,    complexity param=0.2030075
##   predicted class=SG  expected loss=0.7514124  P(node) =1
##     class counts:   120     1   147     1     2   138   119     2   176     1     1
##    probabilities: 0.169 0.001 0.208 0.001 0.003 0.195 0.168 0.003 0.249 0.001 0.001 
##   left son=2 (242 obs) right son=3 (466 obs)
##   Primary splits:
##       trb  < 9.75  to the right, improve=82.01022, (0 missing)
##       ast  < 5.85  to the left,  improve=63.46394, (0 missing)
##       blk  < 1.45  to the right, improve=45.10988, (0 missing)
##       x3pa < 2.15  to the left,  improve=42.51604, (0 missing)
##       x3p  < 0.95  to the left,  improve=31.28822, (0 missing)
##   Surrogate splits:
##       x3pa < 2.95  to the left,  agree=0.790, adj=0.384, (0 split)
##       blk  < 1.45  to the right, agree=0.780, adj=0.355, (0 split)
##       x3p  < 0.95  to the left,  agree=0.766, adj=0.314, (0 split)
##       x2p  < 7.85  to the right, agree=0.739, adj=0.236, (0 split)
##       x2pa < 12.65 to the right, agree=0.706, adj=0.140, (0 split)
## 
## Node number 2: 242 observations,    complexity param=0.07142857
##   predicted class=C   expected loss=0.5371901  P(node) =0.3418079
##     class counts:   112     1   101     1     0     5    18     0     4     0     0
##    probabilities: 0.463 0.004 0.417 0.004 0.000 0.021 0.074 0.000 0.017 0.000 0.000 
##   left son=4 (76 obs) right son=5 (166 obs)
##   Primary splits:
##       x3pa < 0.95  to the left,  improve=21.17913, (0 missing)
##       blk  < 1.45  to the right, improve=19.00434, (0 missing)
##       x3p  < 1.25  to the left,  improve=18.71015, (0 missing)
##       trb  < 11.75 to the right, improve=16.96863, (0 missing)
##       x2p  < 4.75  to the right, improve=10.83801, (0 missing)
##   Surrogate splits:
##       x3p  < 0.25  to the left,  agree=0.913, adj=0.724, (0 split)
##       fga  < 11.85 to the left,  agree=0.723, adj=0.118, (0 split)
##       trb  < 19.2  to the right, agree=0.715, adj=0.092, (0 split)
##       blk  < 3.45  to the right, agree=0.715, adj=0.092, (0 split)
##       x2pa < 17.05 to the right, agree=0.698, adj=0.039, (0 split)
## 
## Node number 3: 466 observations,    complexity param=0.1635338
##   predicted class=SG  expected loss=0.6309013  P(node) =0.6581921
##     class counts:     8     0    46     0     2   133   101     2   172     1     1
##    probabilities: 0.017 0.000 0.099 0.000 0.004 0.285 0.217 0.004 0.369 0.002 0.002 
##   left son=6 (141 obs) right son=7 (325 obs)
##   Primary splits:
##       ast  < 5.85  to the right, improve=64.372910, (0 missing)
##       trb  < 6.95  to the right, improve=21.331720, (0 missing)
##       tov  < 2.55  to the right, improve=18.635180, (0 missing)
##       x3pa < 8.35  to the left,  improve= 9.102533, (0 missing)
##       x2pa < 8.05  to the right, improve= 7.470095, (0 missing)
##   Surrogate splits:
##       tov  < 3.05  to the right, agree=0.792, adj=0.312, (0 split)
##       x2pa < 15    to the right, agree=0.736, adj=0.128, (0 split)
##       fg   < 10.05 to the right, agree=0.732, adj=0.113, (0 split)
##       fga  < 23.45 to the right, agree=0.732, adj=0.113, (0 split)
##       ft   < 3.85  to the right, agree=0.732, adj=0.113, (0 split)
## 
## Node number 4: 76 observations
##   predicted class=C   expected loss=0.1842105  P(node) =0.1073446
##     class counts:    62     0    13     0     0     1     0     0     0     0     0
##    probabilities: 0.816 0.000 0.171 0.000 0.000 0.013 0.000 0.000 0.000 0.000 0.000 
## 
## Node number 5: 166 observations,    complexity param=0.03195489
##   predicted class=PF  expected loss=0.4698795  P(node) =0.2344633
##     class counts:    50     1    88     1     0     4    18     0     4     0     0
##    probabilities: 0.301 0.006 0.530 0.006 0.000 0.024 0.108 0.000 0.024 0.000 0.000 
##   left son=10 (68 obs) right son=11 (98 obs)
##   Primary splits:
##       blk  < 1.45  to the right, improve=13.836010, (0 missing)
##       trb  < 11.75 to the right, improve= 8.805977, (0 missing)
##       x3pa < 6.25  to the left,  improve= 8.205598, (0 missing)
##       ast  < 4.85  to the left,  improve= 6.877132, (0 missing)
##       x3p  < 2.15  to the left,  improve= 5.566296, (0 missing)
##   Surrogate splits:
##       x3pa < 3.95  to the left,  agree=0.669, adj=0.191, (0 split)
##       x2p  < 9.3   to the right, agree=0.657, adj=0.162, (0 split)
##       x3p  < 1.55  to the left,  agree=0.633, adj=0.103, (0 split)
##       trb  < 15.1  to the right, agree=0.633, adj=0.103, (0 split)
##       fg   < 11.95 to the right, agree=0.627, adj=0.088, (0 split)
## 
## Node number 6: 141 observations
##   predicted class=PG  expected loss=0.2269504  P(node) =0.1991525
##     class counts:     1     0     2     0     0   109     7     0    22     0     0
##    probabilities: 0.007 0.000 0.014 0.000 0.000 0.773 0.050 0.000 0.156 0.000 0.000 
## 
## Node number 7: 325 observations,    complexity param=0.05639098
##   predicted class=SG  expected loss=0.5384615  P(node) =0.4590395
##     class counts:     7     0    44     0     2    24    94     2   150     1     1
##    probabilities: 0.022 0.000 0.135 0.000 0.006 0.074 0.289 0.006 0.462 0.003 0.003 
##   left son=14 (136 obs) right son=15 (189 obs)
##   Primary splits:
##       trb  < 6.95  to the right, improve=23.463410, (0 missing)
##       ast  < 2.45  to the left,  improve= 9.133476, (0 missing)
##       blk  < 1.35  to the right, improve= 6.243228, (0 missing)
##       x3p  < 3.85  to the left,  improve= 5.850970, (0 missing)
##       x3pa < 8.35  to the left,  improve= 5.315060, (0 missing)
##   Surrogate splits:
##       blk  < 0.85  to the right, agree=0.658, adj=0.184, (0 split)
##       stl  < 2.05  to the right, agree=0.618, adj=0.088, (0 split)
##       x3pa < 5.45  to the left,  agree=0.606, adj=0.059, (0 split)
##       fta  < 3.85  to the right, agree=0.603, adj=0.051, (0 split)
##       x2p  < 4.25  to the right, agree=0.600, adj=0.044, (0 split)
## 
## Node number 10: 68 observations,    complexity param=0.01315789
##   predicted class=C   expected loss=0.4264706  P(node) =0.0960452
##     class counts:    39     1    22     1     0     1     4     0     0     0     0
##    probabilities: 0.574 0.015 0.324 0.015 0.000 0.015 0.059 0.000 0.000 0.000 0.000 
##   left son=20 (51 obs) right son=21 (17 obs)
##   Primary splits:
##       trb < 11.8  to the right, improve=5.529412, (0 missing)
##       ast < 2.35  to the right, improve=5.019608, (0 missing)
##       ft  < 2.9   to the right, improve=4.812038, (0 missing)
##       pts < 18.15 to the right, improve=4.748393, (0 missing)
##       x2p < 4.8   to the right, improve=4.001961, (0 missing)
##   Surrogate splits:
##       tov  < 1.85  to the right, agree=0.824, adj=0.294, (0 split)
##       x2pa < 8.85  to the right, agree=0.794, adj=0.176, (0 split)
##       fga  < 9.8   to the right, agree=0.765, adj=0.059, (0 split)
##       fta  < 2.45  to the right, agree=0.765, adj=0.059, (0 split)
## 
## Node number 11: 98 observations
##   predicted class=PF  expected loss=0.3265306  P(node) =0.1384181
##     class counts:    11     0    66     0     0     3    14     0     4     0     0
##    probabilities: 0.112 0.000 0.673 0.000 0.000 0.031 0.143 0.000 0.041 0.000 0.000 
## 
## Node number 14: 136 observations,    complexity param=0.03571429
##   predicted class=SF  expected loss=0.5661765  P(node) =0.1920904
##     class counts:     7     0    35     0     1     4    59     1    29     0     0
##    probabilities: 0.051 0.000 0.257 0.000 0.007 0.029 0.434 0.007 0.213 0.000 0.000 
##   left son=28 (40 obs) right son=29 (96 obs)
##   Primary splits:
##       stl  < 1.05  to the left,  improve=12.824750, (0 missing)
##       ast  < 3.45  to the left,  improve= 8.090552, (0 missing)
##       x2p  < 2.65  to the left,  improve= 7.358991, (0 missing)
##       x2pa < 4.55  to the left,  improve= 5.207983, (0 missing)
##       x3pa < 9.6   to the right, improve= 4.065126, (0 missing)
##   Surrogate splits:
##       x3p  < 3.45  to the right, agree=0.772, adj=0.225, (0 split)
##       x2p  < 2.65  to the left,  agree=0.772, adj=0.225, (0 split)
##       x3pa < 8.95  to the right, agree=0.765, adj=0.200, (0 split)
##       x2pa < 4.65  to the left,  agree=0.757, adj=0.175, (0 split)
##       blk  < 0.15  to the left,  agree=0.735, adj=0.100, (0 split)
## 
## Node number 15: 189 observations
##   predicted class=SG  expected loss=0.3597884  P(node) =0.2669492
##     class counts:     0     0     9     0     1    20    35     1   121     1     1
##    probabilities: 0.000 0.000 0.048 0.000 0.005 0.106 0.185 0.005 0.640 0.005 0.005 
## 
## Node number 20: 51 observations
##   predicted class=C   expected loss=0.2941176  P(node) =0.0720339
##     class counts:    36     0    12     1     0     1     1     0     0     0     0
##    probabilities: 0.706 0.000 0.235 0.020 0.000 0.020 0.020 0.000 0.000 0.000 0.000 
## 
## Node number 21: 17 observations
##   predicted class=PF  expected loss=0.4117647  P(node) =0.0240113
##     class counts:     3     1    10     0     0     0     3     0     0     0     0
##    probabilities: 0.176 0.059 0.588 0.000 0.000 0.000 0.176 0.000 0.000 0.000 0.000 
## 
## Node number 28: 40 observations
##   predicted class=PF  expected loss=0.425  P(node) =0.05649718
##     class counts:     5     0    23     0     1     2     4     0     5     0     0
##    probabilities: 0.125 0.000 0.575 0.000 0.025 0.050 0.100 0.000 0.125 0.000 0.000 
## 
## Node number 29: 96 observations
##   predicted class=SF  expected loss=0.4270833  P(node) =0.1355932
##     class counts:     2     0    12     0     0     2    55     1    24     0     0
##    probabilities: 0.021 0.000 0.125 0.000 0.000 0.021 0.573 0.010 0.250 0.000 0.000

Explanation

• The root node is the shooting guard position, the first split is total rebounds with the threshold being 9.8 rebounds. This is determined to be the best split of the data. This divides the data into two groups, if above threshold can be classified as centers and below threshold classified as shooting guards.

• The next nodes classifies centers from shooting guards by 3 point attempts at 0.95 which leads to a terminal node for the below value for centers.

• The decision node of 5.9 assists classifies point guards and shooting guards which leads to another terminal node for the above value for point guards.

• To classify shooting guards from power forward the determining factor is 1.5 blocks leading to a terminal node for the above value for power forwards.

• The decision node of 7 rebounds separates small forwards and shooting guard leading to a terminal node for the below value for shooting guards.

• The last decision nodes are for total rebounds of 12 for above that point for centers and below for power forwards. The other decision node for steals of 1.1 with below for power forwards and above for small forwards.

Conclusion

Each statistical category can be used to determine the position in which the player will fall into based on total rebounds, 3 point attempts, assists, blocks and steals. Each decision nodes represent the signification of that category and bench mark for that position. This classification tree can be used to identify players and help team determine what player’s goal and focus should be in certain categories. The classification tree can be extended to a team based model of respective statistics or even narrowed down to a specific team or sets of players.