Evan Parker
2024-12-10
Rk: Rank, (player identification number)Player: Player’s namePos: PositionAge: Player’s ageTm: TeamG: Games playedGS: Games startedMP: Minutes played per gameFG: Field goals per gameFGA: Field goal attempts per gameFG%: Field goal percentage3P: 3-point field goals per game3PA: 3-point field goal attempts per game3P%: 3-point field goal percentage2P: 2-point field goals per game2PA: 2-point field goal attempts per game2P%: 2-point field goal percentageeFG%: Effective field goal percentageFT: Free throws per gameFTA: Free throw attempts per gameFT%: Free throw percentageORB: Offensive rebounds per gameDRB: Defensive rebounds per gameTRB: Total rebounds per gameAST: Assists per gameSTL: Steals per gameBLK: Blocks per gameTOV: Turnovers per gamePF: Personal fouls per gamePTS: Points per game0 missing values
Team: If a player played for multiple teams in the same season, they 1 observations per team, plus an additional observation where the Team variable is TOT. The TOT observations were kept and the other observations for the player were dropped
Some variables had to be renamed due to limitations of R,as variable names cannot have symbols or start with a number:
| Original Name | New Name |
|---|---|
| FG% | FGP |
| 3P% | X3PP |
| 2P% | X2PP |
| eFG% | eFGP |
| FT% | FTP |
| 3P | X3P |
| 3PA | X3PA |
| 2P | X2P |
| 2PA | X2PA |
Can the dataset be split into separate groups using a multitude of different player statistics?
Can we determine any players who are considered outliers in comparison to their group?
First step is determining the number of clusters (“K”) needed
Below plots show that the ideal number of clusters is 2
Principle Component Analysis (PCA) is a tool used to reduce the dimensionality of the dataset. Currently there is 27 variables in the dataset, but this can be reduced to a smaller amount of PCA’s
The below plot shows how many PCA’s are needed
Local Outlier Factor (LOF) is an outlier detection algorithm that is used to identify outliers within their respective clusters given specific criteria
The outlier criteria that I have selected is any player who has a combines z-score for the following variables higher than 3:
| Outlier Criteria |
|---|
| MP |
| PTS |
| FGP |
| X3PP |
| TRB |
| AST |
| STL |
| BLK |
| TOV |
LOF uses a set amount of neighboring observations to determine outliers as well, so different cutoff amounts will produce different results
Below shows AUC curves for LOF detection. The higher the AUC the better performing the model:
Can the dataset be split into separate groups using a multitude of different player statistics?
Yes! After determining the ideal amount of clusters, the dataset was split into 2 separate clusters
Can we determine any players who are considered outliers in comparison to their group?
Yes! Outliers were determined with predetermined criteria and identified through LOF procedures
These conclusions should only be applicable to Modern NBA players. Older NBA players played a different type of game, resulting in a different skewness of certain statistics. If a noticeable shift in the NBA game should occur, these conclusions should be under scrutiny and retested.
These conclusions should not be used for any other basketball league (i.e. WNBA, NCAA), as each league has a different style of play that may be rendered obsolete