Using Datasets Available Here: https://www.kaggle.com/dansbecker/nba-shot-logs (shot_log.csv)

And Here: https://www.basketball-reference.com/leagues/NBA_2015_totals.html (BBREF_players.csv)

This project attempts to identify patterns and create a framework for evaluating the performance value of basketball players by focusing on their scoring efficiency (ability to generate the most points in the least time).

Load in all required packages

This dataset was posted on Kaggle.com, and contains a record of every shot by every player in every game of the 2014-15 season (as far as I can tell, there was too much data to check). This section simply loads the data and does some basic cleaning. The dataset included the point value of shots that were missed as well, so in cleaning the data I tested the implications of both penalizing missed shots (making them behave as the negative of their point value), and nuetralizing missed shots (making all missed point values equal to zero). For the purposes of this project it makes more sense to treat missed shots as zero point values, because the implications of misses to player performance can already be accounted for in the ratio of points per shot.

The data on in-game performance and shooting tendencies contains a lot of very specific information on player performance that could be used to get a high-resolution look at exactly how individuals performed in different situations. Instead of looking for a fine grained means of quantifying player performance in different situations, this analysis will pull out aggregate results to create a generalized means of ranking players on their ability to generate points, given the limited amount of time they have to handle the ball. Since Basketball games are short and high scoring, there is reason to believe that the best (offensive) players should have the highest ratio of points per shot to minutes of ball time. This study will look at the implications of using only this low resolution artificial ranking system, and maybe compare it to ranking systems that account for more features. As can be seen from the ggplot2 graphic, more ball time and low point per shot ratio result in a lower ranking value. Since taking more shots requires more ball-time, the players that tend to have the most ball time and publicity arent ranked very highly with respect to this metric (Kobe Bryant, and Lebron James are pretty low). The plotly graphic is far more useful and interactive in showing the same information, but unfortunately it doesn’t live in the notebook.

In order to get a few more metrics on player performance, the first dataset was joined with data from basketball-reference.com. This dataset contained player specific information on seasonal performance, including position, age, games played etc., many of which were completely unattainable from the dataset on in-game shooting. Using both datasets allows for a fuller picture of the individual player’s performance, and offers a means to check the implications of the rankings and the consistency of the data.

This tangent looks at how player performance stacks up when grouping by position. This will show if the ranking system is dramatically biased towards the skills required for any given position. Looking at the ggplot graphic generated by averaging all the metrics after aggregating by position, it is clear that centers are ranked much higher than other positions when looking at how often they score given the amount of time they handle the ball. The surprise is that the centers had a much higher points-per-shot ratio but also less ball time, which contributed to the higher average score and lower ranking.

This ggplot graphic shows how this ranking system plays out with respect to a player’s point total, accounting for position and ball time as well. As one would expect players with high point totals have the highest rankings, and this generally corresponds to more ball time. Looking at the graphic it is also clear that different positions have defined clustering within the data, something that could be explored further.

The positional rankings exist in clusters, which means that the rankings for players of the same position are grouped around an expected value (average) with some standard deviation within the grouping. By modeling the probabilities of players in each position having a certain ranking as a normal distribution, a set of boundaries for identifying a player’s position by their ranking within this system begins to appear. For example, the most confident estimate that a player with a certain ranking plays a certain position will map directly to the positional distribution with the highest value for that ranking (e.g. if a player’s ranking is less than 1.00, then ‘center’ would be the best estimate for their position.)

