NCAA to NBA - Statistical Developments

An empirical study on how certain statistical categories in Basketball develop from the College Level to the NBA Level

Ole Nedderhoff

21.06.2023

Introduction

As the sixth most popular sport in the world (Veroutsos (2022)) and the second most popular sport in the USA (Richter (2022)), Basketball plays a huge role in the international sports business.

Most popular sports in the US in 2022 (Richter (2022))

Especially the NBA and the NCAA are huge contributors to this market, as both organizations generate revenues of over one billion US-Dollars (Gough (2023b)). But the NBA is where the most money is made with a yearly estimated revenue of 10 billion US-Dollars (Young (2021)). The team which had the largest income stream of the entire NBA in the season 2021/2022 were the Golden State Warriors with a revenue of 765 million US-Dollars (Gough (2023a)). This fact makes a lot of sense, since the Warriors were the most successful team by winning the NBA Championship last season. In fact, when looking at the overall standings of the league and their revenue standings, a clear correlation between the two numbers can be identified, as more successful teams with more wins sell more tickets and merchandise. So, for any NBA team the goal is to create the best possible team and thus having the best chance of a very profitable business model. To get great new players for your team, to make it better, there are multiple possible options.

  1. You can sign players, when they have no more running contract, which usually costs a lot of money for the best players in the league (up to 50 million US-Dollars per year (Dunn (2022))).

  2. Another option is to trade some of your own players for players from other teams, but a team giving up a great player usually also wants a great player in return, making a lot of these deals obsolete in the end.

  3. The third option is to select new players in the yearly draft, where the worst teams get to pick first and the best teams last. This method of building great teams has been the foundation for some of the greatest dynasties in NBA history.

Over the last 20 years, 12 of the 20 Finals-MVPs, the best player on the championship winning team, have also been drafted by that championship winning team (Wood (2022)). So, the question is how to determine which players to pick in the draft, to get great value from them. As 60 players are picked every year, not every one of them has a successful NBA career, as the jump from the amateur college level to the professional level is huge. The question this paper intends to answer, is which statistical categories translate best to the NBA, meaning where and under which circumstances is the drop-off between an average college and NBA career the smallest? To answer this question, the data set “NBA NCAA Comparisons summary” from data.world has been chosen. This data set has information on every past NBA player since 1947, including their NBA and NCAA statistics.

Theoretical Background

The topic of finding the best possible players in an upcoming draft and determining factors on how to grade them is obviously not new in Basketball or the world of sports in general. Scouting departments are entirely focused on finding the best new players in the upcoming draft every year and multiple papers have been written on the predictability of an NBA career based college statistics. Coates and Oguntimein concluded in their discussion that based on their research, “Many college level productivity measures do a good job of predicting/explaining NBA career productivity.” (Coates and Oguntimein (2008)) Greene singled out specific statistics that are significant in predicting future NBA success. “Field goal percentage, blocks, and 1st team All-American were significant variables across all three initial analyses” (Greene (2015)). The data set in this paper does not contain information about individual awards, such as 1st team All-American and is solely focused on individual statistics. The researchers Berri, Brook and Fenn determined that a prediction of an NBA career is quite possible based on college statistics, however NBA executives merely listen to these predictions and are more focused on storylines, such as great team success in their last season, but reason that this recency bias is not relevant to their future performance. The one statistical category they bring up in their conclusion is the points scored, as this has the supposedly greatest impact on executives to draft a player. (Berri, Brook, and Fenn (2011)) One factor multiple sources were focused on in their research was the college experience. It used to be tradition for college players to play at least three or four seasons of amateur Basketball before declaring for the NBA draft. Over the last 30 years this has drastically changed, as very talented young players usually choose to only stay one season in college nowadays, as they have the opportunity to immediately earn millions of dollars once they join an NBA team. This development has led to players with less college experience having better NBA careers on average. (Zestcott et al. (2020)) Based on this theoretical background, this paper will firstly concentrate on the development of the statistics blocks per game, field goal percentage, points per game and games played to validate the thesis of the prior researchers. Subsequently, other statistical categories will be compared.

Statistical Analysis

Data Understanding

As previously mentioned, the data set used for this research is a .csv file called players.csv from data.world. It has information about every active NBA player between 1947 and 2018. In the first steps we will look at some basic information of the data set, such as which columns there are and which rows have any and how many missing values.

## X
## active_from
## active_to
## birth_date
## college
## height
## name
## position
## url
## weight
## NBA__3ptapg
## NBA__3ptpct
## NBA__3ptpg
## NBA_efgpct
## NBA_fg.
## NBA_fg_per_game
## NBA_fga_per_game
## NBA_ft.
## NBA_ft_per_g
## NBA_fta_p_g
## NBA_g_played
## NBA_ppg
## NCAA__3ptapg
## NCAA__3ptpct
## NCAA__3ptpg
## NCAA_efgpct
## NCAA_fgapg
## NCAA_fgpct
## NCAA_fgpg
## NCAA_ft
## NCAA_ftapg
## NCAA_ftpg
## NCAA_games
## NCAA_ppg

Data Preperation

It has to be said that this data set does not contain the statistical categories Assists, Rebounds, Blocks and Steals. It only focuses on the shooting statistics of every player, which will subsequently be the focus of this study. This illustration shows that there are multiple thousands of values missing in different columns. The NAs in the NBA_3pt columns are explainable by the fact that the three-point line was not implemented in the NBA until 1979, otherwise the concerned rows are still usable for the analysis of the data. The same goes for the statistic of effective field goal percentage (NBA_efgpct), as this statistic combines the value a field goal try is worth and did not make sense until the three-point shot existed. On the contrary, there is an overwhelming amount of missing NCAA data. By looking through the data it is distinguishable that there were no college statistics for most of the players until 1988. Adding to these large amounts of missing data is the fact that there are many examples of NBA players not playing in in the NCAA before they came to the NBA. Many Europeans choose to stay in Europe for their pre-NBA development, while some highly talented players used to skip the NCAA entirely and get drafted right out of High School. To prepare the data set properly for the conduction of meaningful analysis, the following choices where made:

  • All rows with values in the column active_from smaller than 1988 will be filtered out

  • All rows with NA values in any of the NCAA columns, except those concerning three-point shots will be filtered out.

Before the filtering, there were a total of 4576 rows of data, now the focus of this analysis lies on the remaining 1811 rows.

As discussed in the theoretical background the first focus will lie on the three statistics points, field goal percentage and games played. To straight up compare these statistics next to each other would not bring proper solution, since an NBA game (48 minutes) is longer than an NCAA game (40 minutes) and and NBA career usually lasts more seasons than an NCAA career, which is limited to four seasons. Another factor that cannot be overlooked is the era a player plays in, as there are different styles of Basketball being played in every era. Today’s game is much more focused on the three-point shot, as Analytics have made executives and coaches realize that this is the most effective way to play. To prevent these falsehoods every player will receive a new column with a percentage of his value in every statistical category divided by the maximum achieved by any player in his decade. Every player will also be given a main decade of his career, the decade where he first started playing.

Modeling

Drop-Off Calculation

In the next step, every statistical category is given new column, where its value in the NBA is divided by its value from the NCAA. For example, Player A scored 50 percent of the highest possible value in the points per game category in the NCAA and 30 percent of the highest possible value in the points per game category in the NBA. His value in the new column “ppg” would now be 60 percent, meaning his NCAA production in points per game dropped off by 40 percent. For every one of these new statistical categories, a summary is being drawn and all of them are stored in the data.frame return_frame. As a precaution to prevent infinite numbers, every time the NCAA category’s value is zero, the NBA value is divided by 0.1.

Here are the summaries of all the different statistical categories:

It is obvious that the smallest drop-offs, or the biggest values are inherited by the percentage categories (three-point percentage, field goal percentage, free throw percentage). Meanwhile the biggest drop-off is registered in the percentage of the games played, contrary to the theory that was discussed in the theoretical background. But these median values do nothing when trying to predict which player in the draft will have smaller statistical drop-offs than others. For this, the values of the columns weight, height, position and the age when being drafted are going to be investigated upon their impact towards the mean of different drop_off columns. The first step for this is to create an “age_when_drafted” columns. To further investigate the data, all values have to be numeric, so the position will be turned from (G, G/F, F, F/C, C/F, C) to (1.5, 2.5, 3.5, 4.5, 5).

Correlation Matrix

In the next step, a correlation matrix will is produced, and this paper will especially look at the effect of the categories weight, position and age_when_drafted on the other statistical categories.

From this correlation matrix, the following conclusions can be drawn: _3ptapg (Three-Point Attempts per Game) shows a weak positive correlation with weight (0.124) and a weak positive correlation with position_nr (0.143). This suggests that players with higher weight and players in certain positions tend to have slightly more three-point attempts per game.

  1. Three-Point Attempts per Game shows a weak positive correlation with weight (0.124) and a weak positive correlation with position (0.143). This suggests that players with higher weight and players in certain positions tend to have a smaller drop-off in three-point attempts per game.

  2. Three-Point Shooting Percentage has a very weak positive correlation with weight (0.034) and a very weak positive correlation with position (0.001). These correlations indicate that weight and position do not have a strong influence on the drop-off of three-point shooting percentage.

  3. Field Goal Percentage is the only statistic that is negatively impacted by all three factors. This indicates that players in smaller positions, less heavy players and players that are younger when they are drafted tend to have a smaller drop-off in their shooting percentage.

  4. Field Goals per Game shows a weak positive correlation with weight (0.086), indicating that players with higher weight tend to have a slightly smaller drop-off in field goals per game.

  5. Field Goal Attempts per Game has a weak positive correlation with weight (0.119) and a weak positive correlation with position (0.088). This suggests that players with higher weight and players in the bigger positions tend to have a slightly smaller drop-off in field goal attempts per game.

  6. Free Throw Percentage has a weak positive correlation with weight (0.056) and a weak positive correlation with position (0.050). This indicates that weight and position have a limited influence on the drop-off of free throw percentage.

  7. Free Throws per Game shows a weak positive correlation with weight (0.072) and a weak positive correlation with position (0.069). This suggests that players with higher weight and players in the bigger positions tend to have a slightly smaller drop-off in free throws per game.

  8. Free Throw Attempts per Game has a weak positive correlation with weight (0.071) and a weak positive correlation with position (0.058). This indicates that players with higher weight and players in the bigger positions tend to have a slightly smaller drop-off in free throw attempts per game.

  9. Number of Games Played has a weak positive correlation with weight (0.094) and a weak positive correlation with position (0.068). This suggests that players with higher weight and players in certain positions tend to have a slightlya smaller drop-off in games played.

  10. Points per Game has a weak positive correlation with weight (0.098) and a weak positive correlation with position (0.082). This indicates that players with higher weight and players in the bigger positions tend to have a slightly a smaller drop-off in points per game.

  11. The draft age has a negative correlation with every single statistic category. This implies that players that are drafted at a younger age tend to do have smaller drop-offs in every statistic.

Since most of the correlations are very small, it is hard to say if there is an actual causation in these cases. The most prominent finding is definitely the verification of the previously discussed trend that better players tend to get drafted at a younger age. Especially games played, points per game and the field goals attempted and made have the highest correlation values overall, so a causation can be read here.

Cluster Analysis

In the next and final step of the Analysis, the focus will lie on the category points per game, as it is the most fitting to single-handedly describe a player’s production out of all the available variables. The columns weight, age_when_drafted and position are going to be included again as the determining variables.

variables <- c("ppg", "weight", "age_when_drafted", "position_nr")
cluster_data <- info[, variables]
scaled_data <- scale(cluster_data)
k <- 3
kmeans_result <- kmeans(scaled_data, centers = k)
info$cluster <- as.factor(kmeans_result$cluster)
cluster_means <- aggregate(cluster_data, by = list(info$cluster), FUN = mean)


ui <- fluidPage(
  titlePanel("Cluster Analysis"),
  sidebarLayout(
    sidebarPanel(
      selectInput("x_var", "Select Variable for X-axis", choices = colnames(cluster_means)[-1], selected = "age_when_drafted")
    ),
    mainPanel(
      plotOutput("cluster_plot")
    )
  )
)

server <- function(input, output) {
  
  output$cluster_plot <- renderPlot({
    x_var <- input$x_var
    ggplot(info, aes_string(x = x_var, y = "ppg", color = "cluster")) +
      geom_point() +
      labs(title = "Cluster Analysis",
           x = x_var,
           y = "PPG") +
      theme_grey()
  })
}

shinyApp(ui = ui, server = server)

This cluster shows similar insights as before. When selecting weight on the x-axis, there are no real trends seen, as no specific weight group seems to be higher on average than any other weight group. Positions are a similar story, but here a little trend is discernible: There are no big outliers to the top when it comes to the position 2.5 (G/F). So a takeaway for NBA teams could be the following: If they want to draft a star player and hope his points per game production more than doubles in the NBA, there has never been a G/F to achieve this, so they might want to stay away from this position. The biggest trend is again connected to the age_drafted statistic. There are only very few outliers in group one with an age of over 24. Similar to the correlation matrix, players drafted at a younger age seem to be more capable of turning up their production in the NBA.

Conclusion

After all of this work with the data, its understanding, preparation and modeling, some final conclusions can now be drawn, whilst also installing some limitations of this research. It was clear from the beginning that the overall production of an average player is higher in college than in the NBA, as only the best college players even make it to the NBA, so it makes sense that they tend to stand out in college. The main theory that this research proved was the fact that younger players taken in the draft are the ones that turn out to be great in the NBA. The correlation matrix and the clustering analysis validated this thesis and it seems that age when drafted is the only clear indicator, whether a player will become successful in the NBA. Of course, it is not the only one, as older drafted players have proven before that there are exceptions to every rule. Weight and position seemed to be rather useless when trying to predict a player’s future.

Limitations

The first limitation is the absence of the height statistic. The data set had this statistic saved as a string in the form of “feet-inches”. To convert this string into the metric systems, a function was build where the string was split into feet and inches and the variables were put into this formula:

height_cm <- (feet * 12 + inches) * 2.54

Unfortunately, when executing the code for the entire filtered_players data.frame, only the first value was converted for every player, making the value in this column identical for every row. This error was not able to be fixed, thus robbing the analysis of the possibility to research the influence of height on a player’s career.

Another limitation is the previously mentioned fact that regular statistics like assists, rebounds, steals and blocks, as well as advanced statistics such as player efficiency rating and others were not included in the data set. It would have been interesting to see how these categories developed from the collegiate to the professional level. Points are the most important statistic in Basketball, but they never tell the full story of how valuable a player actually is.

Bibliography

Berri, David J., Stacey L. Brook, and Aju J. Fenn. 2011. “From College to the Pros: Predicting the NBA Amateur Player Draft.” Journal of Productivity Analysis 35 (February): 25–35. https://doi.org/10.1007/s11123-010-0187-x.
Coates, Dennis, and Babatunde Oguntimein. 2008. “The Length and Success of NBA Careers: Does College Production Predict Professional Outcomes?”
Dunn, Sam. 2022. “Biggest Contracts in NBA History.”
Gough, Christina. 2023a. “National Basketball Association Teams Ranked by Revenue 2021/22 Season.”
———. 2023b. “Revenue of the NCAA from 2012 to 2022, by Segment.”
Greene, Alexander C. 2015. “The Success of NBA Draft Picks: Can College Careers Predict NBA Winners?” https://repository.stcloudstate.edu/stat_etds/4.
Richter, Felix. 2022. “Chart: Which Sports Do Americans Follow? | Statista.” https://www.statista.com/chart/28107/sports-followed-by-americans/.
Veroutsos, Eleni. 2022. “The Most Popular Sports in the World - WorldAtlas.” https://www.worldatlas.com/articles/what-are-the-most-popular-sports-in-the-world.html.
Wood, Robert. 2022. “NBA Finals MVP Winners.”
Young, Jabari. 2021.
Zestcott, Colin A., Jessie Dickens, Noah Bracamonte, Jeff Stone, and C. Keith Harrison. 2020. “One and Done: Examining the Relationship Between Years of College Basketball Experience and Career Statistics in the National Basketball Association.” Journal of Sport and Social Issues 44 (August): 299–315. https://doi.org/10.1177/0193723520919815.