Math Stats Final Project-An Exploration on Basketball Statistics

The goal of my project was to determine whether there are factors that could be used to determine how successful a player will be in the NBA. A couple of the factors I tried to test were the origin of the player (high school, major college conference, mid-major college conference, or international), the statistics of the player in their final pre-NBA season, and number of years in college or international league. This was worth investigating because it could help NBA general managers and executives determine which players they should draft or trade for, and it could help players determine what type of program they attend and for how long.

library(ggplot2)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The data I utilized was drawn almost exclusively from https://www.basketball-reference.com/. This website contained a complete record of the statistics of all the current NBA players. This included the stats from all their NBA seasons, all their college seasons, and any time they played internationally. Another website I utilized was http://insider.espn.com/nba/hollinger/statistics. There, I gathered all of my data on player efficiency rating (PER). Below is the full data that I gathered.

library(readxl)
## Warning: package 'readxl' was built under R version 3.4.2
Full <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/FinalProject/MathStatsFinalProjectFullData.xlsx")

Next, I filtered my data to only include players that had seen at least 500 minutes of gameplay during the 2017-18 NBA regular season in order to eliminate any outliers in my data that would result in high variances from the small sample size. I chose 500 minutes as my cutoff point because that was the number that http://insider.espn.com/nba/hollinger/statistics used to determine if a player was qualified or not to measure their PER.

library(readxl)
Filtered <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/FinalProject/FinalProjectFiltered.xlsx")

The first factor that I wanted to test was whether a player’s previous league (high school, college(major or mid-major), or international) had an effect on their success in the NBA.

I started by creating a boxplot comparing players’ points per game (ppg) against their previous league.

boxplot(Filtered$`PTS/G`~Filtered$Classification,data=Filtered, main="PPG by Classification", 
   xlab="Classification", ylab="PPG")

This gives us a pretty good idea of the average ppg scored across the different previous league classifications as well as how much the ppg values vary within the classification.

Next I tested to see if the ppg data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Filtered$`PTS/G`)

qqnorm(Filtered$`PTS/G`)

These tests were normal enough by my estimation to run an anova test. The null hypothesis of an anova test is that the means of all the particular treatments are equal and the alternate hypothesis is that the means are not all equal. In our case, the null hypothesis is that all of the mean ppg from each classification are equal. The alternate hypothesis is that not all of the mean ppg from each classification are equal. In order to do this, I first needed to set up a linear model. In this case, I set up a model testing how much a players ppg is affected by their previous league’s classification.

a <- lm(Filtered$`PTS/G`~Filtered$Classification)
anova(a)
## Analysis of Variance Table
## 
## Response: Filtered$`PTS/G`
##                          Df  Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification   3    71.2  23.743  0.7553 0.5199
## Residuals               354 11128.4  31.436

The p-value of the test of player’s ppg against player’s previous league classification is 0.5199. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s ppg from different previous league classifications is not statistically significant.

Next, I created a boxplot comparing a players assists per game (apg) against their previous league.

boxplot(Filtered$AST~Filtered$Classification,data=Filtered, main="APG by Classification", 
   xlab="Classification", ylab="APG")

This gives us a pretty good idea of the average apg across the different previous league classifications as well as how much the apg values vary within the classification.

Next, I tested to see if the apg data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Filtered$AST)

qqnorm(Filtered$AST)

These tests provided evidence against the apg data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean apg from each classification are equal. The alternate hypothesis is that not all of the mean apg from each classification are equal. I set up a model testing how much a player’s apg is affected by their previous league’s classification.

b <- lm(Filtered$AST~Filtered$Classification)
anova(b)
## Analysis of Variance Table
## 
## Response: Filtered$AST
##                          Df  Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification   3    5.28  1.7586  0.5356 0.6581
## Residuals               354 1162.26  3.2832

The p-value of the test of player’s apg against player’s previous league classification is 0.6581. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s apg from different previous league classifications is not statistically significant. The fact that I utilized an anova test, which assumes normality, could have skewed this, but I highly doubt it. I would have liked to do a Spearman Rank correlation test, but since my classification variable was not numeric, that was not a possibility.

Next, I created a boxplot comparing a players rebounds per game (rpg) against their previous league.

boxplot(Filtered$TRB~Filtered$Classification,data=Filtered, main="RPG by Classification", 
   xlab="Classification", ylab="RPG")

This gives us a pretty good idea of the average rpg across the different previous league classifications as well as how much the rpg values vary within the classification.

Next I tested to see if the rpg data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Filtered$TRB)

qqnorm(Filtered$TRB)

These tests were normal enough by my estimation to run an anova test. The null hypothesis is that all of the mean rpg from each classification are equal. The alternate hypothesis is that not all of the mean rpg from each classification are equal. I set up a model testing how much a players rpg is affected by their previous league’s classification.

c <- lm(Filtered$TRB~Filtered$Classification)
anova(c)
## Analysis of Variance Table
## 
## Response: Filtered$TRB
##                          Df  Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification   3   32.66  10.887  1.8778  0.133
## Residuals               354 2052.49   5.798

The p-value of the test of player’s rpg against player’s previous league classification is 0.133. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s rpg from different previous league classifications is not statistically significant.

Next, I created a boxplot comparing a players minutes per game (mpg) against their previous league.

boxplot(Filtered$MP~Filtered$Classification,data=Filtered, main="MPG by Classification", 
   xlab="Classification", ylab="MPG")

This gives us a pretty good idea of the average mpg played across the different previous league classifications as well as how much the mpg values vary within the classification.

Next, I tested to see if the mpg data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Filtered$MP)

qqnorm(Filtered$MP)

These tests were normal enough by my estimation to run an anova test. The null hypothesis is that all of the mean mpg from each classification are equal. The alternate hypothesis is that not all of the mean mpg from each classification are equal. I set up a model testing how much a player’s mpg is affected by their previous league’s classification.

d <- lm(Filtered$MP~Filtered$Classification)
anova(d)
## Analysis of Variance Table
## 
## Response: Filtered$MP
##                          Df  Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification   3   180.5  60.166  1.2649 0.2862
## Residuals               354 16838.5  47.566

The p-value of the test of player’s mpg against player’s previous league classification is 0.2862. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s mpg from different previous league classifications is not statistically significant.

Next, I created a boxplot comparing a players Player Efficiency Rating (PER) against their previous league.

boxplot(Filtered$PER~Filtered$Classification,data=Filtered, main="PER by Classification", 
   xlab="Classification", ylab="PER")

This gives us a pretty good idea of the average PER across the different previous league classifications as well as how much the PER values vary within the classification.

Next, I tested to see if the PER data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Filtered$PER)

qqnorm(Filtered$PER)

These tests were normal enough by my estimation to run an anova test. The null hypothesis is that all of the mean PER from each classification are equal. The alternate hypothesis is that not all of the mean PER from each classification are equal. I set up a model testing how much a players PER is affected by their previous league’s classification.

e <- lm(Filtered$PER~Filtered$Classification)
anova(e)
## Analysis of Variance Table
## 
## Response: Filtered$PER
##                          Df Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification   3   56.2  18.745  0.9271 0.4277
## Residuals               354 7157.3  20.218

The p-value of the test of player’s PER against player’s previous league classification is 0.4277. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s PER from different previous league classifications is not statistically significant.

After conducting all of these tests, none of the statistics I looked at were significantly affected by the player’s previous league classification. Due to this, I would not consider a player’s previous league classification when deciding to draft or trade for them.

The next factor I wanted to test was whether there was a correlation between a player’s performance in their final season prior to the NBA and that player’s performance in their first year in the NBA.

In testing this factor, I looked at the statistics from all the NBA rookies who played at least 500 minutes during the 2017-18 season.

Rookie <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/FinalProject/FinalProjectFilteredRookies.xlsx")

In order to test whether there is a correlation between player’s final pre-NBA performance and Rookie NBA season performance, I performed a series of regression tests.

I started by doing a regression test on a player’s ppg. The null hypothesis of a regression test is that there is no correlation between the two variables being tested while the alternate hypothesis is that there is a correlation between the two variables being tested. In the case of this test, the null hypothesis is that there is no correlation between a player’s final pre-NBA ppg and Rookie season ppg while the alternate hypothesis is that there is a correlation.

plot(Rookie$Collegeppg, Rookie$`PTS/G`, col="blue", main="PPG in NBA Rookie Season Against PPG in Final Season in Previous League", xlab="Final Pre NBA PPG", ylab="Rookie NBA PPG")
f<-lm(Rookie$`PTS/G`~Rookie$Collegeppg)
abline(f)

summary(f)
## 
## Call:
## lm(formula = Rookie$`PTS/G` ~ Rookie$Collegeppg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4199 -2.3057 -0.9033  1.7650 12.5462 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         2.6036     2.2613   1.151   0.2558  
## Rookie$Collegeppg   0.3430     0.1493   2.298   0.0264 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.096 on 44 degrees of freedom
## Multiple R-squared:  0.1071, Adjusted R-squared:  0.08685 
## F-statistic:  5.28 on 1 and 44 DF,  p-value: 0.02638
plot(f)

The p-value of the regression test I ran was 0.0264 which is less than the significance level of 0.05, so I must reject the null hypothesis. Therefore, the data suggests that there is a correlation between a player’s pre-NBA ppg and Rookie season ppg.

I then did a regression test on a player’s apg. The null hypothesis is that there is no correlation between a player’s final pre-NBA apg and Rookie season apg while the alternate hypothesis is that there is a correlation.

plot(Rookie$Collegeapg, Rookie$AST, col="blue", main="APG in NBA Rookie Season Against APG in Final Season in Previous League", xlab="Final Pre NBA APG", ylab="Rookie NBA APG")
g<-lm(Rookie$AST~Rookie$Collegeapg)
abline(g)

summary(g)
## 
## Call:
## lm(formula = Rookie$AST ~ Rookie$Collegeapg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7428 -0.7474 -0.1299  0.5631  4.5283 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.1386     0.2989  -0.464    0.645    
## Rookie$Collegeapg   0.7938     0.1009   7.868 6.24e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.119 on 44 degrees of freedom
## Multiple R-squared:  0.5845, Adjusted R-squared:  0.5751 
## F-statistic: 61.91 on 1 and 44 DF,  p-value: 6.244e-10
plot(g)

The p-value of the regression test I ran was 6.24e^-10 which is less than the significance level of 0.05, so I must reject the null hypothesis. Therefore, the data suggests that there is a correlation between a player’s pre-NBA apg and Rookie season apg.

I then did a regression test on a player’s rpg. The null hypothesis is that there is no correlation between a player’s final pre-NBA rpg and Rookie season rpg while the alternate hypothesis is that there is a correlation.

plot(Rookie$Collegerpg, Rookie$TRB, col="blue", main="RPG in NBA Rookie Season Against RPG in Final Season in Previous League", xlab="Final Pre NBA RPG", ylab="Rookie NBA RPG")
h<-lm(Rookie$TRB~Rookie$Collegerpg)
abline(h)

summary(h)
## 
## Call:
## lm(formula = Rookie$TRB ~ Rookie$Collegerpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1843 -0.7795 -0.0630  0.7099  3.3659 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.54843    0.46770   1.173    0.247    
## Rookie$Collegerpg  0.49801    0.07447   6.688 3.29e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.243 on 44 degrees of freedom
## Multiple R-squared:  0.5041, Adjusted R-squared:  0.4928 
## F-statistic: 44.72 on 1 and 44 DF,  p-value: 3.287e-08
plot(h)

The p-value of the regression test I ran was 3.29e^-8 which is less than the significance level of 0.05, so I must reject the null hypothesis. Therefore, the data suggests that there is a correlation between a player’s pre-NBA rpg and Rookie season rpg.

One caveat to these tests is that only the rookie rebounding data was remotely normally distributed. These regression tests assume normally distributed data and the lack of that normal distribution could have affected the results of the tests.

After conducting all of these tests, all of the statistics I looked at were significantly affected by the player’s final pre-NBA season performance. While this may seem obvious, I would consider a player’s final pre-NBA season performance when deciding to draft or trade for them.

The last factor that I wanted to test was whether the duration a player played in a league below the NBA had an effect on NBA Rookie season performance.

To begin with, I filtered my data to only include Rookies with 4 or less years or experience in a league below the NBA. I did this because there were some players with extensive international experience and I did not want them to skew the data.

Rookie.Filter=filter(Rookie, Collegeyears<4.1)

I started by creating a boxplot comparing the Rookies’ points per game (ppg) against their number of years in a previous league.

boxplot(Rookie.Filter$`PTS/G`~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie PPG by # years in College/International", 
   xlab="# Years in College/International", ylab="Rookie PPG")

This gives us a pretty good idea of the average ppg scored by Rookie’s against the different number of years played in a previous league as well as how much the ppg values vary within each group.

Next, I tested to see if the ppg data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Rookie.Filter$`PTS/G`)

qqnorm(Rookie.Filter$`PTS/G`)

These tests provided evidence against the ppg data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean Rookie ppg from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie ppg from each previous experience level are equal. I set up a model testing how much the Rookie’s ppg is affected by their previous experience level.

i <- lm(Rookie.Filter$`PTS/G`~Rookie.Filter$Collegeyears)
anova(i)
## Analysis of Variance Table
## 
## Response: Rookie.Filter$`PTS/G`
##                            Df Sum Sq Mean Sq F value   Pr(>F)   
## Rookie.Filter$Collegeyears  1 149.25 149.249  9.4019 0.003828 **
## Residuals                  41 650.85  15.874                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of the test of Rookie’s ppg against player’s previous experience level is 0.003828. This is less than the significance level of 0.05, so we must reject the null hypothesis. The difference in the means of Rookie’s ppg from different experience levels is statistically significant.

Since the Rookie ppg data was not normally distributed, I wanted to also test this hypothesis using the non-parametric Spearman Rank Correlation Test. Here the null hypothesis is that there is no correlation between Rookie ppg and previous experience level while the alternate hypothesis is that there is a correlation.

cor.test( ~ Rookie.Filter$`PTS/G` + Rookie.Filter$Collegeyears, 
         data=Rookie.Filter,
         method = "spearman",
         continuity = FALSE,
         conf.level = 0.95)
## Warning in cor.test.default(x = c(15.8, 10.5, 4.6, 8.2, 20.5, 5.6, 4.2, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  Rookie.Filter$`PTS/G` and Rookie.Filter$Collegeyears
## S = 19465, p-value = 0.001482
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.4697518

The p-value extracted from this test is 0.001482. This is less than the significance level of 0.05, so we must reject the null hypothesis. There is a correlation between Rookie ppg and previous experience level.

Next, I created a boxplot comparing the Rookies’ assists per game (apg) against their number of years in a previous league.

boxplot(Rookie.Filter$AST~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie APG by # years in College/International", 
   xlab="# Years in College/International", ylab="Rookie APG")

This gives us a pretty good idea of the average apg by Rookie’s against the different number of years played in a previous league as well as how much the apg values vary within each group.

Next, I tested to see if the apg data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Rookie.Filter$AST)

qqnorm(Rookie.Filter$AST)

These tests provided evidence against the apg data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean Rookie apg from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie apg from each previous experience level are equal. I set up a model testing how much the Rookie’s apg is affected by their previous experience level.

j <- lm(Rookie.Filter$AST~Rookie.Filter$Collegeyears)
anova(j)
## Analysis of Variance Table
## 
## Response: Rookie.Filter$AST
##                            Df  Sum Sq Mean Sq F value  Pr(>F)  
## Rookie.Filter$Collegeyears  1  15.418 15.4184  5.9498 0.01913 *
## Residuals                  41 106.248  2.5914                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of the test of Rookie’s apg against player’s previous experience level is 0.01913. This is less than the significance level of 0.05, so we must reject the null hypothesis. The difference in the means of Rookie’s apg from different experience levels is statistically significant.

Since the Rookie apg data was not normally distributed, I wanted to also test this hypothesis using the non-parametric Spearman Rank Correlation Test. Here the null hypothesis is that there is no correlation between Rookie apg and previous experience level while the alternate hypothesis is that there is a correlation.

cor.test( ~ Rookie.Filter$AST + Rookie.Filter$Collegeyears, 
         data=Rookie.Filter,
         method = "spearman",
         continuity = FALSE,
         conf.level = 0.95)
## Warning in cor.test.default(x = c(8.2, 1.3, 1.8, 0.7, 3.7, 0.9, 0.8, 1.5, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  Rookie.Filter$AST and Rookie.Filter$Collegeyears
## S = 17418, p-value = 0.03955
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.315144

The p-value extracted from this test is 0.03955. This is less than the significance level of 0.05, so we must reject the null hypothesis. There is a correlation between Rookie apg and previous experience level.

Next, I created a boxplot comparing the Rookies’ rebounds per game (rpg) against their number of years in a previous league.

boxplot(Rookie.Filter$TRB~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie RPG by # years in College/International", 
   xlab="# Years in College/International", ylab="Rookie RPG")

This gives us a pretty good idea of the average rpg by Rookie’s against the different number of years played in a previous league as well as how much the rpg values vary within each group.

Next, I tested to see if the rpg data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Rookie.Filter$TRB)

qqnorm(Rookie.Filter$TRB)

These tests were normal enough by my estimation to run an anova test. The null hypothesis is that all of the mean Rookie rpg from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie rpg from each previous experience level are equal. I set up a model testing how much a Rookie’s rpg is affected by their previous experience level.

k <- lm(Rookie.Filter$TRB~Rookie.Filter$Collegeyears)
anova(k)
## Analysis of Variance Table
## 
## Response: Rookie.Filter$TRB
##                            Df  Sum Sq Mean Sq F value  Pr(>F)  
## Rookie.Filter$Collegeyears  1  18.666 18.6659  6.5428 0.01432 *
## Residuals                  41 116.968  2.8529                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of the test of Rookie’s rpg against player’s previous experience level is 0.01432. This is less than the significance level of 0.05, so we must reject the null hypothesis. The difference in the means of Rookie’s rpg from different experience levels is statistically significant.

Next, I created a boxplot comparing the Rookies’ minutes per game (mpg) against their number of years in a previous league.

boxplot(Rookie.Filter$MP~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie MPG by # years in College/International", 
   xlab="# Years in College/International", ylab="Rookie MPG")

This gives us a pretty good idea of the average mpg by Rookie’s against the different number of years played in a previous league as well as how much the mpg values vary within each group.

Next, I tested to see if the mpg data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Rookie.Filter$MP)

qqnorm(Rookie.Filter$MP)

These tests provided evidence against the mpg data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean Rookie mpg from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie mpg from each previous experience level are equal. I set up a model testing how much the Rookie’s mpg is affected by their previous experience level.

l <- lm(Rookie.Filter$MP~Rookie.Filter$Collegeyears)
anova(l)
## Analysis of Variance Table
## 
## Response: Rookie.Filter$MP
##                            Df  Sum Sq Mean Sq F value  Pr(>F)   
## Rookie.Filter$Collegeyears  1  332.82  332.82  8.2179 0.00652 **
## Residuals                  41 1660.45   40.50                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of the test of Rookie’s mpg against player’s previous experience level is 0.00652. This is less than the significance level of 0.05, so we must reject the null hypothesis. The difference in the means of Rookie’s mpg from different experience levels is statistically significant.

Since the Rookie mpg data was not normally distributed, I wanted to also test this hypothesis using the non-parametric Spearman Rank Correlation Test. Here, the null hypothesis is that there is no correlation between Rookie mpg and previous experience level while the alternate hypothesis is that there is a correlation.

cor.test( ~ Rookie.Filter$MP + Rookie.Filter$Collegeyears, 
         data=Rookie.Filter,
         method = "spearman",
         continuity = FALSE,
         conf.level = 0.95)
## Warning in cor.test.default(x = c(33.7, 24.1, 14.2, 20, 33.4, 14.3, 13.8, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  Rookie.Filter$MP and Rookie.Filter$Collegeyears
## S = 18549, p-value = 0.007771
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.4005773

The p-value extracted from this test is 0.007771. This is less than the significance level of 0.05, so we must reject the null hypothesis. There is a correlation between Rookie mpg and previous experience level.

Next, I created a boxplot comparing the Rookies’ Player Efficiency Rating (PER) against their number of years in a previous league.

boxplot(Rookie.Filter$PER~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie PER by # years in College/International", 
   xlab="# Years in College/International", ylab="Rookie PER")

This gives us a pretty good idea of the average PER by Rookie’s against the different number of years played in a previous league as well as how much the PER values vary within each group.

Next, I tested to see if the PER data was close to being normally distributed using a histogram and a qqnorm chart.

hist(Rookie.Filter$PER)

qqnorm(Rookie.Filter$PER)

These tests provided evidence against the PER data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean Rookie PER from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie PER from each previous experience level are equal. I set up a model testing how much the Rookie’s PER is affected by their previous experience level.

m <- lm(Rookie.Filter$PER~Rookie.Filter$Collegeyears)
anova(m)
## Analysis of Variance Table
## 
## Response: Rookie.Filter$PER
##                            Df Sum Sq Mean Sq F value  Pr(>F)  
## Rookie.Filter$Collegeyears  1   50.3  50.303  3.6082 0.06455 .
## Residuals                  41  571.6  13.942                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of the test of Rookie’s PER against player’s previous experience level is 0.06455. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of Rookie’s PER from different experience levels is not statistically significant.

Since the Rookie PER data was not normally distributed, I wanted to also test this hypothesis using the non-parametric Spearman Rank Correlation Test. Here, the null hypothesis is that there is no correlation between Rookie PER and previous experience level while the alternate hypothesis is that there is a correlation.

cor.test( ~ Rookie.Filter$PER + Rookie.Filter$Collegeyears, 
         data=Rookie.Filter,
         method = "spearman",
         continuity = FALSE,
         conf.level = 0.95)
## Warning in cor.test.default(x = c(20.03, 18.33, 18, 17.56, 16.74, 16.64, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  Rookie.Filter$PER and Rookie.Filter$Collegeyears
## S = 16586, p-value = 0.1026
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.252336

The p-value extracted from this test is 0.1026. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. There is no evidence of correlation between Rookie PER and previous experience level.

Based on these anova and correlation tests I ran, all of the statistics in question, with the exception of PER are affected by number of years in a previous league. This is interesting since PER is a rating based ppg, apg, rpg, etc. That being said, I would take into consideration the duration of a player’s previous league experience when deciding to draft or trade for them. Furthermore, a player may see this data and consider staying in college or international leagues a different amount of time to optimize success in the NBA.

As with all sports data, there are almost infinite numbers of questions that could be asked. One I considered testing was if the position of a player had an effect on Rookie season performance. I would also have liked to test data prior to the rule change requiring players to play at least one year in college or internationally prior to becoming eligible for the NBA. The high school experience population would be much larger and could have made a large impact on the results. Lastly, I would have liked to test multiple years’ worth of data to ensure that the current year’s data was not a fluke and I actually found real trends.