The first factor that I wanted to test was whether a player’s previous league (high school, college(major or mid-major), or international) had an effect on their success in the NBA.
I started by creating a boxplot comparing players’ points per game (ppg) against their previous league.
boxplot(Filtered$`PTS/G`~Filtered$Classification,data=Filtered, main="PPG by Classification",
xlab="Classification", ylab="PPG")

This gives us a pretty good idea of the average ppg scored across the different previous league classifications as well as how much the ppg values vary within the classification.
Next I tested to see if the ppg data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Filtered$`PTS/G`)

qqnorm(Filtered$`PTS/G`)

These tests were normal enough by my estimation to run an anova test. The null hypothesis of an anova test is that the means of all the particular treatments are equal and the alternate hypothesis is that the means are not all equal. In our case, the null hypothesis is that all of the mean ppg from each classification are equal. The alternate hypothesis is that not all of the mean ppg from each classification are equal. In order to do this, I first needed to set up a linear model. In this case, I set up a model testing how much a players ppg is affected by their previous league’s classification.
a <- lm(Filtered$`PTS/G`~Filtered$Classification)
anova(a)
## Analysis of Variance Table
##
## Response: Filtered$`PTS/G`
## Df Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification 3 71.2 23.743 0.7553 0.5199
## Residuals 354 11128.4 31.436
The p-value of the test of player’s ppg against player’s previous league classification is 0.5199. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s ppg from different previous league classifications is not statistically significant.
Next, I created a boxplot comparing a players assists per game (apg) against their previous league.
boxplot(Filtered$AST~Filtered$Classification,data=Filtered, main="APG by Classification",
xlab="Classification", ylab="APG")

This gives us a pretty good idea of the average apg across the different previous league classifications as well as how much the apg values vary within the classification.
Next, I tested to see if the apg data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Filtered$AST)

qqnorm(Filtered$AST)

These tests provided evidence against the apg data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean apg from each classification are equal. The alternate hypothesis is that not all of the mean apg from each classification are equal. I set up a model testing how much a player’s apg is affected by their previous league’s classification.
b <- lm(Filtered$AST~Filtered$Classification)
anova(b)
## Analysis of Variance Table
##
## Response: Filtered$AST
## Df Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification 3 5.28 1.7586 0.5356 0.6581
## Residuals 354 1162.26 3.2832
The p-value of the test of player’s apg against player’s previous league classification is 0.6581. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s apg from different previous league classifications is not statistically significant. The fact that I utilized an anova test, which assumes normality, could have skewed this, but I highly doubt it. I would have liked to do a Spearman Rank correlation test, but since my classification variable was not numeric, that was not a possibility.
Next, I created a boxplot comparing a players rebounds per game (rpg) against their previous league.
boxplot(Filtered$TRB~Filtered$Classification,data=Filtered, main="RPG by Classification",
xlab="Classification", ylab="RPG")

This gives us a pretty good idea of the average rpg across the different previous league classifications as well as how much the rpg values vary within the classification.
Next I tested to see if the rpg data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Filtered$TRB)

qqnorm(Filtered$TRB)

These tests were normal enough by my estimation to run an anova test. The null hypothesis is that all of the mean rpg from each classification are equal. The alternate hypothesis is that not all of the mean rpg from each classification are equal. I set up a model testing how much a players rpg is affected by their previous league’s classification.
c <- lm(Filtered$TRB~Filtered$Classification)
anova(c)
## Analysis of Variance Table
##
## Response: Filtered$TRB
## Df Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification 3 32.66 10.887 1.8778 0.133
## Residuals 354 2052.49 5.798
The p-value of the test of player’s rpg against player’s previous league classification is 0.133. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s rpg from different previous league classifications is not statistically significant.
Next, I created a boxplot comparing a players minutes per game (mpg) against their previous league.
boxplot(Filtered$MP~Filtered$Classification,data=Filtered, main="MPG by Classification",
xlab="Classification", ylab="MPG")

This gives us a pretty good idea of the average mpg played across the different previous league classifications as well as how much the mpg values vary within the classification.
Next, I tested to see if the mpg data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Filtered$MP)

qqnorm(Filtered$MP)

These tests were normal enough by my estimation to run an anova test. The null hypothesis is that all of the mean mpg from each classification are equal. The alternate hypothesis is that not all of the mean mpg from each classification are equal. I set up a model testing how much a player’s mpg is affected by their previous league’s classification.
d <- lm(Filtered$MP~Filtered$Classification)
anova(d)
## Analysis of Variance Table
##
## Response: Filtered$MP
## Df Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification 3 180.5 60.166 1.2649 0.2862
## Residuals 354 16838.5 47.566
The p-value of the test of player’s mpg against player’s previous league classification is 0.2862. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s mpg from different previous league classifications is not statistically significant.
Next, I created a boxplot comparing a players Player Efficiency Rating (PER) against their previous league.
boxplot(Filtered$PER~Filtered$Classification,data=Filtered, main="PER by Classification",
xlab="Classification", ylab="PER")

This gives us a pretty good idea of the average PER across the different previous league classifications as well as how much the PER values vary within the classification.
Next, I tested to see if the PER data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Filtered$PER)

qqnorm(Filtered$PER)

These tests were normal enough by my estimation to run an anova test. The null hypothesis is that all of the mean PER from each classification are equal. The alternate hypothesis is that not all of the mean PER from each classification are equal. I set up a model testing how much a players PER is affected by their previous league’s classification.
e <- lm(Filtered$PER~Filtered$Classification)
anova(e)
## Analysis of Variance Table
##
## Response: Filtered$PER
## Df Sum Sq Mean Sq F value Pr(>F)
## Filtered$Classification 3 56.2 18.745 0.9271 0.4277
## Residuals 354 7157.3 20.218
The p-value of the test of player’s PER against player’s previous league classification is 0.4277. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of player’s PER from different previous league classifications is not statistically significant.
The next factor I wanted to test was whether there was a correlation between a player’s performance in their final season prior to the NBA and that player’s performance in their first year in the NBA.
In testing this factor, I looked at the statistics from all the NBA rookies who played at least 500 minutes during the 2017-18 season.
Rookie <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/FinalProject/FinalProjectFilteredRookies.xlsx")
In order to test whether there is a correlation between player’s final pre-NBA performance and Rookie NBA season performance, I performed a series of regression tests.
The p-value of the regression test I ran was 0.0264 which is less than the significance level of 0.05, so I must reject the null hypothesis. Therefore, the data suggests that there is a correlation between a player’s pre-NBA ppg and Rookie season ppg.
The p-value of the regression test I ran was 6.24e^-10 which is less than the significance level of 0.05, so I must reject the null hypothesis. Therefore, the data suggests that there is a correlation between a player’s pre-NBA apg and Rookie season apg.
The p-value of the regression test I ran was 3.29e^-8 which is less than the significance level of 0.05, so I must reject the null hypothesis. Therefore, the data suggests that there is a correlation between a player’s pre-NBA rpg and Rookie season rpg.
One caveat to these tests is that only the rookie rebounding data was remotely normally distributed. These regression tests assume normally distributed data and the lack of that normal distribution could have affected the results of the tests.
The last factor that I wanted to test was whether the duration a player played in a league below the NBA had an effect on NBA Rookie season performance.
To begin with, I filtered my data to only include Rookies with 4 or less years or experience in a league below the NBA. I did this because there were some players with extensive international experience and I did not want them to skew the data.
Rookie.Filter=filter(Rookie, Collegeyears<4.1)
I started by creating a boxplot comparing the Rookies’ points per game (ppg) against their number of years in a previous league.
boxplot(Rookie.Filter$`PTS/G`~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie PPG by # years in College/International",
xlab="# Years in College/International", ylab="Rookie PPG")

This gives us a pretty good idea of the average ppg scored by Rookie’s against the different number of years played in a previous league as well as how much the ppg values vary within each group.
Next, I tested to see if the ppg data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Rookie.Filter$`PTS/G`)

qqnorm(Rookie.Filter$`PTS/G`)

These tests provided evidence against the ppg data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean Rookie ppg from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie ppg from each previous experience level are equal. I set up a model testing how much the Rookie’s ppg is affected by their previous experience level.
i <- lm(Rookie.Filter$`PTS/G`~Rookie.Filter$Collegeyears)
anova(i)
## Analysis of Variance Table
##
## Response: Rookie.Filter$`PTS/G`
## Df Sum Sq Mean Sq F value Pr(>F)
## Rookie.Filter$Collegeyears 1 149.25 149.249 9.4019 0.003828 **
## Residuals 41 650.85 15.874
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of the test of Rookie’s ppg against player’s previous experience level is 0.003828. This is less than the significance level of 0.05, so we must reject the null hypothesis. The difference in the means of Rookie’s ppg from different experience levels is statistically significant.
Since the Rookie ppg data was not normally distributed, I wanted to also test this hypothesis using the non-parametric Spearman Rank Correlation Test. Here the null hypothesis is that there is no correlation between Rookie ppg and previous experience level while the alternate hypothesis is that there is a correlation.
cor.test( ~ Rookie.Filter$`PTS/G` + Rookie.Filter$Collegeyears,
data=Rookie.Filter,
method = "spearman",
continuity = FALSE,
conf.level = 0.95)
## Warning in cor.test.default(x = c(15.8, 10.5, 4.6, 8.2, 20.5, 5.6, 4.2, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: Rookie.Filter$`PTS/G` and Rookie.Filter$Collegeyears
## S = 19465, p-value = 0.001482
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.4697518
The p-value extracted from this test is 0.001482. This is less than the significance level of 0.05, so we must reject the null hypothesis. There is a correlation between Rookie ppg and previous experience level.
Next, I created a boxplot comparing the Rookies’ assists per game (apg) against their number of years in a previous league.
boxplot(Rookie.Filter$AST~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie APG by # years in College/International",
xlab="# Years in College/International", ylab="Rookie APG")

This gives us a pretty good idea of the average apg by Rookie’s against the different number of years played in a previous league as well as how much the apg values vary within each group.
Next, I tested to see if the apg data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Rookie.Filter$AST)

qqnorm(Rookie.Filter$AST)

These tests provided evidence against the apg data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean Rookie apg from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie apg from each previous experience level are equal. I set up a model testing how much the Rookie’s apg is affected by their previous experience level.
j <- lm(Rookie.Filter$AST~Rookie.Filter$Collegeyears)
anova(j)
## Analysis of Variance Table
##
## Response: Rookie.Filter$AST
## Df Sum Sq Mean Sq F value Pr(>F)
## Rookie.Filter$Collegeyears 1 15.418 15.4184 5.9498 0.01913 *
## Residuals 41 106.248 2.5914
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of the test of Rookie’s apg against player’s previous experience level is 0.01913. This is less than the significance level of 0.05, so we must reject the null hypothesis. The difference in the means of Rookie’s apg from different experience levels is statistically significant.
Since the Rookie apg data was not normally distributed, I wanted to also test this hypothesis using the non-parametric Spearman Rank Correlation Test. Here the null hypothesis is that there is no correlation between Rookie apg and previous experience level while the alternate hypothesis is that there is a correlation.
cor.test( ~ Rookie.Filter$AST + Rookie.Filter$Collegeyears,
data=Rookie.Filter,
method = "spearman",
continuity = FALSE,
conf.level = 0.95)
## Warning in cor.test.default(x = c(8.2, 1.3, 1.8, 0.7, 3.7, 0.9, 0.8, 1.5, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: Rookie.Filter$AST and Rookie.Filter$Collegeyears
## S = 17418, p-value = 0.03955
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.315144
The p-value extracted from this test is 0.03955. This is less than the significance level of 0.05, so we must reject the null hypothesis. There is a correlation between Rookie apg and previous experience level.
Next, I created a boxplot comparing the Rookies’ rebounds per game (rpg) against their number of years in a previous league.
boxplot(Rookie.Filter$TRB~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie RPG by # years in College/International",
xlab="# Years in College/International", ylab="Rookie RPG")

This gives us a pretty good idea of the average rpg by Rookie’s against the different number of years played in a previous league as well as how much the rpg values vary within each group.
Next, I tested to see if the rpg data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Rookie.Filter$TRB)

qqnorm(Rookie.Filter$TRB)

These tests were normal enough by my estimation to run an anova test. The null hypothesis is that all of the mean Rookie rpg from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie rpg from each previous experience level are equal. I set up a model testing how much a Rookie’s rpg is affected by their previous experience level.
k <- lm(Rookie.Filter$TRB~Rookie.Filter$Collegeyears)
anova(k)
## Analysis of Variance Table
##
## Response: Rookie.Filter$TRB
## Df Sum Sq Mean Sq F value Pr(>F)
## Rookie.Filter$Collegeyears 1 18.666 18.6659 6.5428 0.01432 *
## Residuals 41 116.968 2.8529
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of the test of Rookie’s rpg against player’s previous experience level is 0.01432. This is less than the significance level of 0.05, so we must reject the null hypothesis. The difference in the means of Rookie’s rpg from different experience levels is statistically significant.
Next, I created a boxplot comparing the Rookies’ minutes per game (mpg) against their number of years in a previous league.
boxplot(Rookie.Filter$MP~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie MPG by # years in College/International",
xlab="# Years in College/International", ylab="Rookie MPG")

This gives us a pretty good idea of the average mpg by Rookie’s against the different number of years played in a previous league as well as how much the mpg values vary within each group.
Next, I tested to see if the mpg data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Rookie.Filter$MP)

qqnorm(Rookie.Filter$MP)

These tests provided evidence against the mpg data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean Rookie mpg from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie mpg from each previous experience level are equal. I set up a model testing how much the Rookie’s mpg is affected by their previous experience level.
l <- lm(Rookie.Filter$MP~Rookie.Filter$Collegeyears)
anova(l)
## Analysis of Variance Table
##
## Response: Rookie.Filter$MP
## Df Sum Sq Mean Sq F value Pr(>F)
## Rookie.Filter$Collegeyears 1 332.82 332.82 8.2179 0.00652 **
## Residuals 41 1660.45 40.50
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of the test of Rookie’s mpg against player’s previous experience level is 0.00652. This is less than the significance level of 0.05, so we must reject the null hypothesis. The difference in the means of Rookie’s mpg from different experience levels is statistically significant.
Since the Rookie mpg data was not normally distributed, I wanted to also test this hypothesis using the non-parametric Spearman Rank Correlation Test. Here, the null hypothesis is that there is no correlation between Rookie mpg and previous experience level while the alternate hypothesis is that there is a correlation.
cor.test( ~ Rookie.Filter$MP + Rookie.Filter$Collegeyears,
data=Rookie.Filter,
method = "spearman",
continuity = FALSE,
conf.level = 0.95)
## Warning in cor.test.default(x = c(33.7, 24.1, 14.2, 20, 33.4, 14.3, 13.8, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: Rookie.Filter$MP and Rookie.Filter$Collegeyears
## S = 18549, p-value = 0.007771
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.4005773
The p-value extracted from this test is 0.007771. This is less than the significance level of 0.05, so we must reject the null hypothesis. There is a correlation between Rookie mpg and previous experience level.
Next, I created a boxplot comparing the Rookies’ Player Efficiency Rating (PER) against their number of years in a previous league.
boxplot(Rookie.Filter$PER~Rookie.Filter$Collegeyears,data=Rookie.Filter, main="Rookie PER by # years in College/International",
xlab="# Years in College/International", ylab="Rookie PER")

This gives us a pretty good idea of the average PER by Rookie’s against the different number of years played in a previous league as well as how much the PER values vary within each group.
Next, I tested to see if the PER data was close to being normally distributed using a histogram and a qqnorm chart.
hist(Rookie.Filter$PER)

qqnorm(Rookie.Filter$PER)

These tests provided evidence against the PER data being normally distributed. Despite this, I still decided to run an anova test. The null hypothesis is that all of the mean Rookie PER from each previous experience level are equal. The alternate hypothesis is that not all of the mean Rookie PER from each previous experience level are equal. I set up a model testing how much the Rookie’s PER is affected by their previous experience level.
m <- lm(Rookie.Filter$PER~Rookie.Filter$Collegeyears)
anova(m)
## Analysis of Variance Table
##
## Response: Rookie.Filter$PER
## Df Sum Sq Mean Sq F value Pr(>F)
## Rookie.Filter$Collegeyears 1 50.3 50.303 3.6082 0.06455 .
## Residuals 41 571.6 13.942
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of the test of Rookie’s PER against player’s previous experience level is 0.06455. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the means of Rookie’s PER from different experience levels is not statistically significant.
Since the Rookie PER data was not normally distributed, I wanted to also test this hypothesis using the non-parametric Spearman Rank Correlation Test. Here, the null hypothesis is that there is no correlation between Rookie PER and previous experience level while the alternate hypothesis is that there is a correlation.
cor.test( ~ Rookie.Filter$PER + Rookie.Filter$Collegeyears,
data=Rookie.Filter,
method = "spearman",
continuity = FALSE,
conf.level = 0.95)
## Warning in cor.test.default(x = c(20.03, 18.33, 18, 17.56, 16.74, 16.64, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: Rookie.Filter$PER and Rookie.Filter$Collegeyears
## S = 16586, p-value = 0.1026
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.252336
The p-value extracted from this test is 0.1026. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. There is no evidence of correlation between Rookie PER and previous experience level.