nba <- nba %>%
distinct(Year, Player, Tm, .keep_all = T)
Variables: PPG, Height, MPG
Response: PPG
Explanatory: Height, MPG
df1 <- nba %>%
mutate(PPG = PTS/G,
MPG = MP/G,
height = ((Feet*12) + Inches)) %>%
select(Year, Player, PPG, MPG, height)
df1 %>%
ggplot(mapping = aes(x = MPG, y = PPG)) +
geom_point() +
geom_smooth(method = lm, formula = (y ~ x + 0)) +
geom_smooth(method = lm, formula = (y ~ poly(x,2)), color = "yellow") +
labs(title = "MPG vs PPG")
df1 %>%
ggplot(mapping = aes(x = height, y = PPG)) +
geom_jitter(width = 0.25, alpha = 0.2) +
geom_smooth(method = lm) +
labs(title = "Height vs PPG")
## `geom_smooth()` using formula = 'y ~ x'
MPG vs PPG
Height vs PPG
cor.test(df1$MPG, df1$PPG)
##
## Pearson's product-moment correlation
##
## data: df1$MPG and df1$PPG
## t = 293.26, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8916839 0.8970319
## sample estimates:
## cor
## 0.8943898
cor.test(df1$MPG, df1$PPG, method = "spearman")
## Warning in cor.test.default(df1$MPG, df1$PPG, method = "spearman"): Cannot
## compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: df1$MPG and df1$PPG
## S = 1.0744e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9352376
Both, Pearson’s and Spearman’s Correlation Coefficient, confirm what we say above. As MPG increases PPG does as well. Both coefficient values are rather high and very significant. The Spearman’s coefficient is more significant and higher due to its ability to show monotonic nonlinear relationships better than Pearson’s. Due to this fact and the visual aspect of the graph I would suggest that the relationship between these two variables is not strictly linear.
cor.test(df1$height, df1$PPG)
##
## Pearson's product-moment correlation
##
## data: df1$height and df1$PPG
## t = -7.3384, df = 21509, p-value = 2.239e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06329555 -0.03663517
## sample estimates:
## cor
## -0.04997426
Here the correlation coefficient is slightly negative, but based on the significance test we have enough evidence to suggest that its not equal to 0. Looking at the graph would suggest otherwise. Or at least that there is a better way to model the relationship between these two variables.
Variables: 2P%, 3P%, FT%
Response: 3P%,
Explanatory: 2P%, FT%
df2_ft <- nba %>%
filter(FTA >= 25,`3PA` >= 25) %>%
select(Year, Player, `FT%`, `3P%`)
df2_2p <- nba %>%
filter(`2PA` >=25, `3PA` >= 25) %>%
select(Year, Player, `2P%`, `3P%`)
df2_2p %>%
ggplot(mapping = aes(x = `2P%`, y = `3P%`)) +
geom_point(alpha = 0.2) +
geom_vline(xintercept = 0.5, color = 'orange') +
geom_hline(yintercept = 1/3, color = 'orange') +
geom_smooth(method = lm) +
labs(title = "2P% vs 3P%")
## `geom_smooth()` using formula = 'y ~ x'
df2_ft %>%
ggplot(mapping = aes(x = `FT%`, y = `3P%`)) +
geom_point(alpha = 0.2) +
geom_vline(xintercept = 0.5, color = 'orange') +
geom_hline(yintercept = 1/3, color = 'orange') +
geom_smooth(method = lm) +
labs(title = "FT% vs 3P%")
## `geom_smooth()` using formula = 'y ~ x'
2P% vs 3P%
FT% vs 3P%
cor.test(df2_2p$`2P%`, df2_2p$`3P%`, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: df2_2p$`2P%` and df2_2p$`3P%`
## t = 1.4941, df = 9477, p-value = 0.1352
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.004787457 0.035466135
## sample estimates:
## cor
## 0.01534556
As expected after looking at the graph the correlation coefficient is not statistically significant.
cor.test(df2_ft$`FT%`, df2_ft$`3P%`, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: df2_ft$`FT%` and df2_ft$`3P%`
## t = 22.847, df = 8538, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2199359 0.2599121
## sample estimates:
## cor
## 0.2400258
The graph so a slightly more linear relationship however there is still a lot of data way below the line of best fit on the plot. As mentioned above breaking the data into more granular groups could provide a betting insight into the correlation between these two variables.
Variables: WS, Height, MPG, PPG, STLPG, TOVPG
Response: WS (win shares)
Explanatory: Height, MPG, PPG, STLPG
df3 <- nba %>%
mutate(Height = (Feet*12) + Inches,
STLPG = STL/G,
TOVPG = TOV/G,
PPG = PTS/G,
MPG = MP/G) %>%
select(Year, Player, WS, Height, STLPG, TOVPG, PPG, MPG)
df3 %>%
ggplot(mapping = aes(x = Height, y = WS)) +
geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
geom_smooth(method = lm) +
labs(title = "Height vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'
df3 %>%
ggplot(mapping = aes(x = STLPG, y = WS)) +
geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
geom_smooth(method = lm) +
labs(title = "Steals Per Game vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'
df3 %>%
ggplot(mapping = aes(x = TOVPG, y = WS)) +
geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
geom_smooth(method = lm) +
labs(title = "Turnovers Per Game vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'
df3 %>%
ggplot(mapping = aes(x = PPG, y = WS)) +
geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
geom_smooth(method = lm) +
labs(title = "Points Per Game vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'
df3 %>%
ggplot(mapping = aes(x = MPG, y = WS)) +
geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
geom_smooth(method = lm) +
labs(title = "Minutes Per Game vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'
Height vs WS
Steals Per Game vs WS
Turnovers Per Game vs WS
Points Per Game vs WS
Minutes Per Game vs WS
cor.test(df3$Height, df3$WS)
##
## Pearson's product-moment correlation
##
## data: df3$Height and df3$WS
## t = 7.6552, df = 21509, p-value = 2.012e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03878935 0.06544386
## sample estimates:
## cor
## 0.05212589
Similar to the other height graphs there is evidence to suggest that the correlation coefficient is not zero but looking at the graph we know that it does not make sense to continue looking at this data in this fashion.
cor.test(df3$STLPG, df3$WS)
##
## Pearson's product-moment correlation
##
## data: df3$STLPG and df3$WS
## t = 107.89, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5838187 0.6011623
## sample estimates:
## cor
## 0.5925591
As expected there is a fairly strong positive correlation here.
cor.test(df3$TOVPG, df3$WS)
##
## Pearson's product-moment correlation
##
## data: df3$TOVPG and df3$WS
## t = 110.72, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5939395 0.6109650
## sample estimates:
## cor
## 0.6025208
As discussed above I expected the correlation coefficient to be negative, but there are some other variables that I believe are having a bigger impact on why this variable appears to be positively correlated with win shares.
cor.test(df3$PPG, df3$WS)
##
## Pearson's product-moment correlation
##
## data: df3$PPG and df3$WS
## t = 175.88, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7624764 0.7734398
## sample estimates:
## cor
## 0.7680143
Confirms that the relationship between PPG and WS has a strong positive correlation.
cor.test(df3$MPG, df3$WS)
##
## Pearson's product-moment correlation
##
## data: df3$MPG and df3$WS
## t = 160.27, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7315937 0.7437753
## sample estimates:
## cor
## 0.7377446
Same as PPG, MPG has a pretty strong positive correlation with WS. This is not surprising as we saw earlier in the first set that PPG and MPG have a positive linear relationship between themselves.
Find 95% confidence interval for sample mean for each of the 3 response variables chosen above.
find_ci <- function(data, samp_size = 30, level = 0.95) {
alpha <- 1-level
samp <- sample(data, size = samp_size)
xbar <- mean(samp)
s <- sd(samp)
tcrit <- qt((1-(alpha/2)), samp_size-1)
lb <- xbar - tcrit*(s/sqrt(samp_size))
ub <- xbar + tcrit*(s/sqrt(samp_size))
line1 <- paste("sample mean: ", xbar)
line2 <- paste("95% confidence interval: (", lb, " , ", ub, ")")
line3 <- paste("population mean: ", mean(data, na.rm = T))
cat(line1,"\n",line2,"\n",line3,"\n")
}
find_ci(df1$PPG, 100, 0.95)
## sample mean: 8.27987039789129
## 95% confidence interval: ( 7.10499442969487 , 9.4547463660877 )
## population mean: 8.09764921765621
Initially I felt like this CI was and population mean was quite low, but a vast majority of players in the data set are “role” players who fall into this range rather than being made up of the big stars that are scoring 25+ a game. I would be interested in comparing the positional means to see if there was a significant difference between positions.
find_ci(df2_2p$`3P%`, 100, 0.95)
## sample mean: 0.3413
## 95% confidence interval: ( 0.329987010115885 , 0.352612989884115 )
## population mean: 0.336995674649225
find_ci(df2_ft$`3P%`, 100, 0.95)
## sample mean: 0.33891
## 95% confidence interval: ( 0.325981373630802 , 0.351838626369198 )
## population mean: 0.337211709601874
These values hovering around 33% is really interesting and a significant reason as to why the 3 point shot has become increasingly popular. If a player can average above 33% then the shots expected value is larger than 1 and therefor has a greater long term value than the average player shooting 2 point shots that have an average of around 47% has an expected value less than 1. I would also argue here that the lower bound is the most important part of this CI for the reasons stated above.