Loading Data

nba <- nba %>%
  distinct(Year, Player, Tm, .keep_all = T)

Set 1

Variables: PPG, Height, MPG

Response: PPG

Explanatory: Height, MPG

Create variables

df1 <- nba %>%
  mutate(PPG = PTS/G,
         MPG = MP/G,
         height = ((Feet*12) + Inches)) %>%
  select(Year, Player, PPG, MPG, height)

Plots

df1 %>%
  ggplot(mapping = aes(x = MPG, y = PPG)) +
  geom_point() +
  geom_smooth(method = lm, formula = (y ~ x + 0)) +
  geom_smooth(method = lm, formula = (y ~ poly(x,2)), color = "yellow") +
  labs(title = "MPG vs PPG")

df1 %>%
  ggplot(mapping = aes(x = height, y = PPG)) +
  geom_jitter(width = 0.25, alpha = 0.2) +
  geom_smooth(method = lm) +
  labs(title = "Height vs PPG")
## `geom_smooth()` using formula = 'y ~ x'

MPG vs PPG

  • This graph looks sort of how I expected it to look. If a player is in the game for more minutes it makes sense that they are likely scoring more points. One interesting thing about this plot is the curvature. Its almost seems that as MPG increase PPG increase exponentially. In the future testing for what kind of relationship best models these two variables would be interesting. One thing is clear though: as MPG increases you expect PPG to increase as well.

Height vs PPG

  • This graph is much less informative than the last one. There is a lot of overlapping due to many players sharing equal heights. It would be really interesting to break this plot down by decade or position. This would help give insights on how height has impacts scoring throughout the years or how it has shifted from position to position. For now though the line of best fit indicates a slight negative linear relationship between height and PPG, but I would take that information to the bank.

Correlation Coefficients and Discussion

cor.test(df1$MPG, df1$PPG)
## 
##  Pearson's product-moment correlation
## 
## data:  df1$MPG and df1$PPG
## t = 293.26, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8916839 0.8970319
## sample estimates:
##       cor 
## 0.8943898
cor.test(df1$MPG, df1$PPG, method = "spearman")
## Warning in cor.test.default(df1$MPG, df1$PPG, method = "spearman"): Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  df1$MPG and df1$PPG
## S = 1.0744e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9352376

Both, Pearson’s and Spearman’s Correlation Coefficient, confirm what we say above. As MPG increases PPG does as well. Both coefficient values are rather high and very significant. The Spearman’s coefficient is more significant and higher due to its ability to show monotonic nonlinear relationships better than Pearson’s. Due to this fact and the visual aspect of the graph I would suggest that the relationship between these two variables is not strictly linear.

cor.test(df1$height, df1$PPG)
## 
##  Pearson's product-moment correlation
## 
## data:  df1$height and df1$PPG
## t = -7.3384, df = 21509, p-value = 2.239e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06329555 -0.03663517
## sample estimates:
##         cor 
## -0.04997426

Here the correlation coefficient is slightly negative, but based on the significance test we have enough evidence to suggest that its not equal to 0. Looking at the graph would suggest otherwise. Or at least that there is a better way to model the relationship between these two variables.

Set 2

Variables: 2P%, 3P%, FT%

Response: 3P%,

Explanatory: 2P%, FT%

Create Variables

df2_ft <- nba %>%
  filter(FTA >= 25,`3PA` >= 25) %>%
  select(Year, Player, `FT%`, `3P%`)
df2_2p <- nba %>%
  filter(`2PA` >=25, `3PA` >= 25) %>%
  select(Year, Player, `2P%`, `3P%`)

Plots

df2_2p %>%
  ggplot(mapping = aes(x = `2P%`, y = `3P%`)) +
  geom_point(alpha = 0.2) +
  geom_vline(xintercept = 0.5, color = 'orange') + 
  geom_hline(yintercept = 1/3, color = 'orange') +
  geom_smooth(method = lm) +
  labs(title = "2P% vs 3P%")
## `geom_smooth()` using formula = 'y ~ x'

df2_ft %>%
  ggplot(mapping = aes(x = `FT%`, y = `3P%`)) +
  geom_point(alpha = 0.2) +
  geom_vline(xintercept = 0.5, color = 'orange') + 
  geom_hline(yintercept = 1/3, color = 'orange') +
  geom_smooth(method = lm) +
  labs(title = "FT% vs 3P%")
## `geom_smooth()` using formula = 'y ~ x'

2P% vs 3P%

  • Here we can see the jumbled mess of data all in one central area of the graph. Because of this the graph isn’t very informative. To counteract this in the future I would look to make it more granular by splitting on group, such as year, decade, position, or even team.

FT% vs 3P%

  • The data here are still very bunched up, but a little less than the previous one. This fact is likely due to PFs and Centers that may be terrible 3 point shoots but decent to above average free throw shooters. Once again splitting the data into group would make this graph more informative.

Correlation Coefficients and Discussion

cor.test(df2_2p$`2P%`, df2_2p$`3P%`, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  df2_2p$`2P%` and df2_2p$`3P%`
## t = 1.4941, df = 9477, p-value = 0.1352
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.004787457  0.035466135
## sample estimates:
##        cor 
## 0.01534556

As expected after looking at the graph the correlation coefficient is not statistically significant.

cor.test(df2_ft$`FT%`, df2_ft$`3P%`, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  df2_ft$`FT%` and df2_ft$`3P%`
## t = 22.847, df = 8538, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2199359 0.2599121
## sample estimates:
##       cor 
## 0.2400258

The graph so a slightly more linear relationship however there is still a lot of data way below the line of best fit on the plot. As mentioned above breaking the data into more granular groups could provide a betting insight into the correlation between these two variables.

Set 3

Variables: WS, Height, MPG, PPG, STLPG, TOVPG

Response: WS (win shares)

Explanatory: Height, MPG, PPG, STLPG

Create Variables

df3 <- nba %>%
  mutate(Height = (Feet*12) + Inches, 
         STLPG = STL/G, 
         TOVPG = TOV/G,
         PPG = PTS/G,
         MPG = MP/G) %>%
  select(Year, Player, WS, Height, STLPG, TOVPG, PPG, MPG)

Plots

df3 %>%
  ggplot(mapping = aes(x = Height, y = WS)) +
  geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
  geom_smooth(method = lm) +
  labs(title = "Height vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'

df3 %>%
  ggplot(mapping = aes(x = STLPG, y = WS)) +
  geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
  geom_smooth(method = lm) +
  labs(title = "Steals Per Game vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'

df3 %>%
  ggplot(mapping = aes(x = TOVPG, y = WS)) +
  geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
  geom_smooth(method = lm) +
  labs(title = "Turnovers Per Game vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'

df3 %>%
  ggplot(mapping = aes(x = PPG, y = WS)) +
  geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
  geom_smooth(method = lm) +
  labs(title = "Points Per Game vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'

df3 %>%
  ggplot(mapping = aes(x = MPG, y = WS)) +
  geom_point(mapping = aes(color = WS>0), alpha = 0.1) +
  geom_smooth(method = lm) +
  labs(title = "Minutes Per Game vs Win Shares")
## `geom_smooth()` using formula = 'y ~ x'

Height vs WS

  • Similar to the graph with height in the first set there is not much information to be gained from this graph. Breaking this down by position vs OWS and DWS would be interesting to see if tall PGs or short Centers are more values to winner than the average size for each position.

Steals Per Game vs WS

  • This is what I expected for this graph to look like for the most part. Steals are valuable in giving your own team a “free” position over the other team. Over the long run more possession = more points.

Turnovers Per Game vs WS

  • This graph is the opposite of what I was originally expecting. Initially I was thinking as your turnovers per game increased this would lead to less win shares as you are giving the ball away to the opponent allowing them more opportunities to score. However, I believe there are some underlying factors to why the graph looks like this. In order to accrue more turnovers per game over a period of time that means you are playing more minutes. If you are playing more minutes then you are likely a valuable member of the team and doing other positive things that outweight the negative value that your turnovers bring.

Points Per Game vs WS

  • Pretty straight forward here. More points = more wins. If a player is scoring a lot of points they have a big impact on the success of the team.

Minutes Per Game vs WS

  • As discussed in the turnovers graph if the coach is putting you on the court for more minutes it is likely that you are doing something make an impact of whether the teams wins or not.

Correlation Coefficients and Discussion

cor.test(df3$Height, df3$WS)
## 
##  Pearson's product-moment correlation
## 
## data:  df3$Height and df3$WS
## t = 7.6552, df = 21509, p-value = 2.012e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03878935 0.06544386
## sample estimates:
##        cor 
## 0.05212589

Similar to the other height graphs there is evidence to suggest that the correlation coefficient is not zero but looking at the graph we know that it does not make sense to continue looking at this data in this fashion.

cor.test(df3$STLPG, df3$WS)
## 
##  Pearson's product-moment correlation
## 
## data:  df3$STLPG and df3$WS
## t = 107.89, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5838187 0.6011623
## sample estimates:
##       cor 
## 0.5925591

As expected there is a fairly strong positive correlation here.

cor.test(df3$TOVPG, df3$WS)
## 
##  Pearson's product-moment correlation
## 
## data:  df3$TOVPG and df3$WS
## t = 110.72, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5939395 0.6109650
## sample estimates:
##       cor 
## 0.6025208

As discussed above I expected the correlation coefficient to be negative, but there are some other variables that I believe are having a bigger impact on why this variable appears to be positively correlated with win shares.

cor.test(df3$PPG, df3$WS)
## 
##  Pearson's product-moment correlation
## 
## data:  df3$PPG and df3$WS
## t = 175.88, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7624764 0.7734398
## sample estimates:
##       cor 
## 0.7680143

Confirms that the relationship between PPG and WS has a strong positive correlation.

cor.test(df3$MPG, df3$WS)
## 
##  Pearson's product-moment correlation
## 
## data:  df3$MPG and df3$WS
## t = 160.27, df = 21509, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7315937 0.7437753
## sample estimates:
##       cor 
## 0.7377446

Same as PPG, MPG has a pretty strong positive correlation with WS. This is not surprising as we saw earlier in the first set that PPG and MPG have a positive linear relationship between themselves.

Confidence Intervals

Find 95% confidence interval for sample mean for each of the 3 response variables chosen above.

find_ci <- function(data, samp_size = 30, level = 0.95) {
  alpha <- 1-level
  samp <- sample(data, size = samp_size)
  xbar <- mean(samp)
  s <- sd(samp)
  tcrit <- qt((1-(alpha/2)), samp_size-1)
  lb <- xbar - tcrit*(s/sqrt(samp_size))
  ub <- xbar + tcrit*(s/sqrt(samp_size))
  line1 <- paste("sample mean: ", xbar)
  line2 <- paste("95% confidence interval: (", lb, " , ", ub, ")")
  line3 <- paste("population mean: ", mean(data, na.rm = T))
  cat(line1,"\n",line2,"\n",line3,"\n")
}

PPG (Points Per Game)

find_ci(df1$PPG, 100, 0.95)
## sample mean:  8.27987039789129 
##  95% confidence interval: ( 7.10499442969487  ,  9.4547463660877 ) 
##  population mean:  8.09764921765621

Initially I felt like this CI was and population mean was quite low, but a vast majority of players in the data set are “role” players who fall into this range rather than being made up of the big stars that are scoring 25+ a game. I would be interested in comparing the positional means to see if there was a significant difference between positions.

3P% (3-Point Percentage)

find_ci(df2_2p$`3P%`, 100, 0.95)
## sample mean:  0.3413 
##  95% confidence interval: ( 0.329987010115885  ,  0.352612989884115 ) 
##  population mean:  0.336995674649225
find_ci(df2_ft$`3P%`, 100, 0.95)
## sample mean:  0.33891 
##  95% confidence interval: ( 0.325981373630802  ,  0.351838626369198 ) 
##  population mean:  0.337211709601874

These values hovering around 33% is really interesting and a significant reason as to why the 3 point shot has become increasingly popular. If a player can average above 33% then the shots expected value is larger than 1 and therefor has a greater long term value than the average player shooting 2 point shots that have an average of around 47% has an expected value less than 1. I would also argue here that the lower bound is the most important part of this CI for the reasons stated above.

WS (Win Shares)

find_ci(df3$WS, 100, 0.95)
## sample mean:  2.012 
##  95% confidence interval: ( 1.43527385597427  ,  2.58872614402573 ) 
##  population mean:  2.31958067965227

This is an interesting result. If a team consists of 13 average players you would expect approximately 30 wins. Going to look at the makeup of super successful teams that are way over this mark and the makeup of team that hover around or below this mark would be interesting to see how the great teams are getting such as edge over the “average” team.