We’ll start by reading in two tables of WNBA team data on the 2000-2016 season:
rec <- read.csv('/home/rstudioshared/shared_files/data/wnba_records.csv')
tm <- read.csv('/home/rstudioshared/shared_files/data/wnba_teams.csv')
opp <- read.csv('/home/rstudioshared/shared_files/data/wnba_opponents.csv')
rec has the recrod for each team-season. The tm data contains statistics every team while the opp table has the same data for their opponents. Use View() to take a look at these data sets. You can see, for instance, that in 2016, the Sky scored the most points with 2930 but only went 18-16 and the Sparks, the eventual champions, allowed the fewest points (2580) went 26-8.
Does defense win games? What statistics matter the most? We can join these data sets, create scatter plot and find best-fit lines, and try to answer these and other questions
Let’s get fancy and join all three tables by embedding one join within another. See if you can figure out how this works. [Note: you’ll get a warning message that you can safely ignore.]
library(dplyr)
full <- left_join(left_join(rec, tm, by=c("Season", "Team")), opp, by=c("Season", "Team"))
Now take a look at this full data paying special attention to how the column names have been altered.
Let’s add a couple of columnes that might be useful, points per game and points allowed per game:
full <- full %>% mutate(PPG = PTS.x/G.x, PAPG = PTS.y/G.y)
Let’s look at Winning percent verus point scored and points scored per game and versus points allowed per game:
library(ggplot2)
ggplot(full, aes(PPG, Wpercent))+geom_point()+geom_smooth(method="lm")
ggplot(full, aes(PAPG, Wpercent))+geom_point()+geom_smooth(method="lm")
It’s might be hard to see the relative strengts of these associations in the graphs and we can be more precise by looking at correlations and finding best fit lines:
cor(full$PPG, full$Wpercent)
cor(full$PAPG, full$Wpercent)
lm(Wpercent ~ PPG, data=full)
lm(Wpercent ~ PAPG, data=full)
According to these best fit line, what is the expected winning percentage for a team that scores 80 PPG?
According to these best fit line, what is the expected winning percentage for a team that gives up 80 points per game?
According to these best fit lines, how much does a team’s winning percentage improve by virture of an additional 10 points per gaem?
According to these best fit lines, how much does a team’s winning percentage improve by virture of allowing 10 fewer points per game?
Are you convinced that either offense or defense is more important? If so, which one and why? If not, what else should be investigate?
According to these best fit lines, how many additional points per game does a team need to improve it’s winning percentage by 10%?
Let’s create three more columns with each team’s advantage (or in some cases disadvantage) in blocks, steals, assists, rebounds and turnovers per game.
full <- full %>% mutate(BLKadv = (BLK.x - BLK.y)/G.x,
STLadv = (STL.x - STL.y)/G.x,
ASTadv = (AST.x - AST.y)/G.x,
ORBadv = (ORB.x - ORB.y)/G.x,
DRBadv = (DRB.x - DRB.y)/G.x,
TRBadv = (TRB.x - TRB.y)/G.x,
TOVadv = (TOV.x - TOV.y)/G.x
)
Use some combination plots, correlations and linear models (lm) to find out which statistical advantge is most associated with winning games.
Can you use a combination of points scored per game and points allowed per game that has a considerably higher correlation to winning percentage than either statistic on its own? Try to do so.
How do Offensive Rating (ORtg) and Defensive Rating (DRtg) compare to PPG and PAPG in their ability to predict winning percentages? If you have time, look up how these ratings are calculated.