Bootstrapping and Split Half Correlations

Bootstrapping Beans (An Aside)

Before we give into the data, let’s simulate the bean counting we did on Friday. As you may recall, we had a sample of 7 beans: 4 black and 3 white. We are interested in what this might tell us about the population of beans from which these 7 beans were selected. In other words, the rate of black beans we have obsered, 4/7th is our statistic and we want to estimate the rate in the population, the parameter.

Since we’re interested in the proportion of black beans in the population, I’ll denote black beans by 1’s and white beans by 0’s and create a vector, beans, to represent the 7 beans we observed. The second line of code below samples 7 beans, with replacement from the 7 beans we observed.

beans <- c(1,1,1,1,0,0,0)
sample(beans, 7, replace=TRUE)

## [1] 1 0 0 1 1 1 1

We can then count the number of beans in our selection as:

sum(sample(beans, 7, replace=TRUE))

## [1] 5

Of course, we want to do this many times (even more than the 30 times we did this as a class on Friday). We can create another vector, bean.count, to store our counts of black beans, and then create a for loop that samples beans and counts them 1000 times and places the results into our bean.count vector:

bean.count <- rep(NA, 1000)
for (i in 1:1000){
  bean.count[i] <- sum(sample(beans, 7, replace=TRUE))
}

Here are our results:

table(bean.count)

## bean.count
##   0   1   2   3   4   5   6   7 
##   2  24  92 219 292 263  90  18

plot(table(bean.count))

We might also be interested in 10th and 90th percentile outcomes (or other percentile outcomes) and can calculate them as follows:

quantile(bean.count, c(0.10, 0.90))

## 10% 90% 
##   2   6

Do Winning Teams Keep Winning

One of our criteria for assessing the competitiveness of sports leagues was whether the same teams that won in the past keep winning. The compare leagues, I gathered data on NFL, WNBA, NBA and MLB teams both halfway through the regular season and after the regular season was complete.

Let’s start by loading that data along with a couple of our favorite packages:

library(dplyr);library(ggplot2)

nfl8 <- read.csv('/home/rstudioshared/shared_files/data/team_season_halves/nfl8.csv')
nfl16 <- read.csv('/home/rstudioshared/shared_files/data/team_season_halves/nfl16.csv')

wnba17 <- read.csv('/home/rstudioshared/shared_files/data/team_season_halves/wnba17.csv')
wnba34 <- read.csv('/home/rstudioshared/shared_files/data/team_season_halves/wnba34.csv')

nba41 <- read.csv('/home/rstudioshared/shared_files/data/team_season_halves/nba41.csv')
nba82 <- read.csv('/home/rstudioshared/shared_files/data/team_season_halves/nba82.csv')

mlb81 <- read.csv('/home/rstudioshared/shared_files/data/team_season_halves/mlb81.csv')
mlb162 <- read.csv('/home/rstudioshared/shared_files/data/team_season_halves/mlb162.csv')

Let’s start with NFL teams. We’ll need to join the data on NFL teams from 8 games through the season with the data from the full season. It’s worth taking a look at our data at each step.

head(nfl8)
head(nfl16)

nfl <- left_join(nfl8, nfl16, by=c("season", "team"))

head(nfl)

Next, let’s calculate winning percentages for the first and second halves of the season (after removing any rows that didn’t match up):

nfl <- nfl %>% filter(!is.na(wins.y)) %>% mutate(wins.z=wins.y-wins.x, losses.z = losses.y-losses.x,
                                                 win.percent1 = wins.x/(wins.x+losses.x),
                                                 win.percent2 = wins.z/(wins.z+losses.z) )

It’s generally a good idea to look plot data before calculating summary statistics, so here’s a plot

nfl %>% group_by(win.percent1, win.percent2) %>% 
  summarize(num.teams=length(team)) %>%
  ggplot(aes(win.percent1, win.percent2, size=num.teams))+
  geom_point() + geom_smooth(method="lm") + labs(x="1st Half W%", y="2nd Half W%")

and here’s the correlation between first and second half winning percentages:

nfl %>% summarize(cor(win.percent1, win.percent2))

If we’re interested in how much regression towards the mean to use when predicting second half winning percentages we can calculated that using the formula:

\[ Regression\ Opps\ = n \cdot \frac{1-R}{R}\]

r <- nfl %>% summarize(cor(win.percent1, win.percent2))

as.numeric(8*(1-r)/r)

But, how much confidence should we have in this number?

The correlation we found might be higher or lower than the “true” correlation just like our proportion of black beans may be higher or lower than the population proportion. To estimate the uncertainty in this number, we can use bootsratpping.

There are 536 rows in your NFL data set, let’s sample 536 of them, with replacement and calculated the correlation in those games and calculated a new number of regression games based on that sample:

nrow(nfl)

rows <- sample(1:nrow(nfl), nrow(nfl), replace=TRUE)
r <- nfl[rows, ] %>% summarize(cor(win.percent1, win.percent2))
as.numeric(8*(1-r)/r)

Of course, we don’t want to do this just once, just like with our beans, we want to do this many times. Let’s repeat this procedure 1000 times each time keeping track of the number of regression games we calculate each time.

regression.games <- rep(NA, 1000)

for (i in 1:1000){
  rows <- sample(1:nrow(nfl), nrow(nfl), replace=TRUE)
  r <- as.numeric(nfl[rows, ] %>% summarize(cor(win.percent1, win.percent2)))
  regression.games[i] <- 8*(1-r)/r
  
}

Finally, we can look at the distribution of number of regression games calculated based on our bootstrapped samples and find the 10th, 50th and 90th percentile results:

hist(regression.games)
quantile(regression.games, c(0.1, 0.5, 0.9))

Use this code, making edits where needed to find the number of regression games needed for each league and to complete your chart.

Bootstrapping and Split Half Correlations

Sports Data Science

1/7/2017

Bootstrapping Beans (An Aside)

Do Winning Teams Keep Winning