DATA 606 Data Project Proposal

The Elo Rating system, first developed for rating chess players, is a common rating system used across competitive events such as basketball, football, go and scrabble.

This project seeks to investigate the effectiveness of elo ratings (of basketball teams) as a predictive measure of the outcome of a game as against home court advantage.

Data Preparation

#setup packages
if("dplyr" %in% rownames(installed.packages()) == FALSE) {install.packages("dplyr")}
library(dplyr)
if("ggplot2" %in% rownames(installed.packages()) == FALSE) {install.packages("ggplot2")}
library(ggplot2)


# load data

URL_nba_elo <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"

nba_data <- read.csv(URL_nba_elo) 
nbadata <-  filter(nba_data,X_iscopy == 0)

nbadata2 <- filter(nba_data, game_location == 'H')
nbadata3 <- filter(nba_data, game_location == 'A')

index <- sample(c(TRUE, FALSE), length(nba_data), replace=TRUE, prob=c(0.5,0.5))
nbadata2_a <- nbadata2[index,]
nbadata2_b <- nbadata2[!index,]


index <- sample(c(TRUE, FALSE), length(nba_data), replace=TRUE, prob=c(0.5,0.5))
nbadata3_a <- nbadata3[index,]
nbadata3_b <- nbadata3[!index,]

sampleA <- rbind(nbadata2_a, nbadata3_a)
sampleB <- rbind(nbadata2_b, nbadata2_b)

The project will use sample A

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Which is a better predictor of whether a team will win a basketball game: elo rating of the competing teams or home court advantage?

Cases

What are the cases, and how many are there? Each case represents a game played by a basketball team. There are 126314 cases. These cases represent 6.315710^{4} games since there will be two cases per game (one for each team competing in the game).

Data collection

Describe the method of data collection. The data was compiled by fivethirtyeight.com from the third party source http://www.basketball-reference.com/.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The Complete History of the NBA by Ruben Fischer-Baum and Nate Silver (https://projects.fivethirtyeight.com/complete-history-of-the-nba/#spurs)

The data can be accessed here: https://github.com/fivethirtyeight/data/tree/master/nba-elo

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is game_result: whether the team won/loss. It is categorical.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)?

The explanatory variables are:

elo_i: The elo rating of the team before the game. It is numerical
opp_elo_i: The elo rating of the oposing team before the game. It is numerical
game_location: Whether the game was an away game or home game. It is categorical

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if youâre comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

game_result

table(sampleA$game_result, useNA = 'ifany')

## 
##     L     W 
## 35014 36359

prop.table(table(sampleA$game_result, useNA='ifany')) * 100

## 
##        L        W 
## 49.05777 50.94223

game_location

table(sampleA$game_location, useNA = 'ifany')

## 
##     A     H     N 
## 32941 38432     0

prop.table(table(sampleA$game_location, useNA='ifany')) * 100

## 
##        A        H        N 
## 46.15331 53.84669  0.00000

Location vs. Result

table(sampleA$game_location, sampleA$game_result)

##    
##         L     W
##   A 20517 12424
##   H 14497 23935
##   N     0     0

elo_i

summary(sampleA$elo_i)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1095    1417    1501    1495    1576    1837

ggplot(sampleA, aes(x=elo_i)) + geom_histogram(fill="white", colour="black") + facet_grid(game_result ~.)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

opp_elo_i

summary(sampleA$opp_elo_i)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1092    1416    1500    1495    1575    1837

ggplot(sampleA, aes(x=opp_elo_i)) + geom_histogram(fill="white", colour="black") + facet_grid(game_result ~.)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

elo_i vs opp_elo_i

plot(sampleA$elo_i, sampleA$opp_elo_i)

ggplot(sampleA, aes(x=elo_i, y=opp_elo_i)) + geom_point() + facet_grid(game_result ~ .)