The Elo Rating system, first developed for rating chess players, is a common rating system used across competitive events such as basketball, football, go and scrabble.
This project seeks to investigate the effectiveness of elo ratings (of basketball teams) as a predictive measure of the outcome of a game as against home court advantage.
#setup packages
if("dplyr" %in% rownames(installed.packages()) == FALSE) {install.packages("dplyr")}
library(dplyr)
if("ggplot2" %in% rownames(installed.packages()) == FALSE) {install.packages("ggplot2")}
library(ggplot2)
# load data
URL_nba_elo <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
nba_data <- read.csv(URL_nba_elo)
nbadata <- filter(nba_data,X_iscopy == 0)
nbadata2 <- filter(nba_data, game_location == 'H')
nbadata3 <- filter(nba_data, game_location == 'A')
index <- sample(c(TRUE, FALSE), length(nba_data), replace=TRUE, prob=c(0.5,0.5))
nbadata2_a <- nbadata2[index,]
nbadata2_b <- nbadata2[!index,]
index <- sample(c(TRUE, FALSE), length(nba_data), replace=TRUE, prob=c(0.5,0.5))
nbadata3_a <- nbadata3[index,]
nbadata3_b <- nbadata3[!index,]
sampleA <- rbind(nbadata2_a, nbadata3_a)
sampleB <- rbind(nbadata2_b, nbadata2_b)
The project will use sample A
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Which is a better predictor of whether a team will win a basketball game: elo rating of the competing teams or home court advantage?
What are the cases, and how many are there? Each case represents a game played by a basketball team. There are 126314 cases. These cases represent 6.315710^{4} games since there will be two cases per game (one for each team competing in the game).
Describe the method of data collection. The data was compiled by fivethirtyeight.com from the third party source http://www.basketball-reference.com/.
What type of study is this (observational/experiment)?
This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
The Complete History of the NBA by Ruben Fischer-Baum and Nate Silver (https://projects.fivethirtyeight.com/complete-history-of-the-nba/#spurs)
The data can be accessed here: https://github.com/fivethirtyeight/data/tree/master/nba-elo
What is the response variable, and what type is it (numerical/categorical)?
The response variable is game_result: whether the team won/loss. It is categorical.
What is the explanatory variable, and what type is it (numerical/categorical)?
The explanatory variables are:
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
table(sampleA$game_result, useNA = 'ifany')
##
## L W
## 35014 36359
prop.table(table(sampleA$game_result, useNA='ifany')) * 100
##
## L W
## 49.05777 50.94223
table(sampleA$game_location, useNA = 'ifany')
##
## A H N
## 32941 38432 0
prop.table(table(sampleA$game_location, useNA='ifany')) * 100
##
## A H N
## 46.15331 53.84669 0.00000
table(sampleA$game_location, sampleA$game_result)
##
## L W
## A 20517 12424
## H 14497 23935
## N 0 0
summary(sampleA$elo_i)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1095 1417 1501 1495 1576 1837
ggplot(sampleA, aes(x=elo_i)) + geom_histogram(fill="white", colour="black") + facet_grid(game_result ~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(sampleA$opp_elo_i)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1092 1416 1500 1495 1575 1837
ggplot(sampleA, aes(x=opp_elo_i)) + geom_histogram(fill="white", colour="black") + facet_grid(game_result ~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot(sampleA$elo_i, sampleA$opp_elo_i)
ggplot(sampleA, aes(x=elo_i, y=opp_elo_i)) + geom_point() + facet_grid(game_result ~ .)