DATA 607 Final Project Part 2

In the first part of this project, we created a Database of NBA plays through a series of CSVs. In this second Part, I will scrape data from a sports betting website to add to this database.

The website in question is sportsdatabase.com, and the single field I’m after is the Vegas line - over/under value for each game.Over/unders represent a number of points, with a negative value meaning the team is more likely to win. Since this project isn’t about betting on sports and negative being better is counterintuitive, I’m going to reverse that by pulling the away team’s line and inserting into the home team field. The line for the away team is always the inverse of the line for the home team, so I just need to pull the one.

library(tidyverse)
library(rvest)
library(RMySQL)

Redacted code: con = dbConnect(dbDriver(“MySQL”), user=‘root’, password=‘PASSWORD’, dbname=‘cuny’, host=‘127.0.0.1’,port = 3306)

This website is pretty cool in that you can query pretty specific information. Our query is pretty easy - we just want every regular season game for the 2004-2018 seasons.

url = "https://sportsdatabase.com/nba/query?output=default&sdql=date%2C+team%2C+o%3Ateam%2C+line+%40season%3E%3D2004+and+site%3Daway+and+playoffs%3D0+and+season%3C2019&submit=++S+D+Q+L+%21++"

data <- read_html(url)

table_node = html_nodes(data, "#DT_Table")

table = html_table(table_node)

vegas.df = table[[1]]
head(vegas.df)

##       date          team    o:team line
## 1 20041102       Nuggets    Lakers  2.5
## 2 20041102         Kings Mavericks  4.5
## 3 20041102       Rockets   Pistons  6.5
## 4 20041103        Pacers Cavaliers  3.5
## 5 20041103 Seventysixers   Celtics  3.0
## 6 20041103   Supersonics  Clippers  3.0

Let’s update the date column to match the format we use in the other dataset

vegas.df = mutate(vegas.df,date = paste0(substring(date,1,4),'-',substring(date,5,6),'-',substring(date,7,8)))

Now let’s pull the game data from the database. We’ll pull in all regular season games. I decided to just use regular season games for this project, but it would be interesting to add them in, and possibly just add a binary variable indicating if the game is a playoff game.

query <- "select nba_games.id,data_set,date,home_team,away_team,home_score,away_score,
periods,hm.sports_database_name as home_db_name,aw.sports_database_name as away_db_name from nba_games
  left join nba_teams hm on home_team = hm.short_code
  left join nba_teams aw on away_team = aw.short_code
  where data_set like '%Regular%'
  "

res <- dbSendQuery(con,query)

game.df <- dbFetch(res, n=-1)

head(game.df)

##         id                 data_set       date home_team away_team
## 1 20400522 2004-2005 Regular Season 2005-01-14       BOS       ATL
## 2 20400896 2004-2005 Regular Season 2005-03-09       BOS       ATL
## 3 20500460 2005-2006 Regular Season 2006-01-06       BOS       ATL
## 4 20600625 2006-2007 Regular Season 2007-01-24       BOS       ATL
## 5 20600962 2006-2007 Regular Season 2007-03-14       BOS       ATL
## 6 20700068 2007-2008 Regular Season 2007-11-09       BOS       ATL
##   home_score away_score periods home_db_name away_db_name
## 1        106         94       4      Celtics        Hawks
## 2         95         91       4      Celtics        Hawks
## 3         98        103       4      Celtics        Hawks
## 4         76         82       4      Celtics        Hawks
## 5        109         88       4      Celtics        Hawks
## 6        106         83       4      Celtics        Hawks

Let’s start by doing a “full join” and seeing what’s missing on either end. Since I pulled the “away” line, the “o:team” will be the home team

joined.df = full_join(vegas.df,game.df, c('date'='date','o:team' = 'home_db_name', 'team' = 'away_db_name'))

sum(is.na(joined.df$team))

## [1] 0

There were no missing values from the vegas data frame. Let’s see if there were some extra games that weren’t in the original dataset. It says right in the FAQ for the dataset that there were around 47 missing games, so we should expect around that many missing.

sum(is.na(joined.df$id))

## [1] 42

Looks like there are 42 mising. I assume 5 of the missing games were playoff games, which is why we only have 42 missing. Either way we’ll just remove these 42 and move on.

joined.df = filter(joined.df, !is.na(id))
str(joined.df)

## 'data.frame':    18167 obs. of  11 variables:
##  $ date      : chr  "2004-11-02" "2004-11-02" "2004-11-02" "2004-11-03" ...
##  $ team      : chr  "Nuggets" "Kings" "Rockets" "Pacers" ...
##  $ o:team    : chr  "Lakers" "Mavericks" "Pistons" "Cavaliers" ...
##  $ line      : num  2.5 4.5 6.5 3.5 3 3 13 -3 3.5 3.5 ...
##  $ id        : int  20400003 20400002 20400001 20400006 20400007 20400015 20400009 20400010 20400014 20400005 ...
##  $ data_set  : chr  "2004-2005 Regular Season" "2004-2005 Regular Season" "2004-2005 Regular Season" "2004-2005 Regular Season" ...
##  $ home_team : chr  "LAL" "DAL" "DET" "CLE" ...
##  $ away_team : chr  "DEN" "SAC" "HOU" "IND" ...
##  $ home_score: int  89 107 87 104 95 114 91 91 104 93 ...
##  $ away_score: int  78 98 79 109 98 84 103 106 78 92 ...
##  $ periods   : int  4 4 4 6 4 4 4 4 4 4 ...

Let’s tidy this dataset a bit. We don’t need the spelled out names since we have the 3 letter codes. The data set is also doesn’t give any interesting information since you could figure out what season it was by the date, and we already excluded playoff games.

df = select(joined.df,id,date,home_team,away_team,home_score,away_score,periods,line)
head(df)

##         id       date home_team away_team home_score away_score periods
## 1 20400003 2004-11-02       LAL       DEN         89         78       4
## 2 20400002 2004-11-02       DAL       SAC        107         98       4
## 3 20400001 2004-11-02       DET       HOU         87         79       4
## 4 20400006 2004-11-03       CLE       IND        104        109       6
## 5 20400007 2004-11-03       BOS       PHI         95         98       4
## 6 20400015 2004-11-03       LAC       SEA        114         84       4
##   line
## 1  2.5
## 2  4.5
## 3  6.5
## 4  3.5
## 5  3.0
## 6  3.0

Let’s update the nba_games table with the vegas line. You can’t run 18K updates in one statement, but you can do an insert with an “on duplicate key update” clause. This will just update the value (home_team_vegas_line) when it finds a duplicate key. Since these keys were just pulled out of this table they should all be updates.

query <- paste0("insert into nba_games (id,home_team_vegas_line) values ",
     paste0(paste0("(",df$id,",",df$line,")"),collapse=", "),
     " ON DUPLICATE KEY UPDATE home_team_vegas_line = VALUES(home_team_vegas_line)")
             
                
res <- dbSendQuery(con,query)

Let’s see if they all updated.

query <- paste0("select count(*) from nba_games where data_Set like '%Regular%' and home_team_vegas_line is null")
res <- dbSendQuery(con,query)
missing.count <- dbFetch(res, n=-1)
missing.count

##   count(*)
## 1        0

Looks like they’re all there. Now to verify we did this right, we would expect there to be a strong correlation between the score margin (home_team_score - away_team_score) and this line.

df$margin = df$home_score - df$away_score
df$total_score = df$home_score + df$away_score

ggplot(df, aes(y = margin, x =line, color=(total_score))) +
  geom_point(size = 0.1, stroke = 0, shape = 16) + 
  geom_jitter() +
  geom_smooth(method="lm", color="darkred", fill = "pink", level = 0.99) +
  xlim(-40,40) +
  ylim(-40,40)

lm = lm(margin ~ line, data = df)
summary(lm)

## 
## Call:
## lm(formula = margin ~ line, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.432  -7.354   0.096   7.184  54.767 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.21726    0.09596  -2.264   0.0236 *  
## line         1.01103    0.01367  73.980   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.6 on 18165 degrees of freedom
## Multiple R-squared:  0.2315, Adjusted R-squared:  0.2315 
## F-statistic:  5473 on 1 and 18165 DF,  p-value: < 2.2e-16

As you can see, these sportsbooks are pretty good at setting these lines. The slope of the line is very close to 1, with the intercept close to zero.

The odd looking cross shape comes from the fact that you can’t end in a tie. So “margin” is never zero. Because the margin cannot be zero, the line is never set to -0.5 or 0.5.

For part three of this project, I will pull in play-by-play data and attempt to make a model that predicts the outcome of the game at any point, using this line as a starting point.

DATA 607 Final Project Part 2

Steven Ellingson

December 5, 2019