In the first part of this project, we created a Database of NBA plays through a series of CSVs. In this second Part, I will scrape data from a sports betting website to add to this database.
The website in question is sportsdatabase.com, and the single field I’m after is the Vegas line - over/under value for each game.Over/unders represent a number of points, with a negative value meaning the team is more likely to win. Since this project isn’t about betting on sports and negative being better is counterintuitive, I’m going to reverse that by pulling the away team’s line and inserting into the home team field. The line for the away team is always the inverse of the line for the home team, so I just need to pull the one.
library(tidyverse)
library(rvest)
library(RMySQL)
Redacted code: con = dbConnect(dbDriver(“MySQL”), user=‘root’, password=‘PASSWORD’, dbname=‘cuny’, host=‘127.0.0.1’,port = 3306)
This website is pretty cool in that you can query pretty specific information. Our query is pretty easy - we just want every regular season game for the 2004-2018 seasons.
url = "https://sportsdatabase.com/nba/query?output=default&sdql=date%2C+team%2C+o%3Ateam%2C+line+%40season%3E%3D2004+and+site%3Daway+and+playoffs%3D0+and+season%3C2019&submit=++S+D+Q+L+%21++"
data <- read_html(url)
table_node = html_nodes(data, "#DT_Table")
table = html_table(table_node)
vegas.df = table[[1]]
head(vegas.df)
## date team o:team line
## 1 20041102 Nuggets Lakers 2.5
## 2 20041102 Kings Mavericks 4.5
## 3 20041102 Rockets Pistons 6.5
## 4 20041103 Pacers Cavaliers 3.5
## 5 20041103 Seventysixers Celtics 3.0
## 6 20041103 Supersonics Clippers 3.0
Let’s update the date column to match the format we use in the other dataset
vegas.df = mutate(vegas.df,date = paste0(substring(date,1,4),'-',substring(date,5,6),'-',substring(date,7,8)))
Now let’s pull the game data from the database. We’ll pull in all regular season games. I decided to just use regular season games for this project, but it would be interesting to add them in, and possibly just add a binary variable indicating if the game is a playoff game.
query <- "select nba_games.id,data_set,date,home_team,away_team,home_score,away_score,
periods,hm.sports_database_name as home_db_name,aw.sports_database_name as away_db_name from nba_games
left join nba_teams hm on home_team = hm.short_code
left join nba_teams aw on away_team = aw.short_code
where data_set like '%Regular%'
"
res <- dbSendQuery(con,query)
game.df <- dbFetch(res, n=-1)
head(game.df)
## id data_set date home_team away_team
## 1 20400522 2004-2005 Regular Season 2005-01-14 BOS ATL
## 2 20400896 2004-2005 Regular Season 2005-03-09 BOS ATL
## 3 20500460 2005-2006 Regular Season 2006-01-06 BOS ATL
## 4 20600625 2006-2007 Regular Season 2007-01-24 BOS ATL
## 5 20600962 2006-2007 Regular Season 2007-03-14 BOS ATL
## 6 20700068 2007-2008 Regular Season 2007-11-09 BOS ATL
## home_score away_score periods home_db_name away_db_name
## 1 106 94 4 Celtics Hawks
## 2 95 91 4 Celtics Hawks
## 3 98 103 4 Celtics Hawks
## 4 76 82 4 Celtics Hawks
## 5 109 88 4 Celtics Hawks
## 6 106 83 4 Celtics Hawks
Let’s start by doing a “full join” and seeing what’s missing on either end. Since I pulled the “away” line, the “o:team” will be the home team
joined.df = full_join(vegas.df,game.df, c('date'='date','o:team' = 'home_db_name', 'team' = 'away_db_name'))
sum(is.na(joined.df$team))
## [1] 0
There were no missing values from the vegas data frame. Let’s see if there were some extra games that weren’t in the original dataset. It says right in the FAQ for the dataset that there were around 47 missing games, so we should expect around that many missing.
sum(is.na(joined.df$id))
## [1] 42
Looks like there are 42 mising. I assume 5 of the missing games were playoff games, which is why we only have 42 missing. Either way we’ll just remove these 42 and move on.
joined.df = filter(joined.df, !is.na(id))
str(joined.df)
## 'data.frame': 18167 obs. of 11 variables:
## $ date : chr "2004-11-02" "2004-11-02" "2004-11-02" "2004-11-03" ...
## $ team : chr "Nuggets" "Kings" "Rockets" "Pacers" ...
## $ o:team : chr "Lakers" "Mavericks" "Pistons" "Cavaliers" ...
## $ line : num 2.5 4.5 6.5 3.5 3 3 13 -3 3.5 3.5 ...
## $ id : int 20400003 20400002 20400001 20400006 20400007 20400015 20400009 20400010 20400014 20400005 ...
## $ data_set : chr "2004-2005 Regular Season" "2004-2005 Regular Season" "2004-2005 Regular Season" "2004-2005 Regular Season" ...
## $ home_team : chr "LAL" "DAL" "DET" "CLE" ...
## $ away_team : chr "DEN" "SAC" "HOU" "IND" ...
## $ home_score: int 89 107 87 104 95 114 91 91 104 93 ...
## $ away_score: int 78 98 79 109 98 84 103 106 78 92 ...
## $ periods : int 4 4 4 6 4 4 4 4 4 4 ...
Let’s tidy this dataset a bit. We don’t need the spelled out names since we have the 3 letter codes. The data set is also doesn’t give any interesting information since you could figure out what season it was by the date, and we already excluded playoff games.
df = select(joined.df,id,date,home_team,away_team,home_score,away_score,periods,line)
head(df)
## id date home_team away_team home_score away_score periods
## 1 20400003 2004-11-02 LAL DEN 89 78 4
## 2 20400002 2004-11-02 DAL SAC 107 98 4
## 3 20400001 2004-11-02 DET HOU 87 79 4
## 4 20400006 2004-11-03 CLE IND 104 109 6
## 5 20400007 2004-11-03 BOS PHI 95 98 4
## 6 20400015 2004-11-03 LAC SEA 114 84 4
## line
## 1 2.5
## 2 4.5
## 3 6.5
## 4 3.5
## 5 3.0
## 6 3.0
Let’s update the nba_games table with the vegas line. You can’t run 18K updates in one statement, but you can do an insert with an “on duplicate key update” clause. This will just update the value (home_team_vegas_line) when it finds a duplicate key. Since these keys were just pulled out of this table they should all be updates.
query <- paste0("insert into nba_games (id,home_team_vegas_line) values ",
paste0(paste0("(",df$id,",",df$line,")"),collapse=", "),
" ON DUPLICATE KEY UPDATE home_team_vegas_line = VALUES(home_team_vegas_line)")
res <- dbSendQuery(con,query)
Let’s see if they all updated.
query <- paste0("select count(*) from nba_games where data_Set like '%Regular%' and home_team_vegas_line is null")
res <- dbSendQuery(con,query)
missing.count <- dbFetch(res, n=-1)
missing.count
## count(*)
## 1 0
Looks like they’re all there. Now to verify we did this right, we would expect there to be a strong correlation between the score margin (home_team_score - away_team_score) and this line.
df$margin = df$home_score - df$away_score
df$total_score = df$home_score + df$away_score
ggplot(df, aes(y = margin, x =line, color=(total_score))) +
geom_point(size = 0.1, stroke = 0, shape = 16) +
geom_jitter() +
geom_smooth(method="lm", color="darkred", fill = "pink", level = 0.99) +
xlim(-40,40) +
ylim(-40,40)
lm = lm(margin ~ line, data = df)
summary(lm)
##
## Call:
## lm(formula = margin ~ line, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.432 -7.354 0.096 7.184 54.767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.21726 0.09596 -2.264 0.0236 *
## line 1.01103 0.01367 73.980 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.6 on 18165 degrees of freedom
## Multiple R-squared: 0.2315, Adjusted R-squared: 0.2315
## F-statistic: 5473 on 1 and 18165 DF, p-value: < 2.2e-16
As you can see, these sportsbooks are pretty good at setting these lines. The slope of the line is very close to 1, with the intercept close to zero.
The odd looking cross shape comes from the fact that you can’t end in a tie. So “margin” is never zero. Because the margin cannot be zero, the line is never set to -0.5 or 0.5.
For part three of this project, I will pull in play-by-play data and attempt to make a model that predicts the outcome of the game at any point, using this line as a starting point.