# Variety of libraries that I've used or expected to use throughout the process
library(rvest)
library(dplyr)
library(stringr)
library(stringi)
library(XML)
library(data.table)
library(knitr)
library(tidyr)
The ideal goal is to obtain a data frame that looks something like:
## TeamName.x TeamName.y Spread
## 1 Villanova Radford 23.0
## 2 Purdue CS Fullerton 20.5
## 3 Texas Tech SF Austin 11.0
## 4 Wichita St Marshall 11.5
## 5 West Virginia Murray St 10.5
## 6 Florida St Bonaventure 5.5
## 7 Arkansas Butler -1.5
## 8 Virginia Tech Alabama 2.5
## 9 Kansas Penn 14.5
## 10 Duke Iona 20.0
## 11 Michigan St Bucknell 14.5
## 12 Auburn Col Charleston 9.0
## 13 Clemson New Mexico St 4.5
## 14 TCU Syracuse 4.0
## 15 Rhode Island Oklahoma 1.5
## 16 Seton Hall NC State 3.0
## 17 Virginia UMBC 21.0
## 18 Cincinnati Georgia St 14.0
## 19 Tennessee Wright St 11.5
## 20 Arizona Buffalo 8.5
## 21 Kentucky Davidson 5.0
## 22 Miami FL Loyola-Chicago 2.0
## 23 Nevada Texas 0.0
## 24 Creighton Kansas St 1.0
## 25 Xavier TX Southern 19.5
## 26 North Carolina Lipscomb 19.5
## 27 Michigan Montana 10.0
## 28 Gonzaga UNC Greensboro 13.5
## 29 Ohio St S Dakota St 8.0
## 30 Houston San Diego St 4.0
## 31 Texas A&M Providence 2.5
## 32 Missouri Florida St -1.5
As long as the +/- of the spread variable is consistently either for team x or team y. I think the data frame would be fine to only give one team in the matchup too.
Other notes:
These two scrapings can possibly be done in the same task if the site has that data.
Want to work with website that teamnames are not abbreviated to make the eventual joining easier.
Here’s some websites I’ve found with the spreads info. Open to using different websites as well. Whatever works.
# Links to websites that have ncaa spreads
donURL <- "http://www.donbest.com/ncaab/odds/spreads/"
trURL <- "https://www.teamrankings.com/ncb/odds/"
vegasURL <- "http://www.vegasinsider.com/college-basketball/odds/las-vegas/"
oddssharkURL <- "https://www.oddsshark.com/ncaab/odds"
sbrURL <- "https://www.sportsbookreview.com/betting-odds/ncaa-basketball/pointspread/"
This website is in an html table.
#This worked on Thursday, March 14
don_df <- donURL %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE) %>%
as.data.frame() %>%
select(X2, X3, X5)
head(don_df)
## X2
## 1 COLLEGE BASKETBALL - Wednesday, March 27th - NIT TOURNAMENT - Quarterfinals - Bottom Teams are Home
## 2 Opener
## 3 +3.5\n -110-3.5\n -110
## 4 +6.5\n -110-6.5\n -110
## 5 COLLEGE BASKETBALL - Wednesday, March 27th - CBI TOURNAMENT - Semifinals - Bottom Teams are home
## 6 Opener
## X3
## 1 COLLEGE BASKETBALL - Wednesday, March 27th - NIT TOURNAMENT - Quarterfinals - Bottom Teams are Home
## 2 Team
## 3 LipscombNC State
## 4 ColoradoTexas
## 5 COLLEGE BASKETBALL - Wednesday, March 27th - CBI TOURNAMENT - Semifinals - Bottom Teams are home
## 6 Team
## X5
## 1 COLLEGE BASKETBALL - Wednesday, March 27th - NIT TOURNAMENT - Quarterfinals - Bottom Teams are Home
## 2 SC
## 3 3542
## 4 1944
## 5 COLLEGE BASKETBALL - Wednesday, March 27th - CBI TOURNAMENT - Semifinals - Bottom Teams are home
## 6 SC
This above worked to some degree Thursday 3/14. But Friday 3/15, the data scraped comes in two tables of different dimensions, so can’t converge into one data frame the way the current code shows. I figured it out. If games are played after midnight EST, then it sets up the second table for those games, which is the case on Friday 3/15. OK, so in that scenario, the chunk below works. And now trying again on 3/27, the chunk below does not work, but the chunk above does work.
# This works for Friday, March 15
don_raw <- donURL %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE)
# This above object is a list with two elements, both of which are basically tables of data
don_df1 <- don_raw[[1]] %>%
select(X2, X3, X5)
don_df2 <- don_raw[[2]] %>%
select(X2, X3, X5)
df <- rbind(don_df1, don_df2)
df <- as_tibble(df)
head(df)
This needing to differentiate between whether an evening has post-midnight games or not can be tested with an if statement. Or just done manually.
Either way, this data needs some heavy regular expression parsing, particularly with the second column of team names, which I’m not 100% sure how to go about doing systematically. That’s the downfall to this site. The upside is that it can grab final scores too which are given in the third column also needing to be parsed.
This website is in an html table. But several tables and none of them are relevant to us.
# Attempt to scrape data from teamrankings
tr_raw <- trURL %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE, header = FALSE)
(tr_df1 <- tr_raw[[1]])
## X1 X2
## 1 DePaul vs. Coastal Car (Total) 2.5
## 2 Texas vs. Colorado (Total) 2.0
## 3 NC State vs. Lipscomb (Spread) 1.5
## 4 DePaul vs. Coastal Car (Spread) 1.5
## 5 Texas vs. Colorado (Spread) 1.0
## 6 NC State vs. Lipscomb (Total) 1.0
(tr_df2 <- tr_raw[[2]])
## X1 X2
## 1 Texas vs. Colorado (Total) 9:03PM
## 2 NC State vs. Lipscomb (Spread) 8:52PM
## 3 DePaul vs. Coastal Car (Spread) 8:01PM
(tr_df3 <- tr_raw[[3]])
## X1 X2
## 1 DePaul vs. Coastal Car -8.5
## 2 Texas vs. Colorado -5.0
## 3 NC State vs. Lipscomb -5.0
(tr_df4 <- tr_raw[[4]])
## X1
## 1 No matching games for this category, or odds have not yet been released.
## X2
## 1 No matching games for this category, or odds have not yet been released.
(tr_df5 <- tr_raw[[5]])
## X1 X2
## 1 DePaul vs. Coastal Car 165.0
## 2 NC State vs. Lipscomb 163.5
## 3 Texas vs. Colorado 137.0
(tr_df6 <- tr_raw[[6]])
## X1
## 1 No matching games for this category, or odds have not yet been released.
## X2
## 1 No matching games for this category, or odds have not yet been released.
All of the data we want is not in those tables. The table that we DO actually want is at the XPath: //*[@id="tab-latest-odds"] and I get the feeling the spreads will be easily parseable if we do get that. This site doesn’t give final scores with it, but that can be done separately and probably not with much difficulty.
This website is in an html table.
# Attempt to scrape data from vegasinsider
vi_raw <- vegasURL %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE)
If you look at vi_raw[[5]], vi_raw[[9]], and vi_raw[[11]] you’ll see some useful information, but in dreadfully unhelpful formats. Not my choice right now.
This website is NOT in an html table.
# Attempt to scrape data from sportsbookreview
This website is NOT in an html table.
# Attempt to scrape data from oddsshark
The downside is that there are not final scores available in the same scrape.