# Variety of libraries that I've used or expected to use throughout the process
library(rvest)
library(dplyr)
library(stringr)
library(stringi)
library(XML)
library(data.table)
library(knitr)
library(tidyr)

Goal

  1. To automatically scrape opening lines of ncaa tourney games from the web.
  2. And scrape final scores from the web.
  3. Get into a tidy dataframe or 2.

The ideal goal is to obtain a data frame that looks something like:

##        TeamName.x     TeamName.y Spread
## 1       Villanova        Radford   23.0
## 2          Purdue   CS Fullerton   20.5
## 3      Texas Tech      SF Austin   11.0
## 4      Wichita St       Marshall   11.5
## 5   West Virginia      Murray St   10.5
## 6         Florida St Bonaventure    5.5
## 7        Arkansas         Butler   -1.5
## 8   Virginia Tech        Alabama    2.5
## 9          Kansas           Penn   14.5
## 10           Duke           Iona   20.0
## 11    Michigan St       Bucknell   14.5
## 12         Auburn Col Charleston    9.0
## 13        Clemson  New Mexico St    4.5
## 14            TCU       Syracuse    4.0
## 15   Rhode Island       Oklahoma    1.5
## 16     Seton Hall       NC State    3.0
## 17       Virginia           UMBC   21.0
## 18     Cincinnati     Georgia St   14.0
## 19      Tennessee      Wright St   11.5
## 20        Arizona        Buffalo    8.5
## 21       Kentucky       Davidson    5.0
## 22       Miami FL Loyola-Chicago    2.0
## 23         Nevada          Texas    0.0
## 24      Creighton      Kansas St    1.0
## 25         Xavier    TX Southern   19.5
## 26 North Carolina       Lipscomb   19.5
## 27       Michigan        Montana   10.0
## 28        Gonzaga UNC Greensboro   13.5
## 29        Ohio St    S Dakota St    8.0
## 30        Houston   San Diego St    4.0
## 31      Texas A&M     Providence    2.5
## 32       Missouri     Florida St   -1.5

As long as the +/- of the spread variable is consistently either for team x or team y. I think the data frame would be fine to only give one team in the matchup too.

Other notes:
These two scrapings can possibly be done in the same task if the site has that data.
Want to work with website that teamnames are not abbreviated to make the eventual joining easier.

Here’s some websites I’ve found with the spreads info. Open to using different websites as well. Whatever works.

# Links to websites that have ncaa spreads
donURL <- "http://www.donbest.com/ncaab/odds/spreads/"
trURL <- "https://www.teamrankings.com/ncb/odds/"
vegasURL <- "http://www.vegasinsider.com/college-basketball/odds/las-vegas/"
oddssharkURL <- "https://www.oddsshark.com/ncaab/odds"
sbrURL <- "https://www.sportsbookreview.com/betting-odds/ncaa-basketball/pointspread/"

Donbest.com

This website is in an html table.

#This worked on Thursday, March 14
don_df <- donURL %>% 
            read_html() %>% 
            html_nodes("table") %>% 
            html_table(fill = TRUE) %>% 
            as.data.frame() %>% 
            select(X2, X3, X5)

head(don_df)
##                                                                                                      X2
## 1 COLLEGE BASKETBALL - Wednesday, March 27th  -  NIT TOURNAMENT - Quarterfinals - Bottom Teams are Home
## 2                                                                                                Opener
## 3                                                        +3.5\n             -110-3.5\n             -110
## 4                                                        +6.5\n             -110-6.5\n             -110
## 5    COLLEGE BASKETBALL - Wednesday, March 27th  -  CBI TOURNAMENT - Semifinals - Bottom Teams are home
## 6                                                                                                Opener
##                                                                                                      X3
## 1 COLLEGE BASKETBALL - Wednesday, March 27th  -  NIT TOURNAMENT - Quarterfinals - Bottom Teams are Home
## 2                                                                                                  Team
## 3                                                                                      LipscombNC State
## 4                                                                                         ColoradoTexas
## 5    COLLEGE BASKETBALL - Wednesday, March 27th  -  CBI TOURNAMENT - Semifinals - Bottom Teams are home
## 6                                                                                                  Team
##                                                                                                      X5
## 1 COLLEGE BASKETBALL - Wednesday, March 27th  -  NIT TOURNAMENT - Quarterfinals - Bottom Teams are Home
## 2                                                                                                    SC
## 3                                                                                                  3542
## 4                                                                                                  1944
## 5    COLLEGE BASKETBALL - Wednesday, March 27th  -  CBI TOURNAMENT - Semifinals - Bottom Teams are home
## 6                                                                                                    SC

This above worked to some degree Thursday 3/14. But Friday 3/15, the data scraped comes in two tables of different dimensions, so can’t converge into one data frame the way the current code shows. I figured it out. If games are played after midnight EST, then it sets up the second table for those games, which is the case on Friday 3/15. OK, so in that scenario, the chunk below works. And now trying again on 3/27, the chunk below does not work, but the chunk above does work.

# This works for Friday, March 15
don_raw <- donURL %>% 
            read_html() %>% 
            html_nodes("table") %>% 
            html_table(fill = TRUE)
# This above object is a list with two elements, both of which are basically tables of data

don_df1 <- don_raw[[1]] %>% 
    select(X2, X3, X5)

don_df2 <- don_raw[[2]] %>% 
    select(X2, X3, X5)

df <- rbind(don_df1, don_df2)

df <- as_tibble(df)
head(df)

This needing to differentiate between whether an evening has post-midnight games or not can be tested with an if statement. Or just done manually.

Either way, this data needs some heavy regular expression parsing, particularly with the second column of team names, which I’m not 100% sure how to go about doing systematically. That’s the downfall to this site. The upside is that it can grab final scores too which are given in the third column also needing to be parsed.

TeamRankings.com

This website is in an html table. But several tables and none of them are relevant to us.

# Attempt to scrape data from teamrankings
tr_raw <- trURL %>% 
    read_html() %>% 
    html_nodes("table") %>% 
    html_table(fill = TRUE, header = FALSE)

(tr_df1 <- tr_raw[[1]])
##                                X1  X2
## 1  DePaul vs. Coastal Car (Total) 2.5
## 2      Texas vs. Colorado (Total) 2.0
## 3  NC State vs. Lipscomb (Spread) 1.5
## 4 DePaul vs. Coastal Car (Spread) 1.5
## 5     Texas vs. Colorado (Spread) 1.0
## 6   NC State vs. Lipscomb (Total) 1.0
(tr_df2 <- tr_raw[[2]])
##                                X1     X2
## 1      Texas vs. Colorado (Total) 9:03PM
## 2  NC State vs. Lipscomb (Spread) 8:52PM
## 3 DePaul vs. Coastal Car (Spread) 8:01PM
(tr_df3 <- tr_raw[[3]])
##                       X1   X2
## 1 DePaul vs. Coastal Car -8.5
## 2     Texas vs. Colorado -5.0
## 3  NC State vs. Lipscomb -5.0
(tr_df4 <- tr_raw[[4]])
##                                                                         X1
## 1 No matching games for this category, or odds have not yet been released.
##                                                                         X2
## 1 No matching games for this category, or odds have not yet been released.
(tr_df5 <- tr_raw[[5]])
##                       X1    X2
## 1 DePaul vs. Coastal Car 165.0
## 2  NC State vs. Lipscomb 163.5
## 3     Texas vs. Colorado 137.0
(tr_df6 <- tr_raw[[6]])
##                                                                         X1
## 1 No matching games for this category, or odds have not yet been released.
##                                                                         X2
## 1 No matching games for this category, or odds have not yet been released.

All of the data we want is not in those tables. The table that we DO actually want is at the XPath: //*[@id="tab-latest-odds"] and I get the feeling the spreads will be easily parseable if we do get that. This site doesn’t give final scores with it, but that can be done separately and probably not with much difficulty.

Vegas Insider

This website is in an html table.

# Attempt to scrape data from vegasinsider
vi_raw <- vegasURL %>% 
    read_html() %>% 
    html_nodes("table") %>% 
    html_table(fill = TRUE)

If you look at vi_raw[[5]], vi_raw[[9]], and vi_raw[[11]] you’ll see some useful information, but in dreadfully unhelpful formats. Not my choice right now.

Sportsbook Review

This website is NOT in an html table.

# Attempt to scrape data from sportsbookreview

Oddsshark

This website is NOT in an html table.

# Attempt to scrape data from oddsshark

The downside is that there are not final scores available in the same scrape.