MLB Odds Analysis

Setup and Load Packages:

knitr::opts_chunk$set(echo = TRUE)

# Load necessary packages:
library(dplyr)
library(ggplot2)
library(readr)
library(knitr)
library(retrosheet)
library(lubridate)
library(kableExtra)
library(bigrquery)
library(DBI)

Introduction

For my Google Data Analytics Capstone Project, I chose track 2. I used the six phases of data analysis for this project (ask, prepare, process, analyze, share, & act). I used several tools to complete this project, specifically spreadsheets (excel), SQL (BigQuery) and R ( R Studio).

Background

In this case study scenario, I have been hired to analyze Major League Baseball odds data for FantasyAce, a website that provides projections, software tools and advice for fantasy sports and sports betting.

Ask

Business Question: How accurate are the lines set by sports books on Major League Baseball games?

Sportsbooks employ skilled handicappers with access to huge amounts of historic data in order to create lines in advance of various games and events. Then they adjust these odds in reaction to the money wagered. For example, a book may post an opening line for runs scored in a game at 8.5. As money is wagered, the lines may begin to change. Because the books keep track of every wager placed by every player, they know who the skilled, winning players are, and they will adjust their lines in reaction to these players. If enough money is wagered on the over by winning professional bettors, the book may adjust the line up to 9. This process creates efficient lines that leverage the historical data generated by the sportsbook, as well as the experience and skill of the handicappers they employ and the professional gamblers who wager on these lines. For this reason, FantasyAce uses these numbers to identify profitable betting opportunities, create more accurate player projections, and generally advise their customers on the sports betting landscape. However, we need to better understand the strengths and weaknesses of the various odds in order to improve our understanding of their predictive value and communicate this information to the customer. Specifically, we want to answer the following questions:

How often does the money line favorite cover?
How often does the total favorite cover?
How often does the run line odds favorite cover?

For the purpose of this analysis, we will define a cover as follows:

money line cover = money line < opponent money line AND runs > opponent runs
over total = (runs + oppRuns) > total
under total = (runs + oppRuns) < total
Run line: (oppRuns - runs) < runLine

Prepare

FantasyAce has provided me with Major League Baseball odds data from online sportsbook ActionSports.com. There are 10 separate csv files, one for each season from 2012 to 2021. The data has one row for each team involved in every regular season game. The columns consist of team, date, and the corresponding odds from the sports book. I uploaded each of these files to BigQuery. The data does not include the game results, so I needed to collect this information and attach it to the odds data. Specifically I needed the number of runs scored by each team. I also needed to calculate the total runs scored by both teams in a game, equal to runs plus opponent runs. I used an R package called “retrosheet” to obtain this game data. This package is a collection of tools used to import baseball data from the website retrosheet.org. The data was obtained free of charge from, and is copyrighted by, retrosheet.

Process

I used excel to open and work with the 10 odds data csv files. I created pivot tables and a plot for each file.

I then uploaded these files to BigQuery. I used SQL to join these tables together and then uploaded these into R Studio.

# Create a query to join odds files:
query <- "SELECT *
FROM
(
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2012`
  UNION ALL
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2013`
  UNION ALL
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2014`
  UNION ALL
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2015`
  UNION ALL
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2016`
  UNION ALL
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2017`
  UNION ALL
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2018`
  UNION ALL
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2019`
  UNION ALL
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2020`
  UNION ALL
  SELECT * FROM `vast-logic-393202.baseballOdds.odds2021`
)"

# Run query
oddsData <- dbGetQuery(con, query)

head(oddsData)

## # A tibble: 6 × 10
##   date       at    team  gameNumber  line runLine runLineOdds total overOdds
##   <date>     <chr> <chr>      <int> <int>   <dbl>       <int> <dbl>    <int>
## 1 2021-05-27 V     COL            1   115     1.5        -220   5       -120
## 2 2021-05-27 H     NYM            1  -125    -1.5         195   5       -120
## 3 2021-05-29 V     COL            1   124     1.5        -230   5       -120
## 4 2021-05-29 H     PIT            1  -134    -1.5         200   5       -120
## 5 2021-06-18 V     TB             1  -101     1.5        -215   7.5     -120
## 6 2021-06-18 H     SEA            1  -109    -1.5         190   7.5     -120
## # ℹ 1 more variable: underOdds <int>

I then used the “retrosheet” package to obtain the necessary game data. I saved this as “gameLogs”.

# Define the range of years to get data for:
years <- 2012:2021

# Create an empty data frame to store the data for all years:
combined_gamelog <- data.frame()

# Loop through each year and fetch the game log data:
for (year in years) {
  gamelog <- getPartialGamelog(year, glFields = c('Date', 'DblHdr', 'VisTm', 'HmTm', 'VisRuns',
                                                  'HmRuns', 'ParkID', 'VisH', 'VisD', 'VisT',
                                                  'VisHR', 'VisRBI', 'VisHBP', 'VisBB', 'VisSB',
                                                  'VisER', 'HmH', 'HmD', 'HmT', 'HmHR',
                                                  'HmRBI', 'HmHBP', 'HmBB', 'HmSB', 'HmER',
                                                  'VisStPchID', 'VisStPchNm', 'HmStPchID', 'HmStPchNm'))
  
# Append the current year's data to the combined data frame:
  combined_gamelog <- rbind(combined_gamelog, gamelog)
}

Analyze

The game logs have data for both teams in each game in a single row. The odds data contains a separate row for each team in every game, so I had to divide the game logs into separate rows for each team. In order to merge this data with the odds data, I completed the following steps:

separate rows into visiting and home team columns
change ‘visiting’ and ‘home’ team to columns to ‘team’ and ‘opponent’
changed the team abbreviations
changed date column from integer to date format
created game number column from double header info
created ‘season’ column by extracting year from date column
selected columns needed for analysis

# Create table for visiting teams:
visitor.gl <- combined_gamelog

# Rename the remaining columns:
colnames(visitor.gl) <- c('date', 'DblHdr', 'team', 'opponent', 'runs',
                          'oppRuns', 'ParkID', 'hits', 'doubles', 'triples',
                          'HR', 'RBI', 'HBP', 'BB', 'SB',
                          'pitcherERs', 'oppHits', 'oppDoubles', 'oppTriples', 'oppHR',
                          'oppRBI', 'oppHBP', 'oppBB', 'oppSB', 'oppPitcherERs',
                          'pitcherID', 'pitcherName', 'oppPitcherID', 'oppPitcherName')

# Create table for home teams:
home.gl <- combined_gamelog

# Rename the remaining columns:
colnames(home.gl) <- c('date', 'DblHdr', 'opponent', 'team', 'oppRuns',
                       'runs', 'ParkID', 'oppHits', 'oppDoubles', 'oppTriples',
                       'oppHR', 'oppRBI', 'oppHBP', 'oppBB', 'oppSB',
                       'oppPitcherERs', 'hits', 'doubles', 'triples', 'HR',
                       'RBI', 'HBP', 'BB', 'SB', 'pitcherERs',
                       'oppPitcherID', 'oppPitcherName', 'pitcherID', 'pitcherName')

# Reorder the columns to match visitor.gl:
home.gl <- home.gl[, c('date', 'DblHdr', 'team', 'opponent', 'runs',
                       'oppRuns', 'ParkID', 'hits', 'doubles', 'triples',
                       'HR', 'RBI', 'HBP', 'BB', 'SB',
                       'pitcherERs', 'oppHits', 'oppDoubles', 'oppTriples', 'oppHR',
                       'oppRBI', 'oppHBP', 'oppBB', 'oppSB', 'oppPitcherERs',
                       'pitcherID', 'pitcherName', 'oppPitcherID', 'oppPitcherName')]


# Combine home and visitor to create full table of games:
gameLogs <- rbind(visitor.gl, home.gl)


# Change team abbreviations to match odds data:
Find <- c('WAS', 'TOR', 'TBA', 'SLN', 'SFN', 'SEA', 'PIT', 'PHI', 'NYA', 'MIN', 
          'MIL', 'MIA', 'LAN', 'KCA', 'COL', 'CHA', 'BOS', 'ATL', 'ANA', 'ARI', 
          'CIN', 'DET', 'CLE', 'TEX', 'OAK', 'BAL', 'SDN', 'HOU', 'NYN', 'CHN')

Replace <- c('WSH', 'TOR', 'TB', 'STL', 'SF', 'SEA', 'PIT', 'PHI', 'NYY', 'MIN', 
             'MIL', 'MIA', 'LAD', 'KC', 'COL', 'CWS', 'BOS', 'ATL', 'LAA', 'ARI', 
             'CIN', 'DET', 'CLE', 'TEX', 'OAK', 'BAL', 'SD', 'HOU', 'NYM', 'CHC')

# Change team abbreviations:
gameLogs$team <- Replace[match(gameLogs$team, Find)]
gameLogs$opponent <- Replace[match(gameLogs$opponent, Find)]

# Change date column from integer to proper date format:
gameLogs$date <- as.Date(as.character(gameLogs$date), format = "%Y%m%d")

# Create gameNumber column (needed to join with odds data) from double header information:
gameLogs <- gameLogs %>%
  mutate(gameNumber = ifelse(DblHdr %in% c(0, 1), 1, 2))

# Create 'season' column by extracting year from date column:
gameLogs$season <- year(gameLogs$date)

# Select columns: 
gameLogs <- gameLogs %>%
  select(date, team, opponent, runs, oppRuns, gameNumber, season)

# Remove unnecessary items:
rm(home.gl, visitor.gl, gamelog, combined_gamelog)

# View gameLogs:
head(gameLogs)

##         date team opponent runs oppRuns gameNumber season
## 1 2012-03-28  SEA      OAK    3       1          1   2012
## 2 2012-03-29  SEA      OAK    1       4          1   2012
## 3 2012-04-04  STL      MIA    4       1          1   2012
## 4 2012-04-05  TOR      CLE    7       4          1   2012
## 5 2012-04-05  BOS      DET    2       3          1   2012
## 6 2012-04-05  WSH      CHC    2       1          1   2012

I then merged this data with our odds data. I also created separate columns for:

opponent money line
opponent run line
opponent run line odds
total runs (runs + opponent runs)

# Merge gameLogs and oddsData:
data <- merge(gameLogs, oddsData, by = c("date", "team", "gameNumber"), all = TRUE)

# Create temporary data frame to add opponent line data:
oddsDataTemp <- data[, c("date", "team", "gameNumber", "opponent", "line", "runLine", "runLineOdds")]

# Merge with original data, matching opponent in original data to team in temporary data:
dataTemp <- merge(data, oddsDataTemp, 
                  by.x = c("date", "opponent", "gameNumber"), 
                  by.y = c("date", "team", "gameNumber"))

# Rename columns:
names(dataTemp)[names(dataTemp) %in% c("line.x", "runLine.x", "runLineOdds.x")] <- c("moneyLine", "runLine", "runLineOdds")
names(dataTemp)[names(dataTemp) %in% c("line.y", "runLine.y", "runLineOdds.y")] <- c("oppMoneyLine", "oppRunLine", "oppRunLineOdds")

# Select columns and assign dataTemp to data: 
data <- dataTemp %>% 
  select(
    date, season, team, opponent, runs, oppRuns,
    moneyLine, runLine, runLineOdds, total, overOdds, underOdds,
    oppMoneyLine, oppRunLine, oppRunLineOdds
  )

# Create 'totalRuns' column :
data$totalRuns <- data$runs + data$oppRuns

# Delete oddsDataTemp:
rm(oddsDataTemp)

# Delete dataTemp:
rm(dataTemp)

# View data:
str(data)

## 'data.frame':    45530 obs. of  16 variables:
##  $ date          : Date, format: "2012-03-28" "2012-03-28" ...
##  $ season        : num  2012 2012 2012 2012 2012 ...
##  $ team          : chr  "SEA" "OAK" "SEA" "OAK" ...
##  $ opponent      : chr  "OAK" "SEA" "OAK" "SEA" ...
##  $ runs          : int  3 1 1 4 4 1 1 3 2 0 ...
##  $ oppRuns       : int  1 3 4 1 1 4 0 2 1 4 ...
##  $ moneyLine     : int  -140 120 -108 -112 160 -180 -111 -144 -139 110 ...
##  $ runLine       : num  -1.5 1.5 1.5 -1.5 1.5 -1.5 -1.5 -1.5 -1.5 1.5 ...
##  $ runLineOdds   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ total         : num  7 7 8.5 8.5 7 7 7 7 6.5 7.5 ...
##  $ overOdds      : int  -110 -110 105 105 -125 -125 -105 -120 110 -125 ...
##  $ underOdds     : int  -110 -110 -125 -125 105 105 -115 100 -130 105 ...
##  $ oppMoneyLine  : int  120 -140 -112 -108 -180 160 -109 124 119 -130 ...
##  $ oppRunLine    : num  1.5 -1.5 -1.5 1.5 -1.5 1.5 1.5 1.5 1.5 -1.5 ...
##  $ oppRunLineOdds: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ totalRuns     : int  4 4 5 5 5 5 1 5 3 4 ...

Before I started my analysis, I used the summary() function to check the data for missing values. There are some NA values in the runLine and runLineOdds columns that I need to account for later on.

summary(data)

##       date                season         team             opponent        
##  Min.   :2012-03-28   Min.   :2012   Length:45530       Length:45530      
##  1st Qu.:2014-06-01   1st Qu.:2014   Class :character   Class :character  
##  Median :2016-08-08   Median :2016   Mode  :character   Mode  :character  
##  Mean   :2016-10-08   Mean   :2016                                        
##  3rd Qu.:2019-04-01   3rd Qu.:2019                                        
##  Max.   :2021-10-03   Max.   :2021                                        
##                                                                           
##       runs           oppRuns         moneyLine          runLine       
##  Min.   : 0.000   Min.   : 0.000   Min.   :-550.00   Min.   :-1.5000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:-141.00   1st Qu.:-1.5000  
##  Median : 4.000   Median : 4.000   Median :-106.00   Median : 1.5000  
##  Mean   : 4.425   Mean   : 4.425   Mean   : -19.42   Mean   : 0.0001  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.: 126.00   3rd Qu.: 1.5000  
##  Max.   :29.000   Max.   :29.000   Max.   : 425.00   Max.   : 1.5000  
##                                                      NA's   :438      
##   runLineOdds          total           overOdds        underOdds      
##  Min.   :-270.00   Min.   : 4.500   Min.   :-142.0   Min.   :-142.00  
##  1st Qu.:-155.00   1st Qu.: 7.500   1st Qu.:-115.0   1st Qu.:-115.00  
##  Median :-110.00   Median : 8.500   Median :-110.0   Median :-110.00  
##  Mean   : -20.62   Mean   : 8.394   Mean   : -67.9   Mean   : -71.31  
##  3rd Qu.: 135.00   3rd Qu.: 9.000   3rd Qu.:-105.0   3rd Qu.:-105.00  
##  Max.   : 235.00   Max.   :15.000   Max.   : 122.0   Max.   : 122.00  
##  NA's   :9722                                                         
##   oppMoneyLine       oppRunLine      oppRunLineOdds      totalRuns     
##  Min.   :-550.00   Min.   :-1.5000   Min.   :-270.00   Min.   : 1.000  
##  1st Qu.:-141.00   1st Qu.:-1.5000   1st Qu.:-155.00   1st Qu.: 5.000  
##  Median :-106.00   Median : 1.5000   Median :-110.00   Median : 8.000  
##  Mean   : -19.42   Mean   : 0.0001   Mean   : -20.62   Mean   : 8.849  
##  3rd Qu.: 126.00   3rd Qu.: 1.5000   3rd Qu.: 135.00   3rd Qu.:11.000  
##  Max.   : 425.00   Max.   : 1.5000   Max.   : 235.00   Max.   :38.000  
##                    NA's   :438       NA's   :9722

This gave me a sample of 45,530 rows of data representing each team involved in 22,765 games. I selected a sample row from our data to help explain each type of line. I then started my analysis by calculating the count of each type of line, how often the favorite covered this line, and the percentage of time the line covered. I used this data to create a summary table and a plot for each of our line types.

# Select money line example and create table:
moneyLineExample <- kable(data[10, c(3, 4, 7:15)], format = "pipe")

# Perform calculations for money line and generate HTML table:
moneyLineSumTbl <- data %>%
  reframe(
    Money_Line = c("favorites", "dogs", "even"),
    Count = c(
      sum(moneyLine < oppMoneyLine),
      sum(moneyLine > oppMoneyLine),
      sum(moneyLine == oppMoneyLine)
    ),
    Cover = c(
      sum(moneyLine < oppMoneyLine & runs > oppRuns),
      sum(moneyLine > oppMoneyLine & runs > oppRuns),
      "-"
    ),
    Percentage = c(
      round(
        sum(runs > oppRuns & moneyLine < oppMoneyLine, na.rm = TRUE) /
          sum(moneyLine < oppMoneyLine, na.rm = TRUE) * 100,
        1
      ),
      round(
        sum(runs > oppRuns & moneyLine > oppMoneyLine, na.rm = TRUE) /
          sum(moneyLine > oppMoneyLine, na.rm = TRUE) * 100,
        1
      ),
      "-"
    )
  ) %>%
  kable("html") %>%
  kable_styling(full_width = FALSE, bootstrap_options = "striped", position = "center") %>%
  row_spec(0, extra_css = "border-bottom: 2px solid black;") %>%
  add_header_above(c("Money Line Summary" = 4))

# Generate money line plot (win percentage by money line):
moneyLinePlot1 <- data %>%
  group_by(moneyLine) %>%
  summarise(
    Count = n(),
    Avg_Difference = round(mean(runs - oppRuns), 2),
    Std_Deviation = round(sd(runs - oppRuns), 2),
    Percentage = mean(runs > oppRuns, na.rm = TRUE)
  ) %>%
  filter(Count >= 20) %>% 
  ggplot(aes(factor(moneyLine), Percentage)) +
  geom_point(color = "dodgerblue") +
  labs(x = "Money Line", y = "Win Percentage") +
  ggtitle("Win Percentage by Money Line")+
  theme_minimal() +
  theme(panel.background = element_rect(fill = "honeydew", color = "black", linewidth = 2))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10) ) +
  scale_x_discrete(breaks = c(-350, -260, -210, -191, -180, -170, -160, -150, 
                              -140, -130, -120, -110, 100, 110, 120, 130, 140, 
                              150, 160, 173, 195, 245))

#######################################################################################

# Select game total example and create table:
totalExample <- kable(data[10021, c(3, 4, 7:15)], format = "pipe")

# game total summary:
totalSummary <- data %>% 
  filter(runs > oppRuns) %>%
  summarise(count = sum(!is.na(total)),
            totalOverFav = sum(overOdds < underOdds),
            totalUnderFav = sum(overOdds > underOdds),
            totalEven = sum(overOdds == underOdds),
            overCover = sum(overOdds < underOdds & totalRuns > total),
            underCover = sum(underOdds < overOdds & totalRuns < total),
            totalCover = sum((overOdds < underOdds & totalRuns > total) | (underOdds < overOdds & totalRuns < total)),
            notCover = sum((overOdds < underOdds & totalRuns < total) | (underOdds < overOdds & totalRuns > total)),
            push = sum(total == totalRuns),
            overCoverPct = round((overCover / totalOverFav)*100, 1),
            underCoverPct = round((underCover / totalUnderFav)*100, 1),
            coverPct = round((totalCover / (totalCover + notCover + push))*100, 1),
            notCoverPct = round((notCover / (totalCover + notCover + push))*100, 1),
            pushPct = round((push / (totalCover + notCover + push))*100, 1),
            totalLines = totalOverFav + totalUnderFav
  )

# Create game total summary table:
totalSummaryTable <- data.frame(
  Total = c("over", "under", "even", "total*"),
  Count = c(totalSummary$totalOverFav, totalSummary$totalUnderFav, 
            totalSummary$totalEven, totalSummary$totalLines),
  Cover = c(totalSummary$overCover, totalSummary$underCover, "-", totalSummary$totalCover),
  Percentage = c(totalSummary$overCoverPct, totalSummary$underCoverPct, "-", totalSummary$coverPct)
)

# Generate html table:
totalSumTbl <- totalSummaryTable %>%
  kable("html") %>%
  kable_styling(full_width = FALSE, bootstrap_options = "striped", position = "center") %>%
  row_spec(0, extra_css = "border-bottom: 2px solid black;") %>%
  add_header_above(c("Total Summary" = 4))%>%
  add_footnote("excludes even odds.", escape = FALSE)

# Generate total odds plot (cover percentage by total odds)
totalPlot1 <- data %>%
  group_by(overOdds) %>%
  summarise(
    count = n(),
    overCover = sum(totalRuns > total),
    overMiss = sum(totalRuns < total),
    overPush = sum(totalRuns == total)
  ) %>%
  filter(count >= 20) %>%
  transmute(
    "Odds" = overOdds,
    "count" = count,
    "Percentage Covered" = round(overCover / (overCover + overMiss + overPush), 2),
    "Cover" = overCover,
    "Non Cover" = overMiss,
    "Push" = overPush,
    "Favorite" = "Over"  # Add a new column to identify the favorite
  ) %>%
  bind_rows(data %>%
              group_by(underOdds) %>%
              summarise(
                count = n(),
                underCover = sum(totalRuns < total),
                underMiss = sum(totalRuns > total),
                underPush = sum(totalRuns == total)
              ) %>%
              filter(count >= 20) %>%
              transmute(
                "Odds" = underOdds,
                "count" = count,
                "Percentage Covered" = round(underCover / (underCover + underMiss + underPush), 2),
                "Cover" = underCover,
                "Non Cover" = underMiss,
                "Push" = underPush,
                "Favorite" = "Under"  
              )) %>%
  ggplot(aes(as.factor(Odds), `Percentage Covered`, color = Favorite)) +
  geom_point() +
  labs(title = "Cover Percentage by Total Odds") +
  scale_color_manual(values = c("Over" = "dodgerblue", "Under" = "red")) +  
  theme_minimal() +
  theme(panel.background = element_rect(fill = "honeydew", color = "black", linetype = 1))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))+
  scale_x_discrete(labels = function(x) ifelse(seq_along(x) %% 2 == 0, x, ""))+
  xlab("Total Odds") +
  theme(legend.position = c(0.9, 0.8), 
        legend.box.background = element_rect(color = "black", linewidth = 1)
  )

#######################################################################################

# Select run line example and create table:
runLineExample <- kable(data[20187, c(3, 4, 7:15)], format = "pipe")

runLineSummary <-data %>% 
  filter(!is.na(runLineOdds)) %>% 
  summarise(runLineFavCount = sum(runLine < oppRunLine),
            count = n(),
            runLineFavRunLineOddsFavCount = sum(runLine < oppRunLine & runLineOdds < oppRunLineOdds),
            runLineFavRunLineOddsDogCount = sum(runLine < oppRunLine & runLineOdds > oppRunLineOdds),
            runLineFavRunLineOddsFavCov = sum(runLine < oppRunLine & runLineOdds < oppRunLineOdds & 
                                                ((oppRuns - runs) < runLine)), 
            runLineFavRunLineOddsDogCov = sum(runLine < oppRunLine & runLineOdds > oppRunLineOdds & 
                                                ((oppRuns - runs) < runLine)),
            runLineFavRunLineOddsFavCovPct = round(runLineFavRunLineOddsFavCov / runLineFavRunLineOddsFavCount * 100, 1),
            runLineFavRunLineOddsDogCovPct = round(runLineFavRunLineOddsDogCov / runLineFavRunLineOddsDogCount * 100, 1),
            runLineFavTotalCoverPct = round(((runLineFavRunLineOddsFavCov + runLineFavRunLineOddsDogCov) /
                                               (runLineFavRunLineOddsFavCount + runLineFavRunLineOddsDogCount)) * 100, 1),
            runLineDogCount = sum(runLine > oppRunLine),
            runLineDogRunLineOddsFavCount = sum(runLine > oppRunLine & runLineOdds < oppRunLineOdds),
            runLineDogRunLineOddsDogCount = sum(runLine > oppRunLine & runLineOdds > oppRunLineOdds),
            runLineDogRunLineOddsFavCov = sum(runLine > oppRunLine & runLineOdds < oppRunLineOdds & 
                                                ((oppRuns - runs) < runLine)),
            runLineDogRunLineOddsDogCov = sum(runLine > oppRunLine & runLineOdds > oppRunLineOdds & 
                                                ((oppRuns - runs) < runLine)),
            runLineDogRunLineOddsFavCovPct = round(runLineDogRunLineOddsFavCov / runLineDogRunLineOddsFavCount * 100, 1),
            runLineDogRunLineOddsDogCovPct = round(runLineDogRunLineOddsDogCov / runLineDogRunLineOddsDogCount * 100, 1),
            runLineDogTotalCoverPct = round(((runLineDogRunLineOddsFavCov + runLineDogRunLineOddsDogCov) /
                                               (runLineDogRunLineOddsFavCount + runLineDogRunLineOddsDogCount)) * 100, 1),
            runLineOddsCount = sum(!is.na(runLineOdds)),
            runLineOddsFavCount = sum(runLineOdds < oppRunLineOdds),
            runLineOddsDogCount = sum(runLineOdds > oppRunLineOdds),
            runLineOddsCoverCount = sum((runLineOdds < oppRunLineOdds & (oppRuns - runs) < runLine) | 
                                          (runLineOdds > oppRunLineOdds & (oppRuns - runs) < runLine)),
            totalRunLineOddsFavCov = sum(runLineOdds < oppRunLineOdds & (oppRuns - runs) < runLine),
            totalRunLineOddsDogCov = sum(runLineOdds > oppRunLineOdds & (oppRuns - runs) < runLine),
            totalRunLineOddsDogCovPct = round((totalRunLineOddsDogCov / runLineOddsDogCount) * 100, 1),
            totalRunLineOddsFavCovPct = round((totalRunLineOddsFavCov / runLineOddsFavCount) * 100, 1),
            runLineOddsDogCoverPct = round((runLineFavRunLineOddsDogCov + runLineDogRunLineOddsDogCov) /
                                             runLineOddsDogCount * 100, 1)
            
  )

# Create run line summary table:
runLineSummaryTable <- data.frame(
  Run_Line = c("-1.5", "1.5", "-"),
  count = c(runLineSummary$runLineFavCount, runLineSummary$runLineDogCount, runLineSummary$runLineOddsCount),
  oddsFavorite = c(runLineSummary$runLineFavRunLineOddsFavCount, runLineSummary$runLineDogRunLineOddsFavCount,
                   (runLineSummary$runLineFavRunLineOddsFavCount + runLineSummary$runLineDogRunLineOddsFavCount)),
  oddsFavCovers = c(runLineSummary$runLineFavRunLineOddsFavCov, runLineSummary$runLineDogRunLineOddsFavCov, 
                    (runLineSummary$runLineFavRunLineOddsFavCov + runLineSummary$runLineDogRunLineOddsFavCov)),
  oddsFavCoverPct = c(runLineSummary$runLineFavRunLineOddsFavCovPct, runLineSummary$runLineDogRunLineOddsFavCovPct,
                      runLineSummary$totalRunLineOddsFavCovPct),
  oddsDog = c(runLineSummary$runLineFavRunLineOddsDogCount, runLineSummary$runLineDogRunLineOddsDogCount,
              (runLineSummary$runLineFavRunLineOddsDogCount + runLineSummary$runLineDogRunLineOddsDogCount)),
  oddsDogCovers = c(runLineSummary$runLineFavRunLineOddsDogCov, runLineSummary$runLineDogRunLineOddsDogCov,
                    (runLineSummary$runLineFavRunLineOddsDogCov + runLineSummary$runLineDogRunLineOddsDogCov)),
  oddsDogCoverPct = c(runLineSummary$runLineFavRunLineOddsDogCovPct, runLineSummary$runLineDogRunLineOddsDogCovPct,
                      runLineSummary$runLineOddsDogCoverPct),
  totalCoversPct = c(
    round((((runLineSummary$runLineFavRunLineOddsFavCov + runLineSummary$runLineFavRunLineOddsDogCov) / 
              (runLineSummary$runLineFavRunLineOddsFavCount + runLineSummary$runLineFavRunLineOddsDogCount)) * 100), 1),
    round(((runLineSummary$runLineDogRunLineOddsFavCov + runLineSummary$runLineDogRunLineOddsDogCov) /
             (runLineSummary$runLineDogRunLineOddsFavCount + runLineSummary$runLineDogRunLineOddsDogCount)) * 100, 1),
    round(((runLineSummary$runLineDogRunLineOddsFavCov + runLineSummary$runLineDogRunLineOddsDogCov +
              runLineSummary$runLineFavRunLineOddsFavCov + runLineSummary$runLineFavRunLineOddsDogCov) /
             (runLineSummary$runLineDogRunLineOddsFavCount + runLineSummary$runLineDogRunLineOddsDogCount +
                runLineSummary$runLineDogRunLineOddsFavCount + runLineSummary$runLineDogRunLineOddsDogCount)) *100, 1)
  )
)

# Rename columns:
colnames(runLineSummaryTable) <- c("Run Line", "Count", "Favorites", "Cover", "Percentage",
                                   "Dogs", "Cover", "Percentage", "Total Percentage")

# Create html table:
runLineSumTbl <- runLineSummaryTable %>%
  kable("html") %>%
  kable_styling(full_width = FALSE, bootstrap_options = "striped", position = "center") %>%
  row_spec(0, extra_css = "border-bottom: 2px solid black;") %>%
  add_header_above(c("Run Line Summary" = 9))

# Plot percentage of run lines covered by run line odds:
runLinePlot1 <- data %>%
  filter(!is.na(runLineOdds) & runLine < 0) %>% 
  group_by(runLineOdds, runLine) %>%
  summarise(
    Count = n(),
    Percentage = mean(oppRuns - runs < runLine, na.rm = TRUE)) %>%
  filter(Count >= 20) %>%
  transmute(
    "Odds" = runLineOdds,
    "Count" = Count,
    "Percentage Covered" = Percentage,
    "Run Line" = "Favorite"  
  ) %>%
  bind_rows(data %>%
              filter(!is.na(runLineOdds) & runLine > 0) %>% 
              group_by(runLineOdds, runLine) %>%
              summarise(
                Count = n(),
                Percentage = mean(oppRuns - runs < runLine, na.rm = TRUE)) %>%
              filter(Count >= 20) %>%
              transmute(
                "Odds" = runLineOdds,
                "Count" = Count,
                "Percentage Covered" = Percentage,
                "Run Line" = "Dog"  
              )) %>%
  ggplot(aes(as.factor(Odds), `Percentage Covered`, color = `Run Line`)) +
  geom_point() +
  labs(title = "Cover Percentage by Run Line Odds") +
  scale_color_manual(values = c("Favorite" = "dodgerblue", "Dog" = "red")) +  
  theme_minimal() +
  theme(panel.background = element_rect(fill = "honeydew", color = "black", linetype = 1))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))+
  scale_x_discrete(labels = function(x) ifelse(seq_along(x) %% 5 == 0, x, ""))+
  xlab("Run Line Odds") +
  theme(legend.position = c(0.9, 0.8), 
        legend.box.background = element_rect(color = "black", linetype = 1)
  )

I can then present the results of my analysis in a flexdashboard.

View My Dashboard

Act

My recommendations for the client are included in the ‘Summary’ tab of my dashboard.

MLB Odds Analysis

Chris Treasure

2024-03-01