1. Poisson regression model for fitting football scores

General model

The Italian Serie A is considered from the season 2011/2012 to 2017/2018. Every season in the league is constituted by a total of T = 18 temans, playing each other twice in a season (one at home and one visiting). We indicate the number of goals scored by the home and by the visiting team in the gth game as \(y_{g1}\) and \(y_{g2}\) respectively. The elements of the vector of observed counts \(\mathbf{y} = (y_{g1}, y_{g2})\) are modelled as independent Poisson:



\[y_{gj}|\lambda_{gj} \sim Poisson(\lambda_{gj})\]

conditionally on the parameters \(\mathbf{\lambda} = (\lambda_{g1},\lambda_{g2})\). which represent the scoring intensity or the expected goals in the \(gth\) for the team playing at home (\(j = 1\)) and visiting (\(j = 2\)). We also suppose that exists for every team some latent parameters which represent its attack and defense abilities. In that way, the scoring intensity \(\mathbf{\lambda}\) could be explained as a combination of these parameteres and maybe other effects. As it is natural for Poisson distribution, the logarithmic is applied over the \(\mathbf{\lambda}\) parameter as link function for defining the generalized linera model. In such way we’ll have:

\[log(\lambda_{g1}) = \gamma + \eta*(1-I(nv) ) + \alpha_{h(g)} + \beta_{a(g)}\] \[log(\lambda_{g2}) = \gamma - \eta*(1-I(nv) ) + \alpha_{a(g)} + \beta_{h(g)}\]

where \(nv\) indicate *neutral venue" and \(I(nv)\) is an indicatior function over \(nv\) that indicate if the game is played in neutral venue or not. So \(I(nv) = 1\) for a neutral venue game, and \(I(nv) = 0\) otherwise. The nested indexes \(h(g), a(g)\) identify the team playing at home and visiting (away) in the \(gth\) game.

Finally, given \(t \in T\) where \(T\) is the set with all teams available in the data and \(t\) is a team in particular, this model leads to once fitted it to data, we will decribe each team with a pair of numbers \((\alpha_t, \beta_t)\). In few words (we’ll get deeper soon) \(\alpha_t\) represent the strength in attack of the team \(t\) respect the other teams and some constraints, and \(\beta_t\) represent the weakness in deffense of the team \(t\) (also respect the other teams and some constraints). It means that a higher \(\alpha_t\) the better for team \(t\) and a higher \(\beta_t\) the worse for team \(t\).

Interpretation of model parameters

\(\lambda_{g1}\) represents the expected goals of the home team at game \(g\) given the parameters \(\theta_h = (\gamma, eta, nv, \alpha_{h}, \beta_{a})\) (the \(g\) was removed from the notation)



\[\lambda_h = E(y_h|\theta_h)\]

Interpreting the parameters on the log scale is difficult so let’s exponentiate:

\[\lambda_h = E(y_h|\theta_h) = e^{\gamma + \eta*(1-I(nv)) + \alpha_h + \beta_a} = e^{\gamma + \eta*(1-I(nv) )}e^{\alpha_h} e^ {\beta_a}\]

As we’ll see soon, an average team will be represented with parameters \(\alpha_t = 1\) and \(\beta_t = 1\). Let’s interpret the parameters given different scenarios:

  1. An average abilities team against another average abilities team \((\alpha_t = 0, \beta_t = 0)\)

The model is reduced to:

\[E(y_h) = e^{\gamma + \eta*(1-I(nv) )}\] \[E(y_a) = e^{\gamma - \eta*(1-I(nv) )}\]

In case of neutral venue:



\[E(y_h) = E(y_a)= e^{\gamma}\]

\(e^\lambda\) is interpreted as the expected goals when both teams have average abilities and there is not any venue advantage



In case of \(I(nv) = 0\):

\[E(y_h) = e^{\gamma + \eta} = e^{\gamma} e^{\eta} \] \[E(y_a) = e^{\gamma -\eta} = e^{\gamma} \frac{1}{e^{\eta}}\]

The same quantity \(\eta\) is added to the exponent of home team as it is subtracted from visitor team, taking into account the effect of playing at home or playing as visitor. Supposing that \(\eta > 0\) then \(e^\eta > 1\) the neutral expected value for home team will increase, but will decrease for the visitor team. Therefore the quantity \(e^\eta\) expresses the relative change on the average-teams expected goals given the role of the competitors( visitor or home team)

Let’s name this quantity as \(\lambda_0\) expressing the average-teams expected goals for a team playing as home, neutral or away.

\[E(y_{avg}) = e^{\gamma \pm\eta} = \lambda_0\]

  1. Only strength attack is non-zero (non average)

The quantity \(\alpha_h\) was added to the linear combination of parameters. The quantity \(e^{\alpha_h}\) expresses the relative change over the average-teams expected goals \(\lambda_0\) caused by the home team attack strength. If \(\alpha_h > 0 \implies e^{\alpha_h}>1\) meaning that the average-teams expected goals will increase. In opposite case, it will decrese.

\[E(y_h) = e^{\gamma + \eta + \alpha_h} = \lambda_0*e^{\alpha_h}\]

  1. Only defence is non-zero (non average teams)

\(\beta_{a(g)}\) is added to the equation and the effect is similiar to the former one. Be aware that if \(\beta_a > 0 \implies e^{\beta_a} > 1\) and the expected value for the home team will be higher. It means that \(\beta_a\) quantifies the defence weakness. The bigger \(\beta_a\) is, the more goals the allow. Again, the quantity \(e^{\beta_a}\) expresses the relative change over the average-teams expected goals \(\lambda_0\) caused by the opposite team weakness.

\[E(y_h) = e^{\gamma + \eta + \beta_a} = \lambda_0 e^{\beta_a} \]

  1. All paramerters are different from zero. The realtion would be:

\[E(y_h) = e^{\gamma + \eta + \alpha_h + \beta_a} = \lambda_0 e^{\lambda_h + \beta_a} = \lambda_0 e^{\alpha_h}e^{\beta_a} \]

Each effect expresses a relative change over the expected value if itself were equal to zero. For example, if away team have a weakness defence equal to zero (average team) the expected value for home team would be: \(\lambda_0 e^{\alpha_h} = \lambda_0 \lambda_h = \lambda_{0h}\). If we put now a team with \(\beta_a \neq 0\) then \(E(y_h) = \lambda_{0h}e^{\beta_a}\),

Constraints:

Additional constraints or assumptions may be required to make this model identifiable. Suppose the expected value of a game is exactly \(\lambda_*=\lambda_0e^{\alpha_h + \beta_a}\) multiplying by \(\frac{e^{k}}{e^{k}} = e^ke^{-k}\) leads to \(\lambda_* = \lambda_0e^{\alpha_h + k + \beta_a - k}\) grouping we’d have \(\alpha_h(k) = \alpha_h + k\) and \(\beta_{a}(k) = \beta_a - k\) what leads to \(\lambda_* = \lambda_0e^{\alpha_{h}(k) + \beta_a(k)}\). It shows that given a \(\lambda_*\) there are infinite solutions for \(\alpha_h\) and \(\beta_a\).

Let’s talk about two ways of constraint the model.

  1. Corner Constraint: Select a team as baseline. Set the strength and weakness of this baseline-team to zero \(\alpha_{tb} = 0\) and \(\beta_{tb} = 0\). This team will be used as reference and all other team-parameters will be referted as relative changes over this team.

  2. Sum-to-zero (STZ) constraints: The following constraint is imposed over the set of all teams:

\[\sum_{i = 1}^{T} \alpha_i = 0 \; \;\; \;\; \;\; \;\; \;\; \;\; \; \sum_{i = 1}^{T} \beta_i = 0\]



What is equivalent to setting the average-abilities team as baseline. The interpreation of the parameters we did lines up was using the STZ constraint. On another hand, these constraints will be incorporated in future models taking into account that:

\[\alpha_1 = -\sum_{i = 2}^{T} \alpha_i \; \;\; \;\; \;\; \;\; \;\; \;\; \;\beta_1 = - \sum_{i = 1}^{T} \beta_i\]

The STZ constraint leads to interpret all the parameters as deviation respect the average. This is the constraint we will use.

Getting and cleaning the data

This section will be divided in: 1. Download the data 2. Creating a SQLite data base for saving data 3. Cleaning the data

1. Download the data

Let’s get Italian Serie A data from 2011-2012 season to 20517-2018 season from the web http://www.football-data.co.uk/italym.php.

Let’s create a function to download data.

#'@return Full dataframe with all seasons required.
download_data <- function(fromSeason = "1112", toSeason = "1718", serie = "A"){
  
  #'@description Generate format vector to download seasons
  get_seasons <- function(){
  
    # Cut fromSeason, toSeason
    from_headSeason <- substr(fromSeason,1,2)
    to_headSeason <- substr(toSeason, 1, 2)
    
    from_backSeason <- substr(fromSeason, 3, 4)
    to_backSeason <- substr(toSeason, 3, 4)
    
    # Generate seasons
    headSeason <- from_headSeason:to_headSeason
    backSeason <- from_backSeason:to_backSeason
    
    # Paste 
    seasons <- str_c(headSeason, backSeason)
    
    return(seasons)
  }
  
  seasons <- get_seasons()
  if(serie == "A"){serie_csv = "I1.csv"}else if(serie == "B"){serie_csv = "I2.csv"}else{return("error")}
  
  # Optionally put as function argument but not needed as we always use the same 
  dirs <- str_c("http://www.football-data.co.uk/mmz4281", seasons, serie_csv, sep = "/")
  
  # Download dataframes and arrange it in a list
  full_df <- furrr::future_map2_dfr(dirs, seasons, ~ read_csv(.x, col_types = cols(), progress = FALSE) %>%
                   mutate(Season = as.numeric(str_c("20", str_sub(.y, 1, 2))))
  )
  
  # Bind list in only one df
  # full_df <- bind_rows(lOfDfs)
  # Create a id_game
  full_df <- full_df %>% mutate(id_game = 1:n())
  
  return(full_df)
}

# download data
df_full <- download_data()
head(df_full) # inspect df
## # A tibble: 6 x 78
##   Div   Date  HomeTeam AwayTeam  FTHG  FTAG FTR    HTHG  HTAG HTR      HS
##   <chr> <chr> <chr>    <chr>    <int> <int> <chr> <int> <int> <chr> <int>
## 1 I1    09/0~ Milan    Lazio        2     2 D         2     2 D        18
## 2 I1    10/0~ Cesena   Napoli       1     3 A         1     1 D        11
## 3 I1    11/0~ Catania  Siena        0     0 D         0     0 D         9
## 4 I1    11/0~ Chievo   Novara       2     2 D         2     1 H        11
## 5 I1    11/0~ Fiorent~ Bologna      2     0 H         1     0 H        15
## 6 I1    11/0~ Genoa    Atalanta     2     2 D         1     2 A        15
## # ... with 67 more variables: AS <int>, HST <int>, AST <int>, HF <int>,
## #   AF <int>, HC <int>, AC <int>, HY <int>, AY <int>, HR <int>, AR <int>,
## #   B365H <dbl>, B365D <dbl>, B365A <dbl>, BWH <dbl>, BWD <dbl>,
## #   BWA <dbl>, GBH <dbl>, GBD <dbl>, GBA <dbl>, IWH <dbl>, IWD <dbl>,
## #   IWA <dbl>, LBH <dbl>, LBD <dbl>, LBA <dbl>, SBH <dbl>, SBD <dbl>,
## #   SBA <dbl>, WHH <dbl>, WHD <dbl>, WHA <dbl>, SJH <dbl>, SJD <dbl>,
## #   SJA <dbl>, VCH <dbl>, VCD <dbl>, VCA <dbl>, BSH <dbl>, BSD <dbl>,
## #   BSA <dbl>, Bb1X2 <int>, BbMxH <dbl>, BbAvH <dbl>, BbMxD <dbl>,
## #   BbAvD <dbl>, BbMxA <dbl>, BbAvA <dbl>, BbOU <int>, `BbMx>2.5` <dbl>,
## #   `BbAv>2.5` <dbl>, `BbMx<2.5` <dbl>, `BbAv<2.5` <dbl>, BbAH <int>,
## #   BbAHh <dbl>, BbMxAHH <dbl>, BbAvAHH <dbl>, BbMxAHA <dbl>,
## #   BbAvAHA <dbl>, Season <dbl>, PSH <dbl>, PSD <dbl>, PSA <dbl>,
## #   PSCH <dbl>, PSCD <dbl>, PSCA <dbl>, id_game <int>

2. Creating a SQLite data base for saving data

As there are multiple columns sharing information of different tipology, let’s make a little database in SQLite:

createSQLite <- function(dat, root = "~/MEGA/smart_odds/project", dir_db = "data/football_Italy_db.sqlite3"){
  
  olddb <- getwd()
  setwd(root)
  # Games Information
  game_info <- dat %>%
    select(id_game, HomeTeam, AwayTeam, Date, Season, Div)

  # Games numbers
  game_numbers <- dat %>% select(id_game, FTHG:AR)
  # Games market values
  game_market_values <- dat %>% select(id_game, B365H:PSCA, -Season)
  # Create DB
  football_db <- src_sqlite(dir_db, create = TRUE)
  
  copy_to(
    dest = football_db,
    df = as.data.frame(game_info),
    name = "game_info",
    indexes = list("id_game"),
    temporary = FALSE,
    overwrite = TRUE
  )
  
  copy_to(
    dest = football_db,
    df = as.data.frame(game_numbers),
    name = "game_numbers",
    indexes = list("id_game"),
    temporary = FALSE,
    overwrite = TRUE
  )
  
  copy_to(
    dest = football_db,
    df = as.data.frame(game_market_values),
    indexes = list("id_game"),
    name = "game_market_values",
    temporary = FALSE,
    overwrite = TRUE
  )
  
  # Get tables
  SQLtables <- dbListTables(football_db$con)[1:3]
  
  print("Created DB with tables: ")
  print(data_frame(SQLtables))
  
  # Disconnect from database
  dbDisconnect(football_db$con)
  setwd(olddb)
  
}

file_root <- "~/MEGA/smart_odds/project"
db_path <- "data/football_Italy_db.sqlite3"
createSQLite(df_full, root = file_root, dir_db = db_path)
## [1] "Created DB with tables: "
## # A tibble: 3 x 1
##   SQLtables         
##   <chr>             
## 1 df_model          
## 2 game_info         
## 3 game_market_values

3. Cleaning and wrangling

Let’s connect to the database and creating a starter useful dataframe:

df_full_path <- str_c(file_root, db_path, sep = "/")
football_db <- dbConnect(drv = SQLite(), dbname= df_full_path)

dbListTables(football_db)[1:3]
## [1] "df_model"           "game_info"          "game_market_values"
### Get DF from DB
df_info <- tbl(football_db, "game_info")
df_stat <- tbl(football_db, "game_numbers")


# Get data for the model
df <- inner_join(df_info %>%
                   select(id_game, HomeTeam, AwayTeam, Season),
                 df_stat %>%
                   select(id_game, FTHG, FTAG, HST, AST, HR, AR), by = "id_game") %>% 
  select(id_game, HomeTeam, AwayTeam, FTHG, FTAG, Season, HST, AST, HR, AR) %>% 
  collect()

head(df)
## # A tibble: 6 x 10
##   id_game HomeTeam   AwayTeam  FTHG  FTAG Season   HST   AST    HR    AR
##     <int> <chr>      <chr>    <int> <int>  <dbl> <int> <int> <int> <int>
## 1       1 Milan      Lazio        2     2   2011     8     5     0     0
## 2       2 Cesena     Napoli       1     3   2011     3     6     1     0
## 3       3 Catania    Siena        0     0   2011     1     2     0     0
## 4       4 Chievo     Novara       2     2   2011     4     4     1     0
## 5       5 Fiorentina Bologna      2     0   2011     7     2     0     0
## 6       6 Genoa      Atalanta     2     2   2011     4     5     1     0

The column:

  • id_game: identify each game
  • FTHG/FTAG: Full time home/away goals
  • HST/AST: Full time home/away shots-on-targets
  • HR/AR: Full time home/away red cards

Missing Values

Let’s remove all rows with NA values. (We could keep the id_game = 415 but as finally we will use all data we will prefer just remove such row)

df[!complete.cases(df),] # Show rows with missing values
## # A tibble: 6 x 10
##   id_game HomeTeam AwayTeam  FTHG  FTAG Season   HST   AST    HR    AR
##     <int> <chr>    <chr>    <int> <int>  <dbl> <int> <int> <int> <int>
## 1     415 Cagliari Roma         0     3   2012    NA    NA    NA    NA
## 2     761 <NA>     <NA>        NA    NA   2012    NA    NA    NA    NA
## 3     762 <NA>     <NA>        NA    NA   2012    NA    NA    NA    NA
## 4     763 <NA>     <NA>        NA    NA   2012    NA    NA    NA    NA
## 5    1524 <NA>     <NA>        NA    NA   2014    NA    NA    NA    NA
## 6    1905 <NA>     <NA>        NA    NA   2015    NA    NA    NA    NA
df <- drop_na(df) # drops rows with na values
df[!complete.cases(df),] # Na rows removed 
## # A tibble: 0 x 10
## # ... with 10 variables: id_game <int>, HomeTeam <chr>, AwayTeam <chr>,
## #   FTHG <int>, FTAG <int>, Season <dbl>, HST <int>, AST <int>, HR <int>,
## #   AR <int>
# Be carefully because now the order change regard to the id_games

Look-up table for teams

It will be useful to have a look-up table for all the teams. We will identify all teams with a identifier number:

df_all_seasons_teams <- data_frame(teams = unique(c(df$HomeTeam, df$AwayTeam))) %>%
  mutate(teams = factor(teams, levels = teams)) %>%
  mutate(id_team_global = as.numeric(teams)) %>%
  mutate(teams = as.character(teams))


copy_to(
  dest = football_db,
  df = as.data.frame(df_all_seasons_teams),
  indexes = list("teams"),
  name = "look-up-teams",
  temporary = FALSE,
  overwrite = TRUE
)

dbListTables(football_db)[1:4]
## [1] "df_model"           "game_info"          "game_market_values"
## [4] "game_numbers"
head(df_all_seasons_teams)
## # A tibble: 6 x 2
##   teams      id_team_global
##   <chr>               <dbl>
## 1 Milan                   1
## 2 Cesena                  2
## 3 Catania                 3
## 4 Chievo                  4
## 5 Fiorentina              5
## 6 Genoa                   6

Let’s incorporate this identification on the dataset:

df_ready <- df %>%
  inner_join(df_all_seasons_teams, by = c("HomeTeam"="teams")) %>%
  rename(HomeTeam_id = id_team_global) %>%
  inner_join(df_all_seasons_teams, by = c("AwayTeam"="teams")) %>%
  rename(AwayTeam_id = id_team_global) %>%
  select(id_game:AwayTeam, HomeTeam_id, AwayTeam_id, everything())
head(df_ready) 
## # A tibble: 6 x 12
##   id_game HomeTeam AwayTeam HomeTeam_id AwayTeam_id  FTHG  FTAG Season
##     <int> <chr>    <chr>          <dbl>       <dbl> <int> <int>  <dbl>
## 1       1 Milan    Lazio              1          15     2     2   2011
## 2       2 Cesena   Napoli             2          16     1     3   2011
## 3       3 Catania  Siena              3          18     0     0   2011
## 4       4 Chievo   Novara             4          20     2     2   2011
## 5       5 Fiorent~ Bologna            5          14     2     0   2011
## 6       6 Genoa    Atalanta           6          13     2     2   2011
## # ... with 4 more variables: HST <int>, AST <int>, HR <int>, AR <int>

Finally saving the dataframe in the database:

copy_to(
  dest = football_db,
  df = as.data.frame(df_ready),
  indexes = list("id_game"),
  name = "df_model",
  temporary = FALSE,
  overwrite = TRUE
)

Turning off the connection with database:

RSQLite::dbDisconnect(football_db)

Exploratory Data Analasys

Exploring data to get some insights and answering some questions.

file_root <- "~/MEGA/smart_odds/project"
db_path <- "data/football_Italy_db.sqlite3"
# 3. Cleaning and Wrangling
## Connecting with db
df_full_path <- str_c(file_root, db_path, sep = "/")
football_db <- dbConnect(drv = SQLite(), dbname= df_full_path)
df <- tbl(football_db, "df_model") %>% collect()

First we are going to see how a Poisson model fits on data. It’s possible to observe how the home expected values is higher that away expected values.

df_t1 <- bind_rows(df %>% select(team = HomeTeam, goals = FTHG) %>% mutate(role = "home"),
          df %>% select(team = AwayTeam, goals = FTAG) %>% mutate(role = "away"))

df_t1_poiss <- df_t1 %>% group_by(role) %>% summarise(lambda = mean(goals), n = n())

lambda_home <- df_t1_poiss %>% 
  filter(role == "home") %>%
  pull(lambda)

lambda_away <- df_t1_poiss %>% 
  filter(role == "away") %>%
  pull(lambda)  
  
ng <- df_t1_poiss %>% 
  filter(role == "home") %>%
  pull(n)




# Comparing home vs away
df_t1 %>% ggplot() + 
  geom_bar(aes(x = goals, fill = role, y = ..prop..), position = "dodge") +
  geom_vline(xintercept = lambda_home, col = "blue") +
  geom_vline(xintercept = lambda_away, col = "red") +
  annotate("text", x = lambda_home + 0.5, y = 0.4, label = paste("H =", round(lambda_home,2)), col = "blue") + 
  annotate("text", x = lambda_away - 0.5, y = 0.4, label = paste("A =", round(lambda_away,2)), col = "red") +
  ggtitle("Home vs Away goals distribution ") +
  theme(plot.title = element_text(hjust = 0.5))

## Poisson simulations It’s possible to see that poisson distribution fits well the data.

# Comparing simulated from real 
## Simulating some new data
set.seed(345)
df_t1_sim <- data_frame(home_goals = rpois(n = ng, lambda = lambda_home),
                        away_goals = rpois(n = ng, lambda = lambda_away)) %>%
                        gather(role, goals) %>%
  mutate(role = if_else(role == "home_goals", "home", "away"))

# Wrangling
df_t2 <- bind_rows(
  df_t1 %>%
   count(role, goals) %>% 
  mutate(probs = n/ng,
         type = "real"),
# Wrangling
df_t1_sim %>%
   count(role, goals) %>% 
  mutate(probs = n/ng,
         type = "sim"))
# Plot
df_t2 %>% 
  ggplot() + 
  geom_bar(aes(x = as.factor(goals), fill = type, y = probs), stat = "identity", position = "dodge") +
  facet_grid(.~ role, margins = FALSE) +
  ggtitle("Goals versus simulated goals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Goals")

df_participation <- df %>%
  group_by(HomeTeam) %>%
  summarise(participation = length(unique(Season))) %>%
  arrange(participation) %>%
  mutate(ord = 1:length(participation))

ggplot(data = inner_join(df, df_participation, by = "HomeTeam"), 
       aes(x = factor(Season), y = fct_reorder(HomeTeam, ord))) +
  geom_point() +
  labs(x = "Seasons", y = "Teams") +
  ggtitle("Team participation across seasons") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Goals")