Stadiums, home advantages, etc of EPL clubs

A look at the stadiums

colnames(decade_epl)
##  [1] "X"                "Competition_Name" "Gender"           "Country"         
##  [5] "Season_End_Year"  "Round"            "Wk"               "Day"             
##  [9] "Date"             "Time"             "Home"             "HomeGoals"       
## [13] "Away"             "AwayGoals"        "Attendance"       "Venue"           
## [17] "Referee"          "Notes"            "MatchURL"         "Home_xG"         
## [21] "Away_xG"
decade_epl %>% group_by(Venue) %>% count() %>% arrange(desc(n))
## # A tibble: 39 × 2
## # Groups:   Venue [39]
##    Venue                  n
##    <chr>              <int>
##  1 Goodison Park        210
##  2 Anfield              209
##  3 Emirates Stadium     209
##  4 Etihad Stadium       209
##  5 Old Trafford         209
##  6 St. Mary's Stadium   209
##  7 Stamford Bridge      209
##  8 Selhurst Park        208
##  9 King Power Stadium   171
## 10 St. James' Park      171
## # ℹ 29 more rows

Did some clubs have a change of stadium names?

teamsMultipleVenues <- decade_epl %>% group_by(Home) %>% summarise(nVenues = n_distinct(Venue)) %>% filter(nVenues > 1) %>% pull(Home)

multipleStadiums <- decade_epl %>% filter(Home %in% teamsMultipleVenues) %>% dplyr::select(Home, Venue) %>% distinct() %>% group_by(Home) %>% summarise(Venues = paste((Venue), collapse = ', '))
multipleStadiums
## # A tibble: 8 × 2
##   Home           Venues                                                     
##   <chr>          <chr>                                                      
## 1 Crystal Palace Selhurst Park, Goodison Park                               
## 2 Hull City      KCOM Stadium, Kingston Communications Stadium              
## 3 Leicester City King Power Stadium, Vicarage Road Stadium                  
## 4 Newcastle Utd  St. James' Park, St James' Park                            
## 5 Stoke City     Britannia Stadium, Bet365 Stadium                          
## 6 Tottenham      White Hart Lane, Wembley Stadium, Tottenham Hotspur Stadium
## 7 Watford        Vicarage Road Stadium, King Power Stadium                  
## 8 West Ham       Boleyn Ground, London Stadium

While some are slight differences in spelling/abbreviations, and others are changes in stadiums, there are some glaring mistakes in the data that have to be fixed - e.g. Goodison park as Crystal Palace’s home ground, and Vicarage Road Stadium appearing under Leicester home ground.

## Cleaning required -- Goodison Park belongs to Everton
decade_epl %>% filter(Venue == 'Goodison Park', Home == 'Crystal Palace') %>% select(Date, Home, Away, Venue, HomeGoals, AwayGoals) %>% ## obvious error -- Should be Selhurt Park
  mutate(Venue = 'Selhurst Park')
##         Date           Home      Away         Venue HomeGoals AwayGoals
## 1 2017-04-26 Crystal Palace Tottenham Selhurst Park         0         1
decade_epl <- decade_epl %>% mutate(Venue = ifelse(Venue == 'Goodison Park' & Home == 'Crystal Palace', 
                                                   'Selhurst Park', Venue))

## Cleaning -- Standardise Hull City
decade_epl %>% filter(Home == 'Hull City') %>% select(Home, Venue) %>% distinct()
##        Home                           Venue
## 1 Hull City                    KCOM Stadium
## 2 Hull City Kingston Communications Stadium
decade_epl <- decade_epl %>% mutate(Venue = ifelse(Home == 'Hull City', 
                                                   'Kingston Communications Stadium', Venue))
## Leicester
decade_epl <- decade_epl %>% mutate(Venue = ifelse(Home == 'Leicester City', 'King Power Stadium', Venue))
## Watford
decade_epl <- decade_epl %>% mutate(Venue = ifelse(Home == 'Watford', 'Vicarage Road Stadium', Venue))
## Newcastle 
decade_epl <- decade_epl %>% mutate(Venue = ifelse(Home == 'Newcastle Utd', "St. James' Park", Venue))
## Stoke
decade_epl <- decade_epl %>% mutate(Venue = ifelse(Home == 'Stoke City', 'Bet365 Stadium', Venue))


## RELOOK AT THE STADIUMS NOW
teamsMultipleVenues <- decade_epl %>% group_by(Home) %>% summarise(nVenues = n_distinct(Venue)) %>% filter(nVenues > 1) %>% pull(Home)
decade_epl %>% filter(Home %in% teamsMultipleVenues) %>% dplyr::select(Home, Venue) %>% distinct() %>% group_by(Home) %>% summarise(Venues = paste((Venue), collapse = ', '))
## # A tibble: 2 × 2
##   Home      Venues                                                     
##   <chr>     <chr>                                                      
## 1 Tottenham White Hart Lane, Wembley Stadium, Tottenham Hotspur Stadium
## 2 West Ham  Boleyn Ground, London Stadium

Only two clubs have really had a change in home venue.

*West Ham: (Wikipedia) moved from former Boleyn Ground in August 2016 to London Stadium, a stadium that was initially constructed for the 2012 Olympics, notably as the site for the opening and closing ceremonies that year.

*Spurs: (tottenhamhotspur.com) left White Hart Liane in May 2017, played home matches at Wembley Stadium throughout 2017/18 season, a number in 2018/2019 before moving to the self titled stadium in April 2019.

We can further verify the changes in the dataset later

Home teams with the most goals scored and fewest conceded

stadiumGoalsScoredConceded <- decade_epl %>% group_by(Venue) %>% summarise(goalsScored = mean(HomeGoals), goalsConceded = mean(AwayGoals)) %>% arrange(desc(goalsScored))

head(stadiumGoalsScoredConceded)
## # A tibble: 6 × 3
##   Venue                     goalsScored goalsConceded
##   <chr>                           <dbl>         <dbl>
## 1 Etihad Stadium                   2.82         0.794
## 2 Anfield                          2.34         0.852
## 3 Emirates Stadium                 2.04         0.890
## 4 Wembley Stadium                  1.97         0.879
## 5 Tottenham Hotspur Stadium        1.91         1.04 
## 6 Stamford Bridge                  1.89         0.866
stadiumGoalsScoredConceded %>% head(10) %>% 
  ggplot(aes(x = reorder(Venue, -goalsScored))) + 
  geom_point(aes(y = goalsScored, color = 'goalScored')) + 
  geom_point(aes(y = goalsConceded, color = 'goalConceded')) +
  scale_color_manual(values = c('goalScored' = 'blue', 
                                'goalConceded' = 'red'))   + 
  labs(
    title = "Top 10 Venues by Goals Scored",
    x = "Venue",
    y = "Average Goals",
    color = "Legend"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
    plot.title = element_text(hjust = 0.5) # Center the title
  )

On average over the past decade, City has averaged almost 3 goals scored at home and conceding .8 goals a game. Liverpool comes second but with an almost .5 difference in goals scored, while conceding a similar number (0.05 more). On average, City has a goal difference of 2 at home.

How have home advantages changed over the years?

decade_epl$Date <- as.Date(decade_epl$Date)
n <- 8
decade_epl %>% mutate(GoalDiff = HomeGoals - AwayGoals) %>% filter(Venue %in% head(stadiumGoalsScoredConceded, n)$Venue) %>% ggplot(aes(x = Date, y = GoalDiff, color = Venue)) +
  geom_smooth(se = FALSE)+
    labs(
    title = "Trend of Goal Difference Over Time by Venue",
    x = "Date",
    y = "Goal Difference (Home - Away)",
    color = "Venue"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Notably, due to changes in stadiums, there is disjointedness for Spurs from White Hart Lane to Wembley Stadium to Tottenham Hotspur stadium. We can instead plot Home team as the categorical variable instead of Venue to resolve the breaks.

## Plot by Home team instead to resolve mutliple stadium names issue
decade_epl <- decade_epl %>% mutate(GoalDiff = HomeGoals - AwayGoals)
topHomeClubs <- decade_epl %>% group_by(Home) %>% summarise(meanGD = mean(GoalDiff)) %>% slice_max(order_by = meanGD, n = 10)
topHomeClubs
## # A tibble: 11 × 2
##    Home            meanGD
##    <chr>            <dbl>
##  1 Manchester City  2.03 
##  2 Liverpool        1.49 
##  3 Arsenal          1.15 
##  4 Chelsea          1.03 
##  5 Tottenham        0.890
##  6 Manchester Utd   0.861
##  7 Brentford        0.474
##  8 Everton          0.383
##  9 Leicester City   0.380
## 10 Nott'ham Forest  0.158
## 11 Stoke City       0.158
decade_epl  %>% filter(Home %in% topHomeClubs$Home) %>% ggplot(aes(x = Date, y = GoalDiff, color = Home)) +
  geom_smooth(se = FALSE) +
    labs(
    title = "Trend of Goal Difference Over Time by Team",
    x = "Date",
    y = "Goal Difference (Home - Away)",
    color = "Venue"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Stoke city dropped off after 2018, and was relegated, while entrants such as Brentford and Nottingham Forest did well and started overtaking clubs like Chelsea, Everton, Leicester. Liverpool and Man City had similar teajectories, taking dips during 2016 (When Leicester won the title) but subsequently rose, reaching heights in 2018-19, although Liverpool slowly declined after while City stabilised. Arsenal currently has the best upward projection, surpassing Liverpool in the latest year in the dataset. Spurs were contesting Liverpool and City during their heights of 2017, but peaked then and have been falling since.

Impact of Venue changes: Visualisation and Regression Discontinuity Design

West Ham

decade_epl <- decade_epl %>% arrange(Date)
## Observe in the data when the change occured
important_cols <- c('Date', 'Wk', 'Home', 'Away', 'Venue')

## When did the clubs change stadiums? -- West ham 
westhamStadiums <- decade_epl %>% filter(Home == 'West Ham') %>% select((Venue)) %>% distinct() %>% pull()
westhamChangeDate <- decade_epl %>% filter(Venue %in% westhamStadiums) %>% dplyr::select(important_cols) %>% 
  mutate(venueChange = Venue != lag(Venue, n = 1)) %>% filter(venueChange) %>% pull(Date)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(important_cols)
## 
##   # Now:
##   data %>% select(all_of(important_cols))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(sprintf('West Ham played at new stadium on : %s ', westhamChangeDate))#
## [1] "West Ham played at new stadium on : 2016-08-21 "
## Did it come mid season?
decade_epl %>% filter(Home == 'West Ham', Season_End_Year %in% c(2016, 2017)) %>% distinct(Home, Season_End_Year, Venue)
##       Home Season_End_Year          Venue
## 1 West Ham            2016  Boleyn Ground
## 2 West Ham            2017 London Stadium
## Subset to get the seasons before and after i.e. 2016 and 2017
westhamChangeSubset <- decade_epl %>% filter(Home == 'West Ham', Season_End_Year %in% c(2016, 2017)) %>% select(Home, GoalDiff, Date, Venue)

## Visualise 
ggplot(data = westhamChangeSubset, aes(x = Date, y = GoalDiff, color = Venue)) + geom_point() +
  geom_smooth(method = 'lm', formula = y~x) +
  labs(title = 'Change in Home Advantage for West Ham?', 
       subtitle = 'Due to change in home venue', 
       y = 'Goal Difference at home')+
  theme_minimal() 

Regression Discontinuity Design

Assuming that the West Ham team before and after the change are not too different. We can then isolate the effects of the change in stadium on the change on goal difference, which is a proxy for home advantage. Visually, from the graph above, we can see that there looks to be a difference in the average goal difference after the change, which looks to be a drop in goal difference and hence a drop in home advantage moving to the new stadium. Assuming both before and after are linear, we run a quantitative test to check for a difference in goal difference

## Data prep for RDD
westhamRDDdata <- westhamChangeSubset %>% mutate(Change = ifelse(Date >= westhamChangeDate, 1, 0), 
                                            daysSinceChange = Date - westhamChangeDate)
rdd_westham <- lm(GoalDiff ~ Change + daysSinceChange + daysSinceChange*Change, 
                  data = westhamRDDdata)
rdd_wh_summary <-summary(rdd_westham)
print(rdd_wh_summary)
## 
## Call:
## lm(formula = GoalDiff ~ Change + daysSinceChange + daysSinceChange * 
##     Change, data = westhamRDDdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5117 -0.4898  0.5115  1.5190  3.6391 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)             0.5897444  1.1416443   0.517    0.609
## Change                 -1.1148208  1.3825994  -0.806    0.426
## daysSinceChange         0.0007361  0.0046676   0.158    0.876
## Change:daysSinceChange -0.0015168  0.0067756  -0.224    0.824
## 
## Residual standard error: 1.74 on 34 degrees of freedom
## Multiple R-squared:  0.09401,    Adjusted R-squared:  0.01407 
## F-statistic: 1.176 on 3 and 34 DF,  p-value: 0.3333
## Days in between stadium change 
daysGapWH <- westhamRDDdata %>% dplyr::select(daysSinceChange) %>% filter(daysSinceChange <0) %>% top_n(1) %>% pull()
## Selecting by daysSinceChange

Here, we can see that

  • Intercept = 0.59, which means right before the stadium change, the average GD is 0.59 (positive for the team)
  • Change = -1.11, which says that the GD dropped by 1.1 due to stadium change, however it is clearly not statistically significant at all
  • The slope changed from around 0 to very slightly negative, but the absolute difference is negligible and statistically insignificant.n

Drawbacks: the change happened between seasons, hence there is a lot that could happen during the the 103 days in between. Furthermore, the assumption that the teams are similar across a season break is highly unlikely.

We move onto Spurs and repeat the analysis, which would be even better if they have a stadium change midseason.

Tottenham Hotspur

## When did the clubs change stadiums? -- Spurs
spursStadiums <- decade_epl %>% filter(Home == 'Tottenham') %>% select((Venue)) %>% distinct() %>% pull()
spursChangeDates <- decade_epl %>% filter(Home == 'Tottenham') %>% dplyr::select(important_cols) %>% 
  mutate(venueChange = Venue != lag(Venue, n = 1)) %>% filter(venueChange) %>% pull(Date)
print(sprintf('Spurs played at new stadium on : %s ', paste(spursChangeDates, collapse = ', ')))
## [1] "Spurs played at new stadium on : 2017-08-20, 2019-04-03 "
## Observe 
decade_epl %>% filter(Home == 'Tottenham') %>% dplyr::select(important_cols, Season_End_Year) %>% 
  mutate(venueChange = Venue != lag(Venue, n = 1)) %>% filter(venueChange)
##         Date Wk      Home           Away                     Venue
## 1 2017-08-20  2 Tottenham        Chelsea           Wembley Stadium
## 2 2019-04-03 31 Tottenham Crystal Palace Tottenham Hotspur Stadium
##   Season_End_Year venueChange
## 1            2018        TRUE
## 2            2019        TRUE

The second change (returning from Wembley Stadium back to original home ground, albeit renamed) happens towards the end of the 18/19 season season, but not at the end. We include the some games in the start of the 2020 season, for more data post stadium change.

spursChangeSubset <- decade_epl %>% filter(Home == 'Tottenham', Season_End_Year == 2019 | (Season_End_Year == 2020 & Wk < 10)) %>%  select(Home, GoalDiff, Date, Venue)

## Visualise 
ggplot(data = spursChangeSubset, aes(x = Date, y = GoalDiff, color = Venue)) + geom_point() +
  geom_smooth(method = 'lm', formula = y~x) +
  labs(title = 'Change in Home Advantage for Spurs?', 
       subtitle = 'Due to change in home venue', 
       y = 'Goal Difference at home')+
  theme_minimal() 

## Data prep for RDD
spursChangeDate = spursChangeDates[2] ## take the second change date i.e. back to spurs stadium from wembley
spursRDDdata <- spursChangeSubset %>% mutate(Change = ifelse(Date >= spursChangeDate, 1, 0), 
                                            daysSinceChange = Date - spursChangeDate)
rdd_spurs <- lm(GoalDiff ~ Change + daysSinceChange + daysSinceChange*Change, 
                  data = spursRDDdata)
rdd_spurs_summary <-summary(rdd_spurs)
print(rdd_spurs_summary)
## 
## Call:
## lm(formula = GoalDiff ~ Change + daysSinceChange + daysSinceChange * 
##     Change, data = spursRDDdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8742 -1.4178  0.0962  1.1026  4.1284 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)             0.958002   1.139021   0.841    0.410
## Change                  0.411421   1.475107   0.279    0.783
## daysSinceChange         0.000882   0.008933   0.099    0.922
## Change:daysSinceChange -0.002750   0.011974  -0.230    0.821
## 
## Residual standard error: 1.885 on 20 degrees of freedom
## Multiple R-squared:  0.01271,    Adjusted R-squared:  -0.1354 
## F-statistic: 0.08585 on 3 and 20 DF,  p-value: 0.967
## Days in between stadium change 
daysGapspurs <- spursRDDdata %>% dplyr::select(daysSinceChange) %>% filter(daysSinceChange <0) %>% top_n(1) %>% pull()
## Selecting by daysSinceChange

The number of days in between the stadium change for Spurs is 32

The Change i.e. stadium effect on GD is positive at 0.4, but statistically insignificant once again.

This is understandable from the POV of football, where there is a lot of variability in goals.

Improving the RDD: adding covariates?

  • proxy for opponent strength: rolling average of opponent’s goal difference prior to match

By adding a control variable, we aim to help to reduce the variance in the goal difference, to see if we can better identify the effect of the changes in the stadium with more precision.

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Turn into long format first 
long_df <- decade_epl %>% select(Date, Home, Away, HomeGoals, AwayGoals) %>% 
  tidyr::pivot_longer(cols = c(Home, Away), 
                      names_to = 'TeamType', 
                      values_to = 'Team') %>% 
  mutate(
    GoalsScored = ifelse(TeamType == 'Home', HomeGoals, AwayGoals), 
    GoalsConceded = ifelse(TeamType == 'Home', AwayGoals, HomeGoals), 
    GD = GoalsScored - GoalsConceded)

## Get rolling avg for each team 
rolling_avgs <- long_df %>% arrange(Team, Date) %>% ## to ensure that the data is sorted by team and then date --> if you wanted to get rolling average of 2, u can just look upwards for the 2 rows above
                distinct(Date, TeamType, Team, GD, .keep_all = TRUE) %>% ## for some weird reason, there are duplicates -- drop duplicates
  group_by(Team) %>% 
  mutate(RollingGD = rollmean(lag(GD), k = 3, fill = NA, align = 'right')) %>% 
  ungroup()
  
## Check that it makes sense 
head(rolling_avgs)
## # A tibble: 6 × 9
##   Date       HomeGoals AwayGoals TeamType Team   GoalsScored GoalsConceded    GD
##   <date>         <int>     <int> <chr>    <chr>        <int>         <int> <int>
## 1 2013-08-17         1         3 Home     Arsen…           1             3    -2
## 2 2013-08-24         1         3 Away     Arsen…           3             1     2
## 3 2013-09-01         1         0 Home     Arsen…           1             0     1
## 4 2013-09-14         1         3 Away     Arsen…           3             1     2
## 5 2013-09-22         3         1 Home     Arsen…           3             1     2
## 6 2013-09-28         1         2 Away     Arsen…           2             1     1
## # ℹ 1 more variable: RollingGD <dbl>
## MERGE BACK INTO ORIGINAL WIDE FORMAT -- MATCH UP THE OPPOSING TEAMS 
decade_epl_RA <- decade_epl %>% left_join(
  rolling_avgs %>% select(Team, Date, RollingGD) %>% rename(HomeRollingGD = RollingGD),  ## now name it as HomeRollingGD so that we can distinguish between the two rollingGDs 
  by = c('Home' = 'Team', 'Date')) %>%  ## join by team name and date -- here we are assuming that date-team is enough as unique identifier
  
  ## repeat leftjoin now for away team 
  left_join(
  rolling_avgs %>% select(Team, Date, RollingGD) %>% rename(AwayRollingGD = RollingGD),
  by = c('Away' = 'Team', 'Date')
  )

## Now, we presumably end up with a lot of NAs - since we are using rolling averages of lag (1) goal difference
decade_epl_RA <- decade_epl_RA %>% select(Date, Home, Away, GoalDiff, Venue, HomeRollingGD, AwayRollingGD, Season_End_Year, Wk)
missingtable <- colSums(is.na(decade_epl_RA))/nrow(decade_epl_RA) * 100 ## percentage missing 
missingtable
##            Date            Home            Away        GoalDiff           Venue 
##        0.000000        0.000000        0.000000        0.000000        0.000000 
##   HomeRollingGD   AwayRollingGD Season_End_Year              Wk 
##        1.913876        1.889952        0.000000        0.000000

Repeat the RDD with covariates

## Get the proxy for team strength differences 
decade_epl_RA <- decade_epl_RA %>% mutate(RollingGDDiff = HomeRollingGD - AwayRollingGD)

## For West Ham -- but now we use new dataset decade_epl_RA
westhamChangeSubset <- decade_epl_RA %>% filter(Home == 'West Ham', Season_End_Year %in% c(2016, 2017)) %>% select(Home, GoalDiff, Date, Venue, RollingGDDiff)

westhamRDDdata <- westhamChangeSubset %>% mutate(Change = ifelse(Date >= westhamChangeDate, 1, 0), 
                                            daysSinceChange = Date - westhamChangeDate)
rdd_westham_new <- lm(GoalDiff ~ Change + daysSinceChange + daysSinceChange*Change + RollingGDDiff, 
                  data = westhamRDDdata)
rdd_wh_summary_new <-summary(rdd_westham_new)
print(rdd_wh_summary_new)
## 
## Call:
## lm(formula = GoalDiff ~ Change + daysSinceChange + daysSinceChange * 
##     Change + RollingGDDiff, data = westhamRDDdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4445 -0.6286  0.4391  1.5295  3.7740 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)             0.3754063  1.2212437   0.307    0.761
## Change                 -0.9133610  1.4608322  -0.625    0.536
## daysSinceChange        -0.0006210  0.0051585  -0.120    0.905
## RollingGDDiff          -0.0879658  0.2324908  -0.378    0.708
## Change:daysSinceChange -0.0003933  0.0072611  -0.054    0.957
## 
## Residual standard error: 1.771 on 32 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1104, Adjusted R-squared:  -0.0008116 
## F-statistic: 0.9927 on 4 and 32 DF,  p-value: 0.4257
## Change in p value for 'Change' parameter
old_p_val <- (rdd_wh_summary$coefficients)['Change', 'Pr(>|t|)']
new_p_val <- (rdd_wh_summary_new$coefficients)['Change', 'Pr(>|t|)']

## Change in adjusted R Squared
rdd_wh_summary$adj.r.squared
## [1] 0.01407455
rdd_wh_summary_new$adj.r.squared
## [1] -0.0008116349

For West Ham, the change in p value for Change parameter is 0.11, which is a rise in p value. We can see from the results that the addition of the rolling average decreased the adjusted R sqaured, hence adding it was not fruitful.

spursChangeSubset <- decade_epl_RA %>% filter(Home == 'Tottenham', Season_End_Year == 2019 | (Season_End_Year == 2020 & Wk < 10)) %>%  select(Home, GoalDiff, Date, Venue, RollingGDDiff)
spursRDDdata <- spursChangeSubset %>% mutate(Change = ifelse(Date >= spursChangeDate, 1, 0), 
                                            daysSinceChange = Date - spursChangeDate)
rdd_spurs_new <- lm(GoalDiff ~ Change + daysSinceChange + daysSinceChange*Change + RollingGDDiff, 
                  data = spursRDDdata)
rdd_spurs_summary_new <-summary(rdd_spurs_new)
print(rdd_spurs_summary)
## 
## Call:
## lm(formula = GoalDiff ~ Change + daysSinceChange + daysSinceChange * 
##     Change, data = spursRDDdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8742 -1.4178  0.0962  1.1026  4.1284 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)             0.958002   1.139021   0.841    0.410
## Change                  0.411421   1.475107   0.279    0.783
## daysSinceChange         0.000882   0.008933   0.099    0.922
## Change:daysSinceChange -0.002750   0.011974  -0.230    0.821
## 
## Residual standard error: 1.885 on 20 degrees of freedom
## Multiple R-squared:  0.01271,    Adjusted R-squared:  -0.1354 
## F-statistic: 0.08585 on 3 and 20 DF,  p-value: 0.967
rdd_spurs_summary$adj.r.squared
## [1] -0.1353794
rdd_spurs_summary_new$adj.r.squared
## [1] -0.1671256

The same happened for Spurs.

Unfortunately, it seems that the main issue with Football Data still lies in the amount of noise. This makes sense, as football is a complex sport that involves 11 player on each side of a big pitch, with tactics, injuries, weather, and luck being huge factors in the game.