Null hypothesis: The average goals scored by home and away teams are the same

H1: Home teams, on average, score more than the away team

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")
## Rows: 380 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl  (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
## time  (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Setting up table and sorting home and away into their own column to compare the two

pl2 <- pl |>
  select(FTHG, FTAG) |>
  pivot_longer(cols = c(FTHG, FTAG),
               names_to = "venue",
               values_to = "goals")

comparing how many goals home teams score vs away teams with a box plot. It turned out how you would expect except for a very similar 50% range

pl2 |>
  ggplot() +
  geom_boxplot(mapping =
                 aes(x = goals,
                     y = factor(venue, levels = c("FTAG", "FTHG"),
                                labels = c("Away", "Home")))) +
  labs(title = "Home vs Away Goals Scored",
       x = "Average goals scored per match",
       y = "Venue") +
  theme_minimal()

Gathering statistics before i work on the NP test

pl2 |>
  group_by(venue) |>
  summarize(sd = sd(goals),
            mean = mean(goals))
## # A tibble: 2 × 3
##   venue    sd  mean
##   <chr> <dbl> <dbl>
## 1 FTAG   1.18  1.22
## 2 FTHG   1.42  1.63

I used .2 as a meaningful distance between goals I tried for a few different but everything under was too small to be significant and anything bigger felt like I was not getting everything I could out of the model. .85 power worked for me. 80% power made sense to me for the same reason. I thought .05 made sense for this dataset also.

The power analysis shows that with α = 0.05 and power = 0.85, the study has an 85% probability of detecting a meaningful goal difference of 0.20 goals per match. This does prove to me my data size was enough and I did not need to bootstrap. It asks for 300 and I have more than that along with a controllable risk size. This dataset is more than adequate.

library(pwrss)
## 
## Attaching package: 'pwrss'
## The following object is masked from 'package:stats':
## 
##     power.t.test
mu_away <- pl |> summarize(m = mean(FTAG)) |> pull(m)
sd_pooled <- pl |> summarize(s = sd(c(FTHG, FTAG))) |> pull(s)

test <- pwrss.t.2means(
  mu1 = mu_away,                 
  mu2 = mu_away + 0.20,          
  sd1 = sd_pooled,               
  kappa = 1,
  power = 0.85,
  alpha = 0.05,
  alternative = "greater"        
)
## +--------------------------------------------------+
## |             SAMPLE SIZE CALCULATION              |
## +--------------------------------------------------+
## 
## Welch's T-Test (Independent Samples)
## 
## ---------------------------------------------------
## Hypotheses
## ---------------------------------------------------
##   H0 (Null Claim) : d - null.d >= 0 
##   H1 (Alt. Claim) : d - null.d < 0 
## 
## ---------------------------------------------------
## Results
## ---------------------------------------------------
##   Sample Size            = 630 and 630  <<
##   Type 1 Error (alpha)   = 0.050
##   Type 2 Error (beta)    = 0.150
##   Statistical Power      = 0.85
plot(test)

test
## $parms
## $parms$d
## [1] -0.1512072
## 
## $parms$null.d
## [1] 0
## 
## $parms$var.ratio
## [1] 1
## 
## $parms$n2
## [1] 630
## 
## $parms$n.ratio
## [1] 1
## 
## $parms$alpha
## [1] 0.05
## 
## $parms$alternative
## [1] "one.sided"
## 
## $parms$claim.basis
## [1] "md.pval"
## 
## $parms$ceiling
## [1] TRUE
## 
## $parms$verbose
## [1] 1
## 
## 
## $test
## [1] "t"
## 
## $df
## [1] 1258
## 
## $ncp
## [1] -2.683662
## 
## $null.ncp
## [1] 0
## 
## $t.alpha
## [1] -1.646066
## 
## $power
## [1] 0.850217
## 
## $n
##  n1  n2 
## 630 630 
## 
## attr(,"class")
## [1] "pwrss" "t"     "welch"

2.

This time I wanted to see how shots taken affect match result. The team that takes the most shots does not always win since counter attack with efficient finishing is a very popular tactic in the Premier League. I started this by making a new chart and separates the home ans away team so I can look at each team individually instead of just the home team or away team. This also doubles my observation count.

H0: There is no relationship between shots and match results

H1: The more shots you take the more likely you are to win

pl4 <- pl |>
  transmute(
    Date = if ("Date" %in% names(pl)) Date else NA,
    HomeTeam, AwayTeam, FTHG, FTAG, FTR,
    HS, AS, HST, AST, HC, AC, HY, AY
  ) |>
  pivot_longer(
    cols = c(HomeTeam, AwayTeam),
    names_to = "side",
    values_to = "Team"
  ) |>
  mutate(
    venue = if_else(side == "HomeTeam", "Home", "Away"),
    goals = if_else(venue == "Home", FTHG, FTAG),
    shots = if_else(venue == "Home", HS, AS),
    shots_on_target = if_else(venue == "Home", HST, AST),
    corners = if_else(venue == "Home", HC, AC),
    yellows = if_else(venue == "Home", HY, AY),
    result = case_when(
      venue == "Home" & FTR == "H" ~ "Win",
      venue == "Home" & FTR == "D" ~ "Draw",
      venue == "Home" & FTR == "A" ~ "Loss",
      venue == "Away" & FTR == "A" ~ "Win",
      venue == "Away" & FTR == "D" ~ "Draw",
      venue == "Away" & FTR == "H" ~ "Loss",
      TRUE ~ NA_character_
    )
  ) |>
  select(Date, Team, venue, result, goals, shots, shots_on_target, corners, yellows)

pl4
## # A tibble: 760 × 9
##    Date       Team      venue result goals shots shots_on_target corners yellows
##    <chr>      <chr>     <chr> <chr>  <dbl> <dbl>           <dbl>   <dbl>   <dbl>
##  1 05/08/2022 Crystal … Home  Loss       0    10               2       3       1
##  2 05/08/2022 Arsenal   Away  Win        2    10               2       5       2
##  3 06/08/2022 Fulham    Home  Draw       2     9               3       4       2
##  4 06/08/2022 Liverpool Away  Draw       2    11               4       4       0
##  5 06/08/2022 Bournemo… Home  Win        2     7               3       5       3
##  6 06/08/2022 Aston Vi… Away  Loss       0    15               2       5       3
##  7 06/08/2022 Leeds     Home  Win        2    12               4       6       2
##  8 06/08/2022 Wolves    Away  Loss       1    15               6       4       0
##  9 06/08/2022 Newcastle Home  Win        2    23              10      11       0
## 10 06/08/2022 Nott'm F… Away  Loss       0     5               0       1       3
## # ℹ 750 more rows

Viewing my data set

shots_df <- pl4 |>
  filter(result %in% c("Win", "Loss")) |>
  select(result, shots)

observed_diff <- shots_df |>
  summarize(
    diff = mean(shots[result == "Win"]) -
      mean(shots[result == "Loss"])
  ) |>
  pull(diff)
shots_df
## # A tibble: 586 × 2
##    result shots
##    <chr>  <dbl>
##  1 Loss      10
##  2 Win       10
##  3 Win        7
##  4 Loss      15
##  5 Win       12
##  6 Loss      15
##  7 Win       23
##  8 Loss       5
##  9 Win       18
## 10 Loss      10
## # ℹ 576 more rows
observed_diff
## [1] 3.573379

Some more data from the data set pertaining wins and losses including standard deviation and standard error

sd1 <- shots_df |>
  summarize(
    sd_win  = sd(shots[result == "Win"]),
    sd_loss = sd(shots[result == "Loss"]),
    n_win   = sum(result == "Win"),
    n_loss  = sum(result == "Loss"),
    se_diff = sqrt(sd_win^2 / n_win + sd_loss^2 / n_loss)
  )

sd1
## # A tibble: 1 × 5
##   sd_win sd_loss n_win n_loss se_diff
##    <dbl>   <dbl> <int>  <int>   <dbl>
## 1   5.14    5.12   293    293   0.424

Building a probability density showing the difference between shots in wins and losses

sd_null <- sd1 |> pull(se_diff)
f_sampling <- function(x) dnorm(x, mean = 0, sd = sd_null)
x_max <- max(abs(observed_diff), 1) * 3
ggplot() +
  stat_function(aes(fill = "more extreme samples"),
                fun = f_sampling,
                xlim = c(observed_diff, x_max),
                geom = "area") +
  stat_function(aes(fill = "more extreme samples"),
                fun = f_sampling,
                xlim = c(-x_max, -observed_diff),
                geom = "area") +
  geom_function(xlim = c(-x_max, x_max),
                fun = f_sampling) +
  geom_vline(aes(xintercept = observed_diff,
                 color = paste("observed:", round(observed_diff, 2)))) +
  labs(title = " Shot Differences (Win vs Loss)",
       x = "Difference in Average Shots (Win vs Loss)",
       y = "Probability Density",) +
  scale_fill_manual(values = "lightblue") +
  theme_minimal()

The z stat is really high showing how far the difference between the two values are when compared to the null hypothesis. I did some more research towards this P value since it seems suspisious and it is really 0.0000 so the null hypothesis holds know meaning and we should reject it. There is no relationship between shots and match results is not true. We also see a small SD and an observed difference of roughly 3.6

z_stat <- observed_diff / sd_null
p_value <- 1 - pnorm(z_stat)

z_stat
## [1] 8.431518
p_value
## [1] 0
observed_diff
## [1] 3.573379
sd_null
## [1] 0.4238121
z_stat
## [1] 8.431518