Null hypothesis: The average goals scored by home and away teams are the same
H1: Home teams, on average, score more than the away team
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")
## Rows: 380 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Setting up table and sorting home and away into their own column to compare the two
pl2 <- pl |>
select(FTHG, FTAG) |>
pivot_longer(cols = c(FTHG, FTAG),
names_to = "venue",
values_to = "goals")
comparing how many goals home teams score vs away teams with a box plot. It turned out how you would expect except for a very similar 50% range
pl2 |>
ggplot() +
geom_boxplot(mapping =
aes(x = goals,
y = factor(venue, levels = c("FTAG", "FTHG"),
labels = c("Away", "Home")))) +
labs(title = "Home vs Away Goals Scored",
x = "Average goals scored per match",
y = "Venue") +
theme_minimal()
Gathering statistics before i work on the NP test
pl2 |>
group_by(venue) |>
summarize(sd = sd(goals),
mean = mean(goals))
## # A tibble: 2 × 3
## venue sd mean
## <chr> <dbl> <dbl>
## 1 FTAG 1.18 1.22
## 2 FTHG 1.42 1.63
I used .2 as a meaningful distance between goals I tried for a few different but everything under was too small to be significant and anything bigger felt like I was not getting everything I could out of the model. .85 power worked for me. 80% power made sense to me for the same reason. I thought .05 made sense for this dataset also.
The power analysis shows that with α = 0.05 and power = 0.85, the study has an 85% probability of detecting a meaningful goal difference of 0.20 goals per match. This does prove to me my data size was enough and I did not need to bootstrap. It asks for 300 and I have more than that along with a controllable risk size. This dataset is more than adequate.
library(pwrss)
##
## Attaching package: 'pwrss'
## The following object is masked from 'package:stats':
##
## power.t.test
mu_away <- pl |> summarize(m = mean(FTAG)) |> pull(m)
sd_pooled <- pl |> summarize(s = sd(c(FTHG, FTAG))) |> pull(s)
test <- pwrss.t.2means(
mu1 = mu_away,
mu2 = mu_away + 0.20,
sd1 = sd_pooled,
kappa = 1,
power = 0.85,
alpha = 0.05,
alternative = "greater"
)
## +--------------------------------------------------+
## | SAMPLE SIZE CALCULATION |
## +--------------------------------------------------+
##
## Welch's T-Test (Independent Samples)
##
## ---------------------------------------------------
## Hypotheses
## ---------------------------------------------------
## H0 (Null Claim) : d - null.d >= 0
## H1 (Alt. Claim) : d - null.d < 0
##
## ---------------------------------------------------
## Results
## ---------------------------------------------------
## Sample Size = 630 and 630 <<
## Type 1 Error (alpha) = 0.050
## Type 2 Error (beta) = 0.150
## Statistical Power = 0.85
plot(test)
test
## $parms
## $parms$d
## [1] -0.1512072
##
## $parms$null.d
## [1] 0
##
## $parms$var.ratio
## [1] 1
##
## $parms$n2
## [1] 630
##
## $parms$n.ratio
## [1] 1
##
## $parms$alpha
## [1] 0.05
##
## $parms$alternative
## [1] "one.sided"
##
## $parms$claim.basis
## [1] "md.pval"
##
## $parms$ceiling
## [1] TRUE
##
## $parms$verbose
## [1] 1
##
##
## $test
## [1] "t"
##
## $df
## [1] 1258
##
## $ncp
## [1] -2.683662
##
## $null.ncp
## [1] 0
##
## $t.alpha
## [1] -1.646066
##
## $power
## [1] 0.850217
##
## $n
## n1 n2
## 630 630
##
## attr(,"class")
## [1] "pwrss" "t" "welch"
2.
This time I wanted to see how shots taken affect match result. The team that takes the most shots does not always win since counter attack with efficient finishing is a very popular tactic in the Premier League. I started this by making a new chart and separates the home ans away team so I can look at each team individually instead of just the home team or away team. This also doubles my observation count.
H0: There is no relationship between shots and match results
H1: The more shots you take the more likely you are to win
pl4 <- pl |>
transmute(
Date = if ("Date" %in% names(pl)) Date else NA,
HomeTeam, AwayTeam, FTHG, FTAG, FTR,
HS, AS, HST, AST, HC, AC, HY, AY
) |>
pivot_longer(
cols = c(HomeTeam, AwayTeam),
names_to = "side",
values_to = "Team"
) |>
mutate(
venue = if_else(side == "HomeTeam", "Home", "Away"),
goals = if_else(venue == "Home", FTHG, FTAG),
shots = if_else(venue == "Home", HS, AS),
shots_on_target = if_else(venue == "Home", HST, AST),
corners = if_else(venue == "Home", HC, AC),
yellows = if_else(venue == "Home", HY, AY),
result = case_when(
venue == "Home" & FTR == "H" ~ "Win",
venue == "Home" & FTR == "D" ~ "Draw",
venue == "Home" & FTR == "A" ~ "Loss",
venue == "Away" & FTR == "A" ~ "Win",
venue == "Away" & FTR == "D" ~ "Draw",
venue == "Away" & FTR == "H" ~ "Loss",
TRUE ~ NA_character_
)
) |>
select(Date, Team, venue, result, goals, shots, shots_on_target, corners, yellows)
pl4
## # A tibble: 760 × 9
## Date Team venue result goals shots shots_on_target corners yellows
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 05/08/2022 Crystal … Home Loss 0 10 2 3 1
## 2 05/08/2022 Arsenal Away Win 2 10 2 5 2
## 3 06/08/2022 Fulham Home Draw 2 9 3 4 2
## 4 06/08/2022 Liverpool Away Draw 2 11 4 4 0
## 5 06/08/2022 Bournemo… Home Win 2 7 3 5 3
## 6 06/08/2022 Aston Vi… Away Loss 0 15 2 5 3
## 7 06/08/2022 Leeds Home Win 2 12 4 6 2
## 8 06/08/2022 Wolves Away Loss 1 15 6 4 0
## 9 06/08/2022 Newcastle Home Win 2 23 10 11 0
## 10 06/08/2022 Nott'm F… Away Loss 0 5 0 1 3
## # ℹ 750 more rows
Viewing my data set
shots_df <- pl4 |>
filter(result %in% c("Win", "Loss")) |>
select(result, shots)
observed_diff <- shots_df |>
summarize(
diff = mean(shots[result == "Win"]) -
mean(shots[result == "Loss"])
) |>
pull(diff)
shots_df
## # A tibble: 586 × 2
## result shots
## <chr> <dbl>
## 1 Loss 10
## 2 Win 10
## 3 Win 7
## 4 Loss 15
## 5 Win 12
## 6 Loss 15
## 7 Win 23
## 8 Loss 5
## 9 Win 18
## 10 Loss 10
## # ℹ 576 more rows
observed_diff
## [1] 3.573379
Some more data from the data set pertaining wins and losses including standard deviation and standard error
sd1 <- shots_df |>
summarize(
sd_win = sd(shots[result == "Win"]),
sd_loss = sd(shots[result == "Loss"]),
n_win = sum(result == "Win"),
n_loss = sum(result == "Loss"),
se_diff = sqrt(sd_win^2 / n_win + sd_loss^2 / n_loss)
)
sd1
## # A tibble: 1 × 5
## sd_win sd_loss n_win n_loss se_diff
## <dbl> <dbl> <int> <int> <dbl>
## 1 5.14 5.12 293 293 0.424
Building a probability density showing the difference between shots in wins and losses
sd_null <- sd1 |> pull(se_diff)
f_sampling <- function(x) dnorm(x, mean = 0, sd = sd_null)
x_max <- max(abs(observed_diff), 1) * 3
ggplot() +
stat_function(aes(fill = "more extreme samples"),
fun = f_sampling,
xlim = c(observed_diff, x_max),
geom = "area") +
stat_function(aes(fill = "more extreme samples"),
fun = f_sampling,
xlim = c(-x_max, -observed_diff),
geom = "area") +
geom_function(xlim = c(-x_max, x_max),
fun = f_sampling) +
geom_vline(aes(xintercept = observed_diff,
color = paste("observed:", round(observed_diff, 2)))) +
labs(title = " Shot Differences (Win vs Loss)",
x = "Difference in Average Shots (Win vs Loss)",
y = "Probability Density",) +
scale_fill_manual(values = "lightblue") +
theme_minimal()
The z stat is really high showing how far the difference between the two values are when compared to the null hypothesis. I did some more research towards this P value since it seems suspisious and it is really 0.0000 so the null hypothesis holds know meaning and we should reject it. There is no relationship between shots and match results is not true. We also see a small SD and an observed difference of roughly 3.6
z_stat <- observed_diff / sd_null
p_value <- 1 - pnorm(z_stat)
z_stat
## [1] 8.431518
p_value
## [1] 0
observed_diff
## [1] 3.573379
sd_null
## [1] 0.4238121
z_stat
## [1] 8.431518