library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.6
## âś” forcats   1.0.1     âś” stringr   1.6.0
## âś” ggplot2   4.0.1     âś” tibble    3.3.0
## âś” lubridate 1.9.4     âś” tidyr     1.3.1
## âś” purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")
## Rows: 380 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl  (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
## time  (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

H0: All of the “big 6” clubs have the same mean home goals

H1: At least one of the big 6 teams has a different mean home goals

Big 6 is a common term within Premier League circles. These 6 teams have been all historically the most successful, the best in recent times, brings in more money, and uses more money to invest into their squads. There is a noticeable gap so it is significant to look at. This is easier than working with all 20 teams.

big6 <- c("Arsenal", "Chelsea", "Liverpool", "Man City", "Man United", "Tottenham")
b6 <- pl |>
  filter(HomeTeam %in% big6)

ggplot(b6, aes(x = HomeTeam, y = FTHG)) +
  geom_boxplot()

Median

ano <- aov(FTHG ~ HomeTeam, data = b6)
summary(ano)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## HomeTeam      5  52.95  10.589   4.726 0.000608 ***
## Residuals   108 242.00   2.241                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Large F tells me there is a lot of variance between the big 6 teams in terms of goal scoring and the p value is super small so we will be rejecting the null hypothesis that big 6 teams score the same, as we have strong evidence at least one big 6 team scores goals at a different rate.

Looking at the relationship between shot on target rate(shots on target/shots) at home vs home goals for these big 6 teams. These 6 teams usually prefer to play aggressively and take a lot of shots, which sometimes reduces the quality. Should the teams keep sending in shots? Or should they be more particular and favor quality shots they know they can put on target?

b6 <- b6 |>
  mutate(home_on_target_rate =  HST/HS)
b6 |>
  ggplot() +
  geom_point(mapping = aes(x = home_on_target_rate , y = FTHG)) +
  labs(
    title = "Home shots on target rate vs Home goals by game from Big 6 PL Clubs",
    x = "Home shot on target rate",
    y = "Home goals"
  )

b6 |>
  ggplot(mapping = aes(x = home_on_target_rate , y = FTHG)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue')
## `geom_smooth()` using formula = 'y ~ x'

slope = 6.5. To make things more applicable to my data set, it is better to look at things by every 10% instead of 100% because, as it is, that slope tells me nothing. So, for every 10% increase in shot on target rate, you can expect that the home teams’ goal count will go up by .65 at home, from the big 6 teams. This plot may suggest that the super aggressive high shot volume strategy that a lot of the big 6 run may not be as effective as a team that hunts for quality chances instead. The line of best fit itself in my opinion, does a good job of representing the correlation, and the equation itself, as explained earlier, makes a lot of sense contextually. This graph clearly shows a better on target rate is favored.