ddw8

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")

## Rows: 380 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl  (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
## time  (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

H0: All of the “big 6” clubs have the same mean home goals

H1: At least one of the big 6 teams has a different mean home goals

Big 6 is a common term within Premier League circles. These 6 teams have been all historically the most successful, the best in recent times, brings in more money, and uses more money to invest into their squads. There is a noticeable gap so it is significant to look at. This is easier than working with all 20 teams.

big6 <- c("Arsenal", "Chelsea", "Liverpool", "Man City", "Man United", "Tottenham")
b6 <- pl |>
  filter(HomeTeam %in% big6)

ggplot(b6, aes(x = HomeTeam, y = FTHG)) +
  geom_boxplot()

Median

ano <- aov(FTHG ~ HomeTeam, data = b6)
summary(ano)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## HomeTeam      5  52.95  10.589   4.726 0.000608 ***
## Residuals   108 242.00   2.241                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Large F tells me there is a lot of variance between the big 6 teams in terms of goal scoring and the p value is super small so we will be rejecting the null hypothesis that big 6 teams score the same, as we have strong evidence at least one big 6 team scores goals at a different rate.

Looking at the relationship between shot on target rate(shots on target/shots) at home vs home goals for these big 6 teams. These 6 teams usually prefer to play aggressively and take a lot of shots, which sometimes reduces the quality. Should the teams keep sending in shots? Or should they be more particular and favor quality shots they know they can put on target?

b6 <- b6 |>
  mutate(home_on_target_rate =  HST/HS)
b6 |>
  ggplot() +
  geom_point(mapping = aes(x = home_on_target_rate , y = FTHG)) +
  labs(
    title = "Home shots on target rate vs Home goals by game from Big 6 PL Clubs",
    x = "Home shot on target rate",
    y = "Home goals"
  )

b6 |>
  ggplot(mapping = aes(x = home_on_target_rate , y = FTHG)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue')

## `geom_smooth()` using formula = 'y ~ x'

slope = 6.5. To make things more applicable to my data set, it is better to look at things by every 10% instead of 100% because, as it is, that slope tells me nothing. So, for every 10% increase in shot on target rate, you can expect that the home teams’ goal count will go up by .65 at home, from the big 6 teams. This plot may suggest that the super aggressive high shot volume strategy that a lot of the big 6 run may not be as effective as a team that hunts for quality chances instead. The line of best fit itself in my opinion, does a good job of representing the correlation, and the equation itself, as explained earlier, makes a lot of sense contextually. This graph clearly shows a better on target rate is favored.

Adding variables to my model to include if the team they face is in the big 6 and home shots. Since I was mainly looking into attacking intensity in the first portion of this assignment, total shots are another indicator of aggression on the other side of the spectrum. Considering the very low p-value, this new model is highly significant, meaning these variables do a better job of explaining home goals than no home goals(the null hypothesis is rejected). A 42 percent adjusted R-squared is also pretty good, especially for sports data, but still leaves a lot of variation, again expected due to the random nature of sports data.

HS p-value alone is also low, making shots a predictor for goals. every one shot results in .074 goals according to the model while big6 team is less relevant.

library(broom)
library(lindia)
library(GGally)
library(ggrepel)
library(boot)
library(ggthemes)
big6 <- c("Arsenal", "Chelsea", "Liverpool", "Man City", "Man United", "Tottenham")

b6 <- b6 |>
  mutate(
    big6_opponent = if_else(AwayTeam %in% big6, 1, 0)
  )
model2 <- lm(FTHG ~ home_on_target_rate + HS + big6_opponent, data = b6)
summary(model2)

## 
## Call:
## lm(formula = FTHG ~ home_on_target_rate + HS + big6_opponent, 
##     data = b6)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8975 -0.8146 -0.1075  0.6255  4.2981 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -2.11412    0.54831  -3.856 0.000195 ***
## home_on_target_rate  8.55274    0.96947   8.822 1.93e-14 ***
## HS                   0.07444    0.02182   3.412 0.000904 ***
## big6_opponent        0.22444    0.26795   0.838 0.404058    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.231 on 110 degrees of freedom
## Multiple R-squared:  0.4352, Adjusted R-squared:  0.4198 
## F-statistic: 28.26 on 3 and 110 DF,  p-value: 1.265e-13

model1 <- lm(FTHG ~ home_on_target_rate, data = b6)
summary(model1)$r.squared

## [1] 0.3754178

summary(model1)$adj.r.squared

## [1] 0.3698412

summary(model2)$r.squared

## [1] 0.4352443

summary(model2)$adj.r.squared

## [1] 0.4198419

This graph shows that my strongest predictor is on target rate as it is far above 0 CI as is home shots. Big 6 would be my lowest performing, which lines up with the numbers I already have.

tidy(model2, conf.int = TRUE) |>
  ggplot(aes(x = estimate, y = term)) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = "dotted", color = "gray") +
  geom_errorbarh(aes(xmin = conf.low,
                     xmax = conf.high,
                     height = 0.5)) +
  labs(title = "Model 2 Coefficient C.I.")

## Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
## ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning in geom_errorbar(..., orientation = orientation): Ignoring unknown
## aesthetics: height

This plot stays near 0 which is good but the line is not flat. There are some outliers I wish the line was smoother but overall, it is fine. I would say mild heterosscedasticy.

gg_resfitted(model2) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs Fitted Values")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Very low correlation as shown by the heat map. Multicollinearity is not an issue.

ggcorr(
  select(b6, home_on_target_rate, HS, big6_opponent),
  label = TRUE
) +
  labs(title = "Correlation Heatmap")

The residual histogram is centered towards 0, which is a good sign. The right tail is not symmetric and extends to some outliers, so it is positively skewed. There is some mild concern, but in my opinion not worth changing.

gg_reshist(model2) +
  labs(title = "Residual Histogram")

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The QQ plot is also more of the same pretty on line besides the right side positively skewed

gg_qqplot(model2) +
  labs(title = "QQ Plot of Residuals")

## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
##   Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

There are a couple of huge outliers that are influential, as well as a couple of more moderate ones, but overall, this is fine, aligns with everything else: there are some mild outliers but nothing serious.

gg_cooksd(model2, threshold = "matlab") +
  labs(title = "Cook's Distance by Observation")

Overall, this model fits fine it is a little right tail skewed positively but overall I am happy with what the diagnostics revealed.

Here is the result of the model

b6 |>
  ggplot(aes(x = home_on_target_rate, y = FTHG)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(
    title = "Home Goals vs Home Shot-on-Target Rate",
    x = "Home Shot-on-Target Rate",
    y = "Full-Time Home Goals"
  )

## `geom_smooth()` using formula = 'y ~ x'

ddw8

2026-03-08