-++++— title: “dw6” output: html_document date: “2026-02-19” —
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")
## Rows: 380 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pl <- pl |>
mutate(home_on_target_rate = HST/HS)
After adding a column showing how often a team shoots on target. Putting a shot on target forces the keeper to make a play, so in theory, the more often you put the ball on target, the more goals you will score. Aggressive teams will shoot the ball more, leading to more goals. But those aggressive teams often take a lot of bad shots, leading to a lower on target rate. These next few cells attempt to look at how well they correlate
pl |>
ggplot() +
geom_point(mapping = aes(x = home_on_target_rate , y = FTHG)) +
labs(
title = "Home shots on target rate vs Home goals by game",
x = "Home shot on target rate",
y = "Home goals"
)
The way this graph is stretched makes it seem more spread out than it is. There is a clear positive correlation between the two variables.
round(cor(pl$home_on_target_rate, pl$FTHG), 2)
## [1] 0.4
.4 is the correlation between the variables, which, in my experience with working with sports and football data, shows a significant correlation.
t.test(pl$FTHG)$conf.int
## [1] 1.490986 1.777435
## attr(,"conf.level")
## [1] 0.95
mean(pl$FTHG)
## [1] 1.634211
This shows that in 95 out of 100 times, the mean of home goals will fall between 1.49 and 1.78 goals. I was not convinced that this tells me anything meaningful, so I went and looked for other ways to implement confidence intervals into my variables.
model <- lm(FTHG ~ home_on_target_rate, data = pl)
CI1 <- data.frame(home_on_target_rate = 0.5)
predict(model, CI1, interval = "confidence")
## fit lwr upr
## 1 2.198003 2.012194 2.383811
confint(model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -0.08846422 0.6041955
## home_on_target_rate 2.97704296 4.7835048
With a 95% confidence interval, the mean of goals scored would fall between 2.0 and 2.38 if the home team has a .5 on target rate. This shows that if you put half of your shots on target, then the average goals scored will fall between 2 and 2.38 95% of the time.
CI1 <- data.frame(home_on_target_rate = 0.25)
predict(model, CI1, interval = "confidence")
## fit lwr upr
## 1 1.227934 1.065929 1.389939
confint(model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -0.08846422 0.6041955
## home_on_target_rate 2.97704296 4.7835048
This shows that if you put .25 of your shots on target, then the average goals scored will fall between 1.2 and 1.39 95% of the time. A clear difference from the interval above.
pl <- pl |>
mutate(home_shot_share = HS/(HS+AS))
Next, I wanted to see how aggressiveness impacts how many corners a team takes. Seeing the percentage of shots the home team took in the game would give me an idea of how aggressive a team is being offensively.
pl |>
ggplot() +
geom_point(mapping = aes(x = home_shot_share , y = HC)) +
labs(
title = "Home shot share vs Home corners",
x = "Home shot share",
y = "Home corners"
)
The correlation between the two variables is a clear positive one. How aggresive the home team is affects how many corners that team takes.
round(cor(pl$home_shot_share, pl$HC), 2)
## [1] 0.55
A .55 correlation is significant in this context.
CI3 <- lm(HC ~ home_shot_share, data = pl)
summary(CI3)
##
## Call:
## lm(formula = HC ~ home_shot_share, data = pl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8113 -1.6902 -0.2508 1.6986 8.0288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2009 0.4412 0.455 0.649
## home_shot_share 9.9019 0.7672 12.907 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.564 on 378 degrees of freedom
## Multiple R-squared: 0.3059, Adjusted R-squared: 0.3041
## F-statistic: 166.6 on 1 and 378 DF, p-value: < 2.2e-16
confint(CI3)
## 2.5 % 97.5 %
## (Intercept) -0.6666642 1.068443
## home_shot_share 8.3934750 11.410391
confint(CI3, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -0.6666642 1.068443
## home_shot_share 8.3934750 11.410391
According to these results, the slope tells us that for every 10% increase in shot share, the home team’s corner count goes up by .99 (9.9 slope) on average. The P value is nearly 0 which tells us this is statistically significant. Wide residual range but, that’s typical of football, which is unpredictable. 31% R^2 is good for sports correlations