Do More Shots and Shots on Target Lead to More Goals?

Author

Christian Tabuku

Introduction

This project uses Premier League match data to explore whether taking more shots leads to more goals. The dataset includes both categorical and quantitative variables.

Categorical variables include HomeTeam, AwayTeam, and Referee.
Quantitative variables include goals scored, shots, shots on target, fouls, and corners.

The main question is:
Do more shots and shots on target lead to more goals?

To answer this, I will use a linear regression model and a data visualization.

Source: Football-Data.co.uk

## Load data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
soccer <- read_csv("filtered_data.csv")
Rows: 380 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): Date, HomeTeam, AwayTeam, FTR, HTR, Referee
dbl (16): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This step is used to understand the variables and check the structure of the dataset.

## Explore data

glimpse(soccer)
Rows: 380
Columns: 22
$ Date     <chr> "16/08/24", "17/08/24", "17/08/24", "17/08/24", "17/08/24", "…
$ HomeTeam <chr> "Man United", "Ipswich", "Arsenal", "Everton", "Newcastle", "…
$ AwayTeam <chr> "Fulham", "Liverpool", "Wolves", "Brighton", "Southampton", "…
$ FTHG     <dbl> 1, 0, 2, 0, 1, 1, 1, 2, 0, 1, 2, 0, 2, 4, 0, 4, 0, 1, 2, 2, 1…
$ FTAG     <dbl> 0, 2, 0, 3, 0, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 0, 2, 1, 6, 0, 1…
$ FTR      <chr> "H", "A", "H", "A", "H", "D", "A", "H", "A", "D", "H", "A", "…
$ HTHG     <dbl> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 3, 0, 2, 0, 1, 2, 1, 1…
$ HTAG     <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 2, 0, 0…
$ HTR      <chr> "D", "D", "H", "A", "H", "H", "D", "H", "A", "A", "H", "D", "…
$ Referee  <chr> "R Jones", "T Robinson", "J Gillett", "S Hooper", "C Pawson",…
$ HS       <dbl> 14, 7, 18, 9, 3, 14, 14, 9, 10, 7, 14, 14, 18, 14, 5, 13, 11,…
$ AS       <dbl> 10, 18, 9, 10, 19, 13, 15, 14, 11, 15, 11, 18, 10, 1, 23, 10,…
$ HST      <dbl> 5, 2, 6, 1, 1, 8, 3, 5, 3, 3, 5, 2, 6, 5, 1, 7, 3, 4, 4, 8, 7…
$ AST      <dbl> 2, 5, 3, 5, 4, 4, 3, 6, 5, 7, 4, 3, 4, 1, 8, 1, 4, 5, 8, 2, 4…
$ HF       <dbl> 12, 9, 17, 8, 15, 17, 18, 6, 12, 11, 9, 9, 14, 4, 14, 11, 8, …
$ AF       <dbl> 10, 18, 14, 8, 16, 8, 11, 15, 9, 12, 13, 17, 13, 15, 14, 15, …
$ HC       <dbl> 7, 2, 8, 1, 3, 2, 5, 4, 4, 2, 4, 3, 7, 10, 4, 12, 4, 8, 5, 9,…
$ AC       <dbl> 8, 10, 2, 5, 12, 6, 3, 7, 3, 13, 4, 3, 5, 1, 10, 5, 1, 9, 5, …
$ HY       <dbl> 2, 3, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 0, 1, 2, 2, 2, 3…
$ AY       <dbl> 3, 1, 2, 1, 4, 3, 2, 5, 1, 1, 2, 1, 2, 3, 3, 0, 3, 2, 3, 3, 2…
$ HR       <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ AR       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
summary(soccer)
     Date             HomeTeam           AwayTeam              FTHG      
 Length:380         Length:380         Length:380         Min.   :0.000  
 Class :character   Class :character   Class :character   1st Qu.:1.000  
 Mode  :character   Mode  :character   Mode  :character   Median :1.000  
                                                          Mean   :1.513  
                                                          3rd Qu.:2.000  
                                                          Max.   :7.000  
      FTAG           FTR                 HTHG             HTAG       
 Min.   :0.000   Length:380         Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1.000   Class :character   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.000   Mode  :character   Median :1.0000   Median :0.0000  
 Mean   :1.421                      Mean   :0.7526   Mean   :0.6105  
 3rd Qu.:2.000                      3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :6.000                      Max.   :4.0000   Max.   :5.0000  
     HTR              Referee                HS              AS       
 Length:380         Length:380         Min.   : 2.00   Min.   : 1.00  
 Class :character   Class :character   1st Qu.:10.00   1st Qu.: 9.00  
 Mode  :character   Mode  :character   Median :13.00   Median :11.50  
                                       Mean   :13.75   Mean   :12.17  
                                       3rd Qu.:17.00   3rd Qu.:15.00  
                                       Max.   :36.00   Max.   :37.00  
      HST              AST               HF              AF       
 Min.   : 0.000   Min.   : 0.000   Min.   : 2.00   Min.   : 1.00  
 1st Qu.: 3.000   1st Qu.: 3.000   1st Qu.: 8.00   1st Qu.: 9.00  
 Median : 5.000   Median : 4.000   Median :11.00   Median :11.00  
 Mean   : 4.834   Mean   : 4.266   Mean   :10.79   Mean   :11.28  
 3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.:13.00   3rd Qu.:14.00  
 Max.   :16.000   Max.   :13.000   Max.   :21.00   Max.   :21.00  
       HC               AC               HY              AY       
 Min.   : 0.000   Min.   : 0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.: 3.000   1st Qu.: 3.000   1st Qu.:1.000   1st Qu.:1.000  
 Median : 5.000   Median : 4.000   Median :2.000   Median :2.000  
 Mean   : 5.426   Mean   : 4.871   Mean   :1.905   Mean   :2.145  
 3rd Qu.: 7.000   3rd Qu.: 7.000   3rd Qu.:3.000   3rd Qu.:3.000  
 Max.   :17.000   Max.   :18.000   Max.   :7.000   Max.   :8.000  
       HR                AR         
 Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.00000   Median :0.00000  
 Mean   :0.06842   Mean   :0.06842  
 3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :2.00000   Max.   :1.00000  

In this step, I removed missing values from the variables used in the analysis.

## Clean data

soccer_clean <- soccer %>%
  drop_na(FTHG, HS, HST)
## Regression model

model <- lm(FTHG ~ HS + HST, data = soccer_clean)
summary(model)

Call:
lm(formula = FTHG ~ HS + HST, data = soccer_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2305 -0.7294 -0.0739  0.7030  4.0193 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.35931    0.14331   2.507  0.01259 *  
HS          -0.04078    0.01330  -3.067  0.00232 ** 
HST          0.35470    0.02986  11.878  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.042 on 377 degrees of freedom
Multiple R-squared:  0.338, Adjusted R-squared:  0.3345 
F-statistic: 96.25 on 2 and 377 DF,  p-value: < 2.2e-16

Regression Analysis

The regression model is:

FTHG = 0.36 - 0.04(HS) + 0.35(HST)

The coefficient for HS (shots) is negative (-0.04), which suggests that taking more shots does not necessarily increase goals when controlling for shots on target.

The coefficient for HST (shots on target) is positive (0.35), which means that more accurate shots significantly increase the number of goals scored.

Both independent variables are statistically significant since their p-values are less than 0.05. This means that both shots and shots on target have a significant relationship with goals.

The adjusted R-squared value is approximately 0.33, which means that about 33% of the variation in goals can be explained by shots and shots on target.

Overall, the results suggest that shots on target are a stronger predictor of goals than total shots. This indicates that accuracy is more important than simply taking many shots.

## Diagnostic plots

plot(model)

## Visualization

ggplot(soccer_clean, aes(x = HS, y = FTHG, color = FTR)) +
  geom_point() +
  labs(
    title = "Shots vs Goals by Match Result",
    x = "Number of Shots",
    y = "Number of Goals",
    caption = "Source: Football-Data.co.uk"
  ) +
  theme_classic()

Final Analysis

Data Cleaning

The dataset was cleaned by removing missing values from the variables used in the model. This ensured that the regression analysis was accurate and complete.

Visualization

The scatter plot shows the relationship between shots and goals. It appears that teams with more shots tend to score more goals. However, the relationship is not perfect, which means other factors also affect goal scoring.

Limitations

One limitation of this analysis is that shots on target are part of total shots, so the variables are related. Also, other variables such as team strength or tactics were not included in the model.