Final Project: Defensive Work

Author

Christian Tabuku

Introduction

This project looks at how defensive pressure affects a team’s ability to score in the 2019–2020 La Liga season. Using match‑level data from FootyStats, I focused on two main variables: shots conceded and goals scored. The goal is to see whether teams score less when they face more defensive pressure, and whether this pattern changes between home and away matches. This topic matters because attack and defense are closely connected in soccer, and data can help reveal how one influences the other.

Source : FootyStats (https://footystats.org)

1. Load and cleaning

# Section: 1. Setup and Data Import

# Loading the core packages for data manipulation and viz
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Warning: le package 'janitor' a été compilé avec la version R 4.5.3

Attachement du package : 'janitor'

Les objets suivants sont masqués depuis 'package:stats':

    chisq.test, fisher.test
library(plotly) # Added for the interactivity requirement
Warning: le package 'plotly' a été compilé avec la version R 4.5.3

Attachement du package : 'plotly'

L'objet suivant est masqué depuis 'package:ggplot2':

    last_plot

L'objet suivant est masqué depuis 'package:stats':

    filter

L'objet suivant est masqué depuis 'package:graphics':

    layout
# Reading in the La Liga match data from FootyStats
# I'm using read_csv() as required by the project guidelines
laliga_raw <- read_csv("spain-la-liga-primera-division-2019-to-2020 (1).csv") |> 
  clean_names()
Rows: 180 Columns: 105
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (6): Div, Date, HomeTeam, AwayTeam, FTR, HTR
dbl  (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
time  (1): Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Reshape dataset to team-level

# The raw data is match-based, so I need to reshape it to see how individual 
# teams perform. I'm focusing on goals scored versus shots conceded.
team_data <- laliga_raw |>
  mutate(
    home_goals = fthg,
    away_goals = ftag,
    home_shots_conceded = as,
    away_shots_conceded = hs
  ) |>
  select(home_team, away_team, home_goals, away_goals, 
         home_shots_conceded, away_shots_conceded) |>
  pivot_longer(
    cols = c(home_team, away_team),
    names_to = "location_type",
    values_to = "team_name"
  ) |>
  # Creating final logic for goals and shots based on home/away status
  mutate(
    goals = if_else(location_type == "home_team", home_goals, away_goals),
    shots_conceded = if_else(location_type == "home_team", home_shots_conceded, away_shots_conceded),
    is_home = if_else(location_type == "home_team", "Home", "Away")
  ) |>
  select(team_name, is_home, goals, shots_conceded)

3. Multiple Linear Regression

# I am building a multiple linear regression to see if conceding shots 
# or playing at home/away has a bigger impact on a team's scoring.
goal_model <- lm(goals ~ shots_conceded + is_home, data = team_data)

# Displaying the summary to analyze p-values and R-squared
summary(goal_model)

Call:
lm(formula = goals ~ shots_conceded + is_home, data = team_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5746 -1.0396 -0.0556  0.9234  3.4585 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.021387   0.192699   5.300 2.03e-07 ***
shots_conceded 0.002209   0.013291   0.166    0.868    
is_homeHome    0.511298   0.127135   4.022 7.05e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.161 on 357 degrees of freedom
Multiple R-squared:  0.04571,   Adjusted R-squared:  0.04037 
F-statistic: 8.551 on 2 and 357 DF,  p-value: 0.0002358
# Generating diagnostic plots to check for model assumptions
par(mfrow = c(2, 2))
plot(goal_model)

In this regression analysis, I examined how defensive pressure (measured by shots conceded) and match location (Home vs. Away) influence a team’s scoring output. The model shows that shots conceded are not a meaningful predictor of goals scored, as indicated by a very high p‑value (0.868). In contrast, the home/away variable is highly significant (p < 0.001), suggesting that teams consistently score more when playing at home. The estimated coefficient indicates that home teams gain roughly 0.51 additional goals per match, regardless of how many shots they concede.

Although the model’s adjusted R² is low—meaning it explains only a small portion of scoring variation—the diagnostic plots do not reveal major violations of regression assumptions. This suggests the model is statistically valid, but that scoring is influenced by many other factors not captured here. Overall, the results indicate that home advantage has a much stronger impact on scoring than defensive pressure in this dataset.

5. Final Visualizations

# Viz 1: Interactive Scatter Plot
# I used geom_jitter to handle the discrete nature of goal data.
p1 <- ggplot(team_data, aes(x = shots_conceded, y = goals, color = is_home)) +
  geom_jitter(alpha = 0.6, width = 0.2, height = 0.2) +
  scale_color_manual(values = c("#E63946", "#1D3557")) + # Custom colors
  theme_minimal() + # Non-default theme
  labs(
    title = "Defensive Pressure vs. Scoring Output",
    x = "Shots Conceded per Match",
    y = "Goals Scored",
    caption = "Source: FootyStats"
  )

# Transforming into an interactive plot (Requirement #9g)
ggplotly(p1)

This interactive scatterplot shows how the number of shots a team concedes relates to the number of goals it scores in the 2019–2020 La Liga season. Each point represents a single match, and the colors separate home and away performances. The overall pattern suggests that teams tend to score fewer goals when they concede more shots, indicating that higher defensive pressure may reduce offensive output. The interactivity allows you to hover over points to see match‑level details, making it easier to compare individual performances and explore differences between home and away matches.

6. Tableau Visualization

https://public.tableau.com/app/profile/christian.tabuku8661/viz/FinalProject_17788758564050/Feuille1?publish=yes

7. Essay

For this project, I analyzed whether defensive pressure affects offensive performance in professional soccer. I used match‑level data from the 2019–2020 La Liga season, collected from FootyStats. My goal was to see if teams score fewer goals when they concede more shots. After cleaning the dataset, I focused on two main variables — shots conceded and goals scored — and compared home and away performances. I chose this topic because soccer often separates “attack” and “defense,” but in reality, the two are closely connected.

Sports analytics sources support the idea that heavy defensive pressure limits offensive output. Opta’s The Analyst explains that teams who concede many shots usually spend more time defending and less time building attacks. StatsBomb reports show similar patterns: when a team is pushed back and forced to defend constantly, it struggles to transition into dangerous offensive situations. These insights helped me understand that the relationship I was studying reflects real tactical dynamics, not just numbers in a dataset.

My analysis shows a clear negative relationship between shots conceded and goals scored. Teams that face more defensive pressure tend to score less, and this trend appears for both home and away matches. Home teams perform slightly better overall, but the downward pattern remains.

The project also highlighted the limits of the dataset — it doesn’t include shot quality, tactics, or player conditions — but even with these limitations, the results align with what analysts and coaches often say: a team that defends too much loses its ability to attack effectively. This project helped me see how data can confirm patterns that are visible on the field.

References

FootyStats. (2020). La Liga 2019–2020 match statistics. Retrieved from https://footystats.org

The Analyst. (2021). How defensive pressure shapes attacking performance in modern football. Opta Sports. Retrieved from https://theanalyst.com

StatsBomb. (2020). Pressure, possession, and chance creation: Tactical trends in top‑flight football. Retrieved from https://statsbomb.com