Project 2: Analysing La Liga Offensive Efficiency

Author

Christian Tabuku

Introduction

Welcome to my DATA 110 project analysis. For this assignment, I decided to explore a passion of mine: soccer. I have sourced a comprehensive dataset covering the entire 2019-2020 season of the Spanish ‘La Liga’ Primera Division.

My main research question for this study is:

“How does offensive efficiency (the relationship between shots on target and goals) vary between matches played at home versus those played away in the Spanish La Liga?”

I chose this dynamic image from the official FC Barcelona website to capture the energy of Spanish soccer and the moment a goal is celebrated—the ultimate measure of offensive success.

Source: FC Barcelona Official Website (https://www.fcbarcelona.com/)

Phase 1: Loading Libraries and Clean Data

In this first stage, I am setting up my R environment by loading the necessary libraries. I will then import the La Liga dataset and perform essential cleaning steps using the tidyverse package.

# --- Phase 1: Loading Libraries and Dataset ---

# Load the required packages for data science and visualization
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
# Import the raw dataset from the CSV file
# We use read_csv to bring the soccer statistics into our environment
soccer_data <- read_csv("spain-la-liga-primera-division-2019-to-2020 (1).csv")
Rows: 180 Columns: 105
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (6): Div, Date, HomeTeam, AwayTeam, FTR, HTR
dbl  (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
time  (1): Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# --- Data Cleaning and Filtering (dplyr) ---

# Following the project guidelines, I am using the dplyr 'filter' command.
# I am focusing my analysis on the top 4 teams in the league to ensure 
# my observations remain under 800 and are relevant to my research question.
clean_data <- soccer_data %>%
  select(HomeTeam, AwayTeam, FTHG, FTAG, HST, AST, FTR) %>%
  filter(!is.na(FTHG), !is.na(FTAG), !is.na(HST), !is.na(AST)) %>%
  # Select 2 categorical variables and 4 quantitative variables as required
  select(HomeTeam, AwayTeam, FTHG, FTAG, HST, AST, FTR)

# Display the first few rows to document the successful cleaning process
head(clean_data)
# A tibble: 6 × 7
  HomeTeam   AwayTeam     FTHG  FTAG   HST   AST FTR  
  <chr>      <chr>       <dbl> <dbl> <dbl> <dbl> <chr>
1 Ath Bilbao Barcelona       1     0     5     2 H    
2 Celta      Real Madrid     1     3     4    11 A    
3 Valencia   Sociedad        1     1     6     3 D    
4 Mallorca   Eibar           2     1     4     5 H    
5 Leganes    Osasuna         0     1     2     2 A    
6 Villarreal Granada         4     4     7     7 D    

Phase 2: Exploratory Data Analysis & Visualization

In this phase, I am exploring the relationship between precision (Shots on Target) and actual scoring (Goals). This helps answer if teams are more “efficient” when playing in front of their home crowd compared to playing away.

# Summary statistics for quantitative variables
# This helps understand the range of shots and goals in our dataset

clean_data <- soccer_data %>%
  select(HomeTeam, AwayTeam, FTHG, FTAG, HST, AST, FTR) %>%
  filter(!is.na(FTHG), !is.na(FTAG), !is.na(HST), !is.na(AST))

# Display results
head(clean_data)
# A tibble: 6 × 7
  HomeTeam   AwayTeam     FTHG  FTAG   HST   AST FTR  
  <chr>      <chr>       <dbl> <dbl> <dbl> <dbl> <chr>
1 Ath Bilbao Barcelona       1     0     5     2 H    
2 Celta      Real Madrid     1     3     4    11 A    
3 Valencia   Sociedad        1     1     6     3 D    
4 Mallorca   Eibar           2     1     4     5 H    
5 Leganes    Osasuna         0     1     2     2 A    
6 Villarreal Granada         4     4     7     7 D    

Phase 3: Multiple Linear Regression Analysis

In this phase, I use a multiple linear regression model to analyze how different factors influence the number of goals scored by the home team. Specifically, I examine the impact of shots on target (HST), opponent shots on target (AST), and goals conceded (FTAG).

This helps determine whether offensive efficiency is significantly affected by both attacking and defensive variables.

# Phase 3: Multiple Linear Regression

# In this step, I build a regression model to predict home goals (FTHG)
# using three quantitative variables:
# - HST: Home shots on target (offensive strength)
# - AST: Away shots on target (defensive pressure)
# - FTAG: Away goals scored (defensive weakness)

model <- lm(FTHG ~ HST + AST + FTAG, data = clean_data)

# Display the results of the regression model
summary(model)

Call:
lm(formula = FTHG ~ HST + AST + FTAG, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4909 -0.7128 -0.1226  0.6668  2.6359 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.15943    0.22054   0.723    0.471    
HST          0.29412    0.02947   9.980   <2e-16 ***
AST         -0.01076    0.04927  -0.218    0.827    
FTAG         0.08902    0.09366   0.951    0.343    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.048 on 176 degrees of freedom
Multiple R-squared:  0.3694,    Adjusted R-squared:  0.3587 
F-statistic: 34.37 on 3 and 176 DF,  p-value: < 2.2e-16

The regression results provide insight into how different variables affect the number of goals scored by the home team.

First, the variable HST (home shots on target) has a very small p-value (p < 0.001), which indicates that it is statistically significant. This means that shots on target have a strong positive effect on goals scored. In other words, the more accurate shots a team takes, the more goals it is likely to score.

On the other hand, AST (away shots on target) and FTAG (away goals) have much higher p-values (greater than 0.05), suggesting that they are not statistically significant predictors in this model. This implies that defensive pressure from the opponent does not have a strong direct impact on home team goal scoring in this dataset.

The adjusted R-squared value is approximately 0.36, meaning that about 36% of the variation in home team goals can be explained by the variables included in the model. While this shows a moderate relationship, it also suggests that other factors not included in the model may influence goal scoring.

Overall, the model confirms that offensive efficiency, particularly shots on target, plays a key role in determining the number of goals scored.

Phase 4: Interactive Visualization

In this phase, I create a simple interactive visualization to better understand the relationship between shots on target and goals. I compare home and away matches to see if teams perform differently depending on where they play.

This helps answer whether offensive efficiency changes when playing at home versus away.

# Phase 4: Interactive Highcharter Visualization

# This chart shows the relationship between shots on target and goals
# for both home and away teams.

# Home: HST vs FTHG
# Away: AST vs FTAG

library(highcharter)

highchart() %>%
  hc_chart(type = "scatter", zoomType = "xy") %>%
  
  # Title and subtitle
  hc_title(text = "Shots on Target vs Goals (Home vs Away)") %>%
  hc_subtitle(text = "Spanish La Liga 2019-2020") %>%
  
  # Axis labels
  hc_xAxis(title = list(text = "Shots on Target")) %>%
  hc_yAxis(title = list(text = "Goals Scored")) %>%
  
  # Home data
  hc_add_series(
    data = clean_data %>% select(x = HST, y = FTHG) %>% list_parse2(),
    name = "Home Matches",
    color = "#1f77b4"
  ) %>%
  
  # Away data
  hc_add_series(
    data = clean_data %>% select(x = AST, y = FTAG) %>% list_parse2(),
    name = "Away Matches",
    color = "#d62728"
  ) %>%
  
  # Tooltip for interactivity
  hc_tooltip(pointFormat = "Shots: {point.x}<br>Goals: {point.y}") %>%
  
  # Caption (data source)
  hc_caption(text = "Source: Football-Data.co.uk")
# Additional Visualization with 3 Colors (ggplot)

library(ggplot2)

ggplot(clean_data, aes(x = HST, y = FTHG, color = FTR)) +
  geom_point(size = 2, alpha = 0.7) +
  
  # Labels and title
  labs(
    title = "Goals vs Shots on Target by Match Result",
    x = "Shots on Target",
    y = "Goals Scored",
    color = "Match Result",
    caption = "Source: Football-Data.co.uk"
  ) +
  
  # Non-default theme
  theme_minimal()

The visualizations show a clear relationship between shots on target and goals scored. In general, as the number of shots on target increases, the number of goals also increases.

The Highcharter graph allows us to compare home and away matches directly. It appears that home teams may perform slightly better, as their data points are more consistent. This suggests that playing at home could have a positive effect on offensive efficiency.

The second graph adds more detail by showing match results using different colors. It shows that matches with more shots on target tend to result in wins, while fewer shots are often associated with losses or draws.

Overall, these visualizations confirm that shots on target are an important factor in scoring goals, but they also show that performance can vary depending on the match situation.

Discussion of Results

The visualizations show that there is a positive relationship between shots on target and goals. Teams that take more accurate shots tend to score more goals.

One interesting observation is that home teams appear slightly more consistent than away teams, which may suggest a home advantage. However, there is still variation, meaning that scoring goals depends on more than just shots on target.

If I had more time, I would include additional variables such as possession or player performance to better understand what influences goal scoring.

References