Final Project Data 110

Author

Ike Charistan

La Liga Logo

DATA 110 Final Project: Analyzing the 2024–2025 La Liga Season

Introduction :

For my DATA 110 final project, I chose to analyze the 2024–2025 La Liga season—Spain’s top-tier football league—because of my passion for the sport and my curiosity about the statistical patterns behind team performances.

Source: www.football-data.co.uk/

About the Dataset

My primary dataset, season-2425.csv, contains match-by-match statistics for every game in the season. It includes:

Categorical variables (e.g., HomeTeam, AwayTeam, FTR (Full-Time Result))

Quantitative variables (e.g., FTHG [Full-Time Home Goals], FTAG [Full-Time Away Goals], HS [Home Shots], AST [Away Shots])

Why This Topic Matters to Me

Football has always been more than just a game to me—it’s a dynamic interplay of strategy, skill, and statistics. With this project, I want to: * Explore relationships between match outcomes, shot statistics, and disciplinary records (e.g., do more shots always mean more goals?). * Apply data visualization to uncover trends (e.g., home vs. away performance differences). * Use regression techniques to see if certain metrics reliably predict wins, draws, or losses.

Ultimately, I hope to blend data science with football analytics to tell a compelling story about team performance in La Liga.

Variables Used
HomeTeam
home_goal
home_shot_targeted
avg_team_goals
home_yellow_card
home_shot_taken

Load the libraries and the data ” season-2425.csv”.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(webshot2)
library(dplyr)
setwd("~/Desktop/Data 110")
match <- read_csv("season-2425.csv")
Rows: 380 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Date, HomeTeam, AwayTeam, FTR, HTR
dbl (16): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, ...
lgl  (1): Referee

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(match)
# A tibble: 6 × 22
  Date     HomeTeam   AwayTeam  FTHG  FTAG FTR    HTHG  HTAG HTR   Referee    HS
  <chr>    <chr>      <chr>    <dbl> <dbl> <chr> <dbl> <dbl> <chr> <lgl>   <dbl>
1 15/08/24 Ath Bilbao Getafe       1     1 D         1     0 H     NA          7
2 15/08/24 Betis      Girona       1     1 D         1     0 H     NA         19
3 16/08/24 Celta      Alaves       2     1 H         0     1 A     NA          6
4 16/08/24 Las Palmas Sevilla      2     2 D         1     1 D     NA         13
5 17/08/24 Osasuna    Leganes      1     1 D         0     1 A     NA         16
6 17/08/24 Valencia   Barcelo…     1     2 A         1     1 D     NA          6
# ℹ 11 more variables: AS <dbl>, HST <dbl>, AST <dbl>, HF <dbl>, AF <dbl>,
#   HC <dbl>, AC <dbl>, HY <dbl>, AY <dbl>, HR <dbl>, AR <dbl>

Data Cleaning

match1 <- match |>
  rename(
    home_goal = FTHG ,
    away_goal = FTAG , 
    result = FTR ,
    home_shot_taken = HS,
    away_shot_taken = AS,
    home_shot_targeted = HST,
    away_shot_targeted = AST,
    home_fouls = HF ,
    away_fouls = AF,
    home_corner = HC,
    away_corner = AC,
    home_yellow_card = HY,
    away_yellow_card = AY
  )

match1 <- match1 |>
  select(Date,HomeTeam,AwayTeam,home_goal,away_goal,result,home_shot_taken,away_shot_taken,home_shot_targeted,away_shot_targeted,home_fouls,away_fouls,home_corner,away_corner,home_yellow_card,away_yellow_card)

head(match1)
# A tibble: 6 × 16
  Date     HomeTeam   AwayTeam  home_goal away_goal result home_shot_taken
  <chr>    <chr>      <chr>         <dbl>     <dbl> <chr>            <dbl>
1 15/08/24 Ath Bilbao Getafe            1         1 D                    7
2 15/08/24 Betis      Girona            1         1 D                   19
3 16/08/24 Celta      Alaves            2         1 H                    6
4 16/08/24 Las Palmas Sevilla           2         2 D                   13
5 17/08/24 Osasuna    Leganes           1         1 D                   16
6 17/08/24 Valencia   Barcelona         1         2 A                    6
# ℹ 9 more variables: away_shot_taken <dbl>, home_shot_targeted <dbl>,
#   away_shot_targeted <dbl>, home_fouls <dbl>, away_fouls <dbl>,
#   home_corner <dbl>, away_corner <dbl>, home_yellow_card <dbl>,
#   away_yellow_card <dbl>

Lists of Average full-time goals by team

top_team_scores <- match1 |>
  group_by(HomeTeam) |>
  summarise(avg_team_goals = mean(home_goal)) |>
  arrange(desc(avg_team_goals)) 

top_team_scores 
# A tibble: 20 × 2
   HomeTeam    avg_team_goals
   <chr>                <dbl>
 1 Barcelona            2.74 
 2 Real Madrid          2.37 
 3 Villarreal           2.26 
 4 Ath Madrid           2.21 
 5 Osasuna              1.74 
 6 Ath Bilbao           1.68 
 7 Betis                1.68 
 8 Celta                1.68 
 9 Girona               1.42 
10 Valencia             1.37 
11 Vallecano            1.26 
12 Espanol              1.21 
13 Leganes              1.21 
14 Las Palmas           1.11 
15 Mallorca             1.05 
16 Sociedad             1.05 
17 Sevilla              0.895
18 Getafe               0.789
19 Alaves               0.737
20 Valladolid           0.579

Regression Analysis: Do more assists predict more goals?

goal_model <- lm(home_goal ~ home_shot_targeted, data = match1)
goal_model

Call:
lm(formula = home_goal ~ home_shot_targeted, data = match1)

Coefficients:
       (Intercept)  home_shot_targeted  
            0.2131              0.2704  
summary(goal_model)

Call:
lm(formula = home_goal ~ home_shot_targeted, data = match1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7282 -0.5651 -0.2131  0.6238  4.0830 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         0.21309    0.11056   1.927   0.0547 .  
home_shot_targeted  0.27039    0.02112  12.800   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.04 on 378 degrees of freedom
Multiple R-squared:  0.3024,    Adjusted R-squared:  0.3005 
F-statistic: 163.9 on 1 and 378 DF,  p-value: < 2.2e-16

The linear regression equation is :

Estimated home_goal = 0.2704 (home_shot_targeted)+ 0.2131

Mathematical Interpretation :

If home_shot_targeted = 5: Estimated home_goal = (0.2704 × 5) + 0.2131 = 1.352 + 0.2131 ≈ 1.57 goals

This suggests that with 5 shots on target, the home team is expected to score ~1.6 goals.

2.Interpretation of Coefficients

A. Intercept (0.2131):

Meaning: If a home team takes zero shots on target, they are still expected to score ~0.21 goals on average.

B. Slope (0.2704)

Meaning: For every additional shot on target, the home team’s expected goals increase by ~0.27.

Implications:

A team needs ~3.7 shots on target to expect 1 goal (since 1 / 0.2704 ≈ 3.7).

This suggests conversion efficiency—teams score from roughly 27% of their shots on target.

Analysis of the Regression Line Result

To explore whether shooting accuracy translates into scoring, I conducted a linear regression using full-time home goals (home_goal) as the response variable and home shots on target (home_shot_targeted) as the predictor.

The regression revealed a modest but statistically significant relationship: teams that register more shots on target tend to score more goals. However, the adjusted R² value is relatively low, suggesting that shots on target alone don’t account for most of the variation in goals.

Final Visualization 1: Top 10 Home Teams by Average Goals

This bar chart highlights the top 10 La Liga home teams by average goals scored during the 2024–2025 season. It shows that certain teams dominate offensively at home.

top_10 <- top_team_scores |> slice_head(n = 10)

ggplot(top_10, aes(x = reorder(HomeTeam, avg_team_goals), y = avg_team_goals, fill = avg_team_goals)) +
  geom_col(color = "orange", width = 0.8) +
  coord_flip() +
  scale_fill_gradientn(colors = c("#9FE2BF", "#40E0D0", "#3CB371")) +
  labs(title = "Top 10 Home Teams by Avg Goals Scored",
       x = "Team",
       y = "Average Goals",
       caption = "Source: season-2425.csv") +
  theme_bw()

Final Visualization 2: Interactive Plot (Shots on Target vs Goals)

this scatterplot brings match data to life, showing how shots on target, goals scored, and yellow cards interact in real games. Each dot represents a match—hover over any to see which teams played and how the action unfolded.

plot2 <- ggplot(match1, aes(x = home_shot_targeted, y = home_goal , color = home_yellow_card,
    text = paste("Home Team:", HomeTeam,
                 "<br>Goals:", home_goal,
                 "<br>Shots on Target:", home_shot_targeted,
                 "<br>Yellow Cards:", home_yellow_card))) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Interactive: Shots on Target vs Goals (by Yellow Cards)",
       x = "Shots on Target (Home)",
       y = "Full-Time Home Goals",
       color = "Yellow Cards",
       caption = "Source: season-2425.csv") +
  theme_light()

plotly::ggplotly(plot2, tooltip = "text")

Conclusion

The regression analysis confirms a statistically significant relationship between shots on target and goals scored in home matches. The extremely low p-value (p < 2e-16) for home_shot_targeted indicates that the number of shots on target is a strong predictor of goals, with each additional shot increasing expected goals by 0.27 on average.

The model explains 30.2% of the variation in goals (R² = 0.3024), suggesting that while shots on target are important, other factors (e.g., shot quality, opponent defense, set pieces) also play a role. The intercept (0.213) implies that even with zero shots on target, teams still have a small chance of scoring (e.g., from penalties or own goals).