DATA 110 Final Project: Analyzing the 2024–2025 La Liga Season
Introduction :
For my DATA 110 final project, I chose to analyze the 2024–2025 La Liga season—Spain’s top-tier football league—because of my passion for the sport and my curiosity about the statistical patterns behind team performances.
Source: www.football-data.co.uk/
About the Dataset
My primary dataset, season-2425.csv, contains match-by-match statistics for every game in the season. It includes:
Football has always been more than just a game to me—it’s a dynamic interplay of strategy, skill, and statistics. With this project, I want to: * Explore relationships between match outcomes, shot statistics, and disciplinary records (e.g., do more shots always mean more goals?). * Apply data visualization to uncover trends (e.g., home vs. away performance differences). * Use regression techniques to see if certain metrics reliably predict wins, draws, or losses.
Ultimately, I hope to blend data science with football analytics to tell a compelling story about team performance in La Liga.
Variables Used
HomeTeam
home_goal
home_shot_targeted
avg_team_goals
home_yellow_card
home_shot_taken
Load the libraries and the data ” season-2425.csv”.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 380 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Date, HomeTeam, AwayTeam, FTR, HTR
dbl (16): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, ...
lgl (1): Referee
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(match)
# A tibble: 6 × 22
Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee HS
<chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <lgl> <dbl>
1 15/08/24 Ath Bilbao Getafe 1 1 D 1 0 H NA 7
2 15/08/24 Betis Girona 1 1 D 1 0 H NA 19
3 16/08/24 Celta Alaves 2 1 H 0 1 A NA 6
4 16/08/24 Las Palmas Sevilla 2 2 D 1 1 D NA 13
5 17/08/24 Osasuna Leganes 1 1 D 0 1 A NA 16
6 17/08/24 Valencia Barcelo… 1 2 A 1 1 D NA 6
# ℹ 11 more variables: AS <dbl>, HST <dbl>, AST <dbl>, HF <dbl>, AF <dbl>,
# HC <dbl>, AC <dbl>, HY <dbl>, AY <dbl>, HR <dbl>, AR <dbl>
This suggests that with 5 shots on target, the home team is expected to score ~1.6 goals.
2.Interpretation of Coefficients
A. Intercept (0.2131):
Meaning: If a home team takes zero shots on target, they are still expected to score ~0.21 goals on average.
B. Slope (0.2704)
Meaning: For every additional shot on target, the home team’s expected goals increase by ~0.27.
Implications:
A team needs ~3.7 shots on target to expect 1 goal (since 1 / 0.2704 ≈ 3.7).
This suggests conversion efficiency—teams score from roughly 27% of their shots on target.
Plot related to the Linear Regression.
match2 <- match1 |>filter(HomeTeam %in%c("Barcelona","Celta","Real Madrid"," Villarreal","Ath Madrid","Osasuna","Ath Bilbao","Betis")) ggplot( match2,aes(x = home_shot_targeted, y = home_goal, color = HomeTeam)) +geom_point(alpha =0.6, color ="32CD32") +geom_smooth(method ="lm", se =FALSE ) +labs(title ="Linear Regression: Home Shots on Target vs Goals",x ="Shots on Target (Home)",y ="Full-Time Home Goals",caption ="Source: season-2425.csv") +theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Analysis of the Regression Line Result
To explore whether shooting accuracy translates into scoring, I conducted a linear regression using full-time home goals (home_goal) as the response variable and home shots on target (home_shot_targeted) as the predictor.
The regression revealed a modest but statistically significant relationship: teams that register more shots on target tend to score more goals. However, the adjusted R² value is relatively low, suggesting that shots on target alone don’t account for most of the variation in goals.
Final Visualization 1: Top 10 Home Teams by Average Goals
This bar chart highlights the top 10 La Liga home teams by average goals scored during the 2024–2025 season. It shows that certain teams dominate offensively at home.
top_10 <- top_team_scores |>slice_head(n =10)ggplot(top_10, aes(x =reorder(HomeTeam, avg_team_goals), y = avg_team_goals, fill = avg_team_goals)) +geom_col(color ="orange", width =0.8) +coord_flip() +scale_fill_gradientn(colors =c("#9FE2BF", "#40E0D0", "#3CB371")) +labs(title ="Top 10 Home Teams by Avg Goals Scored",x ="Team",y ="Average Goals",caption ="Source: season-2425.csv") +theme_bw()
Final Visualization 2: Interactive Plot (Shots on Target vs Goals)
this scatterplot brings match data to life, showing how shots on target, goals scored, and yellow cards interact in real games. Each dot represents a match—hover over any to see which teams played and how the action unfolded.
plot2 <-ggplot(match1, aes(x = home_shot_targeted, y = home_goal , color = home_yellow_card,text =paste("Home Team:", HomeTeam,"<br>Goals:", home_goal,"<br>Shots on Target:", home_shot_targeted,"<br>Yellow Cards:", home_yellow_card))) +geom_point(size =3, alpha =0.7) +labs(title ="Interactive: Shots on Target vs Goals (by Yellow Cards)",x ="Shots on Target (Home)",y ="Full-Time Home Goals",color ="Yellow Cards",caption ="Source: season-2425.csv") +theme_light()plotly::ggplotly(plot2, tooltip ="text")
Conclusion
The regression analysis confirms a statistically significant relationship between shots on target and goals scored in home matches. The extremely low p-value (p < 2e-16) for home_shot_targeted indicates that the number of shots on target is a strong predictor of goals, with each additional shot increasing expected goals by 0.27 on average.
The model explains 30.2% of the variation in goals (R² = 0.3024), suggesting that while shots on target are important, other factors (e.g., shot quality, opponent defense, set pieces) also play a role. The intercept (0.213) implies that even with zero shots on target, teams still have a small chance of scoring (e.g., from penalties or own goals).