Data source URL: https://www.kaggle.com/datasets/sanjeetsinghnaik/la-liga-match-data

Introduction

This project presents a comprehensive exploration of Spain’s top-flight football league, La Liga. It is one of the worlds best league with top teams and players. We will be exploring various aspects of La Liga using data visualization techniques to uncover insights about team performance, league dynamics and historical trends over the years. we will be looking at various things such as total goals scored, match excitement, match result, yellow cards given and best teams over the years in this visualization.

Dataset

This dataset contains information from 2,660 La Liga matches played between 2014 and 2020. Each row corresponds to a single match which includes important details like goals, fouls, possession percentages, corners, yellow card, match excitement and more. With this information, we can analyze team performances, compare stats, and spot trends over the seasons to better understand how La Liga games unfold.

Findings

The following tabs use data visualizations to show how La Liga matches play out, highlighting important trends, team performances, and key match statistics.

Totals goals scored by year with match excitement

The bubble chart shows the overall number of goals scored in La Liga year between 2014 and 2020. It also adds a layer of insight by showing the average match excitement, which is represented by the size and color intensity of the bubbles. Year of 2016 had the highest number goals(1,118) and also had the highest match excitement. Whereas Year 2019 had the lowest goals(942) with relatively lower average match excitement. 2015 to 2017 had relatively high goal totals and excitement levels and a decline is observed from 2018 to 2020 in both goals scored and match excitement. We can say that there is a positive linear relationship between total goals and average match excitement.

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np
import seaborn as sns
import matplotlib.cm as cm


file_path = "C:\\Users\\chhan\\LaLigaData\\combined_data_laliga.csv"
data = pd.read_csv(file_path)

# combining goals from home and away into one
data['Total Goals'] = data['Home Team Goals Scored'] + data['Away Team Goals Scored']

# Grouping data by year and finding the average excitement in each year through mean
annual_data = data.groupby('year').agg({
    'Total Goals': 'sum',
    'Match Excitement': 'mean'
}).reset_index()

# Adding commas in the label
def add_commas(x, pos):
    return f'{x:,.0f}'

#displaying the data through scatter plot
plt.figure(figsize=(10, 6))

# Manipulating scatter plot
bubble_sizes = (annual_data['Match Excitement'] - annual_data['Match Excitement'].min() + 0.1) ** 1.3 * 1000
scatter = plt.scatter(x=annual_data['year'], y=annual_data['Total Goals'],
                      s=bubble_sizes,
                      c=annual_data['Match Excitement'], cmap='viridis', alpha=0.6, edgecolors='w', linewidth=0.5)
plt.colorbar(scatter, label='Average Match Excitement')
plt.xlabel('Year')
plt.ylabel('Total Goals Scored')
plt.title('Total Goals Scored by Year with Match Excitement')


# Adding totals goals as a text next to highest and lowest 
max_row = annual_data.loc[annual_data['Total Goals'].idxmax()]
min_row = annual_data.loc[annual_data['Total Goals'].idxmin()]
plt.text(max_row['year'] + 0.2, max_row['Total Goals'],
         f"{int(max_row['Total Goals']):,}", va='center', fontsize=8)
plt.text(min_row['year'] + 0.2, min_row['Total Goals'],
         f"{int(min_row['Total Goals']):,}", va='center', fontsize=8)


formatter = FuncFormatter(add_commas)
plt.gca().yaxis.set_major_formatter(formatter)

plt.grid(True)
plt.show()

Top 3 Teams by Goals per Year

For every season from 2014 to 2020, the top three La Liga scoring teams are shown in this horizontal bar chart, along with the overall amount of goals each team scored. Barcelona holds the record for the highest number of goals in a single season, scoring 116 goals in 2016. Throughout the 2014–2020 period, Barcelona and Real Madrid have clearly been the most dominant teams in La Liga, consistently appearing in the top three highest-scoring teams every year.

# setting up the data and grouping them by year
home_goals = data.groupby(['year', 'Home Team'])['Home Team Goals Scored'].sum().reset_index()
away_goals = data.groupby(['year', 'Away Team'])['Away Team Goals Scored'].sum().reset_index()

# merging
home_goals.rename(columns={'Home Team': 'Team', 'Home Team Goals Scored': 'Goals'}, inplace=True)
away_goals.rename(columns={'Away Team': 'Team', 'Away Team Goals Scored': 'Goals'}, inplace=True)

# Combining goals
total_goals = pd.concat([home_goals, away_goals])
total_goals_grouped = total_goals.groupby(['year', 'Team'])['Goals'].sum().reset_index()

# Sorting and selecting top 3 for each year
top_teams_per_year = (total_goals_grouped.sort_values(by=['year', 'Goals'], ascending=[True, False])
                      .groupby('year').head(3).sort_values(by=['year', 'Goals'], ascending=[True, False]))

# Plotting the result into bar chart
fig, ax = plt.subplots(figsize=(15, 10))
step = 0.1  # Reduced space between groups of bars
group_width = 0.75  # Adjusted width for each group

# Setting up the colors for bar chart
colors = ['#FFD700', '#C0C0C0', '#CD7F32']  # gold, silver, bronze (as hex codes)
labels = ['1st Highest Scorer', '2nd Highest Scorer', '3rd Highest Scorer']  
years = sorted(top_teams_per_year['year']. unique())

for i, year in enumerate(years):
    year_data = top_teams_per_year[top_teams_per_year['year'] == year]
    for j, row in enumerate(year_data.itertuples()):
        position = i * (group_width + step) + j * (group_width / 3)
        bar = ax.barh(position, row.Goals, height=0.25, color=colors[j], align='center', edgecolor='black', label=labels[j] if i == 0 else "")
        ax.text(row.Goals + 1, position, f"{row.Team} ({row.Goals})", va='center')

# Removing duplicates from legend
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
ax.legend(by_label.values(), by_label.keys())

# Title
ax.set_title('Top 3 Highest Scoring Teams by Year in LaLiga (2014–2020)', fontsize=15, pad=15)

# Calculate tick positions
year_positions = [i * (group_width + step) + group_width / 2 for i in range(len(years))]

# Making changes to y axis
ax.set_yticks(year_positions)
ax.set_yticklabels([f'{year}' for year in years])
ax.set_ylabel('Year', labelpad=10, va='bottom', ha='center', fontsize=18)


# making changes to x-axis
ax.set_xlabel('Total Goals Scored', fontsize=18)
ax.set_xlim(0, 150)
plt.tight_layout()
plt.show()

Laliga match result distribution

The donut chart gives us information on distribution of 2,660 games played between 2014 and 2020 with three possible result: Home wins, Away wins, and Draws. 45.6% of the games are won by home teams which shows us a significant role a home field plays whereas away teams win about 28.7% and draws account 25.6%.

# computing for match result
def get_result(row):
    if row["Home Team Goals Scored"] > row["Away Team Goals Scored"]:
        return "Home Win"
    elif row["Home Team Goals Scored"] < row["Away Team Goals Scored"]:
        return "Away Win"
    else:
        return "Draw"

# Creating result tab to plot for donut chart
data["Result"] = data.apply(get_result, axis=1)

# Counting different types of result
result_counts = data["Result"].value_counts()
total_games = result_counts.sum()

# adding both percentage and counts in each wedge of the chart to make it more clear
def make_autopct(values):
    def my_autopct(pct):
        total = sum(values)
        count = int(round(pct * total / 100.0))
        return f'{pct:.1f}%\n({count:,})'
    return my_autopct
    
# Plotting a donut chart
fig, ax = plt.subplots(figsize=(6, 6))
wedges, texts, autotexts = ax.pie(
    result_counts,
    labels=None,  # keep labels out of the wedges
    autopct=make_autopct(result_counts),
    startangle=90,
    wedgeprops=dict(width=0.6)
)

# Adding total number of games in center of the donut
ax.text(0, 0, f"Total Games\n{total_games}", ha='center', va='center', fontsize=12, fontweight='bold')

# creating legend
ax.legend(wedges, result_counts.index, title="Match Results", loc="center left",bbox_to_anchor=(1, 0, 0.5, 1))

# creating title
ax.set_title("LaLiga Match Result Distribution for all teams", fontsize=14)
plt.tight_layout()
plt.show()

Average yellow cards per game by team each year

The heat map displays the average number of yellow cards per game received by each La Liga team from 2014 to 2020. Since La Liga follows a promotion and relegation system, teams with zero values in certain years were not in La Liga during those seasons, and thus no data is available for them. we can see that top teams like Real Madrid and Barcelona tend to play a less aggressive style, consistently receiving fewer yellow cards. In contrast, teams like Getafe, Espanyol, and Sevilla show higher averages, suggesting a more physical or aggressive style of play.

#  All the teams each year
teams = pd.unique(pd.concat([data["Home Team"], data["Away Team"]]))

# data structure
card_stats = []

# using for loop through each team and year
for team in teams:
    for year in sorted(data["year"].unique()):
        matches = data[data["year"] == year]
        home = matches[matches["Home Team"] == team]
        away = matches[matches["Away Team"] == team]

        # counting total yellow
        home_yellows = home["Home Team Yellow Cards"].sum()
        away_yellows = away["Away Team Yellow Cards"].sum()
        total_yellows = home_yellows + away_yellows

        total_games = len(home) + len(away)
        avg_yellows = total_yellows / total_games if total_games > 0 else 0

        card_stats.append({
            "Team": team,
            "Year": year,
            "Avg Yellow Cards": round(avg_yellows, 2)
        })

# Creating Data frame and converting into matrix
yellowcards_df = pd.DataFrame(card_stats)
yellow_card_matrix = yellowcards_df.pivot(index="Team", columns="Year", values="Avg Yellow Cards")

# Plotting the heat map
plt.figure(figsize=(12, 9))
sns.heatmap(
    yellow_card_matrix,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    linewidths=0.7,
    cbar_kws={'label': 'Average Yellow Cards per Game'}
)

plt.title("Average Yellow Cards per Game by Team and Year", fontsize=16)
plt.xlabel("Year")
plt.ylabel("Team")
plt.tight_layout()
plt.show()

Final ranking of the best 5 teams in laliga each year

The bump chart displays the final league positions of five top La Liga teams—Real Madrid, Barcelona, Atlético Madrid, Valencia, and Sevilla—from 2014 to 2020. Both Barcelona and Real Madrid consistently finished in the top 3 every year, demonstrating their dominance in the league. Atlético Madrid also maintained a strong presence, finishing in the top 3 in all seasons except for 2019, when they placed 4th. Sevilla shows fluctuations ranging from 3rd to 7th and valencia shows a fluctuation ranging from 4th to 13th over the years.

# computing total points of the 5 teams I am interested in

def calculate_points(row):
    home_team = row["Home Team"]
    away_team = row["Away Team"]
    home_goals = row["Home Team Goals Scored"]
    away_goals = row["Away Team Goals Scored"]
    
    if home_goals > away_goals:
        return {home_team: 3, away_team: 0}
    elif home_goals < away_goals:
        return {home_team: 0, away_team: 3}
    else:
        return {home_team: 1, away_team: 1}

team_year_points = {}

for _, row in data.iterrows():
    year = row["year"]
    points = calculate_points(row)
    for team, pts in points.items():
        team_year_points.setdefault(year, {}).setdefault(team, 0)
        team_year_points[year][team] += pts
# making a rank table to determine the rank of each team in different year
ranking_rows = []
for year, teams in team_year_points.items():
    sorted_teams = sorted(teams.items(), key=lambda x: x[1], reverse=True)
    for rank, (team, points) in enumerate(sorted_teams, start=1):
        ranking_rows.append({"Year": year, "Team": team, "Points": points, "Rank": rank})

ranking_df = pd.DataFrame(ranking_rows)

# Selecting the teams I want to observe
selected_teams = [ "REAL MADRID", "ATLETICO MADRID","BARCELONA", "VALENCIA", "SEVILLA FC"]
filtered_df = ranking_df[ranking_df["Team"].isin(selected_teams)]

# plotting the data into bump chart
bump_data = filtered_df.pivot(index="Year", columns="Team", values="Rank").sort_index()

plt.figure(figsize=(12, 6))
for team in selected_teams:
    plt.plot(bump_data.index, bump_data[team], marker='o', label=team, linewidth=2)
    for x, y in zip(bump_data.index, bump_data[team]):
        plt.text(x, y, str(int(y)), fontsize=8, ha='center', va='bottom')
plt.gca().invert_yaxis() 
plt.title("LaLiga Final Rankings by Year for the best 5 teams in laliga", fontsize=14)
plt.xlabel("Year")
plt.ylabel("Final League Rank")
plt.legend(title="Team")
plt.grid(True, linestyle="--", alpha=0.3)
plt.tight_layout()
plt.show()

Conclusion

We analyzed La Liga performance between 2014 and 2020 through data and visual representations in this project. we can say that Barcelona and Real Madrid were the most stable performers, consistently securing top-three positions in goal scoring and table position closely followed by Athletico Madrid. Getafe, Sevilla, Espanyol had the aggressive playing style as we can see they were among the top teams to receive yellow cards whereas Real Madrid and Barcelona played with more discipline.

The analysis shows that home teams won the majority of matches, highlighting a strong home-field advantage. Additionally, goal scoring peaked in 2016, which also had the highest match excitement, making it one of the most exciting seasons during this period of time.