Final Data 110 Project

Essay Part A

For my final project I will be examining soccer data. The source for this dataset can be found here Link to source. This dataset explores soccer data from different soccer leagues across North America, South America, Europe, and Asia. Variables included in this dataset are Seasons, league, home teams, away teams, home goals, away goals, wins, and other match result data. The data was collected from official football websites, sports news portals, and other reputable football data providers. The reason I am interested in working with this dataset is because I am beginning to play this sport with my friends. At first, I was not a soccer fan, but after playing a couple games with my friends I discovered how amazing and fun the sport can be. Plus, It’s one of the most popular sports in the world and I thought this project could give me more insights into the world of soccer. For this particular project I will be comparing soccer data between Spain and France to identify which country seems to be better at the sport. To accomplish this I will examine total home goals and wins based on different leagues and teams. To better understand league names, I have included the names that each league will be referred as here:

French Division 1 or “French Ligue 1” will be called “Fra1”

French Division 2 or “French Ligue 2” will be called “Fra2”

Spanish Division 1 or “La Liga” will be called “Spa1”

Spanish Division 2 or “Segunda División B” will be called “Spa2”

Loading Data

library(readr)
setwd("/Users/mikea/Desktop/Data 110 ")
football_data <- read_csv("ALLLEAGUESDATA.csv")

Loading Library’s

I also wanted to check all the soccer leagues that were included in this dataset.

library(tidyverse)
library(plotly)
unique(football_data$League)
##  [1] "Eng1" "Eng2" "Eng3" "Eng4" "Eng5" "Ita1" "Ita2" "Spa1" "Spa2" "Fra1"
## [11] "Fra2" "Ger1" "Ger2" "Ger3" "Aus1" "Aus2" "Bel1" "Bel2" "Bul1" "Bul2"
## [21] "Cro1" "Cro2" "Czh1" "Czh2" "Den1" "Den2" "Gre1" "Hun1" "Hun2" "Ire1"
## [31] "Ire2" "Ned1" "Ned2" "Nor1" "Nor2" "Pol1" "Pol2" "Por1" "Por2" "Rom1"
## [41] "Rom2" "Rus1" "Sco1" "Sco2" "Ser1" "Ser2" "Slk1" "Slk2" "Slo1" "Slo2"
## [51] "Swe1" "Swe2" "Swi1" "Tur1" "Ukr1" "Wel1" "Bra1" "Bra2" "Arg1" "Jap1"
## [61] "Jap2" "Fin1" "USA1" "Chi1" "Chi2" "Kor1" "Aut1" "Che1" "Ice1" "Isr1"
## [71] "Mex1"

Let’s Identify The Top Teams Per League and Season

Great! Now that I know that we have data on the two desired countries (Spain and France) it’s time to filter for the data needed. My goal is to identify the top teams based on total soccer goals in their respected soccer leagues and seasons.

# Filter for the leagues
filtered_data <- football_data %>%
  filter(League %in% c("Spa1", "Spa2", "Fra1", "Fra2"))

# Create a new column for combined team names
filtered_data <- filtered_data %>%
  mutate(Team = ifelse(is.na(`Home Team`), `Away Team`, `Home Team`))

# Group the data by league, season, and team, and calculate total goals
grouped_data <- filtered_data %>%
  group_by(League, Season, Team) %>%
  summarize(TotalGoals = sum(`Home Goals` + `Away Goals`), .groups = "drop")

# Arrange the data by league, season, and total goals in descending order
arranged_data <- grouped_data %>%
  arrange(League, Season, desc(TotalGoals))

# Take the top 5 teams from each league per season
top_teams <- arranged_data %>%
  group_by(League, Season) %>%
  top_n(5, TotalGoals)

# Print the top teams instead of using head() to further analyze the data.
print(top_teams)
## # A tibble: 86 × 4
## # Groups:   League, Season [16]
##    League Season Team        TotalGoals
##    <chr>   <dbl> <chr>            <dbl>
##  1 Fra1       20 PSG                 55
##  2 Fra1       20 AS Monaco           51
##  3 Fra1       20 Nice                48
##  4 Fra1       20 Amiens              40
##  5 Fra1       20 Rennes              39
##  6 Fra1       21 Montpellier         69
##  7 Fra1       21 Brestois            65
##  8 Fra1       21 Lyon                65
##  9 Fra1       21 AS Monaco           64
## 10 Fra1       21 Nimes               61
## # ℹ 76 more rows

Results

Great! I now have my top 5 teams for each league and season based on goals scored. Note: some teams have the same amount of goals scored and as such were also included. For example, for French Division 2 there are 6 teams instead of 5. I decided to keep them within the dataset to more accurately identify those top performing teams.

Plot 1: Average Goals Per League By Seasons

First, Lets identify which league seems to be performing better based on average goals scored. This plot accurately tracks the average goals per league and by season.

library(ggplot2)
library(plotly)

# Calculate the average goals per season and league
average_goals <- grouped_data %>%
  group_by(Season, League) %>%
  summarize(AverageGoals = round(mean(TotalGoals)), .groups = "drop")

# Plot the average goals per season
plot <- ggplot(average_goals, aes(x = as.character(Season), y = AverageGoals, color = League, text = paste("League: ", League, "<br>Season: ", Season, "<br>Average Goals: ", AverageGoals))) +
  geom_point() +
  geom_line(aes(group = League)) +
  xlab("Season") +
  ylab("Average Goals") +
  ggtitle("Average Goals per Season") +
  theme_bw()

# Convert the plot to interactive with plotly
plotly_plot <- ggplotly(plot, tooltip = "text")

# Display the plot
plotly_plot

Results

It appears that in the 2020 season, the performance for French Division 1 was relativity low. However, in the 2021-2023 seasons their performance skyrocketed and quickly became a top scoring league. Spanish division 1 has remained constant and has performed relativity well throughout these past 3 years. Spanish Division 2 has had ups and downs in terms of goals scored per season. Lastly, it seems that French Division 2 is a under performing league in terms of goals scored compared to the other leagues.

Plot 2: Team Results

Great! Now let’s visually identify the teams that have scored the most. I have already tracked the goals scored by leagues. Now it’s time to identify the top performing teams for each league by the amount of goals scored.

Let’s start with French Division 1 and analyze the data.

library(plotly)

# Select the desired league
selected_league <- "Fra1"

# Filter the data for the selected league
filtered_data <- football_data %>%
  filter(League == selected_league)

# Create a new column for combined team names
filtered_data <- filtered_data %>%
  mutate(Team = ifelse(is.na(`Home Team`), `Away Team`, `Home Team`))

# Group the data by season and team, and calculate total goals
grouped_data <- filtered_data %>%
  group_by(Season, Team) %>%
  summarize(TotalGoals = sum(`Home Goals` + `Away Goals`), .groups = "drop")

# Create a plotly bar plot with team names as hover text
plot <- plot_ly(grouped_data, x = ~Season, y = ~TotalGoals, color = ~Team, colors = "Set1",
                type = "bar", text = ~paste("Team:", Team, "<br>", "Total Goals:", TotalGoals)) %>%
  layout(xaxis = list(title = "Season"), yaxis = list(title = "Total Goals"),
         title = paste("Total Goals by Team for", selected_league, "League"),
         showlegend = FALSE)

# Display the plot
plot

Repeat for French Division 2

library(plotly)

# Select the desired league
selected_league <- "Fra2"

# Filter the data for the selected league
filtered_data <- football_data %>%
  filter(League == selected_league)

# Create a new column for combined team names
filtered_data <- filtered_data %>%
  mutate(Team = ifelse(is.na(`Home Team`), `Away Team`, `Home Team`))

# Group the data by season and team, and calculate total goals
grouped_data <- filtered_data %>%
  group_by(Season, Team) %>%
  summarize(TotalGoals = sum(`Home Goals` + `Away Goals`), .groups = "drop")

# Create a plotly bar plot with team names as hover text
plot <- plot_ly(grouped_data, x = ~Season, y = ~TotalGoals, color = ~Team, colors = "Set1",
                type = "bar", text = ~paste("Team:", Team, "<br>", "Total Goals:", TotalGoals)) %>%
  layout(xaxis = list(title = "Season"), yaxis = list(title = "Total Goals"),
         title = paste("Total Goals by Team for", selected_league, "League"),
         showlegend = FALSE)

# Display the plot
plot

Repeat for Spanish Division 1

library(plotly)

# Select the desired league
selected_league <- "Spa1"

# Filter the data for the selected league
filtered_data <- football_data %>%
  filter(League == selected_league)

# Create a new column for combined team names
filtered_data <- filtered_data %>%
  mutate(Team = ifelse(is.na(`Home Team`), `Away Team`, `Home Team`))

# Group the data by season and team, and calculate total goals
grouped_data <- filtered_data %>%
  group_by(Season, Team) %>%
  summarize(TotalGoals = sum(`Home Goals` + `Away Goals`), .groups = "drop")

# Create a plotly bar plot with team names as hover text
plot <- plot_ly(grouped_data, x = ~Season, y = ~TotalGoals, color = ~Team, colors = "Set1",
                type = "bar", text = ~paste("Team:", Team, "<br>", "Total Goals:", TotalGoals)) %>%
  layout(xaxis = list(title = "Season"), yaxis = list(title = "Total Goals"),
         title = paste("Total Goals by Team for", selected_league, "League"),
         showlegend = FALSE)

# Display the plot
plot

Lastly, Repeat for Spanish Division 2

library(plotly)

# Select the desired league
selected_league <- "Spa2"

# Filter the data for the selected league
filtered_data <- football_data %>%
  filter(League == selected_league)

# Create a new column for combined team names
filtered_data <- filtered_data %>%
  mutate(Team = ifelse(is.na(`Home Team`), `Away Team`, `Home Team`))

# Group the data by season and team, and calculate total goals
grouped_data <- filtered_data %>%
  group_by(Season, Team) %>%
  summarize(TotalGoals = sum(`Home Goals` + `Away Goals`), .groups = "drop")

# Create a plotly bar plot with team names as hover text
plot <- plot_ly(grouped_data, x = ~Season, y = ~TotalGoals, color = ~Team, colors = "Set1",
                type = "bar", text = ~paste("Team:", Team, "<br>", "Total Goals:", TotalGoals)) %>%
  layout(xaxis = list(title = "Season"), yaxis = list(title = "Total Goals"),
         title = paste("Total Goals by Team for", selected_league, "League"),
         showlegend = FALSE)

# Display the plot
plot

Results From the Four Plots

As we can see, we have successfully identified team performance by goals for each specific league per season. Thus, tracking team performance throughout the various seasons.

Top Teams

Great! Now let’s visually identify the teams that have scored the most.

library(ggplot2)
library(plotly)

# Create the bar plot
bar_plot <- ggplot(top_teams, aes(x = Season, y = TotalGoals, fill = League, text = paste("Team:", Team, "<br>", "Goals:", TotalGoals))) +
  geom_bar(stat = "identity") +
  labs(x = "Season", y = "Total Goals", title = "Top Teams by Season and Total Goals") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  facet_wrap(~ League, scales = "free", ncol = 2)

# Convert to interactive plot using plotly
interactive_plot <- ggplotly(bar_plot, tooltip = "text")

# Display the interactive plot
interactive_plot

Results

Using plotly, we can hover over and identify the top scoring teams for each league and season. Which could prove useful for anyone trying to identify the reason for these changes. For example, did a team acquire a new coach or player that resulted in more goals scored?

Correlation Testing

Now I wanted to determine the relationship between a team scoring in the first half and the likelihood of them winning the match. I also added a regression line for each league.

library(ggplot2)
library(dplyr)

# Filter the football data for the desired leagues
filtered_data <- football_data %>%
  filter(League %in% c("Spa1", "Spa2", "Fra1", "Fra2"))

# Calculate the total goals in the first half for each team
first_half_goals <- filtered_data %>%
  group_by(`Home Team`) %>%
  summarize(TotalFirstHalfGoals = sum(`Home FH`))

# Calculate the total wins for each team
team_wins <- filtered_data %>%
  group_by(`Home Team`) %>%
  summarize(TotalWins = sum(ifelse(Winner == "H", 1, 0)))

# Merge the two dataframes
merged_data <- inner_join(first_half_goals, team_wins, by = "Home Team")

# Add the League column to the merged dataframe
merged_data <- merged_data %>%
  left_join(filtered_data %>% distinct(`Home Team`, League), by = "Home Team")

# Calculate the win percentage based on scoring in the first half
merged_data <- merged_data %>%
  mutate(WinPercentage = TotalWins / TotalFirstHalfGoals * 100)

# Create a scatter plot
scatter_plot <- ggplot(merged_data, aes(x = TotalFirstHalfGoals, y = WinPercentage, color = League)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Total Goals in First Half") +
  ylab("Win Percentage") +
  ggtitle("Goals Scored in the First Half vs. Win Percentage") +
  theme_bw()

# Display the scatter plot
scatter_plot

Correlation Testing Part 2

In this plot, I added plotly to identify all the team names and used only one regression line for all the leagues.

library(plotly)
library(dplyr)

# Filter the football data for the desired leagues
filtered_data <- football_data %>%
  filter(League %in% c("Spa1", "Spa2", "Fra1", "Fra2"))

# Calculate the total goals in the first half for each team
first_half_goals <- filtered_data %>%
  group_by(`Home Team`) %>%
  summarize(TotalFirstHalfGoals = sum(`Home FH`))

# Calculate the total wins for each team
team_wins <- filtered_data %>%
  group_by(`Home Team`) %>%
  summarize(TotalWins = sum(ifelse(Winner == "H", 1, 0)))

# Merge the two dataframes
merged_data <- inner_join(first_half_goals, team_wins, by = "Home Team")

# Add the League column to the merged dataframe
merged_data <- merged_data %>%
  left_join(filtered_data %>% distinct(`Home Team`, League), by = "Home Team")

# Calculate the win percentage based on scoring in the first half
merged_data <- merged_data %>%
  mutate(WinPercentage = TotalWins / TotalFirstHalfGoals * 100)

# Convert TotalFirstHalfGoals and WinPercentage to numeric
merged_data$TotalFirstHalfGoals <- as.numeric(merged_data$TotalFirstHalfGoals)
merged_data$WinPercentage <- as.numeric(merged_data$WinPercentage)

# Create a scatter plot
scatter_plot <- ggplot(merged_data, aes(x = TotalFirstHalfGoals, y = WinPercentage, color = League, text = paste("Team: ", `Home Team`, "<br>Goals: ", sprintf("%.2f", TotalFirstHalfGoals), "<br>Win Percentage: ", sprintf("%.2f", WinPercentage)))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE, aes(group = 1)) +
  xlab("Total Goals in First Half") +
  ylab("Win Percentage") +
  ggtitle("Goals Scored in The First Half vs. Win Percentage") +
  theme_bw() +
  theme(plot.title = element_text(size = 10))

# Convert the scatter plot to interactive using plotly
scatter_plot <- ggplotly(scatter_plot, tooltip = "text")

# Display the scatter plot
scatter_plot

Correlation Results

# Calculate the correlation
correlation <- cor(merged_data$TotalFirstHalfGoals, merged_data$WinPercentage)

# Print the correlation result
cat("Correlation between Total Goals in the First Half and Win Percentage:", correlation, "\n")
## Correlation between Total Goals in the First Half and Win Percentage: -0.2760539

As we can see, there is no significant relationship between the two. Which makes sense, because even if teams score in the first half they might lose the game by the second half.

Essay Part B

Upon further research, I discovered that French Division 1 or “French Ligue 1” was founded in 1932 and their top performance team as of date is PSG with 11 championship wins. French Division 2 also known as “Ligue 2” was founded just after in 1933. Additionally, The top three teams earn promotion to Ligue 1 and three lowest teams are relegated to Championnat National, the third division. Spanish Division 1 or “La Liga”, is the first national football league in Spain that was founded in 1929. Since 1997 there are 20 clubs participating in the top league of La Liga. Lastly, Spanish Division 2 or “Segunda División B” was founded to replace the 3rd division league as the third tier. The 3rd division has since been the fourth tier. I have included sources in case anyone wanted to dive deeper and look at more facts for each league.

Link for French Ligue 1

Link for French Ligue 2

Link for La liga

Link for Segunda División B

Part C

The goal of this project was to identify the top teams and leagues for Spain and France. As shown above, it would seem that France overall has performed better in terms of goals scored. We can also identify that French Division 2 and Spanish division 2 had less goals scored compared to there division 1 leagues. Which makes sense given that these teams are not as competitive as division 1 teams. I think if I were to do this project again I would like to use other datasets that include player, coach, and team budget data. I would like to dive deeper and explore if teams had any player/coach changes that influenced the amount of goals they scored for a season.

Thank you!