Introduction

This data set was sourced from the Wikipedia page, specifically focusing on the attendance records related to the World Cup. I will conduct various analyses on this dataset to explore trends and patterns in attendance over the years. This analysis aims to provide insights into how attendance figures have evolved during different World Cup tournaments. In addition to performing the analysis, I will also clean the data to ensure its accuracy and reliability. Necessary visualizations will be created to effectively communicate the findings and make the data more accessible. By presenting these insights, I hope to enhance the understanding of World Cup attendance trends and their implications.

Loading Required Libraries

library(rvest)
library(tidyverse)
library(janitor)
library(scales)
library(plotly)
library(viridis)
library(DT)
library(plotly)

Reading the FIFA World Cup attendance data

cor_link <- "https://en.wikipedia.org/wiki/FIFA_World_Cup#Attendance"
col_page <- read_html(cor_link)
col_table <- col_page %>%  
  html_nodes("table") %>% 
  .[4] %>%
  html_table() %>% 
  .[[1]]

Clean column names

fifa_data <- clean_names(col_table)

Function to clean numeric columns

# Function to clean numeric columns
clean_numeric <- function(x) {
  as.numeric(gsub("[^0-9]", "", x))
}

Clean all numeric columns

fifa_data <- fifa_data %>%
  mutate(across(c(totalattendance, averageattendance, 
                  highest_attendances, matches), clean_numeric))

Add year as numeric

fifa_data <- fifa_data %>%
  mutate(year = as.numeric(gsub("[^0-9]", "", year)))

Remove rows with NA values

fifa_data_clean <- fifa_data %>%
  filter(complete.cases(.))

Create a summary table

summary_stats <- fifa_data_clean %>%
  summarise(
    Total_Events = n(),
    Avg_Total_Attendance = mean(totalattendance),
    Max_Total_Attendance = max(totalattendance),
    Min_Total_Attendance = min(totalattendance),
    Avg_Matches_Per_Event = mean(matches)
  )

Display formatted summary

datatable(summary_stats %>%
  mutate(across(where(is.numeric), ~format(., big.mark = ",", scientific = FALSE))),
  options = list(dom = 't'))

1. Total Attendance Over Time

p1 <- ggplot(fifa_data_clean, aes(x = year, y = totalattendance)) +
  geom_line(color = "#2C3E50", size = 1) +
  geom_point(aes(color = hosts), size = 3) +
  scale_y_continuous(labels = comma_format()) +
  theme_minimal() +
  labs(title = "FIFA World Cup Total Attendance Over Time",
       subtitle = "Showing progression from 1930 to present",
       x = "Year",
       y = "Total Attendance",
       color = "Host Country") +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

ggplotly(p1)

3. Bubble Plot

# Create bubble plot - using fifa_data_clean
bubble_plot <- ggplot(fifa_data_clean, 
       aes(x = year, 
           y = totalattendance,
           size = averageattendance,
           color = matches,
           text = paste("Year:", year,
                       "<br>Host:", hosts,
                       "<br>Matches:", matches,
                       "<br>Total Attendance:", scales::comma(totalattendance),
                       "<br>Average Attendance:", scales::comma(averageattendance)))) +
  geom_point(alpha = 0.6) +
  scale_size_continuous(range = c(5, 20)) +
  scale_color_viridis() +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  labs(title = "FIFA World Cup Attendance Analysis",
       subtitle = "Bubble size represents average attendance per match",
       x = "Year",
       y = "Total Attendance",
       color = "Matches",
       size = "Avg. Attendance") +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12),
    legend.position = "right"
  )
# Convert to interactive plot
ggplotly(bubble_plot, tooltip = "text")

conclusision

From the first plot, we can see that the United States had the highest attendance during the World Cup, with approximately 3.6 million spectators in 1994. In contrast, Hungary recorded the lowest attendance, with about 600,000 attendees in 1930. This disparity may be attributed to the transportation systems of the time; in the 1930s, travel was likely more expensive and accessible mainly to the affluent and local citizens. However, I must emphasize that this is speculative, as I have not thoroughly examined the data. Additionally, it’s important to remember that correlation does not always imply causation.

I also analyzed the increase in the number of matches over the years and its impact on total attendance. As expected, the United States maintained the highest attendance figures, around 3.6 million, while Hungary’s attendance was about 600,000. Various factors may contribute to these trends, and further exploration of the Wikipedia page could provide additional insights into the influences at play.