This data set was sourced from the Wikipedia page, specifically focusing on the attendance records related to the World Cup. I will conduct various analyses on this dataset to explore trends and patterns in attendance over the years. This analysis aims to provide insights into how attendance figures have evolved during different World Cup tournaments. In addition to performing the analysis, I will also clean the data to ensure its accuracy and reliability. Necessary visualizations will be created to effectively communicate the findings and make the data more accessible. By presenting these insights, I hope to enhance the understanding of World Cup attendance trends and their implications.
library(rvest)
library(tidyverse)
library(janitor)
library(scales)
library(plotly)
library(viridis)
library(DT)
library(plotly)
cor_link <- "https://en.wikipedia.org/wiki/FIFA_World_Cup#Attendance"
col_page <- read_html(cor_link)
col_table <- col_page %>%
html_nodes("table") %>%
.[4] %>%
html_table() %>%
.[[1]]
fifa_data <- clean_names(col_table)
# Function to clean numeric columns
clean_numeric <- function(x) {
as.numeric(gsub("[^0-9]", "", x))
}
fifa_data <- fifa_data %>%
mutate(across(c(totalattendance, averageattendance,
highest_attendances, matches), clean_numeric))
fifa_data <- fifa_data %>%
mutate(year = as.numeric(gsub("[^0-9]", "", year)))
fifa_data_clean <- fifa_data %>%
filter(complete.cases(.))
summary_stats <- fifa_data_clean %>%
summarise(
Total_Events = n(),
Avg_Total_Attendance = mean(totalattendance),
Max_Total_Attendance = max(totalattendance),
Min_Total_Attendance = min(totalattendance),
Avg_Matches_Per_Event = mean(matches)
)
datatable(summary_stats %>%
mutate(across(where(is.numeric), ~format(., big.mark = ",", scientific = FALSE))),
options = list(dom = 't'))
p1 <- ggplot(fifa_data_clean, aes(x = year, y = totalattendance)) +
geom_line(color = "#2C3E50", size = 1) +
geom_point(aes(color = hosts), size = 3) +
scale_y_continuous(labels = comma_format()) +
theme_minimal() +
labs(title = "FIFA World Cup Total Attendance Over Time",
subtitle = "Showing progression from 1930 to present",
x = "Year",
y = "Total Attendance",
color = "Host Country") +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
ggplotly(p1)
# Create bubble plot - using fifa_data_clean
bubble_plot <- ggplot(fifa_data_clean,
aes(x = year,
y = totalattendance,
size = averageattendance,
color = matches,
text = paste("Year:", year,
"<br>Host:", hosts,
"<br>Matches:", matches,
"<br>Total Attendance:", scales::comma(totalattendance),
"<br>Average Attendance:", scales::comma(averageattendance)))) +
geom_point(alpha = 0.6) +
scale_size_continuous(range = c(5, 20)) +
scale_color_viridis() +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
labs(title = "FIFA World Cup Attendance Analysis",
subtitle = "Bubble size represents average attendance per match",
x = "Year",
y = "Total Attendance",
color = "Matches",
size = "Avg. Attendance") +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12),
legend.position = "right"
)
# Convert to interactive plot
ggplotly(bubble_plot, tooltip = "text")
From the first plot, we can see that the United States had the highest attendance during the World Cup, with approximately 3.6 million spectators in 1994. In contrast, Hungary recorded the lowest attendance, with about 600,000 attendees in 1930. This disparity may be attributed to the transportation systems of the time; in the 1930s, travel was likely more expensive and accessible mainly to the affluent and local citizens. However, I must emphasize that this is speculative, as I have not thoroughly examined the data. Additionally, it’s important to remember that correlation does not always imply causation.
I also analyzed the increase in the number of matches over the years and its impact on total attendance. As expected, the United States maintained the highest attendance figures, around 3.6 million, while Hungary’s attendance was about 600,000. Various factors may contribute to these trends, and further exploration of the Wikipedia page could provide additional insights into the influences at play.