The data for the final project was downloaded from Tidy Tuesday (which is a weekly social data project in R). The data set I chose is the FIFA World Cup data set from the 29-11-2022 week. The data can be found using this link. The reason I chose this data set is that it is an easily understood topic and a great source for visualizations. The data comprises of FIFA World Cup individual matches and the entire cup. Hence, there are two data sets which are incorporated in the analysis.
The first ever FIFA World Cup was hosted by Uruguay in 1930 while the most recent was in end of 2022 in Qatar. This data is dated prior to the 2022 World Cup. Hence it includes data from 1930 to 2018 World Cup hosted by Russia. My analysis focuses on covering the following major themes and areas of interest:
My aim in this project is to visually analyze these areas of interest and draw insights on the interaction of the above mentioned variables.
I upload the data using the Tidy Tuesday library. I then convert it into a data table. Once data is uploaded I have the world cup matches data with 900 observations (one observation for each match) & a World Cups data table with 21 observations (one for each cup).
wc_1 <- tt_load('2022-11-29')
wc_2 <- tt_load('2022-11-29')
# Convert it to a data table object
wc_1 <- as.data.table(wc_1$wcmatches)
wc_2 <- as.data.table(wc_2$worldcups)
Before analyzing, I want to know my data better. This enables me to understand the variables, their distribution and structure of the data. It also allows me to proceed with Data cleaning.
glimpse(wc_1)
## 900 Rows & 15 columns
glimpse(wc_2)
## 21 Rows & 10 columns
str(wc_1)
str(wc_2)
summary(wc_1)
summary(wc_2)
skim(wc_1)
skim(wc_2)
Data cleaning is one of the most important aspects of the analysis. Using the data table package I have to rename some countries because they have changed names. For example West Germany is now Germany. So according to FIFA rules, the names of certain countries are changed. This is done for both data tables and for all variables of concern.
I then change variable names for better understanding, for example home score is changed to home goals. This is because the term goals is better understood in the context of football.
I then remove columns that I am not interested in and create variables of interest. For example the data include home and away goals, but not total goals. Hence, I create this important variable as it will be a major part of visual analysis.
Lastly I check the data types using “str” function. I convert year from numeric to factor. This is because the year is identifying the World Cup and important for visualization.
# Replace obsolete country names to current day names according to Fifa Standards
wc_1[home_team %in% c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"),
home_team := c("Czech Republic", "Germany", "Russia", "Serbia", "Serbia")[match(home_team, c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"))]]
wc_1[away_team %in% c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"),
away_team := c("Czech Republic", "Germany", "Russia", "Serbia", "Serbia")[match(away_team, c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"))]]
wc_2[winner %in% c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"),
winner := c("Czech Republic", "Germany", "Russia", "Serbia", "Serbia")[match(winner, c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"))]]
wc_2[second %in% c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"),
second := c("Czech Republic", "Germany", "Russia", "Serbia", "Serbia")[match(second, c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"))]]
wc_2[third %in% c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"),
third := c("Czech Republic", "Germany", "Russia", "Serbia", "Serbia")[match(third, c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"))]]
wc_2[fourth %in% c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"),
fourth := c("Czech Republic", "Germany", "Russia", "Serbia", "Serbia")[match(fourth, c("Czechoslovakia", "West Germany", "Soviet Union", "Yugoslavia", "FR Yugoslavia"))]]
data.table::setnames(wc_1,'home_score','home_goals')
data.table::setnames(wc_1,'away_score','away_goals')
wc_1 <- wc_1[, -10]
# Adding a new column that shows total goals scored in a particular match
wc_1 <- wc_1[, `:=` (
total_goals = home_goals + away_goals)]
# Check data types and make relevant conversions
str(wc_2$goals_scored)
str(wc_1$total_goals)
str(wc_2$year)
# Change year from numeric to factor variable
set(wc_2, j = "year", value = as.factor(wc_2$year))
After the cleaning is complete, I can perform data transformation. Here I create Hosts who won the World Cup. Then I check which countries ever participated in a world cup. The goal variable is transformed to average goals per game. Likewise, attendance is transformed into an average according to games. Then I create attendance categories according to attendance distribution of the data. Lastly, I create a data table only for Winners of the World Cup since 1930.
# Viewing the hosts that won the world cup
host_win = wc_2[host == winner, .(year, host), by = year][, year := NULL][]
host_win = unique(host_win)
host_win
## year host
## 1: 1930 Uruguay
## 2: 1934 Italy
## 3: 1966 England
## 4: 1974 Germany
## 5: 1978 Argentina
## 6: 1998 France
# This shows that a host has only won 5 times. The last time was France in 1998.
# Which countries participated in world cups from 1930 to 2018 (inclusive)
wc_teams = unique(c(wc_1[, home_team], wc_1[, away_team]))
wc_teams
## [1] "France" "Belgium" "Brazil"
## [4] "Peru" "Argentina" "Chile"
## [7] "Bolivia" "Paraguay" "Uruguay"
## [10] "Austria" "Czech Republic" "Egypt"
## [13] "Italy" "Netherlands" "Germany"
## [16] "Cuba" "Hungary" "Spain"
## [19] "Switzerland" "Mexico" "England"
## [22] "Sweden" "Scotland" "South Korea"
## [25] "Colombia" "Russia" "Bulgaria"
## [28] "North Korea" "Portugal" "Israel"
## [31] "Morocco" "El Salvador" "Zaire"
## [34] "East Germany" "Poland" "Serbia"
## [37] "Haiti" "Australia" "Iran"
## [40] "Algeria" "Honduras" "Canada"
## [43] "Northern Ireland" "Denmark" "Iraq"
## [46] "United Arab Emirates" "United States" "Costa Rica"
## [49] "Cameroon" "Republic of Ireland" "Norway"
## [52] "Nigeria" "Romania" "Saudi Arabia"
## [55] "Greece" "Jamaica" "South Africa"
## [58] "Japan" "Croatia" "China PR"
## [61] "Tunisia" "Senegal" "Slovenia"
## [64] "Ecuador" "Turkey" "Trinidad and Tobago"
## [67] "Angola" "Togo" "Ivory Coast"
## [70] "Ghana" "Ukraine" "New Zealand"
## [73] "Slovakia" "Bosnia and Herzegovina" "Iceland"
## [76] "Panama" "Dutch West Indies" "Wales"
## [79] "Kuwait"
length(wc_teams)
## [1] 79
## 79 teams have participated
# Create average goals per game column
wc_2 <- wc_2[, `:=` (
average_goals = goals_scored/games)]
# Round the average goals column to 2 decimal places
set(wc_2, j = "average_goals", value = round(wc_2$average_goals, 2))
# Create average attendance per world cup according to games played
wc_2 <- wc_2[, `:=` (
average_attendance = attendance/games)]
# Round the newly created column to 0 decimal places
set(wc_2, j = "average_attendance", value = round(wc_2$average_attendance, 0))
#describe(wc_2$average_attendance)
# Shows lowest, highest, percentiles, mean, median for all 21 values and none are missing
wc_2 <- wc_2 %>%
mutate(attendance_categories = case_when(average_attendance > 50459 ~ "Very High Attendance",
average_attendance > 44676 ~ "High Attendance",
average_attendance > 33875 ~ "Relatively Normal Attendance",
average_attendance >= 23235 ~ "Relatively Low Attendance"))
# Add binary variable for attendance categories
wc_2[, very_high_binary := as.factor(ifelse(wc_2$attendance_categories == "Very High Attendance", 1,0))]
## Data table for winners
num_of_win <- c(2,4,5,1,2,2,1,4)
winner <- c("Uruguay", "Italy","Brazil", "England","Argentina", "France", "Spain", "Germany")
winners_table <- data.table(num_of_win, winner)
winners_table[order(-num_of_win),]
## num_of_win winner
## 1: 5 Brazil
## 2: 4 Italy
## 3: 4 Germany
## 4: 2 Uruguay
## 5: 2 Argentina
## 6: 2 France
## 7: 1 England
## 8: 1 Spain
I create a custom function which applies a custom theme on my ggplots. This enables the visualizations to be organized and tidy. The theme I create has custom colors using the color Hexa. This allows me to customize each color. I specify various aspects of the theme such as title font, axis titles and background colors.
theme_wc <- function(){
theme_bw() +
theme(
plot.background = element_rect(fill = "#d3d3d3"),
panel.background = element_rect(fill = "#f2ded6", color = 'purple'),
plot.caption = element_text(hjust = 0, face = "italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_text(color = "#a1a1a1", size = 12),
axis.text.y = element_text(color = "#7b7b7b", size = 12),
axis.title.x = element_text(color = "#1c4959", size = 14, face = "bold"),
axis.title.y = element_text(color = "#1c4959", size = 14, face = "bold"),
plot.title = element_text(color = "#0e242c", size = 16, face = "bold", hjust = 0.5),
legend.text = element_text(color = "#FF4500", size = 12),
legend.title = element_text(color = "#FF4500", size = 14, face = "bold")
)
}
I create two density plots to visualize how my data is spread. I want to see how my teams variables is spread across the data set. The variables of interest are teams and games. How are the variables distributed across the data. As seen for most world cup events, teams were below 20. However, a significant density can be observed from 30 and above showing that 32 teams are now playing the World Cup. As more events take place with 32 teams, this density is expected to be the highest in the future.
## Exploring Density for numeric variables
density_1 <- ggplot(wc_2) +
aes(x = teams) +
geom_density(adjust = 0.5, fill = "#89cff0") +
labs(
x = "Teams",
y = "Density",
title = "Density Plot for Teams "
) +
theme_wc()
density_1
As expected, the density of Games follows the same pattern as teams. As teams have increased so have the number of games. Th density for games is highest for above 60 games.
density_2 <- ggplot(wc_2) +
aes(x = games) +
geom_density(adjust = 0.5, fill = "#19664E") +
labs(
x = "Games",
y = "Density",
title = "Density Plot for Games "
) +
theme_wc()
density_2
Correlation of numeric variables is conducted using ggpairs function. This allows me to compare my numeric variables and their correlation or interactivity. The numeric variables of interest are teams, games, goals and attendance as seen below.
# Correlation between key variables
corr <- wc_2 %>% select(c("teams","games","goals_scored","attendance"))
# I use ggpairs to allow for visualizing correlation between teams, games, goals_scored & attendance
corr_gg <- ggpairs(corr)
corr_gg
1-A. Teams participating & Games played Table
This table shows the World Cup events from 1930. It shows the number of teams competing and games played. As seen the data is moving in the same direction. As teams particpation increases games played also increases.
# Games played and average number of teams
ratings_table <- wc_2[,list( host, games, teams = (mean(teams))), by = year]
knitr::kable(ratings_table, caption="Games played and Teams")
| year | host | games | teams |
|---|---|---|---|
| 1930 | Uruguay | 18 | 13 |
| 1934 | Italy | 17 | 16 |
| 1938 | France | 18 | 15 |
| 1950 | Brazil | 22 | 13 |
| 1954 | Switzerland | 26 | 16 |
| 1958 | Sweden | 35 | 16 |
| 1962 | Chile | 32 | 16 |
| 1966 | England | 32 | 16 |
| 1970 | Mexico | 32 | 16 |
| 1974 | Germany | 38 | 16 |
| 1978 | Argentina | 38 | 16 |
| 1982 | Spain | 52 | 24 |
| 1986 | Mexico | 52 | 24 |
| 1990 | Italy | 52 | 24 |
| 1994 | USA | 52 | 24 |
| 1998 | France | 64 | 32 |
| 2002 | Japan, South Korea | 64 | 32 |
| 2006 | Germany | 64 | 32 |
| 2010 | South Africa | 64 | 32 |
| 2014 | Brazil | 64 | 32 |
| 2018 | Russia | 64 | 32 |
1-B. Teams participating & Games played Graph
g_1 <- ggplot(wc_2) +
aes(x = year, y = teams, fill = teams, size = games) +
geom_point(shape = "circle filled", colour = "#112446") +
scale_fill_gradient(low = "#F7FBFF", high = "#08306B") +
labs(
x = "Year of World Cup",
y = "Teams Playing World Cup",
title = "Teams & Games Played ",
subtitle = "Years"
) +
theme(panel.background = element_rect(fill = '#ffe4c4', color = 'purple'),
panel.grid.major = element_line(color = '#faebd7', linetype = 'dotted'),
panel.grid.minor = element_line(color = '#008000', size = 2))+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
theme_wc()
g_1
The x axis is representing years the world cup was played while the y axis is showing teams particpation. Another important aspect is the size of the visualization which reoresents the games played. The circles get bigger as years and teams both increase.
g_2 <- ggplot(wc_2) +
aes(x = year, fill = goals_scored, weight = goals_scored) +
geom_bar() +
scale_fill_distiller(palette = "Blues", direction = 1) +
labs(
x = "World Cup Year",
y = "Total Goals Scored",
title = "Goals Scored in World Cups",
fill = "Goals Scored"
) +
scale_y_continuous(breaks = seq(0,150, by = 25)) +
theme_wc() +
theme(legend.position = "top") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
g_2
The most fascinating moment of a football match is when a goal is scored. So, which world cup saw the most goals from 1930 to 2018? As the bar chart shows, Goals scored are steadily increasing with some exceptions. Interestingly, 2018 world cup in Russia saw a slight fip comapred to 2014 world cup.
g_3 <- ggplot(wc_2, aes(x = year, y = attendance, fill = attendance)) + geom_bar(stat = 'identity') +
theme_wc() +
transition_states(year) +
labs(title = 'Attendance at World Cup Events',
subtitle = 'year: {closest_state}',
y = 'Attendance') +
theme(legend.position = "top" , legend.title = element_blank())
animate(g_3, end_pause = 10, fps = 5)
g_3
The attendance at an event speaks volumes of the event success or faulure. Hence, I incorporate attendace in the visual analysis. Furthermore, I animate the visualization so that the over all trend can be captured. It shows clearly that as time passes, world cup attendance increases. However 2018 saw less than 2014 in terms of attendance.
g_4 <- ggplot(wc_2) +
aes(x = year, y = average_attendance, fill = average_goals) +
geom_col() +
scale_fill_distiller(palette = "YlOrRd", direction = 1) +
labs(
x = "Average Attendance",
y = "World Cup Year",
title = "Average Attendance by year",
subtitle = "Color shows average goals ",
fill = "Average Goals "
) +
coord_flip() +
theme_wc()
g_4
One of the reasons why attendance will go up is because more games are being played. Therefore, I use my transformed variable which shows how average attendance varies per world cup event. Moreover, in this horizontal bar chart years are on the vertical while average attendance is on the horizontal. This shows that some years actually saw a drop in average attendance despite an increase in over all attendance. Moreover, average goals per game is also included in terms of intensity of color of the bar. As seen older world cups saw higher average goals per game (as seen in red). More recent world cups have lower goals per game played.
colors_pie_chart <- c("red", "blue", "green", "yellow", "purple", "orange", "pink", "brown")
g_5 <- ggplot(winners_table, aes(x = "", y = num_of_win, fill = winner)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
ggtitle("World cup winners") +
labs(fill = "") +
scale_fill_manual(values = colors_pie_chart) +
geom_text(aes(label = num_of_win), position = position_stack(vjust = 0.5), size=13)+
theme_wc()
g_5
A pie chart is a great measure to show the share or portion. Hence, I include the Winners data table for this visualization. Three large observations can be noticed immediately: Brazil at 5 World Cups and Germany & Italy at 4 . Then I notice some countries 2 & 1 world cup each.
g_6 <- ggplot(wc_2) +
aes(x = attendance_categories, y = goals_scored, fill = attendance_categories) +
geom_boxplot() +
geom_jitter() +
scale_fill_hue(direction = 1) +
labs(x = "Attendance category", y = "Goals Scored",
title = "Goals by attendance category ", subtitle = "Boxplot", fill = "Attendance category") +
theme_wc()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
g_6
An important transformed variables is attendance category. These categories were made keeping in mind the distribution of the attendance variable. The categories and goals scored are important variables that can be compared. I use box plot to signify the distribution across categories. Hence, it can be observed that the lower the attendance category, the lesser the goals scored are by distribution. The categories are colored for easier readability.
#rm(mds_df)
# MDS data frame
mds_df <- data.frame(matrix(ncol = ncol(wc_2), nrow = 0))
# Relevant column names from wc_2 data table
colnames(mds_df) <- colnames(wc_2)
# Select the World Cup events by years
set.seed(1234)
for (i in 1930:2018){
temp_df <- wc_2[year == i]
mds_df <- rbind(mds_df, temp_df[sample(nrow(temp_df), 1),])
}
# put the names of the games as row names
mds_df <- column_to_rownames(mds_df, var = 'attendance')
# filter numeric variables
mds_df <- mds_df[, lapply(mds_df, is.numeric) == TRUE]
# MDS Analysis
mds_df <- cmdscale(dist(scale(mds_df)))
mds_df <- as.data.frame(mds_df)
mds_df$attendance <- rownames(mds_df)
rownames(mds_df) <- NULL
# Create MDS plot uisng #5d0d41 color
g_7 <- ggplot(data = mds_df, aes(x = V1, y = V2, label = attendance)) +
geom_text_repel(colour="#5d0d41") +
labs( x = '', y = '', title = 'MDS for World Cups (1930-2018)') +
theme(axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank()) +
theme_wc()
g_7
MDS allows for relative distances to be incorporated. I use the years and audience as dimensions and create a visualization for MDS where the attendance is mapped and labelled as seen above.
g_8 <- ggplot(wc_1) +
aes(x = dayofweek, y = total_goals, fill = dayofweek) +
geom_col() +
scale_fill_viridis_d(option = "viridis", direction = 1) +
labs(
x = "Day of Week",
y = "Goals Scored",
title = "Day and Goals Scored ",
fill = "Day of Week"
) +
coord_flip() +
theme_wc()
g_8
Does goal scoring vary according to the day of week. As seen the games played on weekends have seen higher goals scored. Moreover, games on Sunday have a significantly higher number of goals scored compared to other days. The days are colored for easy readability.
g_9 <- ggplot(wc_1) +
aes(x = country, y = home_goals, fill = country) +
geom_col() +
scale_fill_hue(direction = 1) +
labs(
x = "Wolrd Cup Host",
y = "Host Goals",
title = "Host Goals",
fill = "Wolrd Cup Host"
) +
coord_flip() +
theme_bw()
g_9
Lastly, I also wanted to see how hosts have performed by goals scored.Hence, as seen above some host did tremendously well by scoring a high number of goals compared to other hosting countries. Germany, France, Mexico & Brazil have been high scoring hosts while some countries like Uruguay, Japan and England have been reletavily lower scoring hosts.