In the World of sport, success on the international stage has often been associated with resources and support available. Since I was young, I have been passionate about soccer, and it has become one of my favorite hobbies. As a personal project, I wanted to investigate whether soccer performance is affected by wealth.
I used GDP per Capita as the measure wealth and FIFA Men Soccer National Teams Rankings as the measure of performance. The GDP per Capita dataset was retrieved from The World Bank website linked: GDP Per Capita 2021 as a CSV file The FIFA Men Soccer National Teams Rankings dataset was scraped from the FIFA website linked: Men’s Ranking as a CSV file
Data preparation occurred both in the CSV files and in R.
Since I was going to use the countries names as the join variables, I made sure that the countries names were spelled correctly and in the same manner in both CSV files. I also removed some of the unnecessary columns such as past GDP data as this analysis focuses on static data from 2021.
In R, the only cleaning process needed was to get to rid of na values for GDP as certain values were not available from The World Bank. Both datasets were joined using inner join function and only the relevant columns were selected
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
file_path_rankings <- "C:/Users/Rishabh/Desktop/Data Analytics Projects/R Projects/Wealth vs. FIFA Rankings/FIFA Socccer men's ranking - 23 December 2021.csv"
file_path_GDP <- "C:/Users/Rishabh/Desktop/Data Analytics Projects/R Projects/Wealth vs. FIFA Rankings/GDP per Capital - 2021.csv"
data_rankings <- read.csv(file_path_rankings)
data_GDP <- read.csv(file_path_GDP)
merged_data <- na.omit(inner_join(data_GDP, data_rankings, by = "Country") %>%
select(Country, Continent, Rank, GDP_per_Capita_2021))
glimpse(merged_data)
## Rows: 195
## Columns: 4
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "American Sam…
## $ Continent <chr> "Asia", "Europe", "Africa", "Australia/Oceania", "…
## $ Rank <int> 149, 66, 29, 190, 155, 126, 127, 5, 92, 200, 35, 3…
## $ GDP_per_Capita_2021 <dbl> 368.75, 6492.87, 3690.63, 15743.31, 42137.33, 1953…
ggplot(data = merged_data) +
geom_point(mapping = aes(x = GDP_per_Capita_2021, y = Rank,
color = Continent)) +
geom_smooth(mapping = aes(x = GDP_per_Capita_2021, y = Rank),
method = "lm", se = FALSE, color = "black") +
scale_y_reverse() +
scale_x_continuous(
breaks = c(0, 5000, 10000, 15000, 20000, 25000, 30000),
labels = c("$0", "$5K", "$10K", "$15K", "$20K", "$25K", "$30K"),
limits = c(0, 50000))
The scatter plot data points are very spread out. The regression line is the middle of the plot with the slightest upward tilt. Both the scatter plot and regression line do not any clear and strong trend. We can deduce that since the regression line is slightly tilting upward there is a weak positive relationship between wealth and ranking. Let’s calculate the correlation coefficient to get more insight on the relationship
correlation <- cor(merged_data$GDP_per_Capita_2021, merged_data$Rank)
print(correlation)
## [1] -0.1748845
Disclaimer
The correlation function calculate the correlation coefficient as the two coefficients; GDP and Ranking increases. Since we are looking at better ranking(lower is better) and higher GDP, the correlation coefficient -0.1749 from above should be interpreted as 0.1749.
A correlation coefficient of 0.1749 indicates a weak positive linear relationship between the two variables, GDP and Ranking. In this context, it means that as GDP per capita increases, the Rank tends to get slightly better,but the relationship is not strong.
Looking at the relationship between GDP and Ranking from a broader perspective can give us a better or alternate insight since we did not find any strong trend on an individual basics. We will use a stacked bar chart for this.
First, we will classify the GDP per capita of the countries as low income, lower-middle income, upper-middle income and high income based on their respective GDP per capita as per criteria from World Bank New World Bank country classifications by income level: 2021-2022
Second, we will categorize the FIFA ranking of the countries, as high rank(< 50), middle rank(50-150) and low rank(> 150). ~ I came up with the ranking.
The categorization was undertaken by mutating the dataset in the code below:
# Define the thresholds for classifying income levels and national team ranking
low_income_threshold <- 1045
lower_middle_income_threshold <- 4095
upper_middle_income_threshold <- 12695
high_rank <- 50
middle_rank <- 150
# Add the "Income_Category" and "Rank_Category" columns to your dataset based on the thresholds
merged_data_category <- merged_data %>%
mutate(Income_Category = case_when(
GDP_per_Capita_2021 <= low_income_threshold ~ "low income",
GDP_per_Capita_2021 > low_income_threshold & GDP_per_Capita_2021 <= lower_middle_income_threshold ~ "lower-middle income",
GDP_per_Capita_2021 > lower_middle_income_threshold & GDP_per_Capita_2021 <= upper_middle_income_threshold ~ "upper-middle income",
GDP_per_Capita_2021 > upper_middle_income_threshold ~ "high income"
)) %>%
mutate(Rank_Category = case_when(
Rank <= high_rank ~ "high rank",
Rank > high_rank & Rank <= middle_rank ~ "middle rank",
Rank > middle_rank ~ "low rank"
))
# Print the first few rows of the modified dataset
head(merged_data_category)
## Country Continent Rank GDP_per_Capita_2021 Income_Category
## 1 Afghanistan Asia 149 368.75 low income
## 2 Albania Europe 66 6492.87 upper-middle income
## 3 Algeria Africa 29 3690.63 lower-middle income
## 4 American Samoa Australia/Oceania 190 15743.31 high income
## 5 Andorra Europe 155 42137.33 high income
## 6 Angola Africa 126 1953.53 lower-middle income
## Rank_Category
## 1 middle rank
## 2 middle rank
## 3 high rank
## 4 low rank
## 5 low rank
## 6 middle rank
# Create a summary dataset with counts for each combination of Rank_Category and Income_Category
summary_data <- merged_data_category %>%
count(Rank_Category, Income_Category) %>%
rename(Count = n)
# Create a stacked bar chart with ordered rank categories and counts on top of the bars
ggplot(data = summary_data, aes(x = Rank_Category, y = Count, fill = Income_Category)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = Count), size = 3, position = position_stack(vjust = 0.5)) +
labs(x = "FIFA Rank Category", y = "Count of Countries", fill = "Income Category") +
theme_minimal()
# Create a summary dataset with counts for each combination of Rank_Category and Income_Category
summary_data <- merged_data_category %>%
count(Rank_Category, Income_Category) %>%
rename(Count = n)
# Create a grouped bar chart with ordered rank categories and counts on top of the bars
ggplot(data = summary_data, aes(x = Rank_Category, y = Count, fill = Income_Category)) +
geom_bar(stat = "identity", position = "dodge", width = 0.8) +
geom_text(aes(label = Count), size = 3, position = position_dodge(width = 0.8), vjust = -0.5) +
labs(x = "FIFA Rank Category", y = "Count of Countries", fill = "Income Category") +
theme_minimal()
## Interpretation of Bar Charts:
On average, high income countries are kind of evenly spread out in all the rankings which the highest presence in high rank category. There is no low income countries in the high rank category
20 out of 22 of low income countries are ranked in the middle rank category
More than half of lower-middle income countries are ranked in the middle rank category
Update?