Countries around the world exhibit significant variations across various dimensions, including economies, racial diversity, culture, currency, and more. This study seeks to explore the disparities among countries and regions/continents globally, with a specific emphasis on the economic aspect. According to data from the World Population Review website, the United States ranks as the country with the highest GDP globally, followed by China and then Germany. This topic holds personal significance for me as I have always been curious about understanding the extent of economic disparities among countries, regions, and continents worldwide.
Data
The dataset utilized in this analysis was obtained through web scraping conducted by Aditya Kishor and contains information on all countries globally. However, I have specifically directed my focus towards analyzing economic aspects. While the original dataset comprises 64 variables, this analysis concentrates on eleven(11) key variables:
“country”: The name of the country. “region”: The region to which the country belongs. “continent”: The name of the continent to which the country belongs. “central_government_debt_pct_gdp”: The percentage of a country’s GDP that is represented by the central government’s debt. “gdp”: Gross Domestic Product. “self_employed_pct”: The percentage of the total workforce that is self-employed. “tax_revenue_pct_gdp”: Proportion of a country’s GDP that is collected in the form of taxes. “unemployment_pct”: Percentage of the labor force that is actively seeking employment but currently unemployed. “vulnerable_employment_pct”: The percentage of the total workforce engaged in vulnerable employment. “health_expenditure_capita”: Health expenditure per capita for the country. “health_expenditure_pct_gdp”: The percentage of a country’s GDP that is allocated to health expenditure.
Load Libraries
library(tidyverse)
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'plotly' was built under R version 4.3.2
library(treemap)
Warning: package 'treemap' was built under R version 4.3.2
Load Data
# set working directorysetwd("C:/Users/kmerv_6exilcx/Dropbox/SPRING 2024/Data 110/Final Project")countries <-read_csv('AllCountries.csv')# display the first six rowshead(countries)
# A tibble: 6 × 64
country country_long currency capital_city region continent demonym latitude
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… Islamic Sta… Afghan … Kabul South… Asia Afghan 33
2 Albania Republic of… Albania… Tirana South… Europe Albani… 41
3 Algeria People's De… Algeria… Algiers North… Africa Algeri… 28
4 Andorra Principalit… Euro Andorra la … South… Europe Andorr… 42.5
5 Angola People's Re… Angolan… Luanda Middl… Africa Angolan -12.5
6 Antigua … Antigua and… East Ca… Saint John's Carib… Americas Antigu… 17.0
# ℹ 56 more variables: longitude <dbl>, agricultural_land <dbl>,
# forest_area <dbl>, land_area <dbl>, rural_land <dbl>, urban_land <dbl>,
# central_government_debt_pct_gdp <dbl>, expense_pct_gdp <dbl>, gdp <dbl>,
# inflation <dbl>, self_employed_pct <dbl>, tax_revenue_pct_gdp <dbl>,
# unemployment_pct <dbl>, vulnerable_employment_pct <dbl>,
# electricity_access_pct <dbl>, alternative_nuclear_energy_pct <dbl>,
# electricty_production_coal_pct <dbl>, …
# select only the columns I will work withdata <- countries |>select("country", "region", "continent", "gdp", "central_government_debt_pct_gdp", "self_employed_pct", "tax_revenue_pct_gdp", "unemployment_pct", "vulnerable_employment_pct", "health_expenditure_capita", "health_expenditure_pct_gdp")# display the first six rowshead(data)
# A tibble: 6 × 11
country region continent gdp central_government_d…¹ self_employed_pct
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan South… Asia 1.46e10 NA 84.3
2 Albania South… Europe 1.89e10 82.4 53.0
3 Algeria North… Africa 1.92e11 NA 30.5
4 Andorra South… Europe 3.35e 9 NA NA
5 Angola Middl… Africa 1.07e11 NA 65.9
6 Antigua and… Carib… Americas 1.76e 9 NA NA
# ℹ abbreviated name: ¹central_government_debt_pct_gdp
# ℹ 5 more variables: tax_revenue_pct_gdp <dbl>, unemployment_pct <dbl>,
# vulnerable_employment_pct <dbl>, health_expenditure_capita <dbl>,
# health_expenditure_pct_gdp <dbl>
Statistical analysis
Scatterplot of Central Government Debt vs Unemployment Rate
plot1 <-ggplot(data, aes(x = central_government_debt_pct_gdp, y = unemployment_pct, color=continent, group = continent, size = gdp, text =paste("country:", country))) +theme_light(base_size =12, base_family ="serif") +geom_point(alpha =0.5) +geom_smooth(method=lm, se=FALSE, lty =5, linewidth =0.2) +scale_color_brewer(palette ="Set1") +labs(x="Percentage of central government debt (% GDP)", y="Unemployment rate (%)",title ="Scatterplot of Central Government Debt vs Unemployment Rate",caption ="Source: Web Scraping") +scale_x_log10() plot1 <-ggplotly(plot1, tooltip =c("x", "y", "color", "size", "text"))
Warning: Transformation introduced infinite values in continuous x-axis
Transformation introduced infinite values in continuous x-axis
Warning: The following aesthetics were dropped during statistical transformation: size,
text
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
plot1
For Africa, unemployment rates tend to decrease as central government debt as a percentage of GDP increases, whereas the trend is opposite for other continents.
Multiple Linear Model: Use central_government_debt_pct_gdp and region to predict unemployment_pct
model1 <-lm(unemployment_pct ~ central_government_debt_pct_gdp + region, data = data)summary(model1)
Call:
lm(formula = unemployment_pct ~ central_government_debt_pct_gdp +
region, data = data)
Residuals:
Min 1Q Median 3Q Max
-7.9858 -1.8882 -0.3264 1.6729 11.3266
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.665582 2.710376 1.352 0.17955
central_government_debt_pct_gdp -0.003443 0.005387 -0.639 0.52434
regionCaribbean 6.815943 3.051133 2.234 0.02791 *
regionCentral America 2.665708 3.184408 0.837 0.40470
regionCentral Asia 2.321176 3.473804 0.668 0.50568
regionEastern Africa 2.224384 3.008607 0.739 0.46158
regionEastern Asia 1.276799 3.484413 0.366 0.71488
regionEastern Europe 1.030185 2.948318 0.349 0.72758
regionMelanesia -0.879034 3.299212 -0.266 0.79050
regionMiddle Africa 6.890515 3.476257 1.982 0.05044 .
regionNorthern Africa 9.573156 3.296172 2.904 0.00461 **
regionNorthern America 1.061888 3.808973 0.279 0.78103
regionNorthern Europe 1.536658 2.975026 0.517 0.60673
regionPolynesia -0.484447 4.660740 -0.104 0.91744
regionSouth-Eastern Asia -1.019942 3.106901 -0.328 0.74344
regionSouth America 3.846752 3.106954 1.238 0.21883
regionSouthern Africa 18.766632 3.300721 5.686 1.53e-07 ***
regionSouthern Asia 3.480795 3.008545 1.157 0.25028
regionSouthern Europe 5.932474 3.021928 1.963 0.05265 .
regionWestern Africa 0.433056 3.475588 0.125 0.90111
regionWestern Asia 5.820431 2.947769 1.975 0.05132 .
regionWestern Europe 1.284558 3.050919 0.421 0.67471
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.805 on 92 degrees of freedom
(80 observations deleted due to missingness)
Multiple R-squared: 0.5679, Adjusted R-squared: 0.4693
F-statistic: 5.758 on 21 and 92 DF, p-value: 1.428e-09
The trend of the residual plot may suggest violations of the assumption of constant variance. These residual plots indicate that certain observations (countries 38, 62, 85) in the dataset have a notable effect on the plot, as evidenced by their high scale-location values.
Distribution of Self-Employment Percentage by region
# Create the boxplot using ggplot2ggplot(data, aes(x = region, y = self_employed_pct, fill = continent)) +geom_boxplot() +scale_fill_brewer(palette ="Set1") +labs(x ="Region", y ="Self-Employed %", title ="Self-Employed Percentage Distribution by Region", caption ="Source: Web Scraping") +theme_minimal() +theme(axis.text.x =element_text(angle =80, hjust =1)) +theme(axis.text =element_text(size =10)) +theme(plot.title =element_text(size =14, face ="bold")) +theme(legend.position ="bottom")
From the boxplots, we can see that the median self-employed percentage varies significantly across regions and continents. Europe generally exhibits lower percentages of self-employment compared to other continents, while Africa demonstrates higher percentages.
Scatterplot of Self Employment Percentage vs Vulnerable Employment Percentage
plot2 <-ggplot(data, aes(x = self_employed_pct, y = vulnerable_employment_pct, color=continent, group = continent, text =paste("country:", country))) +theme_minimal(base_size =12, base_family ="serif") +geom_point(alpha =0.5) +geom_smooth(method=lm, se=FALSE, lty =5, linewidth =0.2) +scale_color_brewer(palette ="Set1") +labs(x ="Percentage of Self-Employment", y ="Vulnerable Employment Rate (%)",title ="Scatterplot of Percentage of Self-Employment to Vulnerable Employment Rate",caption ="Source: Web Scraping")plot2 <-ggplotly(plot2, tooltip =c("x", "y", "color", "text"))
Warning: The following aesthetics were dropped during statistical transformation: text
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
plot2
The scatterplot illustrates a strong relationship between the percentage of self-employment and the vulnerable employment rate; as the percentage of self-employment increases in a country, a larger proportion of those jobs are vulnerable jobs, so the percentage of vulnerable jobs increases too.
Linear Model: Use self_employed_pct to predict vulnerable_employment_pct
model2 <-lm(vulnerable_employment_pct ~ self_employed_pct, data = data)summary(model2)
Call:
lm(formula = vulnerable_employment_pct ~ self_employed_pct, data = data)
Residuals:
Min 1Q Median 3Q Max
-9.2802 -0.8938 0.2453 1.2903 3.5948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.929713 0.278432 -14.11 <2e-16 ***
self_employed_pct 1.019118 0.005628 181.06 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.041 on 177 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.9946, Adjusted R-squared: 0.9946
F-statistic: 3.278e+04 on 1 and 177 DF, p-value: < 2.2e-16
Equation of the model: vulnerable_employment_pct = −3.929713 + 1.019118 × self_employed_pct The slope coefficient (estimate) for the percentage of self-employed individuals is approximately 1.019118, indicating that for every one-unit increase in the percentage of self-employed individuals, the vulnerable employment percentage increases by approximately 1.019118 units.
The intercept (estimate) is approximately -3.929713, representing the estimated vulnerable employment percentage when the percentage of self-employed individuals is zero.
p-value: < 2.2e-16; very low, indicating that the model as a whole is statistically significant in explaining the variance in the vulnerable employment rate. Moreover, the adjusted R-squared value of 0.9946 indicates that approximately 99.46% (almost all our observations) of the variance in the vulnerable employment percentage can be explained by the percentage of self-employed individuals in the model.
Diagnostic plots model2
autoplot(model2, 1:4, nrow=2, ncol=2)
The almost horizontal pattern observed in the Residuals vs. Fitted plot may suggest the assumption of constant variance. Observation 73 appears to have a slight effect on both the Residual and Normal Q-Q plots. This suggests that this particular observation may have some influence on the model’s fit or may exhibit some deviation from the assumed distribution.
Mean GDP and Tax Revenue as a Percentage of GDP by Region
# group the data by region and calculates the mean values of gdp and tax_revenue_pct_gdp variables within each regiondata1 <- data|>group_by(region)|>summarize(tax_pct_gdp =mean(tax_revenue_pct_gdp, na.rm =TRUE), gdp =mean(gdp, na.rm =TRUE)) # Treemap treemap(data1, index="region", vSize="gdp", title ="Mean GDP and Tax Revenue as a Percentage of GDP by Region", vColor="tax_pct_gdp", type="manual", palette="YlGnBu") +theme_minimal()
NULL
From the treemap, we observe that certain regions, such as Northern America, exhibit a high GDP, but only a small portion of it is derived from taxes. Conversely, regions like Australia and New Zealand display a relatively smaller GDP, yet a significant percentage of it originates from taxes. Additionally, GDPs for regions in Africa are notably low, represented by extremely small squares, which starkly contrasts with regions in the Americas and Asia.
The treemap confirms the findings of the World Population Review website, indicating that the United States, located in North America, indeed has the highest GDP among countries. Similarly, Eastern Asia, where China is situated, appears to be the region with the second-highest GDP, possibly due to China’s significant economic influence. This trend is also observed in Western Europe, which includes Germany, a leading economic powerhouse in the region.
Health Expenditure per Capita in countries
It is the average amount of money spent on healthcare services per person within a specific population or region. It is calculated by dividing the total healthcare expenditure by the total population.
Visualization from Tableau Public: https://public.tableau.com/app/profile/merveille.kuendzong/viz/HealthExpenditureCapitainCountries/health_exp_capita
This visualization indicates that countries in Africa exhibit the lowest values of Health Expenditure per Capita compared to other continents, with Burundi having the lowest value globally and the United States having the highest.
Conclusion
In summary, our analysis reveals substantial economic disparities across regions and continents. We found correlations between variables such as self-employment and vulnerable employment. The boxplots illustrate significant variations in median self-employed percentages, with Europe generally showing lower rates compared to other continents, while Africa tends to have higher rates. Additionally, our findings indicate a unique trend for Africa, where unemployment rates decrease with increasing central government debt as a percentage of GDP, contrary to other continents.
Also, we observed contrasting patterns in GDP and tax revenue percentages, indicating diverse economic structures. Regions in Africa face distinct challenges with lower GDPs, while others like Northern America demonstrate robust economic performance. Moreover, we noticed stark differences in Health Expenditure per Capita, with African countries exhibiting the lowest values globally, and the United States leading with the highest.