Analysis of the Economies of Countries Around the World

Author

Merveille Kuendzong

Published

May 5, 2024

Source: https://puassets.s3.amazonaws.com/wp-content/uploads/2015/11/globe-world-currency-economy-money-travel.jpg

Introduction and Background Research

Countries around the world exhibit significant variations across various dimensions, including economies, racial diversity, culture, currency, and more. This study seeks to explore the disparities among countries and regions/continents globally, with a specific emphasis on the economic aspect. According to data from the World Population Review website, the United States ranks as the country with the highest GDP globally, followed by China and then Germany. This topic holds personal significance for me as I have always been curious about understanding the extent of economic disparities among countries, regions, and continents worldwide.

Data

The dataset utilized in this analysis was obtained through web scraping conducted by Aditya Kishor and contains information on all countries globally. However, I have specifically directed my focus towards analyzing economic aspects. While the original dataset comprises 64 variables, this analysis concentrates on eleven(11) key variables:

“country”: The name of the country. “region”: The region to which the country belongs. “continent”: The name of the continent to which the country belongs. “central_government_debt_pct_gdp”: The percentage of a country’s GDP that is represented by the central government’s debt. “gdp”: Gross Domestic Product. “self_employed_pct”: The percentage of the total workforce that is self-employed. “tax_revenue_pct_gdp”: Proportion of a country’s GDP that is collected in the form of taxes. “unemployment_pct”: Percentage of the labor force that is actively seeking employment but currently unemployed. “vulnerable_employment_pct”: The percentage of the total workforce engaged in vulnerable employment. “health_expenditure_capita”: Health expenditure per capita for the country. “health_expenditure_pct_gdp”: The percentage of a country’s GDP that is allocated to health expenditure.

Load Libraries

library(tidyverse)

Warning: package 'dplyr' was built under R version 4.3.2

library(ggplot2)
library(ggfortify)
library(RColorBrewer)
library(plotly)

Warning: package 'plotly' was built under R version 4.3.2

library(treemap)

Warning: package 'treemap' was built under R version 4.3.2

Load Data

# set working directory
setwd("C:/Users/kmerv_6exilcx/Dropbox/SPRING 2024/Data 110/Final Project")
countries <- read_csv('AllCountries.csv')

# display the first six rows
head(countries)

# A tibble: 6 × 64
  country   country_long currency capital_city region continent demonym latitude
  <chr>     <chr>        <chr>    <chr>        <chr>  <chr>     <chr>      <dbl>
1 Afghanis… Islamic Sta… Afghan … Kabul        South… Asia      Afghan      33  
2 Albania   Republic of… Albania… Tirana       South… Europe    Albani…     41  
3 Algeria   People's De… Algeria… Algiers      North… Africa    Algeri…     28  
4 Andorra   Principalit… Euro     Andorra la … South… Europe    Andorr…     42.5
5 Angola    People's Re… Angolan… Luanda       Middl… Africa    Angolan    -12.5
6 Antigua … Antigua and… East Ca… Saint John's Carib… Americas  Antigu…     17.0
# ℹ 56 more variables: longitude <dbl>, agricultural_land <dbl>,
#   forest_area <dbl>, land_area <dbl>, rural_land <dbl>, urban_land <dbl>,
#   central_government_debt_pct_gdp <dbl>, expense_pct_gdp <dbl>, gdp <dbl>,
#   inflation <dbl>, self_employed_pct <dbl>, tax_revenue_pct_gdp <dbl>,
#   unemployment_pct <dbl>, vulnerable_employment_pct <dbl>,
#   electricity_access_pct <dbl>, alternative_nuclear_energy_pct <dbl>,
#   electricty_production_coal_pct <dbl>, …

# select only the columns I will work with
data <- countries |>
  select("country", "region", "continent", "gdp", "central_government_debt_pct_gdp", "self_employed_pct", "tax_revenue_pct_gdp", "unemployment_pct", "vulnerable_employment_pct", "health_expenditure_capita", "health_expenditure_pct_gdp")

# display the first six rows
head(data)

# A tibble: 6 × 11
  country      region continent     gdp central_government_d…¹ self_employed_pct
  <chr>        <chr>  <chr>       <dbl>                  <dbl>             <dbl>
1 Afghanistan  South… Asia      1.46e10                   NA                84.3
2 Albania      South… Europe    1.89e10                   82.4              53.0
3 Algeria      North… Africa    1.92e11                   NA                30.5
4 Andorra      South… Europe    3.35e 9                   NA                NA  
5 Angola       Middl… Africa    1.07e11                   NA                65.9
6 Antigua and… Carib… Americas  1.76e 9                   NA                NA  
# ℹ abbreviated name: ¹central_government_debt_pct_gdp
# ℹ 5 more variables: tax_revenue_pct_gdp <dbl>, unemployment_pct <dbl>,
#   vulnerable_employment_pct <dbl>, health_expenditure_capita <dbl>,
#   health_expenditure_pct_gdp <dbl>

Statistical analysis

Scatterplot of Central Government Debt vs Unemployment Rate

plot1 <- ggplot(data, aes(x = central_government_debt_pct_gdp, y = unemployment_pct, color=continent, group = continent, size = gdp, text = paste("country:", country))) +
  theme_light(base_size = 12, base_family = "serif") + 
  geom_point(alpha = 0.5) +
  geom_smooth(method=lm, se=FALSE, lty = 5, linewidth = 0.2) +
  scale_color_brewer(palette = "Set1") +
  labs(x="Percentage of central government debt (% GDP)", 
       y="Unemployment rate (%)",
       title = "Scatterplot of Central Government Debt vs Unemployment Rate",
       caption = "Source: Web Scraping") +
  scale_x_log10() 

plot1 <- ggplotly(plot1, tooltip = c("x", "y", "color", "size", "text"))

Warning: Transformation introduced infinite values in continuous x-axis
Transformation introduced infinite values in continuous x-axis

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 82 rows containing non-finite values (`stat_smooth()`).

Warning: The following aesthetics were dropped during statistical transformation: size,
text
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

plot1

For Africa, unemployment rates tend to decrease as central government debt as a percentage of GDP increases, whereas the trend is opposite for other continents.

Multiple Linear Model: Use central_government_debt_pct_gdp and region to predict unemployment_pct

model1 <- lm(unemployment_pct ~ central_government_debt_pct_gdp + region, data = data)
summary(model1)


Call:
lm(formula = unemployment_pct ~ central_government_debt_pct_gdp + 
    region, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9858 -1.8882 -0.3264  1.6729 11.3266 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      3.665582   2.710376   1.352  0.17955    
central_government_debt_pct_gdp -0.003443   0.005387  -0.639  0.52434    
regionCaribbean                  6.815943   3.051133   2.234  0.02791 *  
regionCentral America            2.665708   3.184408   0.837  0.40470    
regionCentral Asia               2.321176   3.473804   0.668  0.50568    
regionEastern Africa             2.224384   3.008607   0.739  0.46158    
regionEastern Asia               1.276799   3.484413   0.366  0.71488    
regionEastern Europe             1.030185   2.948318   0.349  0.72758    
regionMelanesia                 -0.879034   3.299212  -0.266  0.79050    
regionMiddle Africa              6.890515   3.476257   1.982  0.05044 .  
regionNorthern Africa            9.573156   3.296172   2.904  0.00461 ** 
regionNorthern America           1.061888   3.808973   0.279  0.78103    
regionNorthern Europe            1.536658   2.975026   0.517  0.60673    
regionPolynesia                 -0.484447   4.660740  -0.104  0.91744    
regionSouth-Eastern Asia        -1.019942   3.106901  -0.328  0.74344    
regionSouth America              3.846752   3.106954   1.238  0.21883    
regionSouthern Africa           18.766632   3.300721   5.686 1.53e-07 ***
regionSouthern Asia              3.480795   3.008545   1.157  0.25028    
regionSouthern Europe            5.932474   3.021928   1.963  0.05265 .  
regionWestern Africa             0.433056   3.475588   0.125  0.90111    
regionWestern Asia               5.820431   2.947769   1.975  0.05132 .  
regionWestern Europe             1.284558   3.050919   0.421  0.67471    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.805 on 92 degrees of freedom
  (80 observations deleted due to missingness)
Multiple R-squared:  0.5679,    Adjusted R-squared:  0.4693 
F-statistic: 5.758 on 21 and 92 DF,  p-value: 1.428e-09

Equation: Unemployment_Pct = 3.665582 − 0.003443 × Central_Government_Debt_Pct_GDP + Coefficients_for_Regions

p-value: 1.428e-09 very low, indicating that the model as a whole is statistically significant in explaining the variance in unemployment rates.

The Adjusted R-Squared value states that about 46.93% of the variation in the observations may be explained by the model.

Diagnostic plots model1

autoplot(model1, 1:4, nrow=2, ncol=2)

Warning: Removed 62 rows containing missing values (`geom_line()`).

Warning: Removed 1 rows containing missing values (`geom_segment()`).

The trend of the residual plot may suggest violations of the assumption of constant variance. These residual plots indicate that certain observations (countries 38, 62, 85) in the dataset have a notable effect on the plot, as evidenced by their high scale-location values.

Distribution of Self-Employment Percentage by region

# Create the boxplot using ggplot2
ggplot(data, aes(x = region, y = self_employed_pct, fill = continent)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set1") +
  labs(x = "Region", y = "Self-Employed %", title = "Self-Employed Percentage Distribution by Region", caption = "Source: Web Scraping") + theme_minimal() +
  theme(axis.text.x = element_text(angle = 80, hjust = 1)) +  
  theme(axis.text = element_text(size = 10)) + 
  theme(plot.title = element_text(size = 14, face = "bold")) + 
  theme(legend.position = "bottom")

Warning: Removed 15 rows containing non-finite values (`stat_boxplot()`).

From the boxplots, we can see that the median self-employed percentage varies significantly across regions and continents. Europe generally exhibits lower percentages of self-employment compared to other continents, while Africa demonstrates higher percentages.

Scatterplot of Self Employment Percentage vs Vulnerable Employment Percentage

plot2 <- ggplot(data, aes(x = self_employed_pct, y = vulnerable_employment_pct, color=continent, group = continent, text = paste("country:", country))) +
  theme_minimal(base_size = 12, base_family = "serif") + 
  geom_point(alpha = 0.5) +
  geom_smooth(method=lm, se=FALSE, lty = 5, linewidth = 0.2) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "Percentage of Self-Employment", 
       y = "Vulnerable Employment Rate (%)",
       title = "Scatterplot of Percentage of Self-Employment to Vulnerable Employment Rate",
       caption = "Source: Web Scraping")

plot2 <- ggplotly(plot2, tooltip = c("x", "y", "color", "text"))

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 15 rows containing non-finite values (`stat_smooth()`).

Warning: The following aesthetics were dropped during statistical transformation: text
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

plot2

The scatterplot illustrates a strong relationship between the percentage of self-employment and the vulnerable employment rate; as the percentage of self-employment increases in a country, a larger proportion of those jobs are vulnerable jobs, so the percentage of vulnerable jobs increases too.

Linear Model: Use self_employed_pct to predict vulnerable_employment_pct

model2 <- lm(vulnerable_employment_pct ~ self_employed_pct, data = data)
summary(model2)


Call:
lm(formula = vulnerable_employment_pct ~ self_employed_pct, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.2802 -0.8938  0.2453  1.2903  3.5948 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3.929713   0.278432  -14.11   <2e-16 ***
self_employed_pct  1.019118   0.005628  181.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.041 on 177 degrees of freedom
  (15 observations deleted due to missingness)
Multiple R-squared:  0.9946,    Adjusted R-squared:  0.9946 
F-statistic: 3.278e+04 on 1 and 177 DF,  p-value: < 2.2e-16

Equation of the model: vulnerable_employment_pct = −3.929713 + 1.019118 × self_employed_pct The slope coefficient (estimate) for the percentage of self-employed individuals is approximately 1.019118, indicating that for every one-unit increase in the percentage of self-employed individuals, the vulnerable employment percentage increases by approximately 1.019118 units.

The intercept (estimate) is approximately -3.929713, representing the estimated vulnerable employment percentage when the percentage of self-employed individuals is zero.

p-value: < 2.2e-16; very low, indicating that the model as a whole is statistically significant in explaining the variance in the vulnerable employment rate. Moreover, the adjusted R-squared value of 0.9946 indicates that approximately 99.46% (almost all our observations) of the variance in the vulnerable employment percentage can be explained by the percentage of self-employed individuals in the model.

Diagnostic plots model2

autoplot(model2, 1:4, nrow=2, ncol=2)

The almost horizontal pattern observed in the Residuals vs. Fitted plot may suggest the assumption of constant variance. Observation 73 appears to have a slight effect on both the Residual and Normal Q-Q plots. This suggests that this particular observation may have some influence on the model’s fit or may exhibit some deviation from the assumed distribution.

Mean GDP and Tax Revenue as a Percentage of GDP by Region

# group the data by region and calculates the mean values of gdp and tax_revenue_pct_gdp variables within each region
data1 <- data|>
  group_by(region)|>
  summarize(tax_pct_gdp = mean(tax_revenue_pct_gdp, na.rm = TRUE), gdp = mean(gdp, na.rm = TRUE)) 

# Treemap 
treemap(data1, index="region", vSize="gdp", title = "Mean GDP and Tax Revenue as a Percentage of GDP by Region", vColor="tax_pct_gdp", type="manual",  palette="YlGnBu") +
  theme_minimal()

NULL

From the treemap, we observe that certain regions, such as Northern America, exhibit a high GDP, but only a small portion of it is derived from taxes. Conversely, regions like Australia and New Zealand display a relatively smaller GDP, yet a significant percentage of it originates from taxes. Additionally, GDPs for regions in Africa are notably low, represented by extremely small squares, which starkly contrasts with regions in the Americas and Asia.

The treemap confirms the findings of the World Population Review website, indicating that the United States, located in North America, indeed has the highest GDP among countries. Similarly, Eastern Asia, where China is situated, appears to be the region with the second-highest GDP, possibly due to China’s significant economic influence. This trend is also observed in Western Europe, which includes Germany, a leading economic powerhouse in the region.

Health Expenditure per Capita in countries

It is the average amount of money spent on healthcare services per person within a specific population or region. It is calculated by dividing the total healthcare expenditure by the total population.

Visualization from Tableau Public: https://public.tableau.com/app/profile/merveille.kuendzong/viz/HealthExpenditureCapitainCountries/health_exp_capita

This visualization indicates that countries in Africa exhibit the lowest values of Health Expenditure per Capita compared to other continents, with Burundi having the lowest value globally and the United States having the highest.

Conclusion

In summary, our analysis reveals substantial economic disparities across regions and continents. We found correlations between variables such as self-employment and vulnerable employment. The boxplots illustrate significant variations in median self-employed percentages, with Europe generally showing lower rates compared to other continents, while Africa tends to have higher rates. Additionally, our findings indicate a unique trend for Africa, where unemployment rates decrease with increasing central government debt as a percentage of GDP, contrary to other continents.

Also, we observed contrasting patterns in GDP and tax revenue percentages, indicating diverse economic structures. Regions in Africa face distinct challenges with lower GDPs, while others like Northern America demonstrate robust economic performance. Moreover, we noticed stark differences in Health Expenditure per Capita, with African countries exhibiting the lowest values globally, and the United States leading with the highest.

Bibliography:

https://worldpopulationreview.com/countries/by-gdp