Project 2: World GDP and Population Growth,
2012-2016

Author

Michael Desir

Visualizing Growth in Population and GDP for 31 Countries, 2012-2016

The dynamics of our world are dependent on currency

What is this all about?

A country’s place on the world stage is largely dependent on what it produces and how consistently it can do so. One popular measure of production is a nation’s gross domestic product (GDP going forward). This is a comprehensive measure of the value of all final goods and services produced in a country or financed by said country. This includes personal consumption, business investment, government spending and net exports. In sum, GDP measures how much a country brings to the table. This project makes use of the following variables from a World Bank World Economic Indicators dataset.

year: Ranging between 2012 and 2016
country_name: From a group of 31 countries
country_code: A nation’s unique 3-letter ISO3C designation
region: World Bank regional classifications
population_total_millions: A nation’s population, divided by 1 million
gdp_current_us_dollars_billions: A nation’s GDP in 2024 US dollars
military_spending_gdp_percentage: The percentage of a nation’s GDP that is attributed to military spending

Dataset: World Bank World Economic Indicators, https://databank.worldbank.org/embed/Population-and-GDP-by-Country/id/29c4df41

This dataset required mild cleaning. The columns needed to be stripped of economic codes and simplified for processing. To add on a regional classification for each country, I merged data from the Nations dataset (also from World Bank) using a left-join that matched all instances of each country in my dataset to the region it belonged to according to the Nations dataset. I then deleted collective data from my dataset, so I could focus on individual countries. Because my dataset came in with metadata in the first and second columns, my post-cleaned columns had to be recategorized as numeric. After that, I could start designing models.

Essay Source 1: Kimberly Amadeo, https://www.thebalancemoney.com/components-of-gdp-explanation-formula-and-chart-3306015

Essay Source 2: U.S. Bureau of Economic Analysis, https://www.bea.gov/data/gdp/gross-domestic-product

Essay Source 3: Mark Mazzetti and David D. Kirkpatric, https://www.nytimes.com/2015/03/26/world/middleeast/al-anad-air-base-houthis-yemen.html

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(ggalluvial)
library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(lubridate)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(extrafont)

Registering fonts with R

library(leaflet)
library(sf)

Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE

library(knitr)

setwd("C:/Users/desir_7411ic3/Desktop/Montgomery College/DATA110/Project 2/dataset")
gdp1 <- read_csv("Data_Main.csv")

New names:
Rows: 191 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(15): Time Code, Country Name, Country Code, ...5, GDP (current US$) [NY... dbl
(1): Time
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...5`

setwd("C:/Users/desir_7411ic3/Desktop/Montgomery College/DATA110/DATASETS-20240830T194929Z-001/DATASETS")
nations <- read_csv("nations.csv")

Rows: 5275 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): iso2c, iso3c, country, region, income
dbl (5): year, gdp_percap, population, birth_rate, neonat_mortal_rate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Dataset Management

Restructure columns

gdp2 <- gdp1 %>%
  select(-"...5",-"Time Code",-"GDP per capita (constant LCU) [NY.GDP.PCAP.KN]") %>%
  rename("Year" = "Time") %>%
  rename("GDP current US dollars billions" = "GDP (current US$) [NY.GDP.MKTP.CD]") %>%
  rename("GDP per capita  thousands" = "GDP per capita (current US$) [NY.GDP.PCAP.CD]") %>%
  rename("Population total millions" = "Population, total [SP.POP.TOTL]") %>%
  rename("GDP per capita growth percentage" = "GDP per capita growth (annual %) [NY.GDP.PCAP.KD.ZG]") %>%
  rename("GDP per capita PPP intnl dollars thousands" = "GDP per capita, PPP (current international $) [NY.GDP.PCAP.PP.CD]") %>%
  rename("GDP PPP intnl dollars billions" = "GDP, PPP (current international $) [NY.GDP.MKTP.PP.CD]") %>%
  rename("GDP growth percentage" = "GDP growth (annual %) [NY.GDP.MKTP.KD.ZG]") %>%
  rename("GDP current local currency billions" = "GDP (current LCU) [NY.GDP.MKTP.CN]") %>%
  rename("Government spending GDP percentage" = "General government final consumption expenditure (% of GDP) [NE.CON.GOVT.ZS]") %>%
  rename("Military spending GDP percentage" = "Military expenditure (% of GDP) [MS.MIL.XPND.GD.ZS]") %>%
  select("Year","Country Name","Country Code","Population total millions", everything())

names(gdp2) <- tolower(names(gdp2))
names(gdp2) <- gsub(" ","_",names(gdp2))
gdp2 <- gdp2[-c(1),]
  #select("Time","Country Name","Country Code")
head(gdp2,10)

# A tibble: 10 × 13
    year country_name country_code population_total_mil…¹ gdp_current_us_dolla…²
   <dbl> <chr>        <chr>        <chr>                  <chr>                 
 1  2012 China        CHN          1354.2                 8532                  
 2  2012 India        IND          1274.5                 1828                  
 3  2012 Indonesia    IDN          250.2                  918                   
 4  2012 Korea, Rep.  KOR          50.2                   1278                  
 5  2012 Saudi Arabia SAU          30.8                   742                   
 6  2012 Turkiye      TUR          75.2                   881                   
 7  2012 United King… GBR          63.7                   2707                  
 8  2012 United Stat… USA          313.9                  16254                 
 9  2012 Brunei Daru… BRN          0.4                    19                    
10  2012 Israel       ISR          7.9                    262                   
# ℹ abbreviated names: ¹population_total_millions,
#   ²gdp_current_us_dollars_billions
# ℹ 8 more variables: gdp_per_capita__thousands <chr>,
#   gdp_per_capita_growth_percentage <chr>,
#   gdp_per_capita_ppp_intnl_dollars_thousands <chr>,
#   gdp_ppp_intnl_dollars_billions <chr>, gdp_growth_percentage <chr>,
#   gdp_current_local_currency_billions <chr>, …

Join regions to GDP data

regions <- nations %>%
  select("iso3c","region") %>%
  rename("country_code" = "iso3c") %>%
  distinct(.keep_all = TRUE)
gdp_regions <- merge(gdp2, regions, by="country_code",all.x = TRUE)

Separate countries from regions/world in dataset

main <- gdp_regions[!is.na(gdp_regions$region),]

Light reorganization

main <- main %>%
  select("year","country_code","country_name","region",everything()) %>%
  arrange(country_name)

Assign previously categorical variables as numeric

main$population_total_millions <- as.numeric(main$population_total_millions)
main$gdp_current_us_dollars_billions <- as.numeric(main$gdp_current_us_dollars_billions)
main$gdp_per_capita__thousands <- as.numeric(main$gdp_per_capita__thousands)
main$gdp_per_capita_growth_percentage <- as.numeric(main$gdp_per_capita_growth_percentage)
main$gdp_per_capita_ppp_intnl_dollars_thousands <- as.numeric(main$gdp_per_capita_ppp_intnl_dollars_thousands)
main$gdp_ppp_intnl_dollars_billions <- as.numeric(main$gdp_ppp_intnl_dollars_billions)
main$gdp_growth_percentage <- as.numeric(main$gdp_growth_percentage)
main$gdp_current_local_currency_billions <- as.numeric(main$gdp_current_local_currency_billions)
main$government_spending_gdp_percentage <- as.numeric(main$government_spending_gdp_percentage)
main$military_spending_gdp_percentage <- as.numeric(main$military_spending_gdp_percentage)

Plots and Linear Regression Model

Filter linear regression model data for 2016 only

main_2016 <- main %>%
  filter(year == 2016) %>%
  select(year,country_name,country_code,region,population_total_millions,gdp_current_us_dollars_billions) %>%
  mutate(gdp_mill = gdp_current_us_dollars_billions / 1000)

Linear regression model

reg_model <- lm(gdp_mill ~ population_total_millions, data=main_2016)
summary(reg_model)


Call:
lm(formula = gdp_mill ~ population_total_millions, data = main_2016)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9499 -0.6233 -0.5638 -0.2956 16.3759 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)   
(Intercept)               0.578671   0.658537   0.879  0.38678   
population_total_millions 0.005727   0.001829   3.131  0.00396 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.353 on 29 degrees of freedom
Multiple R-squared:  0.2526,    Adjusted R-squared:  0.2268 
F-statistic:   9.8 on 1 and 29 DF,  p-value: 0.00396

Equation: GDP = 0.005(population) + 0.578

(Please note that all values passed to the model were in millions, so the results should be interpreted accordingly.)

The relationship between GDP and population has a p-value of 0.004, which indicates that any observed correlation is most likely not coincidental.

This linear regression model provides insights into the relationship between a country’s population and its gross domestic product (GDP). The adjusted R-squared value 0.2268 indicates that there is a weak yet existent correlation between a country’s population and its GDP. The p-value of 0.00396 would seem to suggest that this model is statistically significant. It should be noted however that most points live in a relatively small region of the graph, with the notable outliers being the United States, China, and India. In essence, my linear model found a weak correlation between a country’s population and its gross domestic product.

Compare 2016 population to GDP values

temp <- main_2016 %>%
  filter(population_total_millions < 300)

reg <- ggplot(temp,
              aes(population_total_millions,
                  gdp_current_us_dollars_billions,
                  color=region,
                  label = country_name,
                  #text = paste("<b>Population (Millions):</b>",population_total_millions,"<br><b>GDP ($ Billions):</b>",gdp_current_us_dollars_billions,"<br><b>Region:</b>",region,"<br></b>Country:</b>",country_name)
                  )
              ) + 
  geom_smooth(color = "purple4", fill="lightgray") + 
  geom_point(size = 2) +
  labs(x = "Country Population in Millions",
       y = "Country GDP in Billions $",
       title = "Relating Population to GDP, 2016",
       subtitle = "The nations of the United States, China and India were removed from this plot because they were outliers such that all other points were indistiguishable.",
       caption = "World Bank World Economic Indicators",
       color = "Region") +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "aliceblue"),
    plot.title = element_text(color = "maroon",family="Times New Roman",face="bold"),
    panel.background = element_rect(fill = "#696969"),
    axis.text = element_text(color = "#3c4142"),
    axis.title = element_text(color = "#3c4142", size = 12),
    legend.text = element_text(color = "#3c4142"),
    legend.title = element_text(color = "#3c4142")
  ) +
  scale_color_manual(values=c("lightblue3", "maroon","green4","orange2","pink","purple"))
ggplotly(reg) %>% ## labs caption broke when ggplot activated, so using plotly caption
  layout(margin = list(l = 50, r = 50, b = 100, t = 50),
         annotations = list(x = 1, y = -0.3, text = "From World Bank World Economic Indicators",xref='paper', yref='paper',xanchor='right', yanchor='auto', xshift=0, yshift=0,font = list(size = 11),showarrow=F)
         )

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

The nations of the United States, China and India were removed from this plot because they were outliers such that all other points were indistiguishable.

This plot confirms the findings of the linear regression model, as it shows little correlation between a country’s population and its gross domestic product. What it does show, however, is that most of the countries in this dataset had relative similar GDP values across populations ranging between 1 and 265 million people. Also, there is an observable correlation between variables in the range of 0-100 million people.

Create new dataset combining main with geo-data

mapped <- main_2016
mapped["latitude"] <- c(33.93911,
                        25.930414,
                        23.684994,
                        4.535277,
                        35.86166,
                        26.820553,
                        22.396428,
                        20.593684,
                        -0.789275,
                        32.427908,
                        33.223191,
                        31.046051,
                        30.585164,
                        35.907757,
                        29.31166,
                        33.854721,
                        4.210484,
                        43.750298,
                        21.512583,
                        30.3894,
                        12.879721,
                        25.354826,
                        23.885942,
                        1.352083,
                        -30.559482,
                        7.873054,
                        15.870032,
                        38.963745,
                        23.424076,
                        55.378051,
                        37.09024
                        )
mapped["longitude"] <- c(67.709953,
                         50.637773,
                         90.356331,
                         114.727669,
                         104.195397,
                         30.802498,
                         114.109497,
                         78.96288,
                         113.921327,
                         53.688046,
                         43.679291,
                         34.851612,
                         36.238414,
                         127.766922,
                         47.481766,
                         35.862285,
                         101.975766,
                         7.412841,
                         55.923255,
                         69.3532,
                         121.774017,
                         51.183884,
                         45.079162,
                         103.819836,
                         22.937506,
                         80.771797,
                         100.992541,
                         35.243322,
                         53.847818,
                         -3.435973,
                         -95.712891
                         )

Create GIS map using 2016 data with a pop-up

map_popup <- paste0(
  "<b>Country: </b>", mapped$country_name, "<br>",
  "<b>Population (millions): </b>", mapped$population_total_millions, "<br>",
  "<b>GDP ($ billions): </b>", mapped$gdp_current_us_dollars_billions, "<br>",
  "<b>Region: </b>", mapped$region, "<br>"
)
map_w_popup <- leaflet() |>
  setView(lat = 18, lng = 80, zoom = 2) |>
  addProviderTiles("Esri.WorldImagery") |>
  addCircles(
    data = mapped,
    color = "#3e095e",
    popup = map_popup
  )

Assuming "longitude" and "latitude" are longitude and latitude, respectively

map_w_popup

Firstly, this map exposes the bias in this dataset. What I didn’t notice from the get-go was that the providers of this dataset did not note why/how this data was collected or how it should be used. This map, however, makes it evident that virtually all of the data points are in Asia and the Middle East, with only a few notable world powers (and Monaco) outside these regions. This is not representative of the world GDP rankings at any time in history.

What is notably missing from this map is variations in the size of the bubbles based on some variable, like population or GDP. This is because any formula I could create that the leaflet would accept made the United State’s area approximately the size of the plot and rendered the rest of the pop-ups useless.

Plot Middle East military spending

military <- main %>%
  filter(region == "Middle East & North Africa")
military <- military[order(military$year),]
sau <- military %>%
  filter(country_code == "SAU")
omn <- military %>%
  filter(country_code == "OMN")
are <- military %>%
  filter(country_code == "ARE")
kwt <- military %>%
  filter(country_code == "KWT")
isr <- military %>%
  filter(country_code == "ISR")
lbn <- military %>%
  filter(country_code == "LBN")

paints <- c("#87cefa", "red","green","orange","pink","purple")
text_col <- list(color = "black",fontWeight = "bold")
highchart() |>
  hc_title(text = "Middle East Military Spending % of GDP",
           style = list(color = "black",fontWeight = "bold",fontSize=20)) |>
  hc_subtitle(text = "This plot visualizes changes in military spending as a percentage of total GDP <br>for the Middle East's biggest spenders between 2012 and 2016.",align = "right") |>
  hc_yAxis(title = list(text = "Military Spending % of GDP",
                        style = text_col)) |>
  hc_caption(text = "From World Bank World Economic Indicators",align="right") |>
  hc_add_series(data = sau$military_spending_gdp_percentage,
                name = "Saudi Arabia",
                type = "line",
                yAxis = 0) |>
  hc_add_series(data = omn$military_spending_gdp_percentage,
                name = "Oman",
                type = "line",
                yAxis = 0) |>
  hc_add_series(data = are$military_spending_gdp_percentage,
                name = "United Arab Emirates",
                type = "line",
                yAxis = 0) |>
  hc_add_series(data = kwt$military_spending_gdp_percentage,
                name = "Kuwait",
                type = "line",
                yAxis = 0) |>
  hc_add_series(data = isr$military_spending_gdp_percentage,
                name = "Israel",
                type = "line",
                yAxis = 0) |>
  hc_add_series(data = lbn$military_spending_gdp_percentage,
                name = "Lebanon",
                type = "line",
                yAxis = 0) |>
  hc_xAxis(categories = sau$year,
           tickInterval = 1,
           title = list(text = "Year")) |>
  hc_colors(paints) |>
  hc_legend(
    backgroundColor = "black",
    borderRadius = 11,
    itemStyle = list(
      color = "snow"
    )
  ) |>
  hc_add_theme(hc_theme(chart = list(backgroundColor = 'snow')))

This plot shows that the militaries of Saudi Arabia and Oman led their region in their shares in their respective national GDP. Saudi Arabia’s spending peaked above the rest in 2015 when they led a military intervention into Yemen. Oman was forced to invest a large amount of money relative to their gross because of Yemen’s civil war and the resulting unstability on their shared border.

In conclusion: This project brought me into the study of economic growth. Though this dataset was not as comprehensive as I hoped it would be, I was able to find a weak-yet-existent correlation between a country’s population and its GDP. I wish that I had been able to find a formula for my GIS plot that could accurately and intuitively represent national GDPs. I was also able to study changes in the Middle East’s military spending as a function of its gross domestic product throughout a tumultuous period.