World Population Dataset

Discussion’s author: Matthew Roland

Context:

As of June 2019, the global population stood at approximately 7.58 billion. India’s population is on track to surpass China’s by 2030, making it the most populous country. While most of the world’s top populous countries are growing, Russia and Japan are set to experience declines by 2050.

Content:

The dataset provides historical population details for all countries, including metrics like area size, continent, capital name, and growth rate.

Questions:

  1. How have population rates changed by year, and are there variations by country or region?
  2. Can we find associations between the trends in population data and other country-specific datasets, such as GDP, poverty, and environmental data?

The discussion suggests an interest in understanding population trends over time and also potentially exploring relationships between population and other country-specific metrics.

Analysis:

library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(purrr)
library(scales)
url1 <- "https://raw.githubusercontent.com/hbedros/data607_prj2/main/df3/world_population.csv"

world_data <- read_csv(url1)
## Rows: 234 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): CCA3, Country/Territory, Capital, Continent
## dbl (13): Rank, 2022 Population, 2020 Population, 2015 Population, 2010 Popu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# rename columns to consistent names
world_data <- world_data %>%
  rename(
    rank = Rank,
    cca3 = CCA3,
    country = `Country/Territory`,
    capital = Capital,
    continent = Continent,
    pop_2022 = `2022 Population`,
    pop_2020 = `2020 Population`,
    pop_2015 = `2015 Population`,
    pop_2010 = `2010 Population`,
    pop_2000 = `2000 Population`,
    pop_1990 = `1990 Population`,
    pop_1980 = `1980 Population`,
    pop_1970 = `1970 Population`,
    area = `Area (km²)`,
    density = `Density (per km²)`,
    growth_rate = `Growth Rate`,
    world_pop_percentage = `World Population Percentage`
  )

# remove rows with any missing values
world_data <- world_data %>%
  filter_all(all_vars(!is.na(.)))

# Remove duplicates
world_data <- world_data %>%
  distinct()

1. How have population rates changed by year, and are there variations by country or region?

tidy_data <- world_data %>%
  select(country, continent, starts_with("pop_")) %>%
  gather(key = "year", value = "population", -country, -continent) %>%
  mutate(year = as.numeric(str_extract(year, "\\d{4}")))

head(tidy_data)
ggplot(tidy_data, aes(x = year, y = population, color = continent)) +
  geom_line(aes(group = country), alpha = 0.6) +
  labs(
    title = "Population Change by Year and Continent",
    x = "Year",
    y = "Population",
    color = "Continent"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)

# Let's plot top 10 countries by 2022 population for clarity
top_countries <- world_data %>%
  arrange(desc(pop_2022)) %>%
  slice(1:10) %>%
  pull(country)

filtered_data <- tidy_data %>%
  filter(country %in% top_countries)

ggplot(filtered_data, aes(x = year, y = population, color = country)) +
  geom_line(aes(group = country), alpha = 0.8) +
  labs(
    title = "Population Change by Year for Top 10 Countries",
    x = "Year",
    y = "Population",
    color = "Country"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_y_continuous(labels = scales::comma)

# let's calculate the growth rate
growth_data <- tidy_data %>%
  group_by(country, continent) %>%
  summarize(
    growth_rate = (first(population) - last(population)) / last(population) * 100,
    .groups = "drop"
  ) %>%
  arrange(-growth_rate)

head(growth_data)
# Checking the average growth rate by continent:
continent_growth <- growth_data %>%
  group_by(continent) %>%
  summarize(
    avg_growth_rate = mean(growth_rate, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(-avg_growth_rate)

print(continent_growth)
## # A tibble: 6 × 2
##   continent     avg_growth_rate
##   <chr>                   <dbl>
## 1 Africa                  300. 
## 2 Asia                    300. 
## 3 South America           145. 
## 4 North America           142. 
## 5 Oceania                 113. 
## 6 Europe                   30.2
# now let's visualize the growth
# Boxplot for growth rate by continent
ggplot(growth_data, aes(x = continent, y = growth_rate, fill = continent)) +
  geom_boxplot() +
  labs(
    title = "Variation in Growth Rate by Continent",
    x = "Continent",
    y = "Growth Rate (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation of the Box Plot on Population Growth Rate by Continent:

  • Africa & Asia Lead: Both continents show the highest growth, with their populations tripling (300% increase).
  • Moderate Growth in Americas & Oceania: Ranging from 113% to 145% growth, suggesting more than a doubling in population.
  • Europe Lags Behind: Europe’s growth is the slowest at 30.2%, indicating minimal population increase compared to other continents.