World Population Dataset

Discussion’s author: Matthew Roland

Context:

As of June 2019, the global population stood at approximately 7.58 billion. India’s population is on track to surpass China’s by 2030, making it the most populous country. While most of the world’s top populous countries are growing, Russia and Japan are set to experience declines by 2050.

Content:

The dataset provides historical population details for all countries, including metrics like area size, continent, capital name, and growth rate.

Questions:

How have population rates changed by year, and are there variations by country or region?
Can we find associations between the trends in population data and other country-specific datasets, such as GDP, poverty, and environmental data?

The discussion suggests an interest in understanding population trends over time and also potentially exploring relationships between population and other country-specific metrics.

Analysis:

library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(purrr)
library(scales)

url1 <- "https://raw.githubusercontent.com/hbedros/data607_prj2/main/df3/world_population.csv"

world_data <- read_csv(url1)

## Rows: 234 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): CCA3, Country/Territory, Capital, Continent
## dbl (13): Rank, 2022 Population, 2020 Population, 2015 Population, 2010 Popu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# rename columns to consistent names
world_data <- world_data %>%
  rename(
    rank = Rank,
    cca3 = CCA3,
    country = `Country/Territory`,
    capital = Capital,
    continent = Continent,
    pop_2022 = `2022 Population`,
    pop_2020 = `2020 Population`,
    pop_2015 = `2015 Population`,
    pop_2010 = `2010 Population`,
    pop_2000 = `2000 Population`,
    pop_1990 = `1990 Population`,
    pop_1980 = `1980 Population`,
    pop_1970 = `1970 Population`,
    area = `Area (km²)`,
    density = `Density (per km²)`,
    growth_rate = `Growth Rate`,
    world_pop_percentage = `World Population Percentage`
  )

# remove rows with any missing values
world_data <- world_data %>%
  filter_all(all_vars(!is.na(.)))

# Remove duplicates
world_data <- world_data %>%
  distinct()

1. How have population rates changed by year, and are there variations by country or region?

tidy_data <- world_data %>%
  select(country, continent, starts_with("pop_")) %>%
  gather(key = "year", value = "population", -country, -continent) %>%
  mutate(year = as.numeric(str_extract(year, "\\d{4}")))

head(tidy_data)

ggplot(tidy_data, aes(x = year, y = population, color = continent)) +
  geom_line(aes(group = country), alpha = 0.6) +
  labs(
    title = "Population Change by Year and Continent",
    x = "Year",
    y = "Population",
    color = "Continent"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)

# Let's plot top 10 countries by 2022 population for clarity
top_countries <- world_data %>%
  arrange(desc(pop_2022)) %>%
  slice(1:10) %>%
  pull(country)

filtered_data <- tidy_data %>%
  filter(country %in% top_countries)

ggplot(filtered_data, aes(x = year, y = population, color = country)) +
  geom_line(aes(group = country), alpha = 0.8) +
  labs(
    title = "Population Change by Year for Top 10 Countries",
    x = "Year",
    y = "Population",
    color = "Country"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_y_continuous(labels = scales::comma)

# let's calculate the growth rate
growth_data <- tidy_data %>%
  group_by(country, continent) %>%
  summarize(
    growth_rate = (first(population) - last(population)) / last(population) * 100,
    .groups = "drop"
  ) %>%
  arrange(-growth_rate)

head(growth_data)

# Checking the average growth rate by continent:
continent_growth <- growth_data %>%
  group_by(continent) %>%
  summarize(
    avg_growth_rate = mean(growth_rate, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(-avg_growth_rate)

print(continent_growth)

## # A tibble: 6 × 2
##   continent     avg_growth_rate
##   <chr>                   <dbl>
## 1 Africa                  300. 
## 2 Asia                    300. 
## 3 South America           145. 
## 4 North America           142. 
## 5 Oceania                 113. 
## 6 Europe                   30.2

# now let's visualize the growth
# Boxplot for growth rate by continent
ggplot(growth_data, aes(x = continent, y = growth_rate, fill = continent)) +
  geom_boxplot() +
  labs(
    title = "Variation in Growth Rate by Continent",
    x = "Continent",
    y = "Growth Rate (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation of the Box Plot on Population Growth Rate by Continent:

Africa & Asia Lead: Both continents show the highest growth, with their populations tripling (300% increase).
Moderate Growth in Americas & Oceania: Ranging from 113% to 145% growth, suggesting more than a doubling in population.
Europe Lags Behind: Europe’s growth is the slowest at 30.2%, indicating minimal population increase compared to other continents.

2. Can we find associations between the trends in population data and other country-specific datasets, such as GDP, poverty, and environmental data?

I will fetch the GDP data from this website: click here

url2 <- "https://raw.githubusercontent.com/hbedros/data607_prj2/main/df3/gdp_csv.csv"

world_gdp <- read_csv(url2)

## Rows: 11507 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country Name, Country Code
## dbl (2): Year, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Filtering data for the specified years
world_gdp <- world_gdp %>%
  filter(Year %in% c(1970, 1980, 1990, 2000, 2010, 2015, 2020, 2022)) %>%
  select(`Country Name`, `Country Code`, Year, Value)

head(world_gdp)  # Display the first few rows for verification

# Extracting unique values of the 'Country Name' column
unique_countries <- unique(world_gdp$`Country Name`)

# Create a named vector to associate countries with continents
country_to_continent <- list(
  
  "Asia" = c("Afghanistan", "Armenia", "Azerbaijan", "Bahrain", "Bangladesh", "Bhutan", 
             "Brunei Darussalam", "Cambodia", "China", "Cyprus", "Georgia", "Hong Kong SAR, China",
             "India", "Indonesia", "Iran, Islamic Rep.", "Iraq", "Israel", "Japan", 
             "Jordan", "Kazakhstan", "Korea, Rep.", "Kuwait", "Kyrgyz Republic", "Lao PDR", 
             "Lebanon", "Macao SAR, China", "Malaysia", "Maldives", "Mongolia", "Myanmar",
             "Nepal", "Oman", "Pakistan", "Palestine", "Philippines", "Qatar", "Russian Federation", 
             "Saudi Arabia", "Singapore", "Sri Lanka", "Syrian Arab Republic", "Taiwan", "Tajikistan", 
             "Thailand", "Timor-Leste", "Turkey", "Turkmenistan", "United Arab Emirates", "Uzbekistan", "Vietnam", "Yemen, Rep."),
  
  "Europe" = c("Albania", "Andorra", "Austria", "Belarus", "Belgium", "Bosnia and Herzegovina",
               "Bulgaria", "Croatia", "Czech Republic", "Denmark", "Estonia", "Faroe Islands", 
               "Finland", "France", "Germany", "Greece", "Greenland", "Hungary", "Iceland",
               "Ireland", "Isle of Man", "Italy", "Kosovo", "Latvia", "Liechtenstein", "Lithuania",
               "Luxembourg", "Macedonia, FYR", "Malta", "Moldova", "Monaco", "Montenegro", "Netherlands",
               "Norway", "Poland", "Portugal", "Romania", "San Marino", "Serbia", "Slovak Republic", 
               "Slovenia", "Spain", "Sweden", "Switzerland", "Ukraine", "United Kingdom", "Vatican City"),
  
"Africa" = c("Algeria", "Angola", "Benin", "Botswana", "Burkina Faso", "Burundi", 
               "Cabo Verde", "Cameroon", "Central African Republic", "Chad", 
               "Comoros", "Congo, Dem. Rep.", "Congo, Rep.", "Djibouti", "Egypt, Arab Rep.", 
               "Equatorial Guinea", "Eritrea", "Eswatini", "Ethiopia", "Gabon", "Gambia, The", 
               "Ghana", "Guinea", "Guinea-Bissau", "Ivory Coast", "Kenya", "Lesotho", 
               "Liberia", "Libya", "Madagascar", "Malawi", "Mali", "Mauritania", "Mauritius", 
               "Morocco", "Mozambique", "Namibia", "Niger", "Nigeria", "Rwanda", "Sao Tome and Principe", 
               "Senegal", "Seychelles", "Sierra Leone", "Somalia", "South Africa", "South Sudan", 
               "Sudan", "Tanzania", "Togo", "Tunisia", "Uganda", "Zambia", "Zimbabwe"),

  "Oceania" = c("Australia", "Fiji", "Kiribati", "Marshall Islands", "Micronesia, Fed. Sts.", 
                "Nauru", "New Zealand", "Palau", "Papua New Guinea", "Samoa", 
                "Solomon Islands", "Tonga", "Tuvalu", "Vanuatu"),

  "North America" = c("Antigua and Barbuda", "Bahamas, The", "Barbados", "Belize", "Canada", 
                      "Costa Rica", "Cuba", "Dominica", "Dominican Republic", "El Salvador", 
                      "Greenland", "Grenada", "Guatemala", "Haiti", "Honduras", "Jamaica", 
                      "Mexico", "Nicaragua", "Panama", "Puerto Rico", "Saint Kitts and Nevis", 
                      "Saint Lucia", "Saint Vincent and the Grenadines", "Trinidad and Tobago", 
                      "United States"),

  "South America" = c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", 
                      "Guyana", "Paraguay", "Peru", "Suriname", "Uruguay", "Venezuela, RB")
)

# Adding a 'continents' column to the world_gdp dataframe based on the country_to_continent list
world_gdp <- world_gdp %>%
  mutate(continents = map_chr(`Country Name`, ~ {
    continents <- names(which(sapply(country_to_continent, function(countries) .x %in% countries)))
    if(length(continents) == 0) return(NA_character_)  # If the country isn't in our list, it'll return NA
    paste(continents, collapse=", ")  # Combine continent names into a single string if there are multiple matches
  }))


# Removing rows with NA in the 'continents' column
world_gdp <- world_gdp %>%
  filter(!is.na(continents))

# Calculating the sum of GDP by continent and year
sum_gdp_by_continent <- world_gdp %>%
  group_by(continents, Year) %>%
  summarise(sum_gdp = sum(Value, na.rm = TRUE)) %>%
  ungroup() %>%
  rename(continent = continents)

## `summarise()` has grouped output by 'continents'. You can override using the
## `.groups` argument.

# View the resulting dataframe
head(sum_gdp_by_continent)

# sum_world_data
sum_world_data <- world_data %>%
  group_by(continent) %>%
  summarise(
    pop_1970 = sum(pop_1970, na.rm = TRUE),
    pop_1980 = sum(pop_1980, na.rm = TRUE),
    pop_1990 = sum(pop_1990, na.rm = TRUE),
    pop_2000 = sum(pop_2000, na.rm = TRUE),
    pop_2010 = sum(pop_2010, na.rm = TRUE),
    pop_2015 = sum(pop_2015, na.rm = TRUE)
  )

# reshape the df from wide to long
long_sum_world_data <- sum_world_data %>%
  pivot_longer(
    cols = starts_with("pop_"), 
    names_to = "Year", 
    values_to = "Population",
    names_transform = list(Year = ~substr(., 5, 8))
  )

long_sum_world_data$Year <- as.numeric(long_sum_world_data$Year)

pop_gdp_data <- left_join(sum_gdp_by_continent, long_sum_world_data, by = c("continent", "Year"))

# View the merged data
head(pop_gdp_data)

# Clean the df from NA values
pop_gdp_data <- filter(pop_gdp_data, !is.na(Population), !is.na(sum_gdp))

# Making a plot from the pop_gdp_data table to see if there's cooralation between GDP growth and Population Growth
# The data we have only covers the following years: 1970, 1980, 1990, 2000, 2010, 2015
ggplot(data = pop_gdp_data, aes(x = Year)) +
  geom_line(aes(y = Population, color = "Population"), size = 1) +
  geom_line(aes(y = sum_gdp, color = "GDP"), size = 1) +
  labs(title = "Population and GDP Trends by Continent from 1970 to 2015",
       y = "Billions",
       x = "Year",
       color = "Metric") +
  theme_minimal() +
  facet_wrap(~continent)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Conclusion:
The plots aren’t clear because population and GDP are on different scales. I tried facet_wrap to fix it, but it didn’t help. Any suggestions on how to effectively handle this situation is very much appreciated.

Project 2 - Part C

Haig Bedros

2023-10-08