The World Population Dataset provides population data for countries and territories from 1970 to 2022. It includes key variables such as area, population density per square kilometer, growth rate, and the percentage share of the global population. Understanding how to work with population data is essential for making accurate predictions and building effective models.
Both tidyr and dplyr are part of the tidyverse and play crucial roles in data manipulation and preparation.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(downloader)
## Warning: package 'downloader' was built under R version 4.3.3
Load the untidy dataset
data <- read.csv(url("https://raw.githubusercontent.com/Shriyanshh/Project-2---Data-Transformation/refs/heads/main/world_population.csv"))
# Get the number of rows and columns
dim(data)
## [1] 234 17
# Display the structure
str(data)
## 'data.frame': 234 obs. of 17 variables:
## $ Rank : int 36 138 34 213 203 42 224 201 33 140 ...
## $ CCA3 : chr "AFG" "ALB" "DZA" "ASM" ...
## $ Country.Territory : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Capital : chr "Kabul" "Tirana" "Algiers" "Pago Pago" ...
## $ Continent : chr "Asia" "Europe" "Africa" "Oceania" ...
## $ X2022.Population : int 41128771 2842321 44903225 44273 79824 35588987 15857 93763 45510318 2780469 ...
## $ X2020.Population : int 38972230 2866849 43451666 46189 77700 33428485 15585 92664 45036032 2805608 ...
## $ X2015.Population : int 33753499 2882481 39543154 51368 71746 28127721 14525 89941 43257065 2878595 ...
## $ X2010.Population : int 28189672 2913399 35856344 54849 71519 23364185 13172 85695 41100123 2946293 ...
## $ X2000.Population : int 19542982 3182021 30774621 58230 66097 16394062 11047 75055 37070774 3168523 ...
## $ X1990.Population : int 10694796 3295066 25518074 47818 53569 11828638 8316 63328 32637657 3556539 ...
## $ X1980.Population : int 12486631 2941651 18739378 32886 35611 8330047 6560 64888 28024803 3135123 ...
## $ X1970.Population : int 10752971 2324731 13795915 27075 19860 6029700 6283 64516 23842803 2534377 ...
## $ Area..km.. : int 652230 28748 2381741 199 468 1246700 91 442 2780400 29743 ...
## $ Density..per.km.. : num 63.1 98.9 18.9 222.5 170.6 ...
## $ Growth.Rate : num 1.026 0.996 1.016 0.983 1.01 ...
## $ World.Population.Percentage: num 0.52 0.04 0.56 0 0 0.45 0 0 0.57 0.03 ...
# Preview of the data frame
head(data)
## Rank CCA3 Country.Territory Capital Continent X2022.Population
## 1 36 AFG Afghanistan Kabul Asia 41128771
## 2 138 ALB Albania Tirana Europe 2842321
## 3 34 DZA Algeria Algiers Africa 44903225
## 4 213 ASM American Samoa Pago Pago Oceania 44273
## 5 203 AND Andorra Andorra la Vella Europe 79824
## 6 42 AGO Angola Luanda Africa 35588987
## X2020.Population X2015.Population X2010.Population X2000.Population
## 1 38972230 33753499 28189672 19542982
## 2 2866849 2882481 2913399 3182021
## 3 43451666 39543154 35856344 30774621
## 4 46189 51368 54849 58230
## 5 77700 71746 71519 66097
## 6 33428485 28127721 23364185 16394062
## X1990.Population X1980.Population X1970.Population Area..km..
## 1 10694796 12486631 10752971 652230
## 2 3295066 2941651 2324731 28748
## 3 25518074 18739378 13795915 2381741
## 4 47818 32886 27075 199
## 5 53569 35611 19860 468
## 6 11828638 8330047 6029700 1246700
## Density..per.km.. Growth.Rate World.Population.Percentage
## 1 63.0587 1.0257 0.52
## 2 98.8702 0.9957 0.04
## 3 18.8531 1.0164 0.56
## 4 222.4774 0.9831 0.00
## 5 170.5641 1.0100 0.00
## 6 28.5466 1.0315 0.45
The dataset was initially tidied by transforming it into a long format, which made it easier to handle and visualize. Additionally, the column names were standardized to enhance clarity and ensure consistency across the dataset.
# The dataset contains regional population data from 1970 to 2022.
# To tidy the data, I will convert the year columns into a single column, transforming it into a long dataset format.
# This will allow for easier visualization of population trends over time, by year and country.
# First, I will rename the column headers to make them more descriptive and standardized.
data <- data %>%
rename(
"Country/Territory" = Country.Territory, # Renaming the column for country or territory names
"2022" = X2022.Population, # Renaming columns to show populations in respective years
"2020" = X2020.Population,
"2015" = X2015.Population,
"2010" = X2010.Population,
"2000" = X2000.Population,
"1990" = X1990.Population,
"1980" = X1980.Population,
"1970" = X1970.Population,
"Area (km)" = Area..km.., # Renaming area column to indicate it's in square kilometers
"Density per km" = Density..per.km.., # Clarifying density column to show it's population per square kilometer
"Growth Rate" = Growth.Rate, # Keeping the growth rate column name as is
"World Population Percentage" = World.Population.Percentage # Renaming to show percentage of world population
)
# Next, I will collapse the population columns from different years into a single "Year" column.
# This transforms the dataset into a long format, making it easier to analyze population changes over time.
world_pop <- data %>%
pivot_longer(`2022`:`1970`, names_to = "Year", values_to = "Population")
# Display the first few rows of the transformed dataset to confirm changes
head(world_pop)
## # A tibble: 6 × 11
## Rank CCA3 `Country/Territory` Capital Continent `Area (km)` `Density per km`
## <int> <chr> <chr> <chr> <chr> <int> <dbl>
## 1 36 AFG Afghanistan Kabul Asia 652230 63.1
## 2 36 AFG Afghanistan Kabul Asia 652230 63.1
## 3 36 AFG Afghanistan Kabul Asia 652230 63.1
## 4 36 AFG Afghanistan Kabul Asia 652230 63.1
## 5 36 AFG Afghanistan Kabul Asia 652230 63.1
## 6 36 AFG Afghanistan Kabul Asia 652230 63.1
## # ℹ 4 more variables: `Growth Rate` <dbl>, `World Population Percentage` <dbl>,
## # Year <chr>, Population <int>
Statistical summaries were generated to identify the countries with the highest and lowest growth rates. The dataset was then visualized to display population trends over time and across different continents. Specifically, graphs were created to showcase the top 10 and bottom 10 countries based on population growth rates, as well as population sizes for the year 2022.
# Calculating statistical summaries for growth rates and populations
# Summarizes the dataset by calculating the average, minimum, and maximum growth rates
# Also identifies the smallest and largest populations in the dataset
world_pop %>%
summarize(
average_growth_rate = mean(`Growth Rate`), # Calculating the average growth rate
min_growth_rate = min(`Growth Rate`), # Finding the minimum growth rate
max_growth_rate = max(`Growth Rate`), # Finding the maximum growth rate
smallest_population = min(Population), # Finding the smallest population
largest_population = max(Population) # Finding the largest population
)
## # A tibble: 1 × 5
## average_growth_rate min_growth_rate max_growth_rate smallest_population
## <dbl> <dbl> <dbl> <int>
## 1 1.01 0.912 1.07 510
## # ℹ 1 more variable: largest_population <int>
# Extracting the names of countries with the highest growth rates
# Sorting the dataset in descending order based on growth rates, and then pulling the country/territory names
countries_with_highest_growth_rate <- world_pop %>%
arrange(desc(`Growth Rate`)) %>%
pull(`Country/Territory`)
# Removing duplicate results as the dataset contains 8 separate entries per country for each year
# Selecting every 8th value to represent a unique country/territory
countries_with_highest_growth_rate <- countries_with_highest_growth_rate[seq(1, 80, 8)]
# Extracting the highest growth rates from the dataset
# Sorting the dataset in descending order of growth rates and pulling the corresponding growth rate values
highest_growths <- world_pop %>%
arrange(desc(`Growth Rate`)) %>%
pull(`Growth Rate`)
# Similarly, removing duplicates by selecting every 8th value from the sorted list
highest_growths <- highest_growths[seq(1, 80, 8)]
Here, we begin by loading the world_map data frame using
st_read(). Next, two groups of countries are identified:
those with the highest and lowest population growth rates, which are
stored in the countries_with_highest_growth_rate and
countries_with_lowest_growth_rate variables, respectively.
For the lowest growth rates, we select every 8th country from a sorted
list, up to 80 entries.
The top_and_bottom data frame augments the
world_map data by adding a fill column, which
is used to color-code countries based on their growth rate
classification. Finally, we generate a plot using ggplot2.
The plot visually distinguishes countries with the highest and lowest
growth rates using different colors and labels them by name. The plot is
saved as a PNG file in the current working directory for future
reference or reporting.
# Get the current working directory
current_wd <- getwd()
# Download the ZIP file containing shapefiles to the current working directory
download.file("https://github.com/autistic96/project-2/archive/refs/heads/main.zip",
paste0(current_wd, "/map_shapefiles.zip"), mode = "wb")
# Unzip the downloaded ZIP file to a new folder called "map_shapefiles_folder"
unzip("map_shapefiles.zip", exdir = "map_shapefiles_folder")
# Unzip the internal ZIP file (within the first unzip) containing the actual shapefiles
unzip("map_shapefiles_folder/project-2-main/map_shapefiles.zip", exdir = "map_shapefiles_folder")
# Define the path to the shapefile (the ".shp" file)
shp_path <- "map_shapefiles_folder/map_shapefiles"
# Read the shapefile into an sf (simple feature) object using st_read from the sf package
world_map = st_read(shp_path)
## Reading layer `ne_10m_admin_0_countries' from data source
## `C:\Users\16462\Desktop\data607\project2\3\map_shapefiles_folder\map_shapefiles'
## using driver `ESRI Shapefile'
## Simple feature collection with 258 features and 168 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.6341
## Geodetic CRS: WGS 84
# Assign the highest growth rates to their corresponding countries/territories
# Names are set to the country/territory names
names(highest_growths) = countries_with_highest_growth_rate
highest_growths
## Moldova Poland Niger Syria Slovakia DR Congo Mayotte Chad
## 1.0691 1.0404 1.0378 1.0376 1.0359 1.0325 1.0319 1.0316
## Angola Mali
## 1.0315 1.0314
# Extract the countries with the lowest growth rates by sorting the data and pulling the relevant names
countries_with_lowest_growth_rate <- world_pop %>%
arrange(`Growth Rate`) %>%
pull(`Country/Territory`)
# Select every 8th country to avoid duplicates (since data spans multiple years)
countries_with_lowest_growth_rate <- countries_with_lowest_growth_rate[seq(1, 80, 8)]
# Pull the lowest growth rates and match them to the countries with the lowest rates
lowest_growths <- world_pop %>%
arrange(`Growth Rate`) %>%
pull(`Growth Rate`)
# Select every 8th value to remove duplicates (similar to above)
lowest_growths <- lowest_growths[seq(1, 80, 8)]
# Assign names to the lowest growth rates (country/territory names)
names(lowest_growths) <- countries_with_lowest_growth_rate
lowest_growths
## Ukraine Lebanon American Samoa
## 0.9120 0.9816 0.9831
## Bulgaria Lithuania Latvia
## 0.9849 0.9869 0.9876
## Bosnia and Herzegovina Marshall Islands Serbia
## 0.9886 0.9886 0.9897
## Croatia
## 0.9927
# Verify the country lists for highest and lowest growth rates
countries_with_lowest_growth_rate
## [1] "Ukraine" "Lebanon" "American Samoa"
## [4] "Bulgaria" "Lithuania" "Latvia"
## [7] "Bosnia and Herzegovina" "Marshall Islands" "Serbia"
## [10] "Croatia"
countries_with_highest_growth_rate
## [1] "Moldova" "Poland" "Niger" "Syria" "Slovakia" "DR Congo"
## [7] "Mayotte" "Chad" "Angola" "Mali"
# Add a 'fill' column to the world_map data to color-code countries
# Countries with the highest growth rates are colored blue, lowest in red, others in white
top_and_bottom <- world_map %>%
mutate(fill = case_when(
`NAME` %in% countries_with_highest_growth_rate ~ "blue", # High growth rates colored blue
`NAME` %in% countries_with_lowest_growth_rate ~ "red", # Low growth rates colored red
TRUE ~ "white" # All other countries colored white
))
# Generate the plot using ggplot2
# geom_sf is used for drawing the map, and geom_sf_text adds country labels with check_overlap to avoid overlapping text
p <- ggplot(data = top_and_bottom) +
geom_sf(aes(fill = fill)) + # Fill the map based on the 'fill' column
geom_sf_text(aes(label = NAME), check_overlap = TRUE) + # Add country names as labels, avoiding overlaps
ggtitle("Map of World") + # Add a title to the plot
scale_fill_identity() # Use the specified colors without any additional mapping
# Save the plot as a PNG file in the current working directory
ggsave("top_and_bottom_10_with_labels.png", plot = p, width = 44, height = 40)
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data
# Plot of the top 10 countries/territories with the highest population growth rate
# Filter the dataset for 2022 and only include countries with the highest growth rates
# Create a bar plot of population for the top 10 countries with the highest growth rates in 2022
world_pop %>%
filter(Year == "2022" & `Country/Territory` %in% countries_with_highest_growth_rate) %>%
ggplot(aes(x = reorder(`Country/Territory`, -Population), y = Population)) +
geom_bar(stat="identity") + # Use geom_bar to create a bar plot
ggtitle("Top 10 Countries (Highest Growth Rate) in 2022") + # Add plot title
xlab("Country/Territory") + # Label for the x-axis
ylab("Population") + # Label for the y-axis
theme_minimal() + # Apply a minimal theme for better aesthetics
coord_flip() # Flip coordinates to make bars horizontal
# Plot of bottom 10 countries/territories with the lowest population growth rate
# Similar process to the highest growth rate plot, but for the lowest growth rate countries
world_pop %>%
filter(Year == "2022" & `Country/Territory` %in% countries_with_lowest_growth_rate) %>%
ggplot(aes(x = reorder(`Country/Territory`, Population), y = Population)) +
geom_bar(stat="identity") + # Use geom_bar for a bar plot
ggtitle("Bottom 10 Countries (Lowest Growth Rate) in 2022") + # Add plot title
xlab("Country/Territory") + # Label for the x-axis
ylab("Population") + # Label for the y-axis
theme_minimal() + # Apply a minimal theme
coord_flip() # Flip coordinates for horizontal bars
# Filter the dataset for the most recent year (2022) and arrange by population in descending order
recent_pop_data <- world_pop %>%
filter(Year == 2022) %>%
arrange(desc(Population))
# Display the top 10 countries/territories with the largest populations in 2022
head(recent_pop_data, n = 10)
## # A tibble: 10 × 11
## Rank CCA3 `Country/Territory` Capital Continent `Area (km)`
## <int> <chr> <chr> <chr> <chr> <int>
## 1 1 CHN China Beijing Asia 9706961
## 2 2 IND India New Delhi Asia 3287590
## 3 3 USA United States Washington, D.C. North America 9372610
## 4 4 IDN Indonesia Jakarta Asia 1904569
## 5 5 PAK Pakistan Islamabad Asia 881912
## 6 6 NGA Nigeria Abuja Africa 923768
## 7 7 BRA Brazil Brasilia South America 8515767
## 8 8 BGD Bangladesh Dhaka Asia 147570
## 9 9 RUS Russia Moscow Europe 17098242
## 10 10 MEX Mexico Mexico City North America 1964375
## # ℹ 5 more variables: `Density per km` <dbl>, `Growth Rate` <dbl>,
## # `World Population Percentage` <dbl>, Year <chr>, Population <int>
# Display the bottom 10 countries/territories with the smallest populations in 2022
tail(recent_pop_data, n = 10)
## # A tibble: 10 × 11
## Rank CCA3 `Country/Territory` Capital Continent `Area (km)`
## <int> <chr> <chr> <chr> <chr> <int>
## 1 225 NRU Nauru Yaren Oceania 21
## 2 226 WLF Wallis and Futuna Mata-Utu Oceania 142
## 3 227 TUV Tuvalu Funafuti Oceania 26
## 4 228 BLM Saint Barthelemy Gustavia North America 21
## 5 229 SPM Saint Pierre and Miquelon Saint-Pierre North America 242
## 6 230 MSR Montserrat Brades North America 102
## 7 231 FLK Falkland Islands Stanley South America 12173
## 8 232 NIU Niue Alofi Oceania 260
## 9 233 TKL Tokelau Nukunonu Oceania 12
## 10 234 VAT Vatican City Vatican City Europe 1
## # ℹ 5 more variables: `Density per km` <dbl>, `Growth Rate` <dbl>,
## # `World Population Percentage` <dbl>, Year <chr>, Population <int>
# Plot of population growth over the years for all countries/territories
# Asia shows the biggest increase in population over time
# Convert the Year column from character to numeric for plotting
world_pop$Year <- as.numeric(world_pop$Year)
# Group the data by Year and Continent, then sum the population for each group
world_pop_summary <- world_pop %>%
group_by(Year, Continent) %>%
summarise(Total_Population = sum(Population))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Create a line plot showing total population over time by continent
ggplot(data = world_pop_summary, aes(x = Year, y = Total_Population, color = Continent)) +
geom_line(linewidth = 1) + # Use geom_line to create a line graph with specified line width
ggtitle("World Population Over Time by Continent") + # Add plot title
xlab("Year") + # Label for the x-axis
ylab("Total Population") + # Label for the y-axis
theme_minimal() # Apply a minimal theme for clean presentation
After tidying and analyzing the World Population Dataset, several key insights became clear. We identified countries with notably high population growth rates, as well as others experiencing low or even negative growth. This information can be invaluable for policymakers in these regions as they plan for future demographic challenges. Additionally, visualizations of the 10 countries with the highest and lowest growth rates, along with their population sizes for 2022, provided a snapshot of global population dynamics, highlighting the disparities between nations.
We also examined population trends over time by continent. The line graph revealed that Asia has seen the most substantial population growth over the years. This trend could have significant socio-economic impacts, such as increased demand for resources and potential pressure on public services in densely populated areas.