#World Population Dataset
##Intro
The World Population Dataset shows us population data across various countries and territories from 1970 to 2022. This dataset includes variables like area, density per km, growth rate, and the percentage of the world population. Its important to understand how to work with population data so that we can accurately make predictions and models.
tidyr and dplyr are both part of the tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
library(downloader)
Load the untidy dataset
data <- read.csv(url("https://raw.githubusercontent.com/autistic96/project-2/main/world_population.csv"))
# Get the number of rows and columns
dim(data)
## [1] 234 17
# Display the structure
str(data)
## 'data.frame': 234 obs. of 17 variables:
## $ Rank : int 36 138 34 213 203 42 224 201 33 140 ...
## $ CCA3 : chr "AFG" "ALB" "DZA" "ASM" ...
## $ Country.Territory : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Capital : chr "Kabul" "Tirana" "Algiers" "Pago Pago" ...
## $ Continent : chr "Asia" "Europe" "Africa" "Oceania" ...
## $ X2022.Population : int 41128771 2842321 44903225 44273 79824 35588987 15857 93763 45510318 2780469 ...
## $ X2020.Population : int 38972230 2866849 43451666 46189 77700 33428485 15585 92664 45036032 2805608 ...
## $ X2015.Population : int 33753499 2882481 39543154 51368 71746 28127721 14525 89941 43257065 2878595 ...
## $ X2010.Population : int 28189672 2913399 35856344 54849 71519 23364185 13172 85695 41100123 2946293 ...
## $ X2000.Population : int 19542982 3182021 30774621 58230 66097 16394062 11047 75055 37070774 3168523 ...
## $ X1990.Population : int 10694796 3295066 25518074 47818 53569 11828638 8316 63328 32637657 3556539 ...
## $ X1980.Population : int 12486631 2941651 18739378 32886 35611 8330047 6560 64888 28024803 3135123 ...
## $ X1970.Population : int 10752971 2324731 13795915 27075 19860 6029700 6283 64516 23842803 2534377 ...
## $ Area..km.. : int 652230 28748 2381741 199 468 1246700 91 442 2780400 29743 ...
## $ Density..per.km.. : num 63.1 98.9 18.9 222.5 170.6 ...
## $ Growth.Rate : num 1.026 0.996 1.016 0.983 1.01 ...
## $ World.Population.Percentage: num 0.52 0.04 0.56 0 0 0.45 0 0 0.57 0.03 ...
# Preview of the data frame
head(data)
## Rank CCA3 Country.Territory Capital Continent X2022.Population
## 1 36 AFG Afghanistan Kabul Asia 41128771
## 2 138 ALB Albania Tirana Europe 2842321
## 3 34 DZA Algeria Algiers Africa 44903225
## 4 213 ASM American Samoa Pago Pago Oceania 44273
## 5 203 AND Andorra Andorra la Vella Europe 79824
## 6 42 AGO Angola Luanda Africa 35588987
## X2020.Population X2015.Population X2010.Population X2000.Population
## 1 38972230 33753499 28189672 19542982
## 2 2866849 2882481 2913399 3182021
## 3 43451666 39543154 35856344 30774621
## 4 46189 51368 54849 58230
## 5 77700 71746 71519 66097
## 6 33428485 28127721 23364185 16394062
## X1990.Population X1980.Population X1970.Population Area..km..
## 1 10694796 12486631 10752971 652230
## 2 3295066 2941651 2324731 28748
## 3 25518074 18739378 13795915 2381741
## 4 47818 32886 27075 199
## 5 53569 35611 19860 468
## 6 11828638 8330047 6029700 1246700
## Density..per.km.. Growth.Rate World.Population.Percentage
## 1 63.0587 1.0257 0.52
## 2 98.8702 0.9957 0.04
## 3 18.8531 1.0164 0.56
## 4 222.4774 0.9831 0.00
## 5 170.5641 1.0100 0.00
## 6 28.5466 1.0315 0.45
##Tidying the dataset
The dataset was first tidied to transform it into a long format, making it easier to manage and visualize. The names of the columns were standardized for better clarity.
# Copy and pasting what Matthew wrote on the discussion board
#This details regional populations from the years 2000 to 2022. To tidy this dataset, I would collapse the year variables into a single grouping column, effectively making it into a long dataset. From there, we can easily graph population rates by year, and potentially by country
# I will first rename the column names to make them more clear
data <- data %>%
rename("Country/Territory" = Country.Territory, "2022" = X2022.Population, "2020" = X2020.Population, "2015" = X2015.Population, "2010" = X2010.Population, "2000" = X2000.Population, "1990" = X1990.Population, "1980" = X1980.Population, "1970" = X1970.Population, "Area (km)" = Area..km.., "Density per km" = Density..per.km.., "Growth Rate" = Growth.Rate, "World Population Percentage" = World.Population.Percentage)
# Make a long dataset by collapsing the years into a single column
world_pop <- data %>%
pivot_longer(`2022`:`1970`, names_to = "Year", values_to = "Population")
head(world_pop)
## # A tibble: 6 × 11
## Rank CCA3 `Country/Territory` Capital Continent `Area (km)` `Density per km`
## <int> <chr> <chr> <chr> <chr> <int> <dbl>
## 1 36 AFG Afghanistan Kabul Asia 652230 63.1
## 2 36 AFG Afghanistan Kabul Asia 652230 63.1
## 3 36 AFG Afghanistan Kabul Asia 652230 63.1
## 4 36 AFG Afghanistan Kabul Asia 652230 63.1
## 5 36 AFG Afghanistan Kabul Asia 652230 63.1
## 6 36 AFG Afghanistan Kabul Asia 652230 63.1
## # ℹ 4 more variables: `Growth Rate` <dbl>, `World Population Percentage` <dbl>,
## # Year <chr>, Population <int>
##Analysis
Statistical summaries were generated to identify countries with the highest and lowest growth rates. The dataset was then visualized to display population trends over the years and across continents. Specifically, graphs were plotted to highlight the top and bottom 10 countries based on their population growth rates, as well as population sizes in 2022.
# Statistics of Growths and Populations
world_pop %>% summarize(average_growth_rate = mean(`Growth Rate`), min_growth_rate = min(`Growth Rate`), max_growth_rate = max(`Growth Rate`), smallest_population = min(Population), largest_population = max(Population))
## # A tibble: 1 × 5
## average_growth_rate min_growth_rate max_growth_rate smallest_population
## <dbl> <dbl> <dbl> <int>
## 1 1.01 0.912 1.07 510
## # ℹ 1 more variable: largest_population <int>
countries_with_highest_growth_rate <- world_pop %>%
arrange(desc(`Growth Rate`)) %>%
pull(`Country/Territory`)
# Remove duplicate results due to having 8 separate years for the same growth rate
countries_with_highest_growth_rate <- countries_with_highest_growth_rate[seq(1, 80, 8)]
highest_growths <- world_pop %>%
arrange(desc(`Growth Rate`)) %>%
pull(`Growth Rate`)
highest_growths <- highest_growths[seq(1, 80, 8)]
Here, we loaded the world_map data frame using st_read(). Then, two sets of countries are identified: those with the highest and lowest population growth rates. These sets are named countries_with_highest_growth_rate and countries_with_lowest_growth_rate, respectively. For the lowest growth rate, every 8th country is selected from a sorted list, up to 80. The top_and_bottom data frame augments the world_map data by adding a fill column to color-code countries based on their growth rate. Finally, a plot is generated using ggplot2. The plot uses different colors for countries with the highest and lowest growth rates and labels them by name. The plot is then saved as a PNG file in the current working directory.
current_wd <- getwd()
# Download the ZIP file to the current working directory
download.file("https://github.com/autistic96/project-2/archive/refs/heads/main.zip",paste0(current_wd, "/map_shapefiles.zip"),
mode = "wb")
# Unzip the ZIP file
unzip("map_shapefiles.zip", exdir = "map_shapefiles_folder")
unzip("map_shapefiles_folder/project-2-main/map_shapefiles.zip", exdir = "map_shapefiles_folder")
# Path to the shapefile within the unzipped folder
shp_path <- "map_shapefiles_folder/map_shapefiles"
world_map = st_read(shp_path)
## Reading layer `ne_10m_admin_0_countries' from data source
## `C:\Users\Guestperson\Desktop\project-2\map_shapefiles_folder\map_shapefiles'
## using driver `ESRI Shapefile'
## Simple feature collection with 258 features and 168 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.6341
## Geodetic CRS: WGS 84
# Top 10 highest growth rates ordered from high to low by country/territory
names(highest_growths) = countries_with_highest_growth_rate
highest_growths
## Moldova Poland Niger Syria Slovakia DR Congo Mayotte Chad
## 1.0691 1.0404 1.0378 1.0376 1.0359 1.0325 1.0319 1.0316
## Angola Mali
## 1.0315 1.0314
countries_with_lowest_growth_rate <- world_pop %>%
arrange(`Growth Rate`) %>%
pull(`Country/Territory`)
countries_with_lowest_growth_rate <- countries_with_lowest_growth_rate[seq(1, 80, 8)]
lowest_growths <- world_pop %>%
arrange(`Growth Rate`) %>%
pull(`Growth Rate`)
lowest_growths <- lowest_growths[seq(1, 80, 8)]
# Bottom 10 lowest growth rates ordered from low to high by country/territory
names(lowest_growths) <- countries_with_lowest_growth_rate
lowest_growths
## Ukraine Lebanon American Samoa
## 0.9120 0.9816 0.9831
## Bulgaria Lithuania Latvia
## 0.9849 0.9869 0.9876
## Bosnia and Herzegovina Marshall Islands Serbia
## 0.9886 0.9886 0.9897
## Croatia
## 0.9927
countries_with_lowest_growth_rate
## [1] "Ukraine" "Lebanon" "American Samoa"
## [4] "Bulgaria" "Lithuania" "Latvia"
## [7] "Bosnia and Herzegovina" "Marshall Islands" "Serbia"
## [10] "Croatia"
countries_with_highest_growth_rate
## [1] "Moldova" "Poland" "Niger" "Syria" "Slovakia" "DR Congo"
## [7] "Mayotte" "Chad" "Angola" "Mali"
# Add a fill column to your world_map data
top_and_bottom <- world_map %>%
mutate(fill = case_when(
`NAME` %in% countries_with_highest_growth_rate ~ "blue",
`NAME` %in% countries_with_lowest_growth_rate ~ "red",
TRUE ~ "white"
))
# Generate the plot
p <- ggplot(data = top_and_bottom) +
geom_sf(aes(fill = fill)) +
geom_sf_text(aes(label = NAME), check_overlap = TRUE) + # Add labels
ggtitle("Map of World") +
scale_fill_identity()
# Save the plot
ggsave("top_and_bottom_10_with_labels.png", plot = p, width = 44, height = 40)
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data
# Plot of the top 10 countries/territories with the highest population growth rate
world_pop %>%
filter(Year == "2022" & `Country/Territory` %in% countries_with_highest_growth_rate) %>%
ggplot(aes(x = reorder(`Country/Territory`, -Population), y = Population)) +
geom_bar(stat="identity") +
ggtitle("Top 10 Countries (Highest Growth Rate) in 2022") +
xlab("Country/Territory") +
ylab("Population") +
theme_minimal() +
coord_flip()
# Plot of bottom 10 countries/territories with the lowest population growth rate
world_pop %>%
filter(Year == "2022" & `Country/Territory` %in% countries_with_lowest_growth_rate) %>%
ggplot(aes(x = reorder(`Country/Territory`, Population), y = Population)) +
geom_bar(stat="identity") +
ggtitle("Bottom 10 Countries (Lowest Growth Rate) in 2022") +
xlab("Country/Territory") +
ylab("Population") +
theme_minimal() +
coord_flip()
recent_pop_data <- world_pop %>%
filter(Year == 2022) %>%
arrange(desc(Population))
# Top 10 Largest population in 2022 ordered from high to low
head(recent_pop_data, n = 10)
## # A tibble: 10 × 11
## Rank CCA3 `Country/Territory` Capital Continent `Area (km)`
## <int> <chr> <chr> <chr> <chr> <int>
## 1 1 CHN China Beijing Asia 9706961
## 2 2 IND India New Delhi Asia 3287590
## 3 3 USA United States Washington, D.C. North America 9372610
## 4 4 IDN Indonesia Jakarta Asia 1904569
## 5 5 PAK Pakistan Islamabad Asia 881912
## 6 6 NGA Nigeria Abuja Africa 923768
## 7 7 BRA Brazil Brasilia South America 8515767
## 8 8 BGD Bangladesh Dhaka Asia 147570
## 9 9 RUS Russia Moscow Europe 17098242
## 10 10 MEX Mexico Mexico City North America 1964375
## # ℹ 5 more variables: `Density per km` <dbl>, `Growth Rate` <dbl>,
## # `World Population Percentage` <dbl>, Year <chr>, Population <int>
# Bottom 10 Smallest Population
tail(recent_pop_data, n = 10)
## # A tibble: 10 × 11
## Rank CCA3 `Country/Territory` Capital Continent `Area (km)`
## <int> <chr> <chr> <chr> <chr> <int>
## 1 225 NRU Nauru Yaren Oceania 21
## 2 226 WLF Wallis and Futuna Mata-Utu Oceania 142
## 3 227 TUV Tuvalu Funafuti Oceania 26
## 4 228 BLM Saint Barthelemy Gustavia North America 21
## 5 229 SPM Saint Pierre and Miquelon Saint-Pierre North America 242
## 6 230 MSR Montserrat Brades North America 102
## 7 231 FLK Falkland Islands Stanley South America 12173
## 8 232 NIU Niue Alofi Oceania 260
## 9 233 TKL Tokelau Nukunonu Oceania 12
## 10 234 VAT Vatican City Vatican City Europe 1
## # ℹ 5 more variables: `Density per km` <dbl>, `Growth Rate` <dbl>,
## # `World Population Percentage` <dbl>, Year <chr>, Population <int>
# Plot of population growth by year for all countries/territory
# Asia have the biggest increase in terms of population
world_pop$Year <- as.numeric(world_pop$Year)
# Group by Year and Continent and then sum the Population
world_pop_summary <- world_pop %>%
group_by(Year, Continent) %>%
summarise(Total_Population = sum(Population))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Create the ggplot line graph
ggplot(data = world_pop_summary, aes(x = Year, y = Total_Population, color = Continent)) +
geom_line(linewidth = 1) +
ggtitle("World Population Over Time by Continent") +
xlab("Year") +
ylab("Total Population") +
theme_minimal()
##Conclusion
After tidying and analyzing the World Population Dataset, several key insights emerged. We found that there are countries with significantly high growth rates, as well as those with low or even negative growth rates. This information could be crucial for policymakers in these nations. We also plotted the 10 countries with the highest and lowest growth rates, and their respective population sizes for the year 2022. These visualizations give a snapshot of global population dynamics and how disparate they can be from one nation to another.
Furthermore, we looked at the population trends over time by continent. The line graph clearly showed that Asia has been experiencing the most significant increase in terms of population over the years. This could have various socio-economic implications, such as increased demand for resources and potential strain on public services in densely populated regions.