#World Population Dataset

##Intro

The World Population Dataset shows us population data across various countries and territories from 1970 to 2022. This dataset includes variables like area, density per km, growth rate, and the percentage of the world population. Its important to understand how to work with population data so that we can accurately make predictions and models.

tidyr and dplyr are both part of the tidyverse

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
library(downloader)

Load the untidy dataset

data <- read.csv(url("https://raw.githubusercontent.com/autistic96/project-2/main/world_population.csv"))

# Get the number of rows and columns
dim(data)
## [1] 234  17
# Display the structure
str(data)
## 'data.frame':    234 obs. of  17 variables:
##  $ Rank                       : int  36 138 34 213 203 42 224 201 33 140 ...
##  $ CCA3                       : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ Country.Territory          : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Capital                    : chr  "Kabul" "Tirana" "Algiers" "Pago Pago" ...
##  $ Continent                  : chr  "Asia" "Europe" "Africa" "Oceania" ...
##  $ X2022.Population           : int  41128771 2842321 44903225 44273 79824 35588987 15857 93763 45510318 2780469 ...
##  $ X2020.Population           : int  38972230 2866849 43451666 46189 77700 33428485 15585 92664 45036032 2805608 ...
##  $ X2015.Population           : int  33753499 2882481 39543154 51368 71746 28127721 14525 89941 43257065 2878595 ...
##  $ X2010.Population           : int  28189672 2913399 35856344 54849 71519 23364185 13172 85695 41100123 2946293 ...
##  $ X2000.Population           : int  19542982 3182021 30774621 58230 66097 16394062 11047 75055 37070774 3168523 ...
##  $ X1990.Population           : int  10694796 3295066 25518074 47818 53569 11828638 8316 63328 32637657 3556539 ...
##  $ X1980.Population           : int  12486631 2941651 18739378 32886 35611 8330047 6560 64888 28024803 3135123 ...
##  $ X1970.Population           : int  10752971 2324731 13795915 27075 19860 6029700 6283 64516 23842803 2534377 ...
##  $ Area..km..                 : int  652230 28748 2381741 199 468 1246700 91 442 2780400 29743 ...
##  $ Density..per.km..          : num  63.1 98.9 18.9 222.5 170.6 ...
##  $ Growth.Rate                : num  1.026 0.996 1.016 0.983 1.01 ...
##  $ World.Population.Percentage: num  0.52 0.04 0.56 0 0 0.45 0 0 0.57 0.03 ...
# Preview of the data frame
head(data)
##   Rank CCA3 Country.Territory          Capital Continent X2022.Population
## 1   36  AFG       Afghanistan            Kabul      Asia         41128771
## 2  138  ALB           Albania           Tirana    Europe          2842321
## 3   34  DZA           Algeria          Algiers    Africa         44903225
## 4  213  ASM    American Samoa        Pago Pago   Oceania            44273
## 5  203  AND           Andorra Andorra la Vella    Europe            79824
## 6   42  AGO            Angola           Luanda    Africa         35588987
##   X2020.Population X2015.Population X2010.Population X2000.Population
## 1         38972230         33753499         28189672         19542982
## 2          2866849          2882481          2913399          3182021
## 3         43451666         39543154         35856344         30774621
## 4            46189            51368            54849            58230
## 5            77700            71746            71519            66097
## 6         33428485         28127721         23364185         16394062
##   X1990.Population X1980.Population X1970.Population Area..km..
## 1         10694796         12486631         10752971     652230
## 2          3295066          2941651          2324731      28748
## 3         25518074         18739378         13795915    2381741
## 4            47818            32886            27075        199
## 5            53569            35611            19860        468
## 6         11828638          8330047          6029700    1246700
##   Density..per.km.. Growth.Rate World.Population.Percentage
## 1           63.0587      1.0257                        0.52
## 2           98.8702      0.9957                        0.04
## 3           18.8531      1.0164                        0.56
## 4          222.4774      0.9831                        0.00
## 5          170.5641      1.0100                        0.00
## 6           28.5466      1.0315                        0.45

##Tidying the dataset

The dataset was first tidied to transform it into a long format, making it easier to manage and visualize. The names of the columns were standardized for better clarity.

# Copy and pasting what Matthew wrote on the discussion board
#This details regional populations from the years 2000 to 2022. To tidy this dataset, I would collapse the year variables into a single grouping column, effectively making it into a long dataset. From there, we can easily graph population rates by year, and potentially by country

# I will first rename the column names to make them more clear
data <- data %>%
  rename("Country/Territory" = Country.Territory, "2022" = X2022.Population, "2020" = X2020.Population, "2015" = X2015.Population, "2010" = X2010.Population, "2000" = X2000.Population, "1990" = X1990.Population, "1980" = X1980.Population, "1970" = X1970.Population, "Area (km)" = Area..km.., "Density per km" = Density..per.km.., "Growth Rate" = Growth.Rate, "World Population Percentage" = World.Population.Percentage)

# Make a long dataset by collapsing the years into a single column
world_pop <- data %>%
  pivot_longer(`2022`:`1970`, names_to = "Year", values_to = "Population")

head(world_pop)
## # A tibble: 6 × 11
##    Rank CCA3  `Country/Territory` Capital Continent `Area (km)` `Density per km`
##   <int> <chr> <chr>               <chr>   <chr>           <int>            <dbl>
## 1    36 AFG   Afghanistan         Kabul   Asia           652230             63.1
## 2    36 AFG   Afghanistan         Kabul   Asia           652230             63.1
## 3    36 AFG   Afghanistan         Kabul   Asia           652230             63.1
## 4    36 AFG   Afghanistan         Kabul   Asia           652230             63.1
## 5    36 AFG   Afghanistan         Kabul   Asia           652230             63.1
## 6    36 AFG   Afghanistan         Kabul   Asia           652230             63.1
## # ℹ 4 more variables: `Growth Rate` <dbl>, `World Population Percentage` <dbl>,
## #   Year <chr>, Population <int>

##Analysis

Statistical summaries were generated to identify countries with the highest and lowest growth rates. The dataset was then visualized to display population trends over the years and across continents. Specifically, graphs were plotted to highlight the top and bottom 10 countries based on their population growth rates, as well as population sizes in 2022.

# Statistics of Growths and Populations
world_pop %>% summarize(average_growth_rate = mean(`Growth Rate`), min_growth_rate = min(`Growth Rate`), max_growth_rate = max(`Growth Rate`), smallest_population = min(Population), largest_population = max(Population))
## # A tibble: 1 × 5
##   average_growth_rate min_growth_rate max_growth_rate smallest_population
##                 <dbl>           <dbl>           <dbl>               <int>
## 1                1.01           0.912            1.07                 510
## # ℹ 1 more variable: largest_population <int>
countries_with_highest_growth_rate <- world_pop %>% 
  arrange(desc(`Growth Rate`)) %>% 
  pull(`Country/Territory`)

# Remove duplicate results due to having 8 separate years for the same growth rate
countries_with_highest_growth_rate <- countries_with_highest_growth_rate[seq(1, 80, 8)]

highest_growths <- world_pop %>% 
  arrange(desc(`Growth Rate`)) %>% 
  pull(`Growth Rate`)

highest_growths <- highest_growths[seq(1, 80, 8)]

Here, we loaded the world_map data frame using st_read(). Then, two sets of countries are identified: those with the highest and lowest population growth rates. These sets are named countries_with_highest_growth_rate and countries_with_lowest_growth_rate, respectively. For the lowest growth rate, every 8th country is selected from a sorted list, up to 80. The top_and_bottom data frame augments the world_map data by adding a fill column to color-code countries based on their growth rate. Finally, a plot is generated using ggplot2. The plot uses different colors for countries with the highest and lowest growth rates and labels them by name. The plot is then saved as a PNG file in the current working directory.

current_wd <- getwd()

# Download the ZIP file to the current working directory
download.file("https://github.com/autistic96/project-2/archive/refs/heads/main.zip",paste0(current_wd, "/map_shapefiles.zip"), 
              mode = "wb")

# Unzip the ZIP file
unzip("map_shapefiles.zip", exdir = "map_shapefiles_folder")

unzip("map_shapefiles_folder/project-2-main/map_shapefiles.zip", exdir = "map_shapefiles_folder")

# Path to the shapefile within the unzipped folder
shp_path <- "map_shapefiles_folder/map_shapefiles"



world_map = st_read(shp_path)
## Reading layer `ne_10m_admin_0_countries' from data source 
##   `C:\Users\Guestperson\Desktop\project-2\map_shapefiles_folder\map_shapefiles' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 258 features and 168 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.6341
## Geodetic CRS:  WGS 84
# Top 10 highest growth rates ordered from high to low by country/territory
names(highest_growths) = countries_with_highest_growth_rate
highest_growths
##  Moldova   Poland    Niger    Syria Slovakia DR Congo  Mayotte     Chad 
##   1.0691   1.0404   1.0378   1.0376   1.0359   1.0325   1.0319   1.0316 
##   Angola     Mali 
##   1.0315   1.0314
countries_with_lowest_growth_rate <- world_pop %>% 
  arrange(`Growth Rate`) %>% 
  pull(`Country/Territory`)

countries_with_lowest_growth_rate <- countries_with_lowest_growth_rate[seq(1, 80, 8)]

lowest_growths <- world_pop %>% 
  arrange(`Growth Rate`) %>% 
  pull(`Growth Rate`)

lowest_growths <- lowest_growths[seq(1, 80, 8)]

# Bottom 10 lowest growth rates ordered from low to high by country/territory
names(lowest_growths) <- countries_with_lowest_growth_rate
lowest_growths
##                Ukraine                Lebanon         American Samoa 
##                 0.9120                 0.9816                 0.9831 
##               Bulgaria              Lithuania                 Latvia 
##                 0.9849                 0.9869                 0.9876 
## Bosnia and Herzegovina       Marshall Islands                 Serbia 
##                 0.9886                 0.9886                 0.9897 
##                Croatia 
##                 0.9927
countries_with_lowest_growth_rate
##  [1] "Ukraine"                "Lebanon"                "American Samoa"        
##  [4] "Bulgaria"               "Lithuania"              "Latvia"                
##  [7] "Bosnia and Herzegovina" "Marshall Islands"       "Serbia"                
## [10] "Croatia"
countries_with_highest_growth_rate
##  [1] "Moldova"  "Poland"   "Niger"    "Syria"    "Slovakia" "DR Congo"
##  [7] "Mayotte"  "Chad"     "Angola"   "Mali"
# Add a fill column to your world_map data
top_and_bottom <- world_map %>% 
  mutate(fill = case_when(
    `NAME` %in% countries_with_highest_growth_rate ~ "blue",
    `NAME` %in% countries_with_lowest_growth_rate ~ "red",
    TRUE ~ "white"
  ))

# Generate the plot
p <- ggplot(data = top_and_bottom) + 
  geom_sf(aes(fill = fill)) + 
  geom_sf_text(aes(label = NAME), check_overlap = TRUE) +  # Add labels
  ggtitle("Map of World") +
  scale_fill_identity()

# Save the plot
ggsave("top_and_bottom_10_with_labels.png", plot = p, width = 44, height = 40)
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data
# Plot of the top 10 countries/territories with the highest population growth rate
world_pop %>%
  filter(Year == "2022" & `Country/Territory` %in% countries_with_highest_growth_rate) %>%
  ggplot(aes(x = reorder(`Country/Territory`, -Population), y = Population)) + 
  geom_bar(stat="identity") +
  ggtitle("Top 10 Countries (Highest Growth Rate) in 2022") +
  xlab("Country/Territory") +
  ylab("Population") +
  theme_minimal() +
  coord_flip()

# Plot of bottom 10 countries/territories with the lowest population growth rate
world_pop %>%
  filter(Year == "2022" & `Country/Territory` %in% countries_with_lowest_growth_rate) %>%
  ggplot(aes(x = reorder(`Country/Territory`, Population), y = Population)) + 
  geom_bar(stat="identity") +
  ggtitle("Bottom 10 Countries (Lowest Growth Rate) in 2022") +
  xlab("Country/Territory") +
  ylab("Population") +
  theme_minimal() +
  coord_flip()

recent_pop_data <- world_pop %>%
  filter(Year == 2022) %>%
  arrange(desc(Population))

# Top 10 Largest population in 2022 ordered from high to low
head(recent_pop_data, n = 10)
## # A tibble: 10 × 11
##     Rank CCA3  `Country/Territory` Capital          Continent     `Area (km)`
##    <int> <chr> <chr>               <chr>            <chr>               <int>
##  1     1 CHN   China               Beijing          Asia              9706961
##  2     2 IND   India               New Delhi        Asia              3287590
##  3     3 USA   United States       Washington, D.C. North America     9372610
##  4     4 IDN   Indonesia           Jakarta          Asia              1904569
##  5     5 PAK   Pakistan            Islamabad        Asia               881912
##  6     6 NGA   Nigeria             Abuja            Africa             923768
##  7     7 BRA   Brazil              Brasilia         South America     8515767
##  8     8 BGD   Bangladesh          Dhaka            Asia               147570
##  9     9 RUS   Russia              Moscow           Europe           17098242
## 10    10 MEX   Mexico              Mexico City      North America     1964375
## # ℹ 5 more variables: `Density per km` <dbl>, `Growth Rate` <dbl>,
## #   `World Population Percentage` <dbl>, Year <chr>, Population <int>
# Bottom 10 Smallest Population
tail(recent_pop_data, n = 10)
## # A tibble: 10 × 11
##     Rank CCA3  `Country/Territory`       Capital      Continent     `Area (km)`
##    <int> <chr> <chr>                     <chr>        <chr>               <int>
##  1   225 NRU   Nauru                     Yaren        Oceania                21
##  2   226 WLF   Wallis and Futuna         Mata-Utu     Oceania               142
##  3   227 TUV   Tuvalu                    Funafuti     Oceania                26
##  4   228 BLM   Saint Barthelemy          Gustavia     North America          21
##  5   229 SPM   Saint Pierre and Miquelon Saint-Pierre North America         242
##  6   230 MSR   Montserrat                Brades       North America         102
##  7   231 FLK   Falkland Islands          Stanley      South America       12173
##  8   232 NIU   Niue                      Alofi        Oceania               260
##  9   233 TKL   Tokelau                   Nukunonu     Oceania                12
## 10   234 VAT   Vatican City              Vatican City Europe                  1
## # ℹ 5 more variables: `Density per km` <dbl>, `Growth Rate` <dbl>,
## #   `World Population Percentage` <dbl>, Year <chr>, Population <int>
# Plot of population growth by year for all countries/territory
# Asia have the biggest increase in terms of population

world_pop$Year <- as.numeric(world_pop$Year)

# Group by Year and Continent and then sum the Population
world_pop_summary <- world_pop %>%
  group_by(Year, Continent) %>%
  summarise(Total_Population = sum(Population))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Create the ggplot line graph
ggplot(data = world_pop_summary, aes(x = Year, y = Total_Population, color = Continent)) +
  geom_line(linewidth = 1) +
  ggtitle("World Population Over Time by Continent") +
  xlab("Year") +
  ylab("Total Population") +
  theme_minimal()

##Conclusion

After tidying and analyzing the World Population Dataset, several key insights emerged. We found that there are countries with significantly high growth rates, as well as those with low or even negative growth rates. This information could be crucial for policymakers in these nations. We also plotted the 10 countries with the highest and lowest growth rates, and their respective population sizes for the year 2022. These visualizations give a snapshot of global population dynamics and how disparate they can be from one nation to another.

Furthermore, we looked at the population trends over time by continent. The line graph clearly showed that Asia has been experiencing the most significant increase in terms of population over the years. This could have various socio-economic implications, such as increased demand for resources and potential strain on public services in densely populated regions.