Overview

For this analysis, I cleaned and analyzed a messy “untidy” data set from the World Bank. The data set includes a vast array of data in various hard to analyze formats. For the sake of this analysis, we’ll focus our efforts on cleaning and prepping the data and then analyzing the trends in urban population growth across different countries using world development indicators from 2014 to 2023.

The dataset utilized for this project contains various economic and social indicators for multiple countries. However, for this analysis, we will concentrate on the urban population growth indicator (SP.URB.GROW), which tracks the annual growth rate of urban populations as a percentage.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(stringr)

Importing and Cleaning the Data

First, we’ll need to import the data set.

world_df <- read_csv("https://raw.githubusercontent.com/awrubes/Data607_Project2/main/worldindicators.csv")
## Rows: 1085 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Country Name, Country Code, Series Name, Series Code, 2014 [YR2014...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(world_df)
## # A tibble: 6 × 14
##   `Country Name` `Country Code` `Series Name`      `Series Code` `2014 [YR2014]`
##   <chr>          <chr>          <chr>              <chr>         <chr>          
## 1 Argentina      ARG            Adolescent fertil… SP.ADO.TFRT   67.791         
## 2 Argentina      ARG            Agriculture, fore… NV.AGR.TOTL.… 6.712703514285…
## 3 Argentina      ARG            Annual freshwater… ER.H2O.FWTL.… 12.90753424657…
## 4 Argentina      ARG            Births attended b… SH.STA.BRTC.… 99.6           
## 5 Argentina      ARG            CO2 emissions (me… EN.ATM.CO2E.… 4.209111894913…
## 6 Argentina      ARG            Contraceptive pre… SP.DYN.CONU.… ..             
## # ℹ 9 more variables: `2015 [YR2015]` <chr>, `2016 [YR2016]` <chr>,
## #   `2017 [YR2017]` <chr>, `2018 [YR2018]` <chr>, `2019 [YR2019]` <chr>,
## #   `2020 [YR2020]` <chr>, `2021 [YR2021]` <chr>, `2022 [YR2022]` <chr>,
## #   `2023 [YR2023]` <chr>

As you can see, before being able to perform any analysis, the data requires significant cleaning and preparation to ensure accuracy and usability. The key steps in the data cleaning process are outlined below:

  1. Redundant Column Removal:
    The Country Code column was removed as it was redundant for the analysis. The Country Name column sufficiently identified each country.

  2. Missing Data:
    Rows that contained missing values across all columns were filtered out.

  3. Removal of Non-Data Rows:
    The data set contained two non-data rows at the end, which were removed.

  4. Pivoting the Data:
    The data set, which was originally in wide format with each year represented as a separate column, was transformed into long format using the pivot_longer() function. This step made it easier to work with time-series data and perform calculations.

  5. Filtering and Reformatting:
    Some entries contained placeholder values like “…” which were filtered out. Additionally, the year values were reformatted to only include the four-digit year, simplifying further analysis.

  6. Focus on Urban Growth:
    I filtered the data set to focus exclusively on the urban growth rate indicator (SP.URB.GROW). This allowed for a targeted analysis of urban population growth trends.

#get rid of country code, is redundant
world_df_col <- world_df %>%
    select(-`Country Code`)

#remove rows missing values
world_df_missing <- world_df_col %>%
  filter(!if_all(everything(), is.na))

#remove last two rows
world_df_last <- world_df_missing[-((nrow(world_df_missing)-1):nrow(world_df_missing)), ]

#pivot longer
world_df_pivot <- world_df_last %>%
  pivot_longer(
    cols = `2014 [YR2014]`:`2023 [YR2023]`,
    names_to="Year",
    values_to="Indicator"
  )

#filter out "..." values and reformat the years
world_df_filter <- world_df_pivot %>%
  filter(!grepl("\\.\\.", Indicator))%>%
  mutate(
    Year = sub("^([0-9]{4}).*", "\\1", Year)
  )

#look at urban growth, filter out unneeded columns
world_df_urban <- world_df_filter %>%
  filter(grepl("^SP\\.URB\\.GROW$", `Series Code`))

head(world_df_urban)
## # A tibble: 6 × 5
##   `Country Name` `Series Name`                     `Series Code` Year  Indicator
##   <chr>          <chr>                             <chr>         <chr> <chr>    
## 1 Argentina      Urban population growth (annual … SP.URB.GROW   2014  1.239638…
## 2 Argentina      Urban population growth (annual … SP.URB.GROW   2015  1.215797…
## 3 Argentina      Urban population growth (annual … SP.URB.GROW   2016  1.192602…
## 4 Argentina      Urban population growth (annual … SP.URB.GROW   2017  1.170195…
## 5 Argentina      Urban population growth (annual … SP.URB.GROW   2018  1.147602…
## 6 Argentina      Urban population growth (annual … SP.URB.GROW   2019  1.125019…

After making these necessary steps to prep and clean the data for analysis, we can look more closely at the urban growth rates by country. I though it would be interesting to calculate the average growth rate for each country and compare the top 5 and bottom 5 countries to see how drastically different those averages might be.

# calculate average growth rate by country making sure columns are correct type
world_df_num <- world_df_urban %>%
  mutate(Indicator_Numeric = as.numeric(Indicator))%>%
  select(-Indicator)%>%
  mutate(`Country Name` = trimws(`Country Name`))

avg_growth_by_country <- world_df_num %>%
  group_by(`Country Name`) %>%
  summarize(Average_Growth = base::mean(Indicator_Numeric, na.rm=TRUE))%>%
  arrange(desc(Average_Growth))

head(avg_growth_by_country)
## # A tibble: 6 × 2
##   `Country Name` Average_Growth
##   <chr>                   <dbl>
## 1 China                    2.31
## 2 India                    2.29
## 3 Indonesia                2.11
## 4 Saudi Arabia             1.86
## 5 Turkiye                  1.80
## 6 Australia                1.54
# Extract top 5 countries
top_5_countries <- avg_growth_by_country %>%
  arrange(desc(Average_Growth)) %>%
  slice_head(n = 5)

# Extract bottom 5 countries
bottom_5_countries <- avg_growth_by_country %>%
  arrange(Average_Growth) %>%
  slice_head(n = 5)

Visualization and Analysis

The analysis revealed significant variation in urban growth rates across countries, with some nations experiencing rapid urbanization while others exhibiting much slower growth.

The top five countries with the highest average urban growth rates and the bottom five countries with the lowest rates are highlighted below. These countries represent the extremes in urbanization trends and provide interesting insight into the global patterns of urban population shifts.

# Combine for easy plotting
top_bottom_countries <- bind_rows(top_5_countries, bottom_5_countries)

ggplot(top_bottom_countries, aes(x = reorder(`Country Name`, Average_Growth), y = Average_Growth, fill = Average_Growth > 1)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip coordinates 
  labs(x = "Country", y = "Average Urban Growth Rate (Annual %)", title = "Top 5 and Bottom 5 Countries by Urban Growth Rate") +
  theme_minimal() +
  scale_fill_manual(values = c("red", "green"), guide = "none")