For this analysis, I cleaned and analyzed a messy “untidy” data set from the World Bank. The data set includes a vast array of data in various hard to analyze formats. For the sake of this analysis, we’ll focus our efforts on cleaning and prepping the data and then analyzing the trends in urban population growth across different countries using world development indicators from 2014 to 2023.
The dataset utilized for this project contains various economic and
social indicators for multiple countries. However, for this analysis, we
will concentrate on the urban population growth indicator
(SP.URB.GROW
), which tracks the annual growth rate of urban
populations as a percentage.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(stringr)
First, we’ll need to import the data set.
world_df <- read_csv("https://raw.githubusercontent.com/awrubes/Data607_Project2/main/worldindicators.csv")
## Rows: 1085 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Country Name, Country Code, Series Name, Series Code, 2014 [YR2014...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(world_df)
## # A tibble: 6 × 14
## `Country Name` `Country Code` `Series Name` `Series Code` `2014 [YR2014]`
## <chr> <chr> <chr> <chr> <chr>
## 1 Argentina ARG Adolescent fertil… SP.ADO.TFRT 67.791
## 2 Argentina ARG Agriculture, fore… NV.AGR.TOTL.… 6.712703514285…
## 3 Argentina ARG Annual freshwater… ER.H2O.FWTL.… 12.90753424657…
## 4 Argentina ARG Births attended b… SH.STA.BRTC.… 99.6
## 5 Argentina ARG CO2 emissions (me… EN.ATM.CO2E.… 4.209111894913…
## 6 Argentina ARG Contraceptive pre… SP.DYN.CONU.… ..
## # ℹ 9 more variables: `2015 [YR2015]` <chr>, `2016 [YR2016]` <chr>,
## # `2017 [YR2017]` <chr>, `2018 [YR2018]` <chr>, `2019 [YR2019]` <chr>,
## # `2020 [YR2020]` <chr>, `2021 [YR2021]` <chr>, `2022 [YR2022]` <chr>,
## # `2023 [YR2023]` <chr>
As you can see, before being able to perform any analysis, the data requires significant cleaning and preparation to ensure accuracy and usability. The key steps in the data cleaning process are outlined below:
Redundant Column Removal:
The Country Code
column was removed as it was redundant for
the analysis. The Country Name
column sufficiently
identified each country.
Missing Data:
Rows that contained missing values across all columns were filtered
out.
Removal of Non-Data Rows:
The data set contained two non-data rows at the end, which were
removed.
Pivoting the Data:
The data set, which was originally in wide format with each year
represented as a separate column, was transformed into long format using
the pivot_longer()
function. This step made it easier to
work with time-series data and perform calculations.
Filtering and Reformatting:
Some entries contained placeholder values like “…” which were filtered
out. Additionally, the year values were reformatted to only include the
four-digit year, simplifying further analysis.
Focus on Urban Growth:
I filtered the data set to focus exclusively on the urban growth rate
indicator (SP.URB.GROW
). This allowed for a targeted
analysis of urban population growth trends.
#get rid of country code, is redundant
world_df_col <- world_df %>%
select(-`Country Code`)
#remove rows missing values
world_df_missing <- world_df_col %>%
filter(!if_all(everything(), is.na))
#remove last two rows
world_df_last <- world_df_missing[-((nrow(world_df_missing)-1):nrow(world_df_missing)), ]
#pivot longer
world_df_pivot <- world_df_last %>%
pivot_longer(
cols = `2014 [YR2014]`:`2023 [YR2023]`,
names_to="Year",
values_to="Indicator"
)
#filter out "..." values and reformat the years
world_df_filter <- world_df_pivot %>%
filter(!grepl("\\.\\.", Indicator))%>%
mutate(
Year = sub("^([0-9]{4}).*", "\\1", Year)
)
#look at urban growth, filter out unneeded columns
world_df_urban <- world_df_filter %>%
filter(grepl("^SP\\.URB\\.GROW$", `Series Code`))
head(world_df_urban)
## # A tibble: 6 × 5
## `Country Name` `Series Name` `Series Code` Year Indicator
## <chr> <chr> <chr> <chr> <chr>
## 1 Argentina Urban population growth (annual … SP.URB.GROW 2014 1.239638…
## 2 Argentina Urban population growth (annual … SP.URB.GROW 2015 1.215797…
## 3 Argentina Urban population growth (annual … SP.URB.GROW 2016 1.192602…
## 4 Argentina Urban population growth (annual … SP.URB.GROW 2017 1.170195…
## 5 Argentina Urban population growth (annual … SP.URB.GROW 2018 1.147602…
## 6 Argentina Urban population growth (annual … SP.URB.GROW 2019 1.125019…
After making these necessary steps to prep and clean the data for analysis, we can look more closely at the urban growth rates by country. I though it would be interesting to calculate the average growth rate for each country and compare the top 5 and bottom 5 countries to see how drastically different those averages might be.
# calculate average growth rate by country making sure columns are correct type
world_df_num <- world_df_urban %>%
mutate(Indicator_Numeric = as.numeric(Indicator))%>%
select(-Indicator)%>%
mutate(`Country Name` = trimws(`Country Name`))
avg_growth_by_country <- world_df_num %>%
group_by(`Country Name`) %>%
summarize(Average_Growth = base::mean(Indicator_Numeric, na.rm=TRUE))%>%
arrange(desc(Average_Growth))
head(avg_growth_by_country)
## # A tibble: 6 × 2
## `Country Name` Average_Growth
## <chr> <dbl>
## 1 China 2.31
## 2 India 2.29
## 3 Indonesia 2.11
## 4 Saudi Arabia 1.86
## 5 Turkiye 1.80
## 6 Australia 1.54
# Extract top 5 countries
top_5_countries <- avg_growth_by_country %>%
arrange(desc(Average_Growth)) %>%
slice_head(n = 5)
# Extract bottom 5 countries
bottom_5_countries <- avg_growth_by_country %>%
arrange(Average_Growth) %>%
slice_head(n = 5)
The analysis revealed significant variation in urban growth rates across countries, with some nations experiencing rapid urbanization while others exhibiting much slower growth.
The top five countries with the highest average urban growth rates and the bottom five countries with the lowest rates are highlighted below. These countries represent the extremes in urbanization trends and provide interesting insight into the global patterns of urban population shifts.
# Combine for easy plotting
top_bottom_countries <- bind_rows(top_5_countries, bottom_5_countries)
ggplot(top_bottom_countries, aes(x = reorder(`Country Name`, Average_Growth), y = Average_Growth, fill = Average_Growth > 1)) +
geom_bar(stat = "identity") +
coord_flip() + # Flip coordinates
labs(x = "Country", y = "Average Urban Growth Rate (Annual %)", title = "Top 5 and Bottom 5 Countries by Urban Growth Rate") +
theme_minimal() +
scale_fill_manual(values = c("red", "green"), guide = "none")