Project_2_Birth_Rate_Data

Author

Khandker Qaiduzzaman

Objective

Transform the wide-format birth rate dataset into a tidy (long) format using the tidyr and dplyr packages in R. Once the data has been cleaned up, examine birth rate trends over time and across nations to find trends and variations.

Approach

Dataset description

I’ll use a dataset with crude birth rates for every nation in the world for this project. The dataset was acquired via Kaggle and comes from World Bank data. The dataset is posted by Brandon Chanderban in the discussion post. The dataset can be accessed here:

https://raw.githubusercontent.com/NafeesKhandker/Project-2-Tidy-Format-Transformation/refs/heads/main/BirthRates%20of%20Countries%20(World%20Bank%20Data).csv

Currently, the dataset is kept in a wide format in which a year is represented by each column and a country by each row. Each year’s birth rate values are kept in its own column (e.g., 1960, 1961, 1962, etc.).

The dataset does not adhere to tidy data principles because the year variable is embedded within column headers rather than being represented as a single column. Every variable should have its own column and every observation should have its own row in a neat dataset.

The dataset contains the following key variables:

  • Country Name: Name of the country
  • Country Code: Standardized country abbreviation
  • Indicator Name: Name of the indicator (Crude Birth Rate)
  • Indicator Code: Code representing the indicator
  • 1960–2023 columns: Birth rate values for each year

The year columns must be reshaped into a single column using pivot_longer() in order to prepare the dataset for analysis.

Anticipated Challenges

Managing the numerous year columns in this dataset, which span several decades, is the primary challenge. The corresponding birth rate values must be maintained while these columns are transformed into a single variable that represents the year.

Missing values in specific country-year combinations could present another difficulty. To maintain the accuracy and interpretability of the analysis, these missing values must be handled carefully during the tidying process.

Implementation of Data Import

The following code shows the data in a tabular format. As we can see a significant amount of work is needed to tidy this data. It has metadata in the first rows. The actual meaningful data starts from the 5th row. Also, there are some null values present in this data which needs to be handled. The years should be listed as rows rather than columns so that the statistical analysis becomes easier.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(readr)
library(gt)

url <- "https://raw.githubusercontent.com/NafeesKhandker/Project-2-Tidy-Format-Transformation/refs/heads/main/Birth%20Rates%20of%20Countries%20(World%20Bank%20Data).csv"

df <- read_csv(url, skip = 4, show_col_types = FALSE) %>%
  select(-any_of(c("Indicator Name", "Indicator Code")))

df |> 
  head(n = 5) |> 
  gt()
Country Name Country Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Aruba ABW 33.88300 32.83100 31.64900 30.41600 29.14700 27.88900 26.66300 25.50300 24.59200 23.73500 22.97400 22.31300 21.76600 21.49200 21.38200 21.3930 21.48500 21.73900 21.92000 21.99300 21.93100 21.73000 21.47600 21.38500 21.18300 20.91900 20.61700 20.26900 19.82100 19.18400 18.66200 17.72200 16.44300 16.12600 15.43100 15.99100 16.15300 16.38800 15.07800 14.36100 14.42700 13.73900 12.99200 12.62100 11.92100 12.34800 13.05500 12.96200 12.74800 12.35000 12.19300 12.24600 12.72300 13.31600 13.53300 12.42800 12.30000 11.53000 9.88100 9.13800 8.10200 7.19300 NA
Africa Eastern and Southern AFE 47.43855 47.53055 47.59756 47.63614 47.64548 47.66766 47.69789 47.69133 47.69102 47.72112 47.67313 47.64967 47.47074 47.22113 47.07547 46.9502 46.79183 46.63214 46.51202 46.47196 46.33961 46.23755 46.15826 46.13473 46.13520 46.14379 46.06749 45.83058 45.33736 44.81301 44.23072 43.84232 43.34168 42.96601 42.53329 42.48572 42.13563 41.57346 41.12879 40.89482 40.52824 40.34121 40.04732 39.75014 39.57589 39.40739 39.23711 39.00052 38.85169 38.36494 37.94026 37.48399 36.92130 36.44714 36.02832 35.61331 35.18902 34.89254 34.61102 34.34145 33.91675 33.54627 NA
Afghanistan AFG 50.34000 50.44300 50.57000 50.70300 50.83100 50.87200 50.98600 51.08100 51.14800 51.19500 51.12200 51.16300 51.10900 51.11400 51.13500 51.0180 50.93500 50.92100 50.81600 50.73700 50.48200 50.26400 50.13800 50.13900 50.23500 50.55300 50.72800 50.84500 50.98000 51.16200 51.42300 51.78800 51.94800 52.03800 52.17400 52.07300 51.87300 51.40000 50.88000 50.35100 49.66400 48.97900 48.20100 47.35000 46.33000 45.26300 44.72100 43.85800 41.50600 41.15700 40.60200 39.85500 40.00900 39.60100 39.10500 38.80300 37.93600 37.34200 36.92700 36.46600 36.05100 35.84200 NA
Africa Western and Central AFW 47.32548 47.42105 47.52922 47.53103 47.51192 47.46857 47.44364 47.42593 47.42235 47.41269 47.41411 47.52970 47.57899 47.63283 47.81713 47.9115 47.86907 47.96894 48.03727 47.93830 47.77071 47.51406 47.25192 47.11112 46.70656 46.20665 45.72924 45.34627 45.00171 44.92848 44.67619 44.47423 44.30932 44.16810 43.94269 43.73024 43.49103 43.21922 43.02697 43.17424 43.19955 43.07550 42.92712 42.74688 42.50272 42.42154 42.19330 41.94301 41.75479 41.50376 41.21963 40.89424 40.41643 39.85651 39.33535 38.85921 38.39310 37.88166 37.44709 37.02783 36.61573 36.23703 NA
Angola AGO 51.02600 51.28200 51.31600 51.32300 51.28200 51.28200 51.29500 51.31400 51.34800 51.35300 51.26700 50.69800 50.47400 50.46700 50.47200 50.4690 50.51400 50.52300 50.61600 50.73200 50.89200 51.10900 51.30700 51.61000 51.93500 52.13600 52.19000 52.14600 51.97300 51.69900 51.34400 50.92600 50.37400 49.89300 49.55000 49.18500 48.86000 48.41200 48.00900 47.77300 47.64700 47.57400 47.44800 47.22600 47.09900 46.94400 46.64300 46.29000 45.88900 45.49500 44.97000 44.36400 43.86000 43.28200 42.67600 42.02000 41.37700 40.81000 40.23600 39.72500 39.27100 38.80900 NA

Transform the Data from Wide to Tidy Format

df_tidy <- df %>%
  pivot_longer(
    cols = -c(`Country Name`, `Country Code`),
    names_to = "Year",
    values_to = "Birth_Rate"
  ) %>%
  mutate(Year = as.numeric(Year))

df_tidy |> 
  head(n = 5) |>
  gt()
Country Name Country Code Year Birth_Rate
Aruba ABW 1960 33.883
Aruba ABW 1961 32.831
Aruba ABW 1962 31.649
Aruba ABW 1963 30.416
Aruba ABW 1964 29.147

Country-Level Birth Rate Data Availability Summary

This table summarizes the availability of birth rate data for each country in the dataset. For every country, the first and last years with recorded birth rate values are identified, and the total number of years in that period is calculated. The number of years with available birth rate data is then counted and used to compute the percentage of data availability. This summary helps evaluate the completeness of each country’s data and supports decisions about whether a country should be retained or excluded from further analysis based on the defined availability threshold.

df_country_stats <- df_tidy %>%
  group_by(`Country Name`) %>%
  summarise(
    first_year = min(Year[!is.na(Birth_Rate)]),
    last_year = max(Year[!is.na(Birth_Rate)]),
    total_years = last_year - first_year + 1,
    available_years = sum(!is.na(Birth_Rate)),
    availability_pct = available_years / total_years,
    .groups = "drop"
  )
Warning: There were 2 warnings in `summarise()`.
The first warning was:
ℹ In argument: `first_year = min(Year[!is.na(Birth_Rate)])`.
ℹ In group 184: `Country Name = "Not classified"`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
df_country_stats |> 
  head(n = 5) |> 
  gt()
Country Name first_year last_year total_years available_years availability_pct
Afghanistan 1960 2021 62 62 1
Africa Eastern and Southern 1960 2021 62 62 1
Africa Western and Central 1960 2021 62 62 1
Albania 1960 2021 62 62 1
Algeria 1960 2021 62 62 1

Filtering Countries Based on Data Availability Threshold

In this step, countries with sufficient birth rate data are selected for further analysis. Using the previously calculated data availability statistics, only countries with at least 30% of birth rate observations available within their observed time range are retained. The names of these valid countries are extracted and used to filter the original dataset. The resulting table contains birth rate records only for countries that meet the minimum data availability threshold, ensuring that subsequent analysis and imputation are performed on datasets with adequate information.

valid_countries <- df_country_stats %>%
  filter(availability_pct >= 0.30) %>%
  pull(`Country Name`)

df_filtered <- df_tidy %>%
  filter(`Country Name` %in% valid_countries)

df_filtered |> 
  head(n = 5) |> 
  gt()
Country Name Country Code Year Birth_Rate
Aruba ABW 1960 33.883
Aruba ABW 1961 32.831
Aruba ABW 1962 31.649
Aruba ABW 1963 30.416
Aruba ABW 1964 29.147

Imputing Missing Birth Rate Values

This step addresses the remaining missing birth rate values in the filtered dataset using interpolation and directional filling techniques. For each country, the data is first sorted chronologically by year. Missing values within the observed range are then estimated using linear interpolation, which calculates intermediate values based on surrounding data points. After interpolation, any remaining missing values at the beginning or end of a country’s time series are filled using forward and backward filling (downup). This process produces a more complete dataset while preserving the overall trend of birth rates over time for each country.

df_imputed <- df_filtered %>%
  group_by(`Country Name`) %>%
  arrange(Year) %>%
  mutate(
    Birth_Rate = approx(Year, Birth_Rate, Year, rule = 1)$y
  ) %>%
  fill(Birth_Rate, .direction = "downup") %>%
  ungroup()

df_imputed |> 
  head(n = 5) |> 
  gt()
Country Name Country Code Year Birth_Rate
Aruba ABW 1960 33.88300
Africa Eastern and Southern AFE 1960 47.43855
Afghanistan AFG 1960 50.34000
Africa Western and Central AFW 1960 47.32548
Angola AGO 1960 51.02600

Summary Statistics of Imputed Birth Rates by Country

This step calculates key summary statistics for the imputed birth rate dataset. For each country, the mean, minimum, and maximum birth rates are computed across all years. The results are then sorted in descending order of mean birth rate to highlight countries with the highest average values. The table is formatted with clear labels and units, providing an easily interpretable overview of birth rate trends and variability across countries. This summary supports both visual analysis and comparison between countries.

summary_table <- df_imputed %>%
  group_by(`Country Name`) %>%
  summarise(
    Mean_Birth_Rate = mean(Birth_Rate, na.rm = TRUE),
    Min_Birth_Rate = min(Birth_Rate, na.rm = TRUE),
    Max_Birth_Rate = max(Birth_Rate, na.rm = TRUE),
    .groups = "drop"
  )


summary_sorted <- summary_table %>%
  arrange(desc(Mean_Birth_Rate))

summary_sorted |> 
  head(n = 5) |>
  gt() |>
  tab_header(
    title = "Summary Statistics of Birth Rates by Country",
    subtitle = "Imputed table"
  ) |>
  cols_label(
    `Country Name` = "Country",
    Mean_Birth_Rate = "Mean Birth Rate (per 1,000 people)",
    Min_Birth_Rate = "Minimum Birth Rate",
    Max_Birth_Rate = "Maximum Birth Rate"
  )
Summary Statistics of Birth Rates by Country
Imputed table
Country Mean Birth Rate (per 1,000 people) Minimum Birth Rate Maximum Birth Rate
Niger 53.13573 45.293 58.121
Chad 48.49152 43.357 53.521
Angola 48.23556 38.809 52.190
Mali 47.70270 41.643 50.654
Afghanistan 47.47814 35.842 52.174

Interactive Horizontal Bar Chart of Mean Birth Rates by Country

This step visualizes the mean birth rates across countries using an interactive horizontal bar chart. Countries are sorted in descending order of mean birth rate, and each bar is color-coded for clarity. Hovering over a bar displays the country name and its exact mean birth rate per 1,000 people. The interactive chart allows scrolling and panning to accommodate the large number of countries, making it easy to compare birth rates and identify countries with the highest and lowest averages.

library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
plot_ly(
  data = summary_sorted,
  x = ~Mean_Birth_Rate,
  y = ~reorder(`Country Name`, Mean_Birth_Rate),
  type = "bar",
  orientation = "h",
  marker = list(color = "#2C7FB8"),
  hovertemplate = paste(
    "<b>%{y}</b><br>",
    "Mean Birth Rate: %{x:.2f} per 1,000 people",
    "<extra></extra>"
  )
) %>%
  layout(
    title = list(
      text = "Mean Birth Rate by Country",
      x = 0.5
    ),
    xaxis = list(
      title = "Mean Birth Rate (per 1,000 people)"
    ),
    yaxis = list(
      title = "Country",
      automargin = TRUE,
      categoryorder = "total ascending"
    ),
    height = 1500,
    margin = list(l = 150, r = 40, t = 80, b = 60)
  )
Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()

Conclusion

The analysis provides a comprehensive overview of global birth rate trends. After filtering countries with insufficient data and imputing missing values, summary statistics revealed differences in mean, minimum, and maximum birth rates across countries. The interactive bar chart highlights countries with the highest and lowest average birth rates, making comparisons straightforward. Overall, the workflow ensures reliable, complete, and interpretable data for understanding global birth rate patterns.

References

  1. OpenAI. (2026, March 8). ChatGPT conversation with K. M. Qaiduzzaman on birth rate data analysis in R. OpenAI. https://chat.openai.com/