Project_2_Birth_Rate_Data

Author

Khandker Qaiduzzaman

Objective

Transform the wide-format birth rate dataset into a tidy (long) format using the tidyr and dplyr packages in R. Once the data has been cleaned up, examine birth rate trends over time and across nations to find trends and variations.

Approach

Dataset description

I’ll use a dataset with crude birth rates for every nation in the world for this project. The dataset was acquired via Kaggle and comes from World Bank data. The dataset is posted by Brandon Chanderban in the discussion post. The dataset can be accessed here:

https://raw.githubusercontent.com/NafeesKhandker/Project-2-Tidy-Format-Transformation/refs/heads/main/BirthRates%20of%20Countries%20(World%20Bank%20Data).csv

Currently, the dataset is kept in a wide format in which a year is represented by each column and a country by each row. Each year’s birth rate values are kept in its own column (e.g., 1960, 1961, 1962, etc.).

The dataset does not adhere to tidy data principles because the year variable is embedded within column headers rather than being represented as a single column. Every variable should have its own column and every observation should have its own row in a neat dataset.

The dataset contains the following key variables:

Country Name: Name of the country
Country Code: Standardized country abbreviation
Indicator Name: Name of the indicator (Crude Birth Rate)
Indicator Code: Code representing the indicator
1960–2023 columns: Birth rate values for each year

The year columns must be reshaped into a single column using pivot_longer() in order to prepare the dataset for analysis.

Anticipated Challenges

Managing the numerous year columns in this dataset, which span several decades, is the primary challenge. The corresponding birth rate values must be maintained while these columns are transformed into a single variable that represents the year.

Missing values in specific country-year combinations could present another difficulty. To maintain the accuracy and interpretability of the analysis, these missing values must be handled carefully during the tidying process.

Implementation of Data Import

The following code shows the data in a tabular format. As we can see a significant amount of work is needed to tidy this data. It has metadata in the first rows. The actual meaningful data starts from the 5th row. Also, there are some null values present in this data which needs to be handled. The years should be listed as rows rather than columns so that the statistical analysis becomes easier.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(readr)
library(gt)

url <- "https://raw.githubusercontent.com/NafeesKhandker/Project-2-Tidy-Format-Transformation/refs/heads/main/Birth%20Rates%20of%20Countries%20(World%20Bank%20Data).csv"

df <- read_csv(url, skip = 4, show_col_types = FALSE) %>%
  select(-any_of(c("Indicator Name", "Indicator Code")))

df |> 
  head(n = 5) |> 
  gt()

Country Name	Country Code	1960	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	1979	1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
Aruba	ABW	33.88300	32.83100	31.64900	30.41600	29.14700	27.88900	26.66300	25.50300	24.59200	23.73500	22.97400	22.31300	21.76600	21.49200	21.38200	21.3930	21.48500	21.73900	21.92000	21.99300	21.93100	21.73000	21.47600	21.38500	21.18300	20.91900	20.61700	20.26900	19.82100	19.18400	18.66200	17.72200	16.44300	16.12600	15.43100	15.99100	16.15300	16.38800	15.07800	14.36100	14.42700	13.73900	12.99200	12.62100	11.92100	12.34800	13.05500	12.96200	12.74800	12.35000	12.19300	12.24600	12.72300	13.31600	13.53300	12.42800	12.30000	11.53000	9.88100	9.13800	8.10200	7.19300	NA
Africa Eastern and Southern	AFE	47.43855	47.53055	47.59756	47.63614	47.64548	47.66766	47.69789	47.69133	47.69102	47.72112	47.67313	47.64967	47.47074	47.22113	47.07547	46.9502	46.79183	46.63214	46.51202	46.47196	46.33961	46.23755	46.15826	46.13473	46.13520	46.14379	46.06749	45.83058	45.33736	44.81301	44.23072	43.84232	43.34168	42.96601	42.53329	42.48572	42.13563	41.57346	41.12879	40.89482	40.52824	40.34121	40.04732	39.75014	39.57589	39.40739	39.23711	39.00052	38.85169	38.36494	37.94026	37.48399	36.92130	36.44714	36.02832	35.61331	35.18902	34.89254	34.61102	34.34145	33.91675	33.54627	NA
Afghanistan	AFG	50.34000	50.44300	50.57000	50.70300	50.83100	50.87200	50.98600	51.08100	51.14800	51.19500	51.12200	51.16300	51.10900	51.11400	51.13500	51.0180	50.93500	50.92100	50.81600	50.73700	50.48200	50.26400	50.13800	50.13900	50.23500	50.55300	50.72800	50.84500	50.98000	51.16200	51.42300	51.78800	51.94800	52.03800	52.17400	52.07300	51.87300	51.40000	50.88000	50.35100	49.66400	48.97900	48.20100	47.35000	46.33000	45.26300	44.72100	43.85800	41.50600	41.15700	40.60200	39.85500	40.00900	39.60100	39.10500	38.80300	37.93600	37.34200	36.92700	36.46600	36.05100	35.84200	NA
Africa Western and Central	AFW	47.32548	47.42105	47.52922	47.53103	47.51192	47.46857	47.44364	47.42593	47.42235	47.41269	47.41411	47.52970	47.57899	47.63283	47.81713	47.9115	47.86907	47.96894	48.03727	47.93830	47.77071	47.51406	47.25192	47.11112	46.70656	46.20665	45.72924	45.34627	45.00171	44.92848	44.67619	44.47423	44.30932	44.16810	43.94269	43.73024	43.49103	43.21922	43.02697	43.17424	43.19955	43.07550	42.92712	42.74688	42.50272	42.42154	42.19330	41.94301	41.75479	41.50376	41.21963	40.89424	40.41643	39.85651	39.33535	38.85921	38.39310	37.88166	37.44709	37.02783	36.61573	36.23703	NA
Angola	AGO	51.02600	51.28200	51.31600	51.32300	51.28200	51.28200	51.29500	51.31400	51.34800	51.35300	51.26700	50.69800	50.47400	50.46700	50.47200	50.4690	50.51400	50.52300	50.61600	50.73200	50.89200	51.10900	51.30700	51.61000	51.93500	52.13600	52.19000	52.14600	51.97300	51.69900	51.34400	50.92600	50.37400	49.89300	49.55000	49.18500	48.86000	48.41200	48.00900	47.77300	47.64700	47.57400	47.44800	47.22600	47.09900	46.94400	46.64300	46.29000	45.88900	45.49500	44.97000	44.36400	43.86000	43.28200	42.67600	42.02000	41.37700	40.81000	40.23600	39.72500	39.27100	38.80900	NA

Transform the Data from Wide to Tidy Format

df_tidy <- df %>%
  pivot_longer(
    cols = -c(`Country Name`, `Country Code`),
    names_to = "Year",
    values_to = "Birth_Rate"
  ) %>%
  mutate(Year = as.numeric(Year))

df_tidy |> 
  head(n = 5) |>
  gt()

Country Name	Country Code	Year	Birth_Rate
Aruba	ABW	1960	33.883
Aruba	ABW	1961	32.831
Aruba	ABW	1962	31.649
Aruba	ABW	1963	30.416
Aruba	ABW	1964	29.147

Country-Level Birth Rate Data Availability Summary

This table summarizes the availability of birth rate data for each country in the dataset. For every country, the first and last years with recorded birth rate values are identified, and the total number of years in that period is calculated. The number of years with available birth rate data is then counted and used to compute the percentage of data availability. This summary helps evaluate the completeness of each country’s data and supports decisions about whether a country should be retained or excluded from further analysis based on the defined availability threshold.

df_country_stats <- df_tidy %>%
  group_by(`Country Name`) %>%
  summarise(
    first_year = min(Year[!is.na(Birth_Rate)]),
    last_year = max(Year[!is.na(Birth_Rate)]),
    total_years = last_year - first_year + 1,
    available_years = sum(!is.na(Birth_Rate)),
    availability_pct = available_years / total_years,
    .groups = "drop"
  )

Warning: There were 2 warnings in `summarise()`.
The first warning was:
ℹ In argument: `first_year = min(Year[!is.na(Birth_Rate)])`.
ℹ In group 184: `Country Name = "Not classified"`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

df_country_stats |> 
  head(n = 5) |> 
  gt()

Country Name	first_year	last_year	total_years	available_years	availability_pct
Afghanistan	1960	2021	62	62	1
Africa Eastern and Southern	1960	2021	62	62	1
Africa Western and Central	1960	2021	62	62	1
Albania	1960	2021	62	62	1
Algeria	1960	2021	62	62	1

Filtering Countries Based on Data Availability Threshold

In this step, countries with sufficient birth rate data are selected for further analysis. Using the previously calculated data availability statistics, only countries with at least 30% of birth rate observations available within their observed time range are retained. The names of these valid countries are extracted and used to filter the original dataset. The resulting table contains birth rate records only for countries that meet the minimum data availability threshold, ensuring that subsequent analysis and imputation are performed on datasets with adequate information.

valid_countries <- df_country_stats %>%
  filter(availability_pct >= 0.30) %>%
  pull(`Country Name`)

df_filtered <- df_tidy %>%
  filter(`Country Name` %in% valid_countries)

df_filtered |> 
  head(n = 5) |> 
  gt()

Country Name	Country Code	Year	Birth_Rate
Aruba	ABW	1960	33.883
Aruba	ABW	1961	32.831
Aruba	ABW	1962	31.649
Aruba	ABW	1963	30.416
Aruba	ABW	1964	29.147

Imputing Missing Birth Rate Values

This step addresses the remaining missing birth rate values in the filtered dataset using interpolation and directional filling techniques. For each country, the data is first sorted chronologically by year. Missing values within the observed range are then estimated using linear interpolation, which calculates intermediate values based on surrounding data points. After interpolation, any remaining missing values at the beginning or end of a country’s time series are filled using forward and backward filling (downup). This process produces a more complete dataset while preserving the overall trend of birth rates over time for each country.

df_imputed <- df_filtered %>%
  group_by(`Country Name`) %>%
  arrange(Year) %>%
  mutate(
    Birth_Rate = approx(Year, Birth_Rate, Year, rule = 1)$y
  ) %>%
  fill(Birth_Rate, .direction = "downup") %>%
  ungroup()

df_imputed |> 
  head(n = 5) |> 
  gt()

Country Name	Country Code	Year	Birth_Rate
Aruba	ABW	1960	33.88300
Africa Eastern and Southern	AFE	1960	47.43855
Afghanistan	AFG	1960	50.34000
Africa Western and Central	AFW	1960	47.32548
Angola	AGO	1960	51.02600

Summary Statistics of Imputed Birth Rates by Country

This step calculates key summary statistics for the imputed birth rate dataset. For each country, the mean, minimum, and maximum birth rates are computed across all years. The results are then sorted in descending order of mean birth rate to highlight countries with the highest average values. The table is formatted with clear labels and units, providing an easily interpretable overview of birth rate trends and variability across countries. This summary supports both visual analysis and comparison between countries.

summary_table <- df_imputed %>%
  group_by(`Country Name`) %>%
  summarise(
    Mean_Birth_Rate = mean(Birth_Rate, na.rm = TRUE),
    Min_Birth_Rate = min(Birth_Rate, na.rm = TRUE),
    Max_Birth_Rate = max(Birth_Rate, na.rm = TRUE),
    .groups = "drop"
  )


summary_sorted <- summary_table %>%
  arrange(desc(Mean_Birth_Rate))

summary_sorted |> 
  head(n = 5) |>
  gt() |>
  tab_header(
    title = "Summary Statistics of Birth Rates by Country",
    subtitle = "Imputed table"
  ) |>
  cols_label(
    `Country Name` = "Country",
    Mean_Birth_Rate = "Mean Birth Rate (per 1,000 people)",
    Min_Birth_Rate = "Minimum Birth Rate",
    Max_Birth_Rate = "Maximum Birth Rate"
  )

Country	Mean Birth Rate (per 1,000 people)	Minimum Birth Rate	Maximum Birth Rate
Summary Statistics of Birth Rates by Country
Imputed table
Niger	53.13573	45.293	58.121
Chad	48.49152	43.357	53.521
Angola	48.23556	38.809	52.190
Mali	47.70270	41.643	50.654
Afghanistan	47.47814	35.842	52.174

Interactive Horizontal Bar Chart of Mean Birth Rates by Country

This step visualizes the mean birth rates across countries using an interactive horizontal bar chart. Countries are sorted in descending order of mean birth rate, and each bar is color-coded for clarity. Hovering over a bar displays the country name and its exact mean birth rate per 1,000 people. The interactive chart allows scrolling and panning to accommodate the large number of countries, making it easy to compare birth rates and identify countries with the highest and lowest averages.

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

plot_ly(
  data = summary_sorted,
  x = ~Mean_Birth_Rate,
  y = ~reorder(`Country Name`, Mean_Birth_Rate),
  type = "bar",
  orientation = "h",
  marker = list(color = "#2C7FB8"),
  hovertemplate = paste(
    "<b>%{y}</b><br>",
    "Mean Birth Rate: %{x:.2f} per 1,000 people",
    "<extra></extra>"
  )
) %>%
  layout(
    title = list(
      text = "Mean Birth Rate by Country",
      x = 0.5
    ),
    xaxis = list(
      title = "Mean Birth Rate (per 1,000 people)"
    ),
    yaxis = list(
      title = "Country",
      automargin = TRUE,
      categoryorder = "total ascending"
    ),
    height = 1500,
    margin = list(l = 150, r = 40, t = 80, b = 60)
  )

Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()

Conclusion

The analysis provides a comprehensive overview of global birth rate trends. After filtering countries with insufficient data and imputing missing values, summary statistics revealed differences in mean, minimum, and maximum birth rates across countries. The interactive bar chart highlights countries with the highest and lowest average birth rates, making comparisons straightforward. Overall, the workflow ensures reliable, complete, and interpretable data for understanding global birth rate patterns.

References

OpenAI. (2026, March 8). ChatGPT conversation with K. M. Qaiduzzaman on birth rate data analysis in R. OpenAI. https://chat.openai.com/