Transform the wide-format birth rate dataset into a tidy (long) format using the tidyr and dplyr packages in R. Once the data has been cleaned up, examine birth rate trends over time and across nations to find trends and variations.
Approach
Dataset description
I’ll use a dataset with crude birth rates for every nation in the world for this project. The dataset was acquired via Kaggle and comes from World Bank data. The dataset is posted by Brandon Chanderban in the discussion post. The dataset can be accessed here:
Currently, the dataset is kept in a wide format in which a year is represented by each column and a country by each row. Each year’s birth rate values are kept in its own column (e.g., 1960, 1961, 1962, etc.).
The dataset does not adhere to tidy data principles because the year variable is embedded within column headers rather than being represented as a single column. Every variable should have its own column and every observation should have its own row in a neat dataset.
The dataset contains the following key variables:
Country Name: Name of the country
Country Code: Standardized country abbreviation
Indicator Name: Name of the indicator (Crude Birth Rate)
Indicator Code: Code representing the indicator
1960–2023 columns: Birth rate values for each year
The year columns must be reshaped into a single column using pivot_longer() in order to prepare the dataset for analysis.
Anticipated Challenges
Managing the numerous year columns in this dataset, which span several decades, is the primary challenge. The corresponding birth rate values must be maintained while these columns are transformed into a single variable that represents the year.
Missing values in specific country-year combinations could present another difficulty. To maintain the accuracy and interpretability of the analysis, these missing values must be handled carefully during the tidying process.
Implementation of Data Import
The following code shows the data in a tabular format. As we can see a significant amount of work is needed to tidy this data. It has metadata in the first rows. The actual meaningful data starts from the 5th row. Also, there are some null values present in this data which needs to be handled. The years should be listed as rows rather than columns so that the statistical analysis becomes easier.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Country-Level Birth Rate Data Availability Summary
This table summarizes the availability of birth rate data for each country in the dataset. For every country, the first and last years with recorded birth rate values are identified, and the total number of years in that period is calculated. The number of years with available birth rate data is then counted and used to compute the percentage of data availability. This summary helps evaluate the completeness of each country’s data and supports decisions about whether a country should be retained or excluded from further analysis based on the defined availability threshold.
Warning: There were 2 warnings in `summarise()`.
The first warning was:
ℹ In argument: `first_year = min(Year[!is.na(Birth_Rate)])`.
ℹ In group 184: `Country Name = "Not classified"`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
df_country_stats |>head(n =5) |>gt()
Country Name
first_year
last_year
total_years
available_years
availability_pct
Afghanistan
1960
2021
62
62
1
Africa Eastern and Southern
1960
2021
62
62
1
Africa Western and Central
1960
2021
62
62
1
Albania
1960
2021
62
62
1
Algeria
1960
2021
62
62
1
Filtering Countries Based on Data Availability Threshold
In this step, countries with sufficient birth rate data are selected for further analysis. Using the previously calculated data availability statistics, only countries with at least 30% of birth rate observations available within their observed time range are retained. The names of these valid countries are extracted and used to filter the original dataset. The resulting table contains birth rate records only for countries that meet the minimum data availability threshold, ensuring that subsequent analysis and imputation are performed on datasets with adequate information.
This step addresses the remaining missing birth rate values in the filtered dataset using interpolation and directional filling techniques. For each country, the data is first sorted chronologically by year. Missing values within the observed range are then estimated using linear interpolation, which calculates intermediate values based on surrounding data points. After interpolation, any remaining missing values at the beginning or end of a country’s time series are filled using forward and backward filling (downup). This process produces a more complete dataset while preserving the overall trend of birth rates over time for each country.
Summary Statistics of Imputed Birth Rates by Country
This step calculates key summary statistics for the imputed birth rate dataset. For each country, the mean, minimum, and maximum birth rates are computed across all years. The results are then sorted in descending order of mean birth rate to highlight countries with the highest average values. The table is formatted with clear labels and units, providing an easily interpretable overview of birth rate trends and variability across countries. This summary supports both visual analysis and comparison between countries.
Interactive Horizontal Bar Chart of Mean Birth Rates by Country
This step visualizes the mean birth rates across countries using an interactive horizontal bar chart. Countries are sorted in descending order of mean birth rate, and each bar is color-coded for clarity. Hovering over a bar displays the country name and its exact mean birth rate per 1,000 people. The interactive chart allows scrolling and panning to accommodate the large number of countries, making it easy to compare birth rates and identify countries with the highest and lowest averages.
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
Conclusion
The analysis provides a comprehensive overview of global birth rate trends. After filtering countries with insufficient data and imputing missing values, summary statistics revealed differences in mean, minimum, and maximum birth rates across countries. The interactive bar chart highlights countries with the highest and lowest average birth rates, making comparisons straightforward. Overall, the workflow ensures reliable, complete, and interpretable data for understanding global birth rate patterns.
References
OpenAI. (2026, March 8). ChatGPT conversation with K. M. Qaiduzzaman on birth rate data analysis in R. OpenAI. https://chat.openai.com/