Project 2 — Data Tidying

Author

Guibril Ramde

Published

March 8, 2026

Approach

In this project, I selected the Worldwide Birth Rate dataset from Discussion 5A. The dataset is not initially organized according to tidy data principles, as the values for different years are stored across multiple columns rather than in a single variable column. Therefore, the first step of the analysis will be to clean and restructure the dataset using tidy data techniques.

Using tools from the tidyverse, I will transform the dataset into a tidy format where each variable is stored in its own column and each observation is represented by a single row. This restructuring will allow the data to be analyzed more effectively.

After tidying the dataset, I will analyze birth rate patterns across different countries. The analysis will focus on comparing birth rates between countries, examining how birth rates have evolved over time, and exploring trends in population growth, particularly during the last ten years. The goal is to identify key patterns and insights regarding global birth rate trends and their potential impact on population growth.

Finally, I will use visualization tools such as ggplot2 to present the findings and highlight important trends in the data.

Code
#library(dplyr)
#library(tidyr)
library(tidyverse)
library(readr)


get_url <- "https://raw.githubusercontent.com/japhet125/Project2-Data-Science/refs/heads/main/API_SP.DYN.CBRT.IN_DS2_en_csv_v2_5607611.csv"

get_data <- read_csv(get_url, skip = 4)
get_data
# A tibble: 266 × 67
   `Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
   <chr>          <chr>          <chr>            <chr>             <dbl>  <dbl>
 1 Aruba          ABW            Birth rate, cru… SP.DYN.CBRT.IN     33.9   32.8
 2 Africa Easter… AFE            Birth rate, cru… SP.DYN.CBRT.IN     47.4   47.5
 3 Afghanistan    AFG            Birth rate, cru… SP.DYN.CBRT.IN     50.3   50.4
 4 Africa Wester… AFW            Birth rate, cru… SP.DYN.CBRT.IN     47.3   47.4
 5 Angola         AGO            Birth rate, cru… SP.DYN.CBRT.IN     51.0   51.3
 6 Albania        ALB            Birth rate, cru… SP.DYN.CBRT.IN     41.1   40.3
 7 Andorra        AND            Birth rate, cru… SP.DYN.CBRT.IN     NA     NA  
 8 Arab World     ARB            Birth rate, cru… SP.DYN.CBRT.IN     47.6   47.6
 9 United Arab E… ARE            Birth rate, cru… SP.DYN.CBRT.IN     41.8   41.4
10 Argentina      ARG            Birth rate, cru… SP.DYN.CBRT.IN     23.8   23.6
# ℹ 256 more rows
# ℹ 61 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
#   `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, …
Code
colnames(get_data)
 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"           "1966"           "1967"          
[13] "1968"           "1969"           "1970"           "1971"          
[17] "1972"           "1973"           "1974"           "1975"          
[21] "1976"           "1977"           "1978"           "1979"          
[25] "1980"           "1981"           "1982"           "1983"          
[29] "1984"           "1985"           "1986"           "1987"          
[33] "1988"           "1989"           "1990"           "1991"          
[37] "1992"           "1993"           "1994"           "1995"          
[41] "1996"           "1997"           "1998"           "1999"          
[45] "2000"           "2001"           "2002"           "2003"          
[49] "2004"           "2005"           "2006"           "2007"          
[53] "2008"           "2009"           "2010"           "2011"          
[57] "2012"           "2013"           "2014"           "2015"          
[61] "2016"           "2017"           "2018"           "2019"          
[65] "2020"           "2021"           "2022"          

Data Source

The dataset originates from the World Bank and was obtained from Kaggle.
It contains birth rate statistics (births per 1,000 people) for countries
around the world from 1960 to 2022.

The dataset is stored locally as:

`API_SP.DYN.CBRT.IN_DS2_en_csv_v2_5607611.csv`

Showing the raw structure

Code
glimpse(get_data)
Rows: 266
Columns: 67
$ `Country Name`   <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan"…
$ `Country Code`   <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB…
$ `Indicator Name` <chr> "Birth rate, crude (per 1,000 people)", "Birth rate, …
$ `Indicator Code` <chr> "SP.DYN.CBRT.IN", "SP.DYN.CBRT.IN", "SP.DYN.CBRT.IN",…
$ `1960`           <dbl> 33.88300, 47.43855, 50.34000, 47.32548, 51.02600, 41.…
$ `1961`           <dbl> 32.83100, 47.53055, 50.44300, 47.42105, 51.28200, 40.…
$ `1962`           <dbl> 31.64900, 47.59756, 50.57000, 47.52922, 51.31600, 39.…
$ `1963`           <dbl> 30.41600, 47.63614, 50.70300, 47.53103, 51.32300, 38.…
$ `1964`           <dbl> 29.14700, 47.64548, 50.83100, 47.51192, 51.28200, 36.…
$ `1965`           <dbl> 27.88900, 47.66766, 50.87200, 47.46857, 51.28200, 35.…
$ `1966`           <dbl> 26.66300, 47.69789, 50.98600, 47.44364, 51.29500, 34.…
$ `1967`           <dbl> 25.50300, 47.69133, 51.08100, 47.42593, 51.31400, 33.…
$ `1968`           <dbl> 24.59200, 47.69102, 51.14800, 47.42235, 51.34800, 33.…
$ `1969`           <dbl> 23.73500, 47.72112, 51.19500, 47.41269, 51.35300, 33.…
$ `1970`           <dbl> 22.97400, 47.67313, 51.12200, 47.41411, 51.26700, 32.…
$ `1971`           <dbl> 22.31300, 47.64967, 51.16300, 47.52970, 50.69800, 31.…
$ `1972`           <dbl> 21.76600, 47.47074, 51.10900, 47.57899, 50.47400, 31.…
$ `1973`           <dbl> 21.49200, 47.22113, 51.11400, 47.63283, 50.46700, 30.…
$ `1974`           <dbl> 21.38200, 47.07547, 51.13500, 47.81713, 50.47200, 30.…
$ `1975`           <dbl> 21.39300, 46.95020, 51.01800, 47.91150, 50.46900, 29.…
$ `1976`           <dbl> 21.48500, 46.79183, 50.93500, 47.86907, 50.51400, 29.…
$ `1977`           <dbl> 21.73900, 46.63214, 50.92100, 47.96894, 50.52300, 28.…
$ `1978`           <dbl> 21.92000, 46.51202, 50.81600, 48.03727, 50.61600, 27.…
$ `1979`           <dbl> 21.99300, 46.47196, 50.73700, 47.93830, 50.73200, 27.…
$ `1980`           <dbl> 21.93100, 46.33961, 50.48200, 47.77071, 50.89200, 26.…
$ `1981`           <dbl> 21.73000, 46.23755, 50.26400, 47.51406, 51.10900, 26.…
$ `1982`           <dbl> 21.47600, 46.15826, 50.13800, 47.25192, 51.30700, 26.…
$ `1983`           <dbl> 21.38500, 46.13473, 50.13900, 47.11112, 51.61000, 26.…
$ `1984`           <dbl> 21.18300, 46.13520, 50.23500, 46.70656, 51.93500, 26.…
$ `1985`           <dbl> 20.91900, 46.14379, 50.55300, 46.20665, 52.13600, 26.…
$ `1986`           <dbl> 20.61700, 46.06749, 50.72800, 45.72924, 52.19000, 25.…
$ `1987`           <dbl> 20.26900, 45.83058, 50.84500, 45.34627, 52.14600, 25.…
$ `1988`           <dbl> 19.82100, 45.33736, 50.98000, 45.00171, 51.97300, 25.…
$ `1989`           <dbl> 19.18400, 44.81301, 51.16200, 44.92848, 51.69900, 24.…
$ `1990`           <dbl> 18.66200, 44.23072, 51.42300, 44.67619, 51.34400, 24.…
$ `1991`           <dbl> 17.72200, 43.84232, 51.78800, 44.47423, 50.92600, 23.…
$ `1992`           <dbl> 16.44300, 43.34168, 51.94800, 44.30932, 50.37400, 23.…
$ `1993`           <dbl> 16.12600, 42.96601, 52.03800, 44.16810, 49.89300, 22.…
$ `1994`           <dbl> 15.43100, 42.53329, 52.17400, 43.94269, 49.55000, 22.…
$ `1995`           <dbl> 15.99100, 42.48572, 52.07300, 43.73024, 49.18500, 21.…
$ `1996`           <dbl> 16.15300, 42.13563, 51.87300, 43.49103, 48.86000, 20.…
$ `1997`           <dbl> 16.38800, 41.57346, 51.40000, 43.21922, 48.41200, 19.…
$ `1998`           <dbl> 15.07800, 41.12879, 50.88000, 43.02697, 48.00900, 18.…
$ `1999`           <dbl> 14.36100, 40.89482, 50.35100, 43.17424, 47.77300, 17.…
$ `2000`           <dbl> 14.42700, 40.52824, 49.66400, 43.19955, 47.64700, 17.…
$ `2001`           <dbl> 13.73900, 40.34121, 48.97900, 43.07550, 47.57400, 16.…
$ `2002`           <dbl> 12.99200, 40.04732, 48.20100, 42.92712, 47.44800, 15.…
$ `2003`           <dbl> 12.62100, 39.75014, 47.35000, 42.74688, 47.22600, 14.…
$ `2004`           <dbl> 11.92100, 39.57589, 46.33000, 42.50272, 47.09900, 13.…
$ `2005`           <dbl> 12.34800, 39.40739, 45.26300, 42.42154, 46.94400, 13.…
$ `2006`           <dbl> 13.05500, 39.23711, 44.72100, 42.19330, 46.64300, 12.…
$ `2007`           <dbl> 12.96200, 39.00052, 43.85800, 41.94301, 46.29000, 12.…
$ `2008`           <dbl> 12.74800, 38.85169, 41.50600, 41.75479, 45.88900, 11.…
$ `2009`           <dbl> 12.35000, 38.36494, 41.15700, 41.50376, 45.49500, 11.…
$ `2010`           <dbl> 12.19300, 37.94026, 40.60200, 41.21963, 44.97000, 11.…
$ `2011`           <dbl> 12.24600, 37.48399, 39.85500, 40.89424, 44.36400, 12.…
$ `2012`           <dbl> 12.72300, 36.92130, 40.00900, 40.41643, 43.86000, 12.…
$ `2013`           <dbl> 13.31600, 36.44714, 39.60100, 39.85651, 43.28200, 12.…
$ `2014`           <dbl> 13.53300, 36.02832, 39.10500, 39.33535, 42.67600, 12.…
$ `2015`           <dbl> 12.42800, 35.61331, 38.80300, 38.85921, 42.02000, 11.…
$ `2016`           <dbl> 12.30000, 35.18902, 37.93600, 38.39310, 41.37700, 11.…
$ `2017`           <dbl> 11.53000, 34.89254, 37.34200, 37.88166, 40.81000, 10.…
$ `2018`           <dbl> 9.88100, 34.61102, 36.92700, 37.44709, 40.23600, 10.5…
$ `2019`           <dbl> 9.13800, 34.34145, 36.46600, 37.02783, 39.72500, 10.3…
$ `2020`           <dbl> 8.10200, 33.91675, 36.05100, 36.61573, 39.27100, 10.2…
$ `2021`           <dbl> 7.19300, 33.54627, 35.84200, 36.23703, 38.80900, 10.2…
$ `2022`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Code
head(get_data)
# A tibble: 6 × 67
  `Country Name`  `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
  <chr>           <chr>          <chr>            <chr>             <dbl>  <dbl>
1 Aruba           ABW            Birth rate, cru… SP.DYN.CBRT.IN     33.9   32.8
2 Africa Eastern… AFE            Birth rate, cru… SP.DYN.CBRT.IN     47.4   47.5
3 Afghanistan     AFG            Birth rate, cru… SP.DYN.CBRT.IN     50.3   50.4
4 Africa Western… AFW            Birth rate, cru… SP.DYN.CBRT.IN     47.3   47.4
5 Angola          AGO            Birth rate, cru… SP.DYN.CBRT.IN     51.0   51.3
6 Albania         ALB            Birth rate, cru… SP.DYN.CBRT.IN     41.1   40.3
# ℹ 61 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
#   `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …
Code
dim(get_data)
[1] 266  67

Interpretation of the data:

The dataset contains 266 countries with birth rate observations from 1960 to 2022. Each year is represented as a separate column, which creates a wide data structure that is not suitable for analysis.

I will need ggplot to visualize the dataset

3.2 Data Import and Tidying

reshaping data from wide to longer

Code
clean_data <- get_data %>%
  pivot_longer(
    cols = `1960`:`2022`,
    names_to = "year",
    values_to = "birth_rate"
  )
clean_data
# A tibble: 16,758 × 6
   `Country Name` `Country Code` `Indicator Name`         `Indicator Code` year 
   <chr>          <chr>          <chr>                    <chr>            <chr>
 1 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1960 
 2 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1961 
 3 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1962 
 4 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1963 
 5 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1964 
 6 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1965 
 7 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1966 
 8 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1967 
 9 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1968 
10 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1969 
# ℹ 16,748 more rows
# ℹ 1 more variable: birth_rate <dbl>

NORMALIZE VARIABLE STRUCTURE

Code
rename_data <- clean_data %>%
  rename(
    country_name = `Country Name`,
    country_code = `Country Code`,
    indicator_name = `Indicator Name`,
    indicator_code = `Indicator Code`
    
    
  )
rename_data
# A tibble: 16,758 × 6
   country_name country_code indicator_name      indicator_code year  birth_rate
   <chr>        <chr>        <chr>               <chr>          <chr>      <dbl>
 1 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1960        33.9
 2 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1961        32.8
 3 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1962        31.6
 4 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1963        30.4
 5 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1964        29.1
 6 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1965        27.9
 7 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1966        26.7
 8 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1967        25.5
 9 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1968        24.6
10 Aruba        ABW          Birth rate, crude … SP.DYN.CBRT.IN 1969        23.7
# ℹ 16,748 more rows
Code
rename_data <- rename_data %>%
  mutate(
    year = as.integer(year)
  )

HANDLING MISSING DATA

Code
tidy_data <- rename_data %>%
  filter(!is.na(birth_rate))

3.3 Analysis

Summary tables of the average birth rate per country with the highest birth rate form 1960 to 2022

Code
tidy_data %>%
  group_by(country_name) %>%
  summarise(avg_birth_rate = mean(birth_rate, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(avg_birth_rate)) %>%
  head(10)
# A tibble: 10 × 2
   country_name avg_birth_rate
   <chr>                 <dbl>
 1 Niger                  53.3
 2 Chad                   48.6
 3 Angola                 48.4
 4 Mali                   47.8
 5 South Sudan            47.8
 6 Afghanistan            47.7
 7 Somalia                47.4
 8 Malawi                 47.1
 9 Uganda                 47.0
10 Yemen, Rep.            46.3

Visualization

Code
tidy_data %>%
  group_by(country_name) %>%
  summarise(avg_birth_rate = mean(birth_rate, na.rm = TRUE)) %>%
  slice_max(avg_birth_rate, n =10) %>%
  ggplot(aes(x = reorder(country_name, avg_birth_rate), y = avg_birth_rate)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(
    title = "Top 10 Countries With Highest Birth Rates",
    x = "Country",
    y = "Average Birth Rate"
  ) +
  theme_minimal()

The chart identifies the countries with the highest average birth rates. These countries are often located in developing regions where population growth tends to be higher.

Visualizations with axis labels, titles and legends

Code
tidy_data %>%
  mutate(year = as.numeric(year)) %>%
  filter(year >= 2013) %>%
  group_by(year) %>%
  summarise(avg_birth_rate = mean(birth_rate, na.rm = TRUE)) %>%
  drop_na(avg_birth_rate) %>%
  ggplot(aes(x = year, y = avg_birth_rate)) +
  geom_line(color = "red", linewidth = 1) +
  geom_point() +
  labs(
    title = "Average Global Birth Rate In The Last 10 Years",
    x = "Year",
    y = "Birth Rate (per 1,000 people)"
  ) +
  theme_minimal()

The last decade provides insight into recent demographic trends. Many countries continue to experience declining birth rates due to urbanization, delayed family planning, and economic factors.

Comparing Selected Countries Over Time

Code
tidy_data %>%
  mutate(year = as.numeric(year)) %>%
  filter(country_name %in% c("United States", "China", "Russian Federation", "Brazil", "India"), !is.na(birth_rate)) %>%
  ggplot(aes(x = year, y = birth_rate, color = country_name)) +
  geom_line(size = 1) +
  geom_point() +
  labs(
    title = "Birth Rate Comparison Between Selected Countries",
    x = "Year",
    y = "Birth Rate"
  ) +
  theme_minimal()

This visualization compares birth rate trends among five major countries. While most countries show a gradual decline in birth rates over time, the rate of decline differs across regions. Developed countries such as the United States and Russia show relatively stable but lower birth rates, while emerging economies such as India and Brazil and China show stronger declines over the decades.

Conclusion

The dataset originates from the World Bank and was obtained through Kaggle. It provides birth rate statistics for countries worldwide from 1960 to 2022.

Before tidying, the dataset was stored in a wide format where each year was represented as a separate column. After applying tidy data principles, the dataset was transformed into a long format where each observation
represents a country-year pair.

The analysis shows that global birth rates have generally declined over time. This trend may be influenced by factors such as economic development,urbanization, increased education, and greater access to family planning.

However, some regions, particularly parts of Africa, still exhibit relatively high birth rates. Countries such as Niger and Chad continue to experience higher population growth compared to developed countries.