Project 2 Code Base Submission Dataset 1

Author

Long Lin

Overview

For this project, I used three different datasets from the Week 5 Discussion 5A post. With these three datasets, I prepared each of them by creating a .csv file and importing the data. Then I worked on tidying the data, and performing an analysis on the dataset. I also made sure that the code within the Quarto Markdown file is reproducible in a clean environment. I used a similar process to what we did in Assignment 5A with the Airline Delays, as I feel like that is very similar assignment to this.

Dataset 1:

Birth Rate by Countries posted by Brandon Chanderban

source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Project%202/Birth_Rates_of_Countries.csv

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)
library(gt)

birth_rate_url <- "https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Project%202/Birth_Rates_of_Countries.csv"

birth_rate_df <- read_csv(
  file = birth_rate_url,
  show_col_types = FALSE,
  progress = FALSE,
  skip = 4
)

head(birth_rate_df)
# A tibble: 6 × 67
  `Country Name`  `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
  <chr>           <chr>          <chr>            <chr>             <dbl>  <dbl>
1 Aruba           ABW            Birth rate, cru… SP.DYN.CBRT.IN     33.9   32.8
2 Africa Eastern… AFE            Birth rate, cru… SP.DYN.CBRT.IN     47.4   47.5
3 Afghanistan     AFG            Birth rate, cru… SP.DYN.CBRT.IN     50.3   50.4
4 Africa Western… AFW            Birth rate, cru… SP.DYN.CBRT.IN     47.3   47.4
5 Angola          AGO            Birth rate, cru… SP.DYN.CBRT.IN     51.0   51.3
6 Albania         ALB            Birth rate, cru… SP.DYN.CBRT.IN     41.1   40.3
# ℹ 61 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
#   `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …

I converted the data from a wide format to a long format using pivot_longer in the following code chunk.

long_birth_rate_df <- birth_rate_df |>
  pivot_longer(
    cols = c("1960":"2022"),
    names_to = "Year",
    values_to = "Birth Rate"
  )
head(long_birth_rate_df, 70)
# A tibble: 70 × 6
   `Country Name` `Country Code` `Indicator Name`         `Indicator Code` Year 
   <chr>          <chr>          <chr>                    <chr>            <chr>
 1 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1960 
 2 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1961 
 3 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1962 
 4 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1963 
 5 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1964 
 6 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1965 
 7 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1966 
 8 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1967 
 9 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1968 
10 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1969 
# ℹ 60 more rows
# ℹ 1 more variable: `Birth Rate` <dbl>

I replaced missing values with 0 with the following code chunk.

long_birth_rate_remove_na_df <- long_birth_rate_df |>
  mutate(`Birth Rate` = replace_na(`Birth Rate`, 0))

head(long_birth_rate_remove_na_df, 70)
# A tibble: 70 × 6
   `Country Name` `Country Code` `Indicator Name`         `Indicator Code` Year 
   <chr>          <chr>          <chr>                    <chr>            <chr>
 1 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1960 
 2 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1961 
 3 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1962 
 4 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1963 
 5 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1964 
 6 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1965 
 7 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1966 
 8 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1967 
 9 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1968 
10 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1969 
# ℹ 60 more rows
# ℹ 1 more variable: `Birth Rate` <dbl>

I filtered the data for US only data using the following code chunk. Then I displayed the US only data in a table.

US_only_df <- long_birth_rate_remove_na_df |>
  filter(`Country Name` == 'United States')
head(US_only_df, 70)
# A tibble: 63 × 6
   `Country Name` `Country Code` `Indicator Name`         `Indicator Code` Year 
   <chr>          <chr>          <chr>                    <chr>            <chr>
 1 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1960 
 2 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1961 
 3 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1962 
 4 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1963 
 5 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1964 
 6 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1965 
 7 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1966 
 8 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1967 
 9 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1968 
10 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1969 
# ℹ 53 more rows
# ℹ 1 more variable: `Birth Rate` <dbl>
US_only_df |>
  gt() |>
  cols_hide(columns = c(`Country Name`, `Country Code`, `Indicator Name`, `Indicator Code`)) |>
  tab_header(
    title = "Birth rates in the US (per 1,000 people)",
  )
Birth rates in the US (per 1,000 people)
Year Birth Rate
1960 23.7
1961 23.3
1962 22.4
1963 21.7
1964 21.1
1965 19.4
1966 18.4
1967 17.8
1968 17.6
1969 17.9
1970 18.4
1971 17.2
1972 15.6
1973 14.8
1974 14.8
1975 14.6
1976 14.6
1977 15.1
1978 15.0
1979 15.6
1980 15.9
1981 15.8
1982 15.9
1983 15.6
1984 15.6
1985 15.8
1986 15.6
1987 15.7
1988 16.0
1989 16.4
1990 16.7
1991 16.2
1992 15.8
1993 15.4
1994 15.0
1995 14.6
1996 14.4
1997 14.2
1998 14.3
1999 14.2
2000 14.4
2001 14.1
2002 14.0
2003 14.1
2004 14.0
2005 14.0
2006 14.3
2007 14.3
2008 14.0
2009 13.5
2010 13.0
2011 12.7
2012 12.6
2013 12.4
2014 12.5
2015 12.4
2016 12.2
2017 11.8
2018 11.6
2019 11.4
2020 10.9
2021 11.0
2022 0.0

I created a line chart for the US birth rate data using the following code chunk. This was done to make it easier to see trends.

library(ggplot2)

ggplot(US_only_df, aes(x = as.numeric(Year), y = `Birth Rate`)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 10)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(
    title = "US Birth Rate Trends Over Time",
    subtitle = "Annual births per 1,000 persons",
    x = "Year",
    y = "Birth Rate",
    color = "Region"
  )
Ignoring unknown labels:
• colour : "Region"

From this plot, we are able to see that the US has been on a downward trend in crude birth rates between the year 1960 and 2022.