Project 2 Code Base Submission Dataset 1

Author

Long Lin

Overview

For this project, I used three different datasets from the Week 5 Discussion 5A post. With these three datasets, I prepared each of them by creating a .csv file and importing the data. Then I worked on tidying the data, and performing an analysis on the dataset. I also made sure that the code within the Quarto Markdown file is reproducible in a clean environment. I used a similar process to what we did in Assignment 5A with the Airline Delays, as I feel like that is very similar assignment to this.

Dataset 1:

Birth Rate by Countries posted by Brandon Chanderban

source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Project%202/Birth_Rates_of_Countries.csv

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(dplyr)
library(gt)

birth_rate_url <- "https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Project%202/Birth_Rates_of_Countries.csv"

birth_rate_df <- read_csv(
  file = birth_rate_url,
  show_col_types = FALSE,
  progress = FALSE,
  skip = 4
)

head(birth_rate_df)

# A tibble: 6 × 67
  `Country Name`  `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
  <chr>           <chr>          <chr>            <chr>             <dbl>  <dbl>
1 Aruba           ABW            Birth rate, cru… SP.DYN.CBRT.IN     33.9   32.8
2 Africa Eastern… AFE            Birth rate, cru… SP.DYN.CBRT.IN     47.4   47.5
3 Afghanistan     AFG            Birth rate, cru… SP.DYN.CBRT.IN     50.3   50.4
4 Africa Western… AFW            Birth rate, cru… SP.DYN.CBRT.IN     47.3   47.4
5 Angola          AGO            Birth rate, cru… SP.DYN.CBRT.IN     51.0   51.3
6 Albania         ALB            Birth rate, cru… SP.DYN.CBRT.IN     41.1   40.3
# ℹ 61 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
#   `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …

I converted the data from a wide format to a long format using pivot_longer in the following code chunk.

long_birth_rate_df <- birth_rate_df |>
  pivot_longer(
    cols = c("1960":"2022"),
    names_to = "Year",
    values_to = "Birth Rate"
  )
head(long_birth_rate_df, 70)

# A tibble: 70 × 6
   `Country Name` `Country Code` `Indicator Name`         `Indicator Code` Year 
   <chr>          <chr>          <chr>                    <chr>            <chr>
 1 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1960 
 2 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1961 
 3 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1962 
 4 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1963 
 5 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1964 
 6 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1965 
 7 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1966 
 8 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1967 
 9 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1968 
10 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1969 
# ℹ 60 more rows
# ℹ 1 more variable: `Birth Rate` <dbl>

I replaced missing values with 0 with the following code chunk.

long_birth_rate_remove_na_df <- long_birth_rate_df |>
  mutate(`Birth Rate` = replace_na(`Birth Rate`, 0))

head(long_birth_rate_remove_na_df, 70)

# A tibble: 70 × 6
   `Country Name` `Country Code` `Indicator Name`         `Indicator Code` Year 
   <chr>          <chr>          <chr>                    <chr>            <chr>
 1 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1960 
 2 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1961 
 3 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1962 
 4 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1963 
 5 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1964 
 6 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1965 
 7 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1966 
 8 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1967 
 9 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1968 
10 Aruba          ABW            Birth rate, crude (per … SP.DYN.CBRT.IN   1969 
# ℹ 60 more rows
# ℹ 1 more variable: `Birth Rate` <dbl>

I filtered the data for US only data using the following code chunk. Then I displayed the US only data in a table.

US_only_df <- long_birth_rate_remove_na_df |>
  filter(`Country Name` == 'United States')
head(US_only_df, 70)

# A tibble: 63 × 6
   `Country Name` `Country Code` `Indicator Name`         `Indicator Code` Year 
   <chr>          <chr>          <chr>                    <chr>            <chr>
 1 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1960 
 2 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1961 
 3 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1962 
 4 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1963 
 5 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1964 
 6 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1965 
 7 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1966 
 8 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1967 
 9 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1968 
10 United States  USA            Birth rate, crude (per … SP.DYN.CBRT.IN   1969 
# ℹ 53 more rows
# ℹ 1 more variable: `Birth Rate` <dbl>

US_only_df |>
  gt() |>
  cols_hide(columns = c(`Country Name`, `Country Code`, `Indicator Name`, `Indicator Code`)) |>
  tab_header(
    title = "Birth rates in the US (per 1,000 people)",
  )

Year	Birth Rate
Birth rates in the US (per 1,000 people)
1960	23.7
1961	23.3
1962	22.4
1963	21.7
1964	21.1
1965	19.4
1966	18.4
1967	17.8
1968	17.6
1969	17.9
1970	18.4
1971	17.2
1972	15.6
1973	14.8
1974	14.8
1975	14.6
1976	14.6
1977	15.1
1978	15.0
1979	15.6
1980	15.9
1981	15.8
1982	15.9
1983	15.6
1984	15.6
1985	15.8
1986	15.6
1987	15.7
1988	16.0
1989	16.4
1990	16.7
1991	16.2
1992	15.8
1993	15.4
1994	15.0
1995	14.6
1996	14.4
1997	14.2
1998	14.3
1999	14.2
2000	14.4
2001	14.1
2002	14.0
2003	14.1
2004	14.0
2005	14.0
2006	14.3
2007	14.3
2008	14.0
2009	13.5
2010	13.0
2011	12.7
2012	12.6
2013	12.4
2014	12.5
2015	12.4
2016	12.2
2017	11.8
2018	11.6
2019	11.4
2020	10.9
2021	11.0
2022	0.0

I created a line chart for the US birth rate data using the following code chunk. This was done to make it easier to see trends.

library(ggplot2)

ggplot(US_only_df, aes(x = as.numeric(Year), y = `Birth Rate`)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 10)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(
    title = "US Birth Rate Trends Over Time",
    subtitle = "Annual births per 1,000 persons",
    x = "Year",
    y = "Birth Rate",
    color = "Region"
  )

Ignoring unknown labels:
• colour : "Region"

From this plot, we are able to see that the US has been on a downward trend in crude birth rates between the year 1960 and 2022.