DATA607: TidyVerse

TidyVerse Assignment: Collaborating Around a Code Project with GitHub.

Overview

In this assignment we will get to practice collaborating around a code project with GitHub. We will create and example using one or more TidyVerse packages and demonstrate how to use the capabilities.

We will use a birth dataset from fivethirtyeight.com. This dataset contains U.S. births data for 1994 - 2003 which, is provided by the Centers for Disease Control and Prevention’s (CDC’s) National Center for Health Statistics (NCNS). We will load the data from their GitHub repository.

Load Library

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)

Load Data

us_births <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv", header = TRUE)

Header Defination Table.

View Data

View the dataset

dim_desc(us_births)

## [1] "[3,652 x 5]"

head(us_births)

##   year month date_of_month day_of_week births
## 1 1994     1             1           6   8096
## 2 1994     1             2           7   7772
## 3 1994     1             3           1  10142
## 4 1994     1             4           2  11248
## 5 1994     1             5           3  11053
## 6 1994     1             6           4  11406

tail(us_births)

##      year month date_of_month day_of_week births
## 3647 2003    12            26           5  10218
## 3648 2003    12            27           6   8646
## 3649 2003    12            28           7   7645
## 3650 2003    12            29           1  12823
## 3651 2003    12            30           2  14438
## 3652 2003    12            31           3  12374

there are 3652obs/rows and 5 variables/columns

Now, I will transform the dataset by using functions from the dplyr and ggplot package.

dplyr::summerize() & dplyr::sum()

How many total number of births in the US from 1994-2003?

us_births %>% summarise(total_births = sum(births))

##   total_births
## 1     39722137

dplyr::Groupby(), dplyr::summerize() & dplyr::sum()

How many total number of births in the US from 1994-2003 each year?

Totalbirth<-us_births %>%
  group_by(year) %>%
  summarise(total_births = sum(births))
Totalbirth

## # A tibble: 10 × 2
##     year total_births
##    <int>        <int>
##  1  1994      3952767
##  2  1995      3899589
##  3  1996      3891494
##  4  1997      3880894
##  5  1998      3941553
##  6  1999      3959417
##  7  2000      4058814
##  8  2001      4025933
##  9  2002      4021726
## 10  2003      4089950

ggplot2::ggplot()

Showing in graph the total number of births in the US from 1994-2003 each year

ggplot(Totalbirth, aes(x=(year), total_births)) + geom_bar(stat="identity", width = 0.5, color = "black", fill = "lightblue") + labs(x = "Year", y = "Total Births", title = "Total Birth in Each Year 1994-2003") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8)) + geom_label(aes(label=total_births), position = position_dodge(width = 0.1), size = 3, label.padding = unit(0.1, "lines"), label.size = 0.09, inherit.aes = TRUE)

dplyr::Groupby(), dplyr::summerize() & dplyr::mean()

What was the average births in the US from 1994-2003 per month?

avg_month <- us_births %>%
   group_by(month) %>%
  summarize(Average = mean(births))
avg_month

## # A tibble: 12 × 2
##    month Average
##    <int>   <dbl>
##  1     1  10427.
##  2     2  10703.
##  3     3  10716.
##  4     4  10618.
##  5     5  10809.
##  6     6  10988.
##  7     7  11286.
##  8     8  11374.
##  9     9  11466.
## 10    10  10899.
## 11    11  10572.
## 12    12  10651.

dplyr::filter(), dplyr::mutate() & dplyr::select()

What was the birth rates in a given year (ex: year in 2000)?

births_2000 <- us_births %>%
  filter(year == 2000) %>%
  group_by(month)%>%
  summarise(births = sum(births))%>%
  mutate (birth_rate = round((births/281400000)*10000,2)) # As per CDC, birth rate is calculated by dividing the number of live births in a population in a year by the midyear resident population. Birth rates are expressed as the number of live births per 1,000 population. In 2000, the U.S. Census Bureau counted 281.4 million people in the United States.  
births_2000

## # A tibble: 12 × 3
##    month births birth_rate
##    <int>  <int>      <dbl>
##  1     1 330108       11.7
##  2     2 317377       11.3
##  3     3 340553       12.1
##  4     4 317180       11.3
##  5     5 341207       12.1
##  6     6 341206       12.1
##  7     7 348975       12.4
##  8     8 360080       12.8
##  9     9 347609       12.4
## 10    10 343921       12.2
## 11    11 333811       11.9
## 12    12 336787       12.0

max(births_2000$birth_rate, na.rm = FALSE )

## [1] 12.8

ggplot2::ggplot()

Showing in graph the birth rate per month in the Year 2000

ggplot(births_2000, aes(x = factor(month), y = birth_rate)) +
  geom_bar(stat = "identity", width = 0.5, color = "black", fill = "lightblue") +
  scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                              "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
  labs(x = "Month", y = "Birth rate", title = "Birth Rate Per Month in Year 2000")+
  geom_label(aes(label=birth_rate), position = position_dodge(width = 0.1), size = 3, label.padding = unit(0.1, "lines"), label.size = 0.09, inherit.aes = TRUE)

Based on our data set from the Centers for Disease Control and Prevention (CDC), the birth rate in the United States in January 2000 was approximately 11.73 compared to December 2000 was approximately 11.97 births per 1,000 population. The highest birth rate in the United States was in August, which was approximately 12.80 births per 1,000 population. It’s important to note that birth rates can vary by state and demographic group, so this is an overall national estimate.

Conclusion

In this assignment we were able to extract dataset from fivethirtyeight.com and use TidyVerse packages, specifically dplyr and ggplot2 to demonstrate it’s capabilities. We have used dplyr to manipulate our data by using mutate(), select(), filter() and summarise() functions. These all combine with group_by() which allowed us to perform our operation “by group”. We have also used ggplot2 to have a visual representation of our data as it is easier and faster to transmit in our brain.