In this assignment we will get to practice collaborating around a code project with GitHub. We will create and example using one or more TidyVerse packages and demonstrate how to use the capabilities.
We will use a birth dataset from fivethirtyeight.com. This dataset contains U.S. births data for 1994 - 2003 which, is provided by the Centers for Disease Control and Prevention’s (CDC’s) National Center for Health Statistics (NCNS). We will load the data from their GitHub repository.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
us_births <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv", header = TRUE)
View the dataset
dim_desc(us_births)
## [1] "[3,652 x 5]"
head(us_births)
## year month date_of_month day_of_week births
## 1 1994 1 1 6 8096
## 2 1994 1 2 7 7772
## 3 1994 1 3 1 10142
## 4 1994 1 4 2 11248
## 5 1994 1 5 3 11053
## 6 1994 1 6 4 11406
tail(us_births)
## year month date_of_month day_of_week births
## 3647 2003 12 26 5 10218
## 3648 2003 12 27 6 8646
## 3649 2003 12 28 7 7645
## 3650 2003 12 29 1 12823
## 3651 2003 12 30 2 14438
## 3652 2003 12 31 3 12374
there are 3652obs/rows and 5 variables/columns
Now, I will transform the dataset by using functions from the dplyr and ggplot package.
How many total number of births in the US from 1994-2003?
us_births %>% summarise(total_births = sum(births))
## total_births
## 1 39722137
How many total number of births in the US from 1994-2003 each year?
Totalbirth<-us_births %>%
group_by(year) %>%
summarise(total_births = sum(births))
Totalbirth
## # A tibble: 10 × 2
## year total_births
## <int> <int>
## 1 1994 3952767
## 2 1995 3899589
## 3 1996 3891494
## 4 1997 3880894
## 5 1998 3941553
## 6 1999 3959417
## 7 2000 4058814
## 8 2001 4025933
## 9 2002 4021726
## 10 2003 4089950
Showing in graph the total number of births in the US from 1994-2003 each year
ggplot(Totalbirth, aes(x=(year), total_births)) + geom_bar(stat="identity", width = 0.5, color = "black", fill = "lightblue") + labs(x = "Year", y = "Total Births", title = "Total Birth in Each Year 1994-2003") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8)) + geom_label(aes(label=total_births), position = position_dodge(width = 0.1), size = 3, label.padding = unit(0.1, "lines"), label.size = 0.09, inherit.aes = TRUE)
What was the average births in the US from 1994-2003 per month?
avg_month <- us_births %>%
group_by(month) %>%
summarize(Average = mean(births))
avg_month
## # A tibble: 12 × 2
## month Average
## <int> <dbl>
## 1 1 10427.
## 2 2 10703.
## 3 3 10716.
## 4 4 10618.
## 5 5 10809.
## 6 6 10988.
## 7 7 11286.
## 8 8 11374.
## 9 9 11466.
## 10 10 10899.
## 11 11 10572.
## 12 12 10651.
What was the birth rates in a given year (ex: year in 2000)?
births_2000 <- us_births %>%
filter(year == 2000) %>%
group_by(month)%>%
summarise(births = sum(births))%>%
mutate (birth_rate = round((births/281400000)*10000,2)) # As per CDC, birth rate is calculated by dividing the number of live births in a population in a year by the midyear resident population. Birth rates are expressed as the number of live births per 1,000 population. In 2000, the U.S. Census Bureau counted 281.4 million people in the United States.
births_2000
## # A tibble: 12 × 3
## month births birth_rate
## <int> <int> <dbl>
## 1 1 330108 11.7
## 2 2 317377 11.3
## 3 3 340553 12.1
## 4 4 317180 11.3
## 5 5 341207 12.1
## 6 6 341206 12.1
## 7 7 348975 12.4
## 8 8 360080 12.8
## 9 9 347609 12.4
## 10 10 343921 12.2
## 11 11 333811 11.9
## 12 12 336787 12.0
max(births_2000$birth_rate, na.rm = FALSE )
## [1] 12.8
Showing in graph the birth rate per month in the Year 2000
ggplot(births_2000, aes(x = factor(month), y = birth_rate)) +
geom_bar(stat = "identity", width = 0.5, color = "black", fill = "lightblue") +
scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
labs(x = "Month", y = "Birth rate", title = "Birth Rate Per Month in Year 2000")+
geom_label(aes(label=birth_rate), position = position_dodge(width = 0.1), size = 3, label.padding = unit(0.1, "lines"), label.size = 0.09, inherit.aes = TRUE)
Based on our data set from the Centers for Disease Control and Prevention (CDC), the birth rate in the United States in January 2000 was approximately 11.73 compared to December 2000 was approximately 11.97 births per 1,000 population. The highest birth rate in the United States was in August, which was approximately 12.80 births per 1,000 population. It’s important to note that birth rates can vary by state and demographic group, so this is an overall national estimate.
In this assignment we were able to extract dataset from fivethirtyeight.com and use TidyVerse packages, specifically dplyr and ggplot2 to demonstrate it’s capabilities. We have used dplyr to manipulate our data by using mutate(), select(), filter() and summarise() functions. These all combine with group_by() which allowed us to perform our operation “by group”. We have also used ggplot2 to have a visual representation of our data as it is easier and faster to transmit in our brain.