I downloaded the popular_baby_names data set from data.gov which is based on baby names collected in the city of New York. The access and use data specifically states that is intended for public use. Popular_Baby_Names

baby_names <- read.csv(file = "/Users/brettmcgillivary/Documents/Coursework/R save files/Popular_Baby_Names.csv", header=TRUE, sep=",")

I changed some header names to make columns easier to read.

names(baby_names)[1] <- c('Birth_Year')
names(baby_names)[4] <- c('Child_First_Name')

Baby Names (tail) Table

Birth_Year Gender Ethnicity Child_First_Name Count Rank
19413 2015 MALE WHITE NON HISPANIC Joel 32 72
19414 2016 FEMALE HISPANIC Alayna 10 74
19415 2015 FEMALE HISPANIC Yaritza 12 79
19416 2015 MALE WHITE NON HISPANIC Mendel 42 64
19417 2016 MALE ASIAN AND PACIFIC ISLANDER Isaac 21 48
19418 2015 FEMALE WHITE NON HISPANIC Alessia 12 81

Slightly longer printing of data

  kable(baby_names[0:20, 1:6]) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Birth_Year Gender Ethnicity Child_First_Name Count Rank
2011 FEMALE HISPANIC GERALDINE 13 75
2011 FEMALE HISPANIC GIA 21 67
2011 FEMALE HISPANIC GIANNA 49 42
2011 FEMALE HISPANIC GISELLE 38 51
2011 FEMALE HISPANIC GRACE 36 53
2011 FEMALE HISPANIC GUADALUPE 26 62
2011 FEMALE HISPANIC HAILEY 126 8
2011 FEMALE HISPANIC HALEY 14 74
2011 FEMALE HISPANIC HANNAH 17 71
2011 FEMALE HISPANIC HAYLEE 17 71
2011 FEMALE HISPANIC HAYLEY 13 75
2011 FEMALE HISPANIC HAZEL 10 78
2011 FEMALE HISPANIC HEAVEN 15 73
2011 FEMALE HISPANIC HEIDI 15 73
2011 FEMALE HISPANIC HEIDY 16 72
2011 FEMALE HISPANIC HELEN 13 75
2011 FEMALE HISPANIC IMANI 11 77
2011 FEMALE HISPANIC INGRID 11 77
2011 FEMALE HISPANIC IRENE 11 77
2011 FEMALE HISPANIC IRIS 10 78

Birth Year vs Gender - Bar Chart

df1 = baby_names %>% 
  group_by(Birth_Year,Gender) %>% count( ) %>% ungroup() %>% 
  rename('Count' = n)

ggplot(df1,aes(x = Birth_Year,y = Count, fill = Gender))+
  ggtitle("Count vs Birth_Year")+
  xlab("Year") + ylab("Count")+
  theme(
    plot.title = element_text(color="red", size=14, face="bold"),
    axis.title.x = element_text(color="blue", size=14, face="bold"),
    axis.title.y = element_text(color="#993333", size=14, face="bold"))+
  geom_bar(stat='identity',position = position_dodge(width = 0.9))

I found it interesting that there was a sharp decline in birth rates starting in 2015 and that generally speaking the ratio of males to females in the city is pretty even.

Line Graph

df2 = baby_names %>% 
  group_by( Birth_Year,Ethnicity) %>% count() %>% ungroup() %>% as.data.frame()
df2$Ethnicity = as.character(df2$Ethnicity)

df2 = df2 %>% 
  mutate(Ethnicity = ifelse(Ethnicity=='ASIAN AND PACI',    'ASIAN AND PACIFIC ISLANDER', Ethnicity)) %>% 
  mutate(Ethnicity = ifelse(Ethnicity=='BLACK NON HISP',    'BLACK NON HISPANIC', Ethnicity)) %>%
  mutate(Ethnicity = ifelse(Ethnicity=='WHITE NON HISP',    'WHITE NON HISPANIC', Ethnicity))

ggplot(df2, aes(x = Birth_Year, y = n , color = Ethnicity)) + 
  ggtitle("Birth_Year vs Count with Ethnicity dimension")+
  xlab("Year") + ylab("Count")+
  geom_line()

It is interesting that the decline in birth rates was even across ethnicities. Researching for relevant news I found multiple articles that attributed the decline in birth rates to a marked reduction in teen pregnancies and an increase in “induced terminations”

Issues

It goes without saying there were multiple issues with plots. I found styling the plots far easier that creating the plots themselves. I would say it took me about 3 days to complete the line plot until because I couldn’t figure a way to convert the ethnicity data so I could mutate it. In the original data, each ethnicity has two groups, for example (“ASIAN AND PACIFIC ISLANDER” AND “ASIAN PAND PAC”) both referenced the same group. Eventually, I was able to convert them to characters and then mutate the names so they matched.