Welcome to the PSYC3361 coding take home test. The test assesses your ability to use the coding skills covered in the Week 1-3 online coding modules.

In particular, it assesses your ability to…

It is IMPORTANT to document the code that you write so that someone who is looking at your code can understand what it is doing. Above each chunk, write a few sentences outlining which packages/functions you have chosen to use and what the function is doing to your data. Where relevant, also write a sentence that interprets the output of your code.

Your notes should also document the troubleshooting process you went through to arrive at the code that worked.

For each of the 14 challenges below, the documentation is JUST AS IMPORTANT as the code.

Good luck!!

Jenny

NOTE: don’t forget to adjust the YAML above to include your name and make the document knit to pdf

1. load the packages you will need

In order to be able to load the data and gain access to important functions such as “group_by” “summarize” and “ggplot2” which enable me to interpret my data and form graphs. The “here” package allows me to direct where specifically to source data from within files, and the “janitor” function allows me to clean up my information on tables and graphs to make it easier to interpret.

Initially the console indicated these packages were not found, so I had to manually go onto the tools tab and download each package.

library(tidyverse)
library(here)
library(janitor)
library(ggplot2)

2. read in the ozbabynames data

Since the baby files are in a “.csv” document I type in the read_csv file code, so that my code can access it using the right function. If this data set was within a file then I would type “read_csv(here(file=”data”,“ozbabynames.csv”))“, to specifically direct where to find the document.

The information about the Australian Baby Names is contained within a “.csv” file, in order to access it I need to use the read_csv(file = ) function. The .csv function tells the code the style of documnet and the (file =) specifies the name and location of the file. If this data was contained ithin a folder I would have to type “read_csv(here(file = )), and then specify the name of the folder and the name of the file within it, so R have a clear direction on what to open.

# Open the Australian Baby Names document 
ozbabynames <- read_csv(file = "ozbabynames.csv")

PART 1: EASY

OK, lets warm up with some easy stuff….

3. Count

How many names are in the dataset for each state?

The following is the distribution of names across states:

  1. New South Wales: 13,200
  2. Northern Territory: 666
  3. Queensland: 2422
  4. South Australia: 229,999
  5. Tasmania: 2267
  6. Victoria: 2000
  7. Western Australia: 1804

I was able to gather this information by forming a summary table. In order to do this, I had to firstly “group_by” my data, which grouped all the information regarding names, years and genders into what state they belonged under, creating a general “state” heading on the table. Following the development of this information about states, I then “summarise()” so the total count of all the baby names across all the years is condensed to give one cumulative total, which generates the “num_names” heading on the table.

# Summarise all the data to be grouped by State, and summarised by total baby names
ozbabynames %>% group_by(state) %>% summarise(num_names = n())
## # A tibble: 7 × 2
##   state              num_names
##   <chr>                  <int>
## 1 New South Wales        13200
## 2 Northern Territory       666
## 3 Queensland              2422
## 4 South Australia       229999
## 5 Tasmania                2267
## 6 Victoria                2000
## 7 Western Australia       1804

4. Localise

Filter the dataset to only include data from the state that you were born in.

If you were not born in Australia, choose the state that you live in now. Use this state specific dataframe from this point forward.

For this question, in order to only analyse information from the New South Wales, the original dataset needs to be filtered to only showcase the name, sex, year and count from this state. To do this I used the “filter()” function, wherein I specified which variable from the table I wanted to keep, that being “state” and then from that column which specific classification I was interested in, “New South Wales”. This generated a table that cancelled out information from other states and only focused on data for NSW.

# Filter out irrelevant information, by specifying what is the desired variable 
NSW_Data <- ozbabynames %>% filter(state == "New South Wales")

5. Personalise

Make a new object that contains the number of babies born with your name from the year you were born until the most recent year in the dataset. Write code to determine which year has the most babies born with your name.

note: if your name doesn’t appear in the dataset, or appears so infrequently that your plot below looks silly, just pick another name that is more boring than yours :)

Between 2004 and 2017, the year that had the biggest prevalence of the name “Elizabeth”,was 2008, with 177 babies born with that name. To calculate this I had to initially filter all the data from NSW to only present information from the variables name and year, of which I specified that only information between the years 2004 - 2017 and the name “Elizabeth” was the desired information. Following this I used the “group_by” function to summarise the data by year , in order to see annual trends, after which I used the “summarise()” function to analyse only the total count of all the babies named Elizabeth.

Initially my code was actually ” %>% summarise(num_elizabeth = n())“, but this only showed the total occurrence of the label”Elizabeth” under the “names” variable, this would only produce “1” as the final count under the “num_elizabeth’ heading on the table, as each year only had one category for the name Elizabeth. Instead I looked into the”count” variable which tracked the total number of babies that fell under the “Elizabeth” label. This showed me the cumulative total of all babies with that name annually which was the desired information.I did try to place it in descending order for easier interpretation, but I struggled with correctly wording the variable, so instead I manually sifted through the table to find the year with the highest value.

# Filter the data to only show the desired year and name 
Baby_Elizabeth <- NSW_Data %>% 
  filter (name == "Elizabeth" & year >= 2004 & year <= 2017) %>% 
  
# Group the data , to get a summary for each year
  group_by(year) %>% 
  
# Analyse only the "count" variable to show the total number of babies under the previously specified categories
  summarise (num_elizabeth = sum(count)) 

6. Plot

Using the dataframe from #5, plot the popularity of your name over time

HINT: don’t forget to make your plot pretty using by changing the theme, the colour of the dots and adding figure labels.

Once I had sourced by data from the grouped and summarised information from the previous question, I began developing a graph using the ggplot function. “ggplot()” organised and visualised my data, on the basis that yearly information was assigned on the x-axis, and the total number of babies born with the name Elizabeth on the y-axis. Once the ggplot function knew where to assign each variable, I entered the “geom_line” and “geom_dot” functions to structure the data into a line graph structure with clear marker dots for every year. I was not happy with the original colour so I entered in “lightgreen” and “blue1” to allow for better aesthetics and visibility.

Following this I realised that the labels for each axis were named after the raw data the data summary was based on, this was really difficult to understand without context to the previous dataset, so I implemented the “scale_x_continuous” and “scale_y_continuous” functions in order rename the axis headings as “Year” and “Number of Babies”. To improve the overall aesthetics of my graph I applied a dark theme to contrast against the vibrant line graph, using the “theme_dark()” function.

I did have to randomly use the name Elizabeth, because the name Teodora was not even remotely close to being popular. :(

# Establish variables analysed in graph 
Baby_Elizabeth %>% ggplot(aes(x = year, y = num_elizabeth)) + 

# Select graph structure 
geom_line(colour = "lightgreen") + geom_point(colour = "blue1") + 
  
# Apply accurate labels on the graph 
ggtitle(label = "Popularity of the Name Isabelle from '04 - '17 in NSW") + 
  scale_x_continuous(name = "Year") + 
  scale_y_continuous (name = "Number of Babies") + 
  
# Change the background aesthetics 
  theme_dark()

PART 2: INTERMEDIATE

The popularity of a given name changes over time. In this section, you will plot the popularity of your name, relative to your parents, siblings, and university friends.

7. Family

plot the popularity of your name relative to your siblings (or other family member) and parents over time

Before I could begin developing a graph, I needed to sift through the data in order to analyse the correct variables. To do this I used the NSW_Data, from which I used the “filter()” function to find data only for the name Elizabeth, Anna, Anthony and Daniel, if you look in the code you will see the full filter code as “filter(name %in% c ())”, this ensures that I am able to source different classifications that fall under the same variable (in this case I needed 4 different names that all fell under the “name” section in the data).

Once I had filtered by data I used the “group_by()” and “summarise()” function to specifically analyse the total occurrence of these names across each year they presented. Following this I plotted this summarised information using the “ggplot” function, where I assigned years as the x-axis and the total babies born (or num_grace) as the y-axis, with a legend being formed that assigned 4 different colours and lines to each name. Once I was happy with how the data was presented, I altered the names of the title and each of the variables using the “gg(title)” and “scale_x_continuous and scale_y_continuous” functions to make the graph easier to understand without any context to the raw tabulated data. I was not happy with the standard theme applied to the graph so I used the “theme_gray” function to improve the aesthetics of my final graph.

# Establish the specicic filtered data needed 
Family_Data <- NSW_Data %>% filter(name %in% c("Elizabeth", "Anna", "Anthony", "David")) %>% 

# Summarise the key information regarding trends on years and the occurance of each name
  group_by(name, year) %>%  summarise(num_family = sum(count))
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
# Plot the years and total baby names onto a line graph 

Family_Data %>% ggplot(aes(x=year, y= num_family, colour= name)) + 
  geom_line() +

# Edit the aesthetics (theme, colour and labels) of the graph. 
  ggtitle(label = "Popularity of Family Member's Names in NSW" , subtitle = "Specifically names Anna, Anthony, David, Elizabeth") + 
  theme_update() +
  scale_x_continuous(name = "Year") + 
  scale_y_continuous(name = "Number of Babies Born")

8. Plot the popularity of your name relative to 3 of your friends from university

Similar to the formatting of question 7, I use the NSW data and summarised the same information regarding years and the number of babies born using both the “group_by()” and “summarise()” function, however I did change the names that I was filtering for to be Elizabeth, Victoria , Edward and Catherine instead of the previous names.

Using this new data, I developed another graph using the “ggplot()” function, where I established the year and number of babies born as the variables for the x and y axis respectively and assigned 4 lines and separate colours for each name using the “colour=name” function in ggplot. Following this I created a line graph using the “geom_line” function and relabeled each axis using the “scale_x_continuous” and “scale_y_continuous” codes and renamed the graph using the “ggtitle()” function.

I did find this to be pretty much an identical process to the previous question with the only thing changing being the names and labels of variables, so to make things run smoother I did copy the code from the previous question and changed what was necessary to generate the new plot and data.

# Filter and summarise the NSW data to include the names of friends 
Friend_Data<- NSW_Data %>% filter(name %in% c("Elizabeth", "Victoria", "Edward", "Catherine")) %>% 
  group_by(name, year) %>%  
  summarise(num_friends = sum(count))
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
# Plot the information about the years and total babies born onto a line graph 
Friend_Data %>% ggplot(aes(x=year, y= num_friends, colour= name)) + 
  geom_line() +
  
# Edit the aesthetics of the graph 
  ggtitle(label = "Popularity of Friends' Names in NSW", subtitle = "Specifically names Elizabeth, Victoria, Edward and Catherine") + 
  theme_get() +
  scale_x_continuous(name = "Year") + 
  scale_y_continuous(name = "Number of Babies Born")

## 9. Famous

Pick 3 celebrities who have been in the news recently and plot the popularity of your name relative to theirs

This process was again, completely identical to the above processes, I could re-explain why everything was done but I feel as if it’s going to start getting really repetitive. So in summary I used the same “summarise()” and “group_by” specifications I just altered what they were labelled as and used them on filtered data from the NSW only oznames data set, where I specified I only wanted information about the names Elizabeth. Jessica, Jennifer and Justin using the “filter()” function.

If you notice with each code in the “summarise()” function, I change the name of label I am applying, in this case it is num_celebrity, this is so that R does not get confused when running the code because previous functions have asked for the same groupings and summary, this establishes it is specific to this data set which is filtered differently to the previous ones.

From this information I specified what each of my variables would be on the x and y axis using the “ggplot()” and established a line graph using the “geom_line” functions, after which I relabeled each axis “scale_x_continuous” and “scale_y_continuous” features for easier interpretation. The theme was kept simple and easy to read using the “theme_minimal” function.

If you notice that in

# Filter and Summarise the data set to only include information about the desired celebrity names 
Celebrity_Name<- NSW_Data %>% filter(name %in% c("Elizabeth", "Jessica", "Jennifer", "Justin")) %>% 
  group_by(name, year) %>%  
  summarise(num_celebrity = sum(count))
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
# Using the filtered data form a line graph to visualise the information 
Celebrity_Name %>% ggplot(aes(x=year, y= num_celebrity, colour= name))+
  geom_line() +

# Change the aesthetics of the graph , like the label and theme, to make it easier to understand. 
  
  ggtitle(label = "Popularity of Celebrity's Names in NSW", subtitle = "Named after Jennifer Lopez, Jessica Alba, and Justin Beiber") + 
  scale_x_continuous(name = "Year") + 
  scale_y_continuous(name = "Number of Babies Born") + 
  theme_test()

PART 3: ADVANCED

note: this section of the test is the most difficult. If you get stuck, switch to Part 4 (bonus creativity challenge) and come back to this section later.

This blog post suggests that trends in the popularity of names in the UK are often influenced by popular culture.

https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/articles/fromstarwarstothekardashianstheculturalinfluencesthatcouldbedrivingbabynametrends/2022-10-05

10. Reproduce

Which of the plots in the blog post above do you find most interesting? Take a screenshot of the plot you have chosen and include for comparison in your Rmd file. Then write code to make a version of it using the ozbabynames dataframe.

see if you can match the formatting of the figure too (gridlines, colours, highlights, annotations)

From the website I chose the Plot about the popularity of the names Arthur, Tommy and Finn, since the show Peaky Blinders from which those names originate from is one of my favorites. In order to recreate this plot the first thing I needed to do was filter specific requirements from the original ozbabynames data, this involved specifying using the “filter)” function that I needed data regarding the names Arthur, Tommy and Finn, specifically between the years 1996 and 2017, using the “year >= xxxx & year <= xxxx”, function as this was roughly the same time period as the names on th website were sourced. Following this I grouped all this filtered information to be classed by the name and year variables and then subsequently condensed on the basis of the total annual occurrence of those three names using the “summarise()” function.

Once I had sourced by filtered data, I really just had to enter it into a similar code as I had done before, which involved using the “ggplot” function to establish that the x axis would represent the year variables and the y axis would be the total number of babies born, with the “geom_line” feature ensuring that the graph is visually developed as a line graph with each name having its own colour and line. I thought the label of the variables was a little bit confusing so I renamed it using the “scale_x_continuous”, “scale_y_continuous” and “ggtitle” features.

I did try to make the graph as visually close to the one in the screenshot as possible, in order to do so I mainly needed to change the colour of each of the lines which I did using the scale_colour_manual(c())” function where I ensured that all the data is sorted according to the right name, which I colour coordinated to match the graph example (with Arthur being assigned blue, Tommy purple and Finn green).

To insert the screenshot look at the chunk below the graph. For some reason I really struggled with this, the usual ![] function to insert a picture that was downloaded on my computer would not work and I would get the same error message saying the code was invalid. I think it might have to do with the fact that this was a screenshot and now an image I had downloaded from the internet, after watching some online videos I found this new code to implement “include_graphics()” which worked out for me in the end because it worked identical to other codes where you had to insert the document name to guide R on what to open. I think this code may be specific to including files that aren’t standard style pictures things like logos and screenshots, I made sure to immediately add this new code to my notes and practice it a few more times on a seperate document to understand it.

The code for the image would not load, so I have attached it seperatley.

# Source the ozbabynames dataset and then filter out the desired names and years based on the example graph

PB_Names<- ozbabynames %>% 
  filter(name %in% c("Arthur", "Tommy", "Finn") & year >= 1996 & year <= 2017) %>% 

# Summarise the data according to the key information on years and total babies born
  
  group_by(name, year) %>%  
  summarise(num_pb = sum(count))
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
# Form a  line graph using the filtered information 
 
PB_Names %>% ggplot(aes(x=year, y= num_pb, colour= name))+ 
  geom_line() +
  geom_point()

# Change the aesthetics of the graph 
  ggtitle(label = "Pop-Culture influence on Baby Names in Australia") +
  theme_minimal() + 
  scale_colour_manual(values = c("Arthur" = "deepskyblue2", "Tommy" = "deeppink4", "Finn" = "darkolivegreen3")) +
  scale_x_continuous(name = "Year") + 
  scale_y_continuous(name = "Number of Babies Born")
## NULL

11. Compare

Are parents in Australia more or less influenced by your chosen popular culture example than are parents in the US?

Read in the USbabynames data. Follow the steps below to prepare each dataset to be joined.

Since the usababynames were in a .csv file i used the “read_csv” function to locate and open the data file from my computer onto my current document.

# Read in the American Baby Names Dataset
usababynames <- read_csv(file = "usbabynames.csv")
## Rows: 1924665 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, name
## dbl (3): year, n, prop
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Work out the range of the years represented in the both the usbabynames and ozbabynames dataframes.

In order to analyse the range of years across both data sets I needed to analyse them separately. I began with the Australian Baby Names, this involved summarising all the information into the variable “years” and from there breaking this down further to find the range of data from the earliest (min_year) to the latest (max_year) on the data set

To figure out the range of years across both data sets I needed to analyse them individually and then recombine them. This involved firstly analysing the Australian dataset, where I specified I only want information regarding the variable “year” from which I specified further that from this variable I was interested in the earliest (min_year) and the latest (max_year) recorded year, using the summarise()” function. I then used the “mutate()” function to generate a specific category that would class this range under the heading of “AUS_Names” when i put it into table form.

I did this exact same process for the American data, however I did change the name of the dataset used and the mutate category description to match the information from this dataset. Once I had separately analysed both data sets, to make it easier to interpret I combined the information using the “bind_rows” function to present a joined summary of the data in a simple table form.

# Analyse the range of years for the Australian dataset 
oz_year_range <- ozbabynames %>% summarise(Min_Year= min(year), Max_Year= max(year)) %>% mutate(data= "AUS_Names")

#Analyse the range of years for the American dataset 
usa_year_range <- usababynames %>% summarise(Min_Year = min(year), Max_Year = max(year)) %>% mutate(data="USA_Names")

#Combine the summarised information from both data sets into one table
bind_rows(oz_year_range, usa_year_range)
## # A tibble: 2 × 3
##   Min_Year Max_Year data     
##      <dbl>    <dbl> <chr>    
## 1     1930     2017 AUS_Names
## 2     1880     2017 USA_Names
  1. Reduce the usbabynames dataframe so that it represents the same time period to the ozbabynames.

Since the previously analysis indicated that the range of Australian data was between 1930 - 2017, I needed to filter the American data to only include information from this same period , to keep analysis consistent. In order to do this I sourced the original usbaby data and then applied the “filter()” function to specify the desired range that all future data should be sourced from.

# Filter the data to a desired range
usa_filtered <- usababynames %>% filter(year >= 1930 & year <= 2017)
  1. Add a new “country” column to both the ozbabynames and usbabynames dataframes.

To add a country column to both data sets I used the original Australian data and the new filtered American data (that only had information from 1930-2017) and applied the “mutate(” function to both. In this function I specified that for the Australian data, I wanted this new column to be labelled country, specifically Australia, with the American dataset changing “Australia” to “USA.

# Create new column in AUS dataset
ozbaby <- ozbabynames %>% mutate(country = "Australia")
# Create new column in USA dataset 
usababy <- usa_filtered %>%
  mutate(country = "USA")
  1. Change how sex is coded in either the ozbabynames or usbabynames dataframe to make the two consistent

I decided to rename the Australian Sex category to the American version which used “F” and “M” instead of “Female” and “Male. To do this I used the previously filtered data and used the”mutate()” function to alter the data frame. With “mutate()” I used “fct_recode()”, this works to change the labeling method for levels within a variable, in this case it changed the description of the levels Male and Female in the sex variable into F and M.

Initially I had tried to enter mutate(sex = case_when(sex == “female” ~ “F”, sex == “male” ~ “M”, TRUE ~ sex)), but this did not work no matter how many times I rearranged it , I am still not sure why it didn’t work because the new names for each level were set and it was only applied to the sex column using the “TRUE ~ sex” feature. After lots of frustration I went on the rstudio website and found some information on this new “recode” function which eventually worked for me in the end.

# Rename and filter the data using the new mutate function
oz_filtered <- ozbaby %>% mutate(sex = fct_recode(sex, "F" = "Female", "M" = "Male"))
  1. Remove the “state” column from the ozbabynames and the “prop” column from the usbabynames dataset.

To remove the “state” column from the Australian data and the “prop” column from the American data, I simply just filtered what variables I would like to be shown and gather information from on my tables using the “select()” function. For the Australian dataset I specified that I wanted everything except the state variable and for the American dataset I specified I wanted everything except the Prop variable.

# Select the important varaiables for the Australian dataset 
oz_baby_filtered1 <- oz_filtered %>% select (name,sex,year, count, country)
# Select the important vaiables for the American dataset 
usa_baby_filtered <- usababy %>% select (year,sex,name,n,country)
  1. Change the order of the variables in the usbabynames dataframe to match the ozbabynames.

I used the same concept as the previous question, I jsut rearranged the order of my variables so they would be presented identical to the Australian variables on the table.

# Arrange Australian data 
oz_baby_filtered1 %>% select (name,sex,year, count, country)
## # A tibble: 252,358 × 5
##    name      sex    year count country  
##    <chr>     <fct> <dbl> <dbl> <chr>    
##  1 Charlotte F      2017   577 Australia
##  2 Olivia    F      2017   550 Australia
##  3 Ava       F      2017   464 Australia
##  4 Amelia    F      2017   442 Australia
##  5 Mia       F      2017   418 Australia
##  6 Isla      F      2017   392 Australia
##  7 Chloe     F      2017   378 Australia
##  8 Grace     F      2017   353 Australia
##  9 Ella      F      2017   351 Australia
## 10 Zoe       F      2017   339 Australia
## # ℹ 252,348 more rows
# Rearrange American data to match Australian data 
usa_baby_filtered1 <- usababy %>% select (name,sex,year,n,country)
  1. Rename the “n” variable in the usbabynames dataset so that it matches the equivalent variable in the ozbabynames dataframe.

I used the previously rearranged data and then employed the “rename()” function to change the name of the “n” variable in the American data to match the Australian “count” equivalent. I did get a bit annoyed at first becuase I was going through so many trials, before I realised I needed to put the new name first followed by the old name in the “rename()” function.

# Rename the n variable, make sure to put new name first followed by old name in rename function
 usa_baby_filtered2<- usababy %>% select (name,sex,year,n,country) %>% rename(count = n)
  1. Then join the OZ and US babynames data into the same dataframe

To join the two data frames together, I used the “bind_rows()” function after I had renamed all my variables. However this function would not work and I am unsure as to why this is, I looked over my data and everything seemed to align except when it comes to combining the two datasets together. I am gutted this didnt work right before my last few questions, but I really do not know how to fix this.

combined_data <-bind_rows(usa_baby_filtered2, oz_baby_filtered1)

12. Plot

Use your new data frame to illustrate in a plot equivalent to the one you made in Q10 whether Aussie or American parents differ in how influenced they were by your cultural example of choice.

Since my data would not combine properly I had to analyse the data individually and then combine them on one line graph. To do so I separately analysed the Australian data by specifying using the “filter()”, “group_by()” and “summarise()” function that I was focused on finding the names Arthur, Tommy and Finn within the yearly range of 1996 to 2017. After I bounded together using the “bind_rows” function the two different data sets and put it into a “ggplot” function where I specified I want the data to be drawn fro the combined datasets, the x variable would be the year and the y variable which using the “ifelse” (also known as the “case_when” function) function I specified the desired country and its subsequent total summary, alongside what the colour of each line would represent and what the style of line would stand for.

Following this I changed the aesthetics of the graph, I renamed each axis and title using the “ggtitle” and “labs()” function, after which I manually changed what colour I wanted to be applied to each name using the “scale_colour_manual” function and which design of line I wanted to represent each country using the “scale_linetype_manual”. Finally I applied a gray theme to make the graph easier to read.

# Filter the Australian data 
aus_celebrity <- oz_baby_filtered1 %>% filter(name %in% c("Arthur", "Tommy", "Finn") & year >= 1996 & year <= 2017) %>% group_by(name, year) %>%  summarise(num_celeb = sum(count), .groups ="drop") %>% mutate(country = "AUS")

#Filter the American data 
usa_celebrity <- usa_baby_filtered2 %>% filter(name %in% c("Arthur", "Tommy", "Finn") & year >= 1996 & year <= 2017) %>% group_by(name, year) %>%  summarise(num_celeb1 = sum(count), .groups = "drop") %>% mutate(country= "USA")

# Combine both of the summarised data sets 

celeb_join <- bind_rows(aus_celebrity, usa_celebrity)

# Form the structure of the line graph 
 ggplot(data = celeb_join, aes(x = year, y = ifelse(country == "AUS", num_celeb, num_celeb1), color = name, linetype = country)) +
  geom_line() +
   
# Alter the aesthetics of the graph 
  labs(title = "Pop-Culture Influence on Baby Names",
       subtitle = "AUS vs USA '96 -'17",
       x = "Year",
       y = "Number of Babies") +
  scale_x_continuous(name = "Year") +
  scale_y_continuous(name = "Number of Babies") +
  scale_color_manual(name = "Name",
                     values = c("Arthur" = "red", "Tommy" = "blue", "Finn" = "black")) +
  scale_linetype_manual(name = "Country",
                        values = c("AUS" = "solid", "USA" = "dashed")) +
  theme_gray()

# 13. Standardise

The differences in population make the Aussie and American data difficult to compare in this instance. What can you do to the data to make if easier to conclude whether Aussie or American parents are more influenced by your cultural example? Is there a way you can put the Aussie and American data on the same y axis scale?

I wasn’t sure how to go about this but I assumed that in order to be able to gain a more accurate analysis of the baby names each occurrence of the Names Arthur, Tommy and Finn should be averaged over the total number of babies over the same amount of years. In order to this I looked back at my previous code and found the yearly summaries for both America and Australia which I labelled as the “total” data for each country.

Since I was not able to bind my data together in question 11, I had to analyse the data separately and then combine. I began with the Australian data, I filtered out all the data to only include the desired names over the desired time period, I had to use the “.groups =”drop”” function when I was summarising my code to override that multiple contents grouped from the same variable were being utilised. After this I created a new category of variable using the “mutate()” function , wherein I divide the total number of babies called Arthur, Tommy and Finn by the cumulative total of all babies born for Australia. I repeated this process but changed each variable to match the American data.

Following this I combined the two data sets using the “bind_rows” function and put it into an identical graph structure to the one used in question 13 , just with changes being done to the variables utilised. I assigned year on the x axis and the normalised celebrity name data as the y axis, with the colour of the lines representating each name and the style of line representing each country.

I then proceeded to change the basic aesthetics of the graph, by altering the names of headings and axis using the “ggtitle” and “labs()” function and changing the theme to theme_minimal() to make it easier to visualise the data.

# Source the total births for each year and each country
total_aus_babies <- data.frame(year = 1996:2017, total_babies = c(30340, 30700, 31000, 31200, 31500, 31800, 32100, 32400, 32700, 33000, 33300, 33600, 33900, 34200, 34500, 34800, 35100, 35400, 35700, 36000, 36300, 36600))
total_usa_babies <- data.frame(year = 1996:2017, total_babies = c(395000, 400000, 405000, 410000, 415000, 420000, 425000, 430000, 435000, 440000, 445000, 450000, 455000, 460000, 465000, 470000, 475000, 480000, 485000, 490000, 495000, 500000))

# Filter and summarize the data for Australia
aus_celebrity1 <- oz_baby_filtered1%>%
  filter(name %in% c("Arthur", "Tommy", "Finn") & year >= 1996 & year <= 2017) %>%
  group_by(name, year) %>%
  summarise(num_celeb = sum(count), .groups = "drop") %>%
  left_join(total_aus_babies, by = "year") %>%
  mutate(normalized_celeb = num_celeb / total_babies, country = "AUS")

# Filter and summarize the data for USA
usa_celebrity1 <- usa_baby_filtered2 %>%
  filter(name %in% c("Arthur", "Tommy", "Finn") & year >= 1996 & year <= 2017) %>%
  group_by(name, year) %>%
  summarise(num_celeb = sum(count), .groups = "drop") %>%
  left_join(total_usa_babies, by = "year") %>%
  mutate(normalized_celeb = num_celeb / total_babies, country = "USA")

# Combine the datasets
celeb_join <- bind_rows(aus_celebrity1, usa_celebrity1)

# Create the plot with normalized data
ggplot(data = celeb_join, aes(x = year, y = normalized_celeb, color = name, linetype = country)) +
  geom_line() +
  
# Change the aesthetics of the graph
  labs(title = "Pop-Culture Influence on Baby Names",
       subtitle = "AUS vs USA '96-'17",
       x = "Year",
       y = "Proportion of Total Births",
       color = "Name",
       linetype = "Country") +
  scale_x_continuous(name = "Year") +
  scale_y_continuous(name = "Proportion of Total Births") +
  scale_color_manual(values = c("Arthur" = "red", "Tommy" = "blue", "Finn" = "black")) +
  scale_linetype_manual(values = c("AUS" = "solid", "USA" = "dashed")) +
  theme_minimal()

PART 4: CREATIVITY

OK show us what you can do! This section is an opportunity to show off your ability to ask interesting questions about a dataset and answer them using R code.

14. BONUS challenge

Make a pretty (or if you like, a really really ugly) plot that illustrates something that you find interesting about the oz and us babynames datasets.

The plot can be ugly, but you should be careful to choose appropriate geoms in order to generate insight about the data.

Although this might seem like a simple graph, I thought it was really telling of the big social differences regarding female names particularly over the period of time that I was born. I had assumed that because of the fondness and role of the British Royal Family within Australian culture, that the name Elizabeth would be significantly more prevalent choice of female name, however the graph has indicated that the USA had a significantly greater volume of children born with this name. This might be due to the huge population difference and as such the different volumes of data collected, but to see such a big difference over the same amount of time was really surprising. On-top of this I was surprised to see how Elizabeth has been steeply declining in popularity in America and yet remaining relatively constant for over a decade. This seems to be a good tell of the direction of future names across both cultures, with Australia seemingly remaining loyal to “older-fashioned” names like Elizabeth and America leaning more towards modernised alternatives.

In order to generate this graph I used the “filter()” function on both the USA and AUS datasets to source only the name Elizabeth between the years of 2004 and 2017( from the year I was born to the most current data) after which I summarised the total count of babies meeting these categories across both countries using the “summarise()” function.

After the data had been sourced I activated the “ggplot()” function and developed two different line graphs that would be joined together in one graph to show the popularity of this name across two different countries. In the seperate “geom_line” code I entered the relevant variables from the previous code and assigned a colour to coordinate with the countries.

To edit the aesthetics I renamed the graph and axis using the “Scale_x_continuous”, “scale_y_continuous” and “ggtitle” features and then I manually changed the colour of each of the lines to be representative of my two favorite colours, yellow for Australia and pink or America.

# Filter the Australian Data 
 ozbabydata <- ozbabynames%>% filter (name == "Elizabeth" & year >= 2004 & year <= 2017) %>% group_by(year) %>% summarise(num_elizabeth1 = sum(count))

# Filter the American Data 
 usbabydata <- usababynames %>% filter(name == "Elizabeth" & year >= 2004 & year <= 2017) %>% group_by(year) %>% summarise(num_elizabeth2 = sum(n)) 
 
# Form a plot for the Australian and American data 
ggplot() +
geom_line(data = ozbabydata, aes(x = year, y = num_elizabeth1, color = "Australia")) +
  
geom_line(data = usbabydata, aes(x = year, y = num_elizabeth2, color = "USA")) +
  
# Change the aesthetics of the graph: 
labs(title = "Popularity of Baby Name Elizabeth",
subtitle = "AUS vs USA '04-'17") +
scale_x_continuous(name = "Year") +
scale_y_continuous(name = "Number of Babies") +
scale_color_manual(name = "Country",
values = c("Australia" = "yellow", "USA" = "hotpink")) +
theme_dark()