Source file ⇒ CSC303_Project_Final.Rmd
Task
The task of the BabyNames project was to create a graph showing the ups and downs in the popularity of names of interest to me.
This task requires not only finding the count, or number, of babies given a certain name, but the percentage of babies given a certain name out of all babies born in a particular year. This is important because popularity by year can’t be determined only by how many babies were given a name. That is because there could have been a large spike in total population from one year to the next. This means that there could be more babies named “Shae” in 1990 than in 1980 but you can’t say that the name became more popular because there could have been way more babies in general in the year 1990 than in 1980. So to truly judge the popularity of a name within a year you must turn the name into a percentage for that year.
The task assigned wasn’t as interesting or challenging as I thought that it could be, so I decided to elaborate on it a bit. I was very intrigued by the example the book gave of how Prince’s name became popular after he released his first hit album. That got me to thinking, to what extent does pop culture influence how people name their children. One of the biggest influences of modern culture is the film industry, so I decided to dive into popular movies of each decade to see which character names had influence over the amount of children given that name after the release.
This new task involves graphing each name chosen for its respective decade and a vertical line showing when the movie was released.
Analysis
In order to show which name the movie came from on the graph, we have to reassign names/labels.
movie_names <- list(
'Wendy'="Peter Pan: Wendy",
'Tiffany'="Breakfast at Tiffany's: Tiffany",
'Logan'="Logan's Run: Logan",
'Samantha'="16 Candles: Samantha",
'Mia'="Pulp Fiction: Mia"
)
movie_labeller <- function(variable, value){
return(movie_names[value])
}This assigns a name to a year so that we can see the x intercept of each name associated with the year that the movie was released.
movieName <- c("Wendy", "Tiffany", "Logan","Samantha", "Mia" )
ym <- data.frame(year_movie = c(1953, 1961, 1976, 1984, 1994),
name = movieName)We want to find the percentage of babies given that name during a year, not just the count. This will require making new variables and joining tables.
We need to create a variable that shows the total number of babies born in each year, call it
totalBabies.We need a variable that is the number of babies given each name within each year, call it
totalName.Lastly, we need to make a new data table that contains
name,year,year_movie,nameTotal, andtotal. We do this by joiningtotalNametototalBabiestoymand calling the resulting tablePopNames.
totalBabies <-
BabyNames %>%
group_by(year) %>%
summarise(total = sum(count))
totalName <-
BabyNames %>%
group_by(name, year)%>%
summarise(nameTotal=sum(count))
PopNames <-
totalName %>%
inner_join(totalBabies) %>%
inner_join(ym)Now we can find the percentages of babies given the names from the movies and plot each movie/name.
PopNames %>%
filter(name %in% movieName) %>%
# This filters to only the names that we are looking at.
group_by(name, year, year_movie, nameTotal, total) %>%
# Here we group the data into the five variables that we need.
summarise(namepercent = ((nameTotal/total)*1000)) %>%
# This is where we find the percentage of all babies given that
# name within the year.
ggplot(aes(x = year, y = namepercent)) +
# we graph by year and percent
ggtitle("Percent of Babies Given Names Popularized by Movies")+
theme(text = element_text(size=11.5), plot.title=element_text(hjust=0.5))+
# give the graph a title and adjust the size of the text used
geom_line() +
geom_vline(aes(xintercept = year_movie)) +
# make a line graph and then create a vertical line showing when
# that particular movie associated with that name was released.
labs(x = "Year", y = "Proportion Out of 1000")+
facet_wrap(~ name, labeller=movie_labeller)+
# the labeller changes the labels on each graph to have the names
# assigned in the above code chunk.
# The scale function below sets the scale of the x axis so that each
# graph starts at 1945 and ends in 2010 with 10 year increments
scale_x_continuous(limits = c(1940, 2010), breaks=seq(1940,2010,10)) Discussion
Looking at the graphs of this data from names/movies chosen you can conclude that in some cases the film industry greatly impacts the names given to babies.
In the project completed above I manually went through popular movies, found names that seemed distinctive, and then checked to see if they boomed after the release of the movie. This leaves room for a more in depth possible future study.
In a possible future study you could let R find the popular movies and popular names instead of doing it manually. This would start by scraping the Internet Movie Database (IMDb) to collect popular movies from each decade and then go further and collect the names of important characters in each movie. You would join this to the BabyNames data table. The next step would be to create something that analyzed each of those names from each movie in each decade and decided how many of them became “popular.” Because popular is not a numerical variable you would need to decide what you will count as popular. For example, if there are less than 1000 people with that name after the release, it should not be designated as popular. You would also have to check the before and after of the movie release. For example, the name will count as becoming popular if you can compare the 20 years before the movie release and the 20 years after and the amount of babies given that name has tripled.