For my project, I chose to examine the correlation between baby names and US state names over time. I utilized RStudio to conduct my research and analyze my findings.
To begin, I predicted that less and less babies were given US State names as time went on and that, in general, more females were named after states than males.
I was able to locate the data sets I needed already in R: baby names and state names. So, I installed all the packages I would be using and viewed the data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(babynames)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ stringr 1.4.0
## ✓ tidyr 1.0.2 ✓ forcats 0.4.0
## ✓ readr 1.3.1
## ── Conflicts ──────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
View(babynames)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
data("babynames")
state.name
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
as.data.frame(state.name) -> stateNames
Next, I merged the two data sets to identify the overlap.
intersect(babynames$name, stateNames$state.name) -> overlap
View(overlap)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
From this, I was able to identify that babies born between 1880 and 2017 were named after the following states: Georgia, Virginia, Missouri, Nevada, Florida, Indiana, Arizona, Tennessee, Texas, Washington, Maryland, Louisiana, Iowa, Maine, Kansas, Oklahoma, Nebraska, Montana, Alaska, California, Illinois, Utah, Vermont, Wyoming, Colorado, Alabama, Michigan, Hawaii.
In order to be able to break down the data by sex and date I knew I had to inner join the two data sets as well. I also renames the first column of the stateNames data set for my future convience.
as.data.frame(stateNames) -> stateNamesDf
colnames(stateNamesDf)[1] <- "name"
babynames %>%
inner_join(stateNamesDf) -> babyStateName
## Joining, by = "name"
## Warning: Column `name` joining character vector and factor, coercing into
## character vector
colnames(stateNames)[1] -> name
I then created a visualization to identify and display the top ten state names used as baby names.
babyStateName %>%
group_by(name) %>%
count(sort = TRUE) %>%
head(10) %>%
ggplot(aes(reorder(name,n), n)) + geom_col() + coord_flip()
As you can see, these were the ten most common state names used as baby names between the years of 1880 and 2017: Virginia, Georgia, Nevada, Washington, Maryland, Montana, Arizona, Indiana, Florida, Tennessee. Additionally, the order these names are listed is the order of their popularity from most popular to least popular.
To test the first part of my hypothesis, I needed to break this breakdown by year. So, I created a line chart to display this new data including the top ten state names used as baby names.
babynames %>%
filter(name %in% c("Virginia", "Georgia", "Nevada", "Washington", "Maryland", "Montana", "Arizona", "Indiana", "Florida", "Tennessee")) %>%
ggplot(aes(year, prop, color = name)) + geom_line()
From this data visualization I found the first part of my hypothesis to be true: less and less babies were given US State names as time went on. The only argument could be concerning the name Georgia as it appears to be on the rise again during the early 2000s.
The part of the data that stuck out to me the most is the peak of the name Virginia in the 1920s. Future researchers may delve deeper into this finding and try to explain the ‘why?’ of the phenomena. What was the significance of the name Virginia at the time? Was it purely a coincidence?
Anyway, I then continued to test my heypothesis by breaking down my findings by sex. I created a new visualization to display this information much like the previous one.
babynames %>%
filter(name %in% c("Virginia", "Georgia", "Nevada", "Washington", "Maryland", "Montana", "Arizona", "Indiana", "Florida", "Tennessee")) %>%
ggplot(aes(year, prop, color = name)) + geom_line() + facet_wrap(~sex, scales = "free_y")
I found that the name Washington was popular in the late 1800s for males. More recently, the name Montana peaked for the male sex in the late 1900s. Additionally, I determined the peak in the name Virginia was primarily female. When looking at the proportions on the left side of each graph, I noticed the Female side to have much larger numbers. Although the Male graph had similar peaks to that of the Female graph, the proportions were much smaller.
As interesting as these findings were I knew that I could prove the second part of my hypothesis with a much cleaner data visualization. So, I created a column graph to more effectively show the difference in the number of females versus males named after states.
babyStateName %>%
ggplot(data = babyStateName, mapping = aes(x = sex,y = n)) + geom_col()
Overall, this visualization helped me to put the amount of females versus males named after states into perspective and determine the second part of my hypothesis to be true: more females were named after states than males.
In conclusion, I found my hypothesis to be true as I predicted that less and less babies were given US State names as time went on and that, in general, more females were named after states than males.