Color-Specific Baby Names

Hypothesis: Parents do not choose their baby’s names based on how common a color is or on the color itself.

To test this hypothesis, I chose to analyze the name trends (if any) for basic color names within the babynames dataset (i.e. Red, Orange, Yellow, Green, Blue, Violet) and cross-reference those names with those on a different dataset that lists all shades of color names grouped by the color type itself.

Required R Packages

This analysis utilizes R and RStudio as well as the babynames and tidyverse packages that were installed in R. The analysis also utilized the scales package to clean and change the x-axis notation in some of the visuals.

Analysis and Code

To begin, I wanted to first analyze the popularity, if any, of the basic rainbow colors over time. From this visual, I was able to conclude that baby names based on the basic rainbow colors (i.e. Red, Orange, Yellow, Green, Blue, Violet) have drastically decreased in popularity over time so much that these names have not been popular for quite some time.

library(babynames)
library(tidyverse)
rainbow = c("Red", "Orange", "Yellow", "Green", "Blue", "Violet")
babynames %>% 
  filter(sex == "M" & name %in% rainbow) -> rainbowNames
ggplot(rainbowNames, aes(year, prop)) + geom_line()

To analyze this aspect further, I found the total number of uses of these color names in the babynames database. This shows a total of only 3,783 instances of these names in the database, and the highest ‘n’ value is 48. This is a very low number of instances.

rainbowNames %>% 
  summarise(total = sum(n))

## # A tibble: 1 x 1
##   total
##   <int>
## 1  3783

rainbowNames %>% 
  arrange(desc(n))

## # A tibble: 268 x 5
##     year sex   name      n      prop
##    <dbl> <chr> <chr> <int>     <dbl>
##  1  1927 M     Green    48 0.0000413
##  2  1916 M     Green    47 0.0000509
##  3  1914 M     Green    46 0.0000673
##  4  1886 M     Green    45 0.000378 
##  5  1920 M     Green    44 0.0000400
##  6  1918 M     Green    43 0.000041 
##  7  1917 M     Green    42 0.0000438
##  8  1919 M     Green    42 0.0000414
##  9  1921 M     Green    42 0.0000369
## 10  1922 M     Green    42 0.0000373
## # … with 258 more rows

To prove my hypothesis, I loaded another names dataset from online to cross reference with the baby names set for any overlap. I found the dataset from https://data.world/.

df <- read.csv("https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fquery.data.world%2Fs%2F3hl25k2iqijv37rxwpb4voyofdh5yx&amp;data=02%7C01%7C%7C72ee268ed27e41fa543d08d7b183b853%7Cba18326d711f4ae286816115493a7a53%7C1%7C0%7C637173051592363185&amp;sdata=AQErHo2MRuUQCblZOj4O9oZx2qST2Q09059%2BljsfUBU%3D&amp;reserved=0", header=TRUE, stringsAsFactors=FALSE);

After exporting the dataset into R, I combined the babynames and color names data frames and only kept overlapping results. These results would be incredibly unique to analyze later on. I used the intersect() function to filter any over the overlapping names discussed earlier. I called the new dataset colorsList.

intersect(babynames$name, df$Color.Name) -> colorsList
colorsList

##  [1] "Rose"      "Myrtle"    "Pearl"     "Olive"     "Ruby"      "Violet"   
##  [7] "Veronica"  "Iris"      "Amber"     "Magnolia"  "Cherry"    "Jasper"   
## [13] "Pink"      "Ivory"     "Coral"     "Lemon"     "Almond"    "Olivine"  
## [19] "Gray"      "Red"       "Auburn"    "Carmine"   "Snow"      "Ceil"     
## [25] "Silver"    "Vanilla"   "Fawn"      "White"     "Emerald"   "Jasmine"  
## [31] "Ginger"    "Onyx"      "Salmon"    "Lion"      "Pear"      "Orchid"   
## [37] "Bubbles"   "Ruddy"     "Jonquil"   "Scarlet"   "Jade"      "Ebony"    
## [43] "Cerise"    "Rajah"     "Buff"      "Capri"     "Flame"     "Jet"      
## [49] "Teal"      "Sand"      "Beaver"    "Sienna"    "Tangerine" "Cadet"    
## [55] "Cinnamon"  "Amethyst"  "Blue"      "Aqua"      "Burgundy"  "Indigo"   
## [61] "Topaz"     "Sepia"     "Denim"     "Crimson"   "Saffron"   "Shadow"   
## [67] "Azure"     "Mahogany"  "Sangria"   "Tan"       "Turquoise" "Champagne"
## [73] "Copper"    "Lilac"     "Sapphire"  "Wisteria"  "Beige"     "Magenta"  
## [79] "Sunset"    "Cyan"      "Xanadu"    "Umber"     "Tangelo"   "Bronze"   
## [85] "Maize"     "Platinum"  "Harlequin" "Thistle"   "Linen"     "Cobalt"   
## [91] "Amaranth"  "Cerulean"

View(colorsList)

## Error in check_for_XQuartz(): X11 library is missing: install XQuartz from xquartz.macosforge.org

This intersection gave me a variety of results, but I wanted to narrow down my names even further. To do so, I filtered the names included in colorsList based on the top 15 most popular names. I called this visual chart1.

babynames %>% 
  filter(name %in% colorsList) %>% 
  group_by(name, sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(15) %>% 
  ggplot(aes(reorder(name, total), total)) + geom_col() + coord_flip() -> chart1
chart1

The visual that I got from the code was not super clean. To make it easier to analyze, I changed its appearance. I added a theme display, axes titles, a chart title and got rid of the scientific notation on the x-axis by installing the scales package in R. It is important to note that the x and y axes are flipped because I flipped the columns with the coord_flip() function listed in the code above.

install.packages("scales")

## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

library(scales)
chart1 + theme_light() + 
  xlab("Name") + 
  ylab("Total Number of Names Over Time") + 
  ggtitle("Most Popular Baby Names of Colors from 1880-2017") +
  scale_y_continuous(labels = comma)

It is clear from these visuals that the color names parents choose to give their children are not necessarily the most popular or based on a common color. To more closely observe this concept as well as compare more closely the male and female names across the time span of 1880-2017, I created name rankings for the year 1880, the year 1949 (1948.5 would be the median year), and the final year of 2017.

babynames %>% 
  filter(name %in% colorsList) %>% 
  filter(year == 1880, sex =="F") %>% 
  arrange(desc(prop)) %>% 
  mutate(rank = row_number()) %>% 
  mutate(percent = (prop * 100)) -> rankedColorsF1880
rankedColorsF1880

## # A tibble: 11 x 7
##     year sex   name         n      prop  rank percent
##    <dbl> <chr> <chr>    <int>     <dbl> <int>   <dbl>
##  1  1880 F     Rose       700 0.00717       1 0.717  
##  2  1880 F     Myrtle     615 0.00630       2 0.630  
##  3  1880 F     Pearl      569 0.00583       3 0.583  
##  4  1880 F     Olive      224 0.00229       4 0.229  
##  5  1880 F     Ruby        92 0.000943      5 0.0943 
##  6  1880 F     Violet      42 0.000430      6 0.0430 
##  7  1880 F     Veronica    14 0.000143      7 0.0143 
##  8  1880 F     Iris        11 0.000113      8 0.0113 
##  9  1880 F     Amber        9 0.0000922     9 0.00922
## 10  1880 F     Magnolia     8 0.0000820    10 0.00820
## 11  1880 F     Cherry       6 0.0000615    11 0.00615

babynames %>% 
  filter(name %in% colorsList) %>% 
  filter(year == 1880, sex =="M") %>% 
  arrange(desc(prop)) %>% 
  mutate(rank = row_number()) %>% 
  mutate(percent = (prop * 100)) -> rankedColorsM1880
rankedColorsM1880

## # A tibble: 7 x 7
##    year sex   name       n      prop  rank percent
##   <dbl> <chr> <chr>  <int>     <dbl> <int>   <dbl>
## 1  1880 M     Jasper    98 0.000828      1 0.0828 
## 2  1880 M     Pearl     62 0.000524      2 0.0524 
## 3  1880 M     Pink      30 0.000253      3 0.0253 
## 4  1880 M     Ivory      8 0.0000676     4 0.00676
## 5  1880 M     Rose       7 0.0000591     5 0.00591
## 6  1880 M     Ruby       6 0.0000507     6 0.00507
## 7  1880 M     Myrtle     5 0.0000422     7 0.00422

During the year 1880, the male and female names that were ranked by highest numerical count (n) only showed overlap in the names “Myrtle,” “Pearl” and “Ruby.” This was interseting to note. During 1880, “Myrtle” had the second-higheset count for females. For males, it was the seventh-highest. “Myrtle” is not a common color name. It is simply a unique name that sounded very admirable and likeable amongst expecting parents. The name “Pearl” amongst females has 569 instances, compared to 62 instances for males. “Ruby” had 92 instances for females and 6 instances for males.

In comparing males and females during the year 1880, color names were much more common amongst females than males. The males category only had a list of seven names total. In addition, there was only a fair amount of overlap, but not enough for the names to be considered “popular.”

babynames %>% 
  filter(name %in% colorsList) %>% 
  filter(year == 1949, sex =="F") %>% 
  arrange(desc(prop)) %>% 
  mutate(rank = row_number()) %>% 
  mutate(percent = (prop * 100)) -> rankedColorsF1949
rankedColorsF1949

## # A tibble: 20 x 7
##     year sex   name         n       prop  rank  percent
##    <dbl> <chr> <chr>    <int>      <dbl> <int>    <dbl>
##  1  1949 F     Rose      5370 0.00306        1 0.306   
##  2  1949 F     Ruby      2576 0.00147        2 0.147   
##  3  1949 F     Veronica  1424 0.000811       3 0.0811  
##  4  1949 F     Iris       920 0.000524       4 0.0524  
##  5  1949 F     Ginger     665 0.000379       5 0.0379  
##  6  1949 F     Pearl      575 0.000328       6 0.0328  
##  7  1949 F     Myrtle     523 0.000298       7 0.0298  
##  8  1949 F     Violet     483 0.000275       8 0.0275  
##  9  1949 F     Cherry     292 0.000166       9 0.0166  
## 10  1949 F     Olive       97 0.0000553     10 0.00553 
## 11  1949 F     Amber       93 0.0000530     11 0.00530 
## 12  1949 F     Ivory       53 0.0000302     12 0.00302 
## 13  1949 F     Coral       47 0.0000268     13 0.00268 
## 14  1949 F     Magnolia    43 0.0000245     14 0.00245 
## 15  1949 F     Scarlet     29 0.0000165     15 0.00165 
## 16  1949 F     Fawn        26 0.0000148     16 0.00148 
## 17  1949 F     Jade        16 0.00000911    17 0.000911
## 18  1949 F     Silver       6 0.00000342    18 0.000342
## 19  1949 F     Ceil         5 0.00000285    19 0.000285
## 20  1949 F     Emerald      5 0.00000285    20 0.000285

babynames %>% 
  filter(name %in% colorsList) %>% 
  filter(year == 1949, sex =="M") %>% 
  arrange(desc(prop)) %>% 
  mutate(rank = row_number()) %>% 
  mutate(percent = (prop * 100)) -> rankedColorsM1949
rankedColorsM1949

## # A tibble: 15 x 7
##     year sex   name        n       prop  rank  percent
##    <dbl> <chr> <chr>   <int>      <dbl> <int>    <dbl>
##  1  1949 M     Jasper    193 0.000107       1 0.0107  
##  2  1949 M     Carmine   133 0.0000738      2 0.00738 
##  3  1949 M     Ivory     123 0.0000683      3 0.00683 
##  4  1949 M     Gray       32 0.0000178      4 0.00178 
##  5  1949 M     Rose       24 0.0000133      5 0.00133 
##  6  1949 M     Ruby       23 0.0000128      6 0.00128 
##  7  1949 M     Pearl      20 0.0000111      7 0.00111 
##  8  1949 M     Lemon      14 0.00000777     8 0.000777
##  9  1949 M     Iris       11 0.0000061      9 0.00061 
## 10  1949 M     Cherry     10 0.00000555    10 0.000555
## 11  1949 M     Auburn      9 0.00000499    11 0.000499
## 12  1949 M     Almond      8 0.00000444    12 0.000444
## 13  1949 M     Pink        8 0.00000444    13 0.000444
## 14  1949 M     Ruddy       7 0.00000388    14 0.000388
## 15  1949 M     Coral       6 0.00000333    15 0.000333

During the year 1949, the names “Rose,” “Ruby,” “Iris,” “Pearl,” “Cherry,” “Ivory and”Coral" were the ones that intersected amongst both males and females. None of these names was in the very original list of rainbow names (i.e. Red, Orange, Yellow, Green, Blue, Violet). These names were unique amongst both sexes, even though all of the names that have intersected in the year 1990 listed above were most numerous amongst females.

babynames %>% 
  filter(name %in% colorsList) %>% 
  filter(year == 2017, sex =="F") %>% 
  arrange(desc(prop)) %>% 
  mutate(rank = row_number()) %>% 
  mutate(percent = (prop * 100)) -> rankedColorsF2017
rankedColorsF2017

## # A tibble: 49 x 7
##     year sex   name         n     prop  rank percent
##    <dbl> <chr> <chr>    <int>    <dbl> <int>   <dbl>
##  1  2017 F     Violet    4699 0.00251      1  0.251 
##  2  2017 F     Ruby      3540 0.00189      2  0.189 
##  3  2017 F     Jade      2725 0.00145      3  0.145 
##  4  2017 F     Jasmine   2256 0.00120      4  0.120 
##  5  2017 F     Rose      2059 0.00110      5  0.110 
##  6  2017 F     Iris      1969 0.00105      6  0.105 
##  7  2017 F     Sienna    1392 0.000742     7  0.0742
##  8  2017 F     Olive     1241 0.000662     8  0.0662
##  9  2017 F     Veronica   817 0.000436     9  0.0436
## 10  2017 F     Magnolia   808 0.000431    10  0.0431
## # … with 39 more rows

babynames %>% 
  filter(name %in% colorsList) %>% 
  filter(year == 2017, sex =="M") %>% 
  arrange(desc(prop)) %>% 
  mutate(rank = row_number()) %>% 
  mutate(percent = (prop * 100)) -> rankedColorsM2017
rankedColorsM2017

## # A tibble: 32 x 7
##     year sex   name        n      prop  rank percent
##    <dbl> <chr> <chr>   <int>     <dbl> <int>   <dbl>
##  1  2017 M     Jasper   2083 0.00106       1 0.106  
##  2  2017 M     Onyx      188 0.0000958     2 0.00958
##  3  2017 M     Gray      144 0.0000733     3 0.00734
##  4  2017 M     Denim     141 0.0000718     4 0.00718
##  5  2017 M     Jet       131 0.0000667     5 0.00667
##  6  2017 M     Carmine   125 0.0000637     6 0.00637
##  7  2017 M     Indigo     52 0.0000265     7 0.00265
##  8  2017 M     Ivory      48 0.0000244     8 0.00244
##  9  2017 M     Jade       40 0.0000204     9 0.00204
## 10  2017 M     Crimson    36 0.0000183    10 0.00183
## # … with 22 more rows

During the year 2017, the name “Jade” was the only that intersected between males and females. This is interesting, considering in years prior, the tables between males and females have resulted in more commmonalities.

Another fascinating aspect to note is - from looking at both tables for 2017, the number of instances of the names for both sexes is the highest it has ever been. We do not see the stereotypically common color names here. In addition, the color names that are at the top of these lists may not be chosen amongst parents because they are based on colors. The names in all of the tables are unique and sound very outside-the-box and unordinary to me.

In looking back to the original list that I compiled when I used the intersect() function for the babynames and imported data set from online, I noticed that the only “rainbow” names that showed up were “Red,” “White,” “Gray” and “Blue.” When keeping these names in mind and looking at the tables for both sexes during 1880, 1949 and 2017, none of these intersected names showed as part of the notable data. This furthermore alludes to the position that the color itself is not significant to parents when choosing a name for their baby.

Color-Specific Baby Names

Halle Brennan

2/20/2020

Required R Packages

Analysis and Code

Conclusion