You probably know a Kaitlyn, of course you do. The name Kaitlyn, and its derivations, became extremely popular from the late 1980’s to 2000. This analysis shows three distinct periods: 1. the steady rise in usage of Kaitlyn between 1980 and 1990; 2. Consistantly high usage in the 1990’s, peaking in the year 2000, when it was the most popular girls name; 3. Rapid and steady decline in usage in usage after 2000 that continues to this day.
This dataset was retrieved from The US Baby Names database at https://www.kaggle.com/kaggle/us-baby-names on 2016-08-18.
rm(list=ls())
#setwd("~/Analytics Course/Kaggle/US Baby Names/output")
setwd("C:/Users/Kier/Documents/Analytics Course/Kaggle/US Baby Names/Katelyns")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tibble)
# Data acquired from https://www.kaggle.com/kaggle/us-baby-names
# on 2016-08-18.
# This analysis uses the National Names dataset
national_names_raw <- read.csv("NationalNames.csv")
names_full <- national_names_raw
kaitlyn_derivations <- c("Katelyn", "Katelin", "Katelynn",
"Katelynne", "Katelan", "Catelyn",
"Catelin", "Katelynd", "Katelen",
"Katelind", "Katelyne", "Catelynn",
"Katelinn", "Katelund", "Kateline",
"Catelynne", "Katelon", "Katelina",
"Caitlin", "Kaitlyn", "Caitlyn",
"Kaitlin", "Caitilin", "Kaitlan",
"Kaitlynn", "Caitlan", "Caitlynn",
"Kaitlen", "Caitlen", "Kaithlyn",
"Caitland", "Kaitland", "Caitlain",
"Kaitlynne", "Caitlynne", "Caithlin",
"Caitlinn", "Kaityln", "Kaithlin",
"Kaitlinn", "Kaitlyne", "Kaitlynd",
"Kaitlain", "Kaitlind", "Caitlynd",
"Caitlyne", "Kaitlon", "Kaitylyn",
"Caitline", "Kaitelyn", "Caitylyn",
"Caityln", "Kaitleen", "Kaitelynn",
"Caitelyn", "Caitilyn", "Kaithlynn")
length(kaitlyn_derivations)
## [1] 57
There are 57 derivations of the name Kaitlyn
kaitlyns <- names_full %>%
filter(Gender == "F" & Name %in% kaitlyn_derivations)
k2 <- kaitlyns %>%
group_by(Year) %>%
summarise(sum_of_names = sum(Count))
ggplot(k2, aes(Year, sum_of_names)) + geom_line() +
labs(title="Number of baby girls named Kaitlyn through history",
y="Number of Babies Named") + geom_smooth()
Note:
The rapid growth from 1980 to 1990;
Sustained high usage from 1990 to 2000;
The rapid decline after after 2000.
k1 <- kaitlyns %>%
group_by(Name) %>%
summarise(sub_total = sum(Count)) %>%
arrange(-sub_total)
head(k1,10)
## # A tibble: 10 x 2
## Name sub_total
## <fctr> <int>
## 1 Kaitlyn 158601
## 2 Katelyn 127809
## 3 Caitlin 110413
## 4 Kaitlin 56678
## 5 Caitlyn 50566
## 6 Katelynn 28857
## 7 Kaitlynn 15248
## 8 Katelin 8808
## 9 Caitlynn 4644
## 10 Catelyn 2112
Get data only for the growth years
# Full set for 1980 to 1990
names_80_90 <- names_full %>%
filter(Year >= 1980 & Year <= 1990) %>%
group_by(Year, Name) %>%
summarise(sum_of_names = sum(Count)) %>%
arrange(Year, -sum_of_names)
# Kaitlyn data for 1980 to 1990
kaitlyn_80_90 <- names_80_90 %>%
filter(Name %in% kaitlyn_derivations) %>%
arrange(-sum_of_names)
head(kaitlyn_80_90, 10)
## Source: local data frame [10 x 3]
## Groups: Year [6]
##
## Year Name sum_of_names
## <int> <fctr> <int>
## 1 1988 Caitlin 7269
## 2 1989 Caitlin 7072
## 3 1990 Caitlin 7045
## 4 1987 Caitlin 5016
## 5 1990 Katelyn 4495
## 6 1990 Kaitlyn 4318
## 7 1989 Katelyn 4044
## 8 1986 Caitlin 3897
## 9 1985 Caitlin 3619
## 10 1988 Katelyn 3520
Interesting that the most popular derivation changed from year to year.
k89_1 <- kaitlyn_80_90 %>%
group_by(Year) %>%
summarise(total = sum(sum_of_names)) %>%
arrange(Year)
ggplot(k89_1, aes(Year, total)) + geom_line() + geom_smooth()
print(paste0("The growth in usage between 1980 and 1990 was ",
round(max(k89_1$total)/min(k89_1$total)*100, 2), "%"))
## [1] "The growth in usage between 1980 and 1990 was 2180.97%"
This plot shows strong, steady growth, with little variation over this period.
From 1980 to 1990 the use of the name Kaitlyn rose from practically zero to 22,985 little girls being given that name.
This decade saw high, sustained usage peaking in 2000.
kaitlyn_90_00 <- kaitlyns %>%
filter(Year >= 1990 & Year <= 2000) %>%
group_by(Year) %>%
summarise(sum_of_names = sum(Count))
# Plot out usage from 1990 to 2000
ggplot(kaitlyn_90_00, aes(Year, sum_of_names)) + geom_line() + geom_smooth() +
labs(title="Usage of Kaitlyn from 1990 to 2000",
x="Year", y="Usage")
Notice that this name falls into fairly consistent usage over the decade, with modest growth. The range is only about 4000 between low and high. The average usage over this period was 25,319 per year
range(kaitlyn_90_00$sum_of_names)
## [1] 22626 26789
mean(kaitlyn_90_00$sum_of_names)
## [1] 25319.73
sd(kaitlyn_90_00$sum_of_names)
## [1] 1463.496
Now let’s look at the period from 2000 to 2014 (most current data).
kaitlyn_00_14 <- kaitlyns %>%
filter(Year >= 2000) %>%
group_by(Year) %>%
summarise(sum_of_names = sum(Count))
ggplot(kaitlyn_00_14, aes(Year, sum_of_names)) + geom_line() + geom_smooth()
A long sustained decline with little variation
# Get all the female names for 1990, order from highest usage to lowest
year_1990 <- national_names_raw %>%
filter(Gender == "F" & (Year == 1990)) %>%
group_by(Year, Name) %>%
summarise(sum_of_name = sum(Count)) %>%
arrange(-sum_of_name, Name)
(kaitlyns_1990 <- year_1990 %>%
filter(Name %in% kaitlyn_derivations) %>%
summarise(total_kaitlyns_1990 = sum(sum_of_name)) %>%
select(-Year) %>%
unlist())
## total_kaitlyns_1990
## 22985
dispersion_1990 <- year_1990 %>%
filter(Name %in% kaitlyn_derivations) %>%
arrange(-sum_of_name)
length(unique(dispersion_1990$Name))
## [1] 38
head(dispersion_1990, 10)
## Source: local data frame [10 x 3]
## Groups: Year [1]
##
## Year Name sum_of_name
## <int> <fctr> <int>
## 1 1990 Caitlin 7025
## 2 1990 Katelyn 4485
## 3 1990 Kaitlyn 4312
## 4 1990 Kaitlin 3472
## 5 1990 Caitlyn 1585
## 6 1990 Katelynn 695
## 7 1990 Katelin 409
## 8 1990 Kaitlynn 340
## 9 1990 Kaitlan 107
## 10 1990 Caitlan 88
ranked_names_1990 <- tibble::rownames_to_column(year_1990, var = "rank")
rn_1990 <- ranked_names_1990 %>%
mutate(rank = as.integer(rank))
(top_twenty_1990 <- rn_1990 %>%
filter(rank <= 20))
## Source: local data frame [20 x 4]
## Groups: Year [1]
##
## rank Year Name sum_of_name
## <int> <int> <fctr> <int>
## 1 1 1990 Jessica 46466
## 2 2 1990 Ashley 45549
## 3 3 1990 Brittany 36535
## 4 4 1990 Amanda 34406
## 5 5 1990 Samantha 25864
## 6 6 1990 Sarah 25808
## 7 7 1990 Stephanie 24856
## 8 8 1990 Jennifer 22221
## 9 9 1990 Elizabeth 20742
## 10 10 1990 Lauren 20498
## 11 11 1990 Megan 20255
## 12 12 1990 Emily 19358
## 13 13 1990 Nicole 17950
## 14 14 1990 Kayla 17536
## 15 15 1990 Amber 15863
## 16 16 1990 Rachel 15703
## 17 17 1990 Courtney 15377
## 18 18 1990 Danielle 14330
## 19 19 1990 Heather 14217
## 20 20 1990 Melissa 13996
top_twenty_1990$Name <- ordered(x=top_twenty_1990$Name, levels = top_twenty_1990$Name)
ggplot(top_twenty_1990,aes(x=Name, y=sum_of_name))+
geom_bar(stat = "identity") +
labs(title="Top 20 girl names of 1990",
y="Number of girls with name") +
geom_hline(yintercept = kaitlyns_1990, color = "red") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
annotate("text", x="Nicole", y=27000,
label = "Red line represents number of Kaitlyns born in 1990",
color = "red")
The number of girls named Kaitlyn, and derivations, in 1990 was 22,985 that name would have ranked 8th in 1990. A very popular name.
Even though the average usage in the top ten is 30,294 there is a nearly 10,000 break between #4 and #5.
This is in the middle of the sustained usage period.
Get all the female names for 1995, order from highest usage to lowest
year_1995 <- national_names_raw %>%
filter(Gender == "F" & (Year == 1995)) %>%
group_by(Year, Name) %>%
summarise(sum_of_name = sum(Count)) %>%
arrange(-sum_of_name, Name)
(kaitlyns_1995 <- year_1995 %>%
filter(Name %in% kaitlyn_derivations) %>%
summarise(total_kaitlyns_1995 = sum(sum_of_name)) %>%
select(-Year) %>%
unlist())
## total_kaitlyns_1995
## 26450
dispersion_1995 <- year_1995 %>%
filter(Name %in% kaitlyn_derivations) %>%
arrange(-sum_of_name)
length(unique(dispersion_1995$Name))
## [1] 39
head(dispersion_1995, 10)
## Source: local data frame [10 x 3]
## Groups: Year [1]
##
## Year Name sum_of_name
## <int> <fctr> <int>
## 1 1995 Kaitlyn 7387
## 2 1995 Katelyn 5573
## 3 1995 Caitlin 4382
## 4 1995 Kaitlin 3602
## 5 1995 Caitlyn 2300
## 6 1995 Katelynn 1238
## 7 1995 Kaitlynn 698
## 8 1995 Katelin 437
## 9 1995 Caitlynn 160
## 10 1995 Kaitlan 83
ranked_names_1995 <- tibble::rownames_to_column(year_1995, var = "rank")
rn_1995 <- ranked_names_1995 %>%
mutate(rank = as.integer(rank))
(top_twenty_1995 <- rn_1995 %>%
filter(rank <= 20))
## Source: local data frame [20 x 4]
## Groups: Year [1]
##
## rank Year Name sum_of_name
## <int> <int> <fctr> <int>
## 1 1 1995 Jessica 27938
## 2 2 1995 Ashley 26603
## 3 3 1995 Emily 24377
## 4 4 1995 Samantha 21646
## 5 5 1995 Sarah 21365
## 6 6 1995 Taylor 20424
## 7 7 1995 Hannah 17012
## 8 8 1995 Brittany 16477
## 9 9 1995 Amanda 16344
## 10 10 1995 Elizabeth 16183
## 11 11 1995 Kayla 16083
## 12 12 1995 Rachel 16041
## 13 13 1995 Megan 15529
## 14 14 1995 Alexis 14330
## 15 15 1995 Lauren 13444
## 16 16 1995 Stephanie 12979
## 17 17 1995 Courtney 12771
## 18 18 1995 Jennifer 12685
## 19 19 1995 Nicole 12276
## 20 20 1995 Victoria 12250
top_twenty_1995$Name <- ordered(x=top_twenty_1995$Name, levels = top_twenty_1995$Name)
ggplot(top_twenty_1995,aes(x=Name, y=sum_of_name))+
geom_bar(stat = "identity") +
labs(title="Top 20 girl names of 1995",
y="Number of girls with name") +
geom_hline(yintercept = kaitlyns_1995, color = "red") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
annotate("text", x="Kayla", y=24000,
label = "Red line represents number of Kaitlyns born in 1995",
color = "red")
The number of girls named Kaitlyn, and derivations, in 1995 was 22,985 that name would have ranked 3rd in 1995. A very popular name.
This is in the end of the sustained usage period.
Get all the female names for 2000, order from highest usage to lowest
year_2000 <- national_names_raw %>%
filter(Gender == "F" & (Year == 2000)) %>%
group_by(Year, Name) %>%
summarise(sum_of_name = sum(Count)) %>%
arrange(-sum_of_name, Name)
(kaitlyns_2000 <- year_2000 %>%
filter(Name %in% kaitlyn_derivations) %>%
summarise(total_kaitlyns_2000 = sum(sum_of_name)) %>%
select(-Year) %>%
unlist())
## total_kaitlyns_2000
## 26789
dispersion_2000 <- year_2000 %>%
filter(Name %in% kaitlyn_derivations) %>%
arrange(-sum_of_name)
length(unique(dispersion_2000$Name))
## [1] 37
head(dispersion_2000, 10)
## Source: local data frame [10 x 3]
## Groups: Year [1]
##
## Year Name sum_of_name
## <int> <fctr> <int>
## 1 2000 Kaitlyn 8757
## 2 2000 Katelyn 5501
## 3 2000 Caitlin 4102
## 4 2000 Caitlyn 2835
## 5 2000 Kaitlin 2135
## 6 2000 Katelynn 1388
## 7 2000 Kaitlynn 802
## 8 2000 Katelin 409
## 9 2000 Caitlynn 274
## 10 2000 Catelyn 91
ranked_names_2000 <- tibble::rownames_to_column(year_2000, var = "rank")
rn_2000 <- ranked_names_2000 %>%
mutate(rank = as.integer(rank))
(top_twenty_2000 <- rn_2000 %>%
filter(rank <= 20))
## Source: local data frame [20 x 4]
## Groups: Year [1]
##
## rank Year Name sum_of_name
## <int> <int> <fctr> <int>
## 1 1 2000 Emily 25952
## 2 2 2000 Hannah 23073
## 3 3 2000 Madison 19967
## 4 4 2000 Ashley 17995
## 5 5 2000 Sarah 17687
## 6 6 2000 Alexis 17627
## 7 7 2000 Samantha 17264
## 8 8 2000 Jessica 15704
## 9 9 2000 Elizabeth 15088
## 10 10 2000 Taylor 15078
## 11 11 2000 Lauren 14172
## 12 12 2000 Alyssa 13552
## 13 13 2000 Kayla 13310
## 14 14 2000 Abigail 13087
## 15 15 2000 Brianna 12873
## 16 16 2000 Olivia 12852
## 17 17 2000 Emma 12540
## 18 18 2000 Megan 11433
## 19 19 2000 Grace 11283
## 20 20 2000 Victoria 10922
top_twenty_2000$Name <- ordered(x=top_twenty_2000$Name, levels = top_twenty_2000$Name)
ggplot(top_twenty_2000,aes(x=Name, y=sum_of_name))+
geom_bar(stat = "identity") +
labs(title="Top 20 girl names of 2000",
y="Number of girls with name") +
geom_hline(yintercept = kaitlyns_2000, color = "red") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
annotate("text", x="Kayla", y=24000,
label = "Red line represents number of Kaitlyns born in 2000",
color = "red")
The number of girls named Kaitlyn, and derivations, in 2000 was 22,985 that name would have ranked 1st in 2000. The most popular name.
We can assume that Kaitlyn was in the top ten most used girls name during the entire decade from 1990 to 2000. It was the most popular name in 2000, and the 3rd most popular in 1995.If you meet a Kaitlyn chances are that she was born between 1985 (the 2nd half of the growth period), and 2000 (the end of the sustained period). As of this writing (2016), they would be between 16 and 31 years old.
An older Kaitlyn would be considered an early adopter, and a younger Kaitlyn would be considered a late adopter.