The dataset babynames provides all names for US babies from 1880-2017. Using this resource, I want to determine the frequency of rare letters.

thesis statement (or not)

Let’s start by loading our packages in R.

library(tidyverse)
library(babynames)
library(ggthemes)

The first step is to determine what the rarest first letters are in all the names.

babynames %>% 
  mutate(first_letter = substr(name, 0,1)) %>% 
  arrange(desc(first_letter)) -> baby_first_letter
 
baby_first_letter %>% 
  group_by(first_letter) %>% 
  summarize(letter_count = sum(n)) %>% 
  arrange(-desc(letter_count)) %>% 
  knitr::kable()
first_letter letter_count
U 147264
X 238807
Q 266889
Y 836514
Z 1506279
O 2126638
I 3397960
V 4776946
F 5171385
N 7784994
H 8273725
W 8366114
P 9274037
G 10775475
T 14334373
B 16678231
K 17006684
E 17033760
L 18942067
S 21373830
R 23702794
D 24240271
C 25533863
A 28855232
M 32864210
J 44612175

Let’s load the ‘state names’ files from kaggle:

sdkjhds;lfjds ;dlsj d;lasj dl;ajdl;askjd asldjk as;lkjd as;ldkj as;ldkj as;ldkj a;ldkj as;ldkj asdljas ;djasd;ljas ;dlaksj d


```r
baby_first_letter %>% 
  #filter(first_letter %in% c("U", "Z", "Q", "X","Y")) %>% 
  group_by(first_letter, year) %>% 
  summarize(letter_count = sum(n)) %>% 
  ggplot(aes(year, letter_count, color = first_letter)) + geom_line()
## `summarise()` has grouped output by 'first_letter'. You can override using the
## `.groups` argument.


this is a fae parajerfg rfljah lkdj fojashd flkjans dlkjenw f;ljdsnkjan dkajsnd lkjsad als;djk as;ldkj sal;dj/