options(stringsAsFactors = FALSE)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
Names data from IPUMS 1% samples of the 1850, 1860, and 1870 censuses. This is not as good as having the earlier censuses, but it at least lets us know about the birthyears of people represented.
names <- read.csv("usa_00005.csv")
What birthyears are represented? Keep in mind that this is not really representative, since we haven’t taken into account variable person weighting. But it does show what we would expect, a declining number of people from earlier birth years, due to increasing population and morality. The spikes show us that the data entry was fudged in many places.
names %>%
group_by(BIRTHYR) %>%
summarize(count = n()) %>%
ggplot(aes(x = BIRTHYR, y = count)) +
geom_bar(stat = "identity")
For now we don’t want to mess with person weights, so let’s just look at the 1850 census.
c_1850 <- names %>%
filter(YEAR == 1850)
We need to clean up the names to remove initials, etc. Here is a function to do that. (A more sophisticated function would take into account people with initials for first names.)
clean_names <- function(name) {
require(stringr)
return( name %>% word() %>% tolower() )
}
Let’s find out the number of unique names in the 1850 census
c_1850$name <- clean_names(c_1850$NAMEFRST)
## Loading required package: stringr
length(unique(c_1850$name))
## [1] 14056
Now let’s create statistics for each name by birth year:
n_1850 <- c_1850 %>%
group_by(name, BIRTHYR) %>%
summarize(count = n()) %>%
arrange(desc(count))
n_1850 %>%
head(n = 50)
## Source: local data frame [50 x 3]
## Groups: name
##
## name BIRTHYR count
## 1 mary 1848 452
## 2 mary 1850 436
## 3 mary 1846 434
## 4 mary 1847 424
## 5 mary 1844 413
## 6 mary 1849 406
## 7 mary 1845 403
## 8 mary 1843 378
## 9 mary 1842 369
## 10 mary 1830 359
## 11 mary 1840 355
## 12 mary 1838 346
## 13 mary 1832 344
## 14 mary 1836 344
## 15 john 1848 343
## 16 john 1846 342
## 17 mary 1841 326
## 18 john 1849 322
## 19 john 1842 321
## 20 john 1845 320
## 21 mary 1820 319
## 22 mary 1834 317
## 23 john 1820 314
## 24 john 1844 311
## 25 john 1847 309
## 26 mary 1833 307
## 27 john 1838 302
## 28 mary 1837 302
## 29 mary 1825 301
## 30 mary 1839 300
## 31 mary 1828 298
## 32 john 1840 295
## 33 john 1843 292
## 34 mary 1835 280
## 35 mary 1827 273
## 36 john 1850 271
## 37 john 1825 269
## 38 john 1841 269
## 39 john 1836 259
## 40 mary 1826 258
## 41 william 1842 250
## 42 john 1828 247
## 43 mary 1831 245
## 44 john 1832 241
## 45 john 1834 241
## 46 sarah 1848 240
## 47 john 1835 236
## 48 william 1845 236
## 49 mary 1822 235
## 50 john 1826 234
Now we can plot the occurence of a name by birth year for any given name. Keep in mind that this is the number of people alive in 1850 uncorrected for mortality. This is probably too long from your period to tell anything useful.
n_1850 %>%
filter(name == "mary") %>%
ggplot(aes(x = BIRTHYR, y = count)) +
geom_line()
One other approach is to create a list of names, and then we can get the counts for each combination of name and year.
interesting_names <- c(
"patience",
"virtue",
"honor",
"chastity",
"prudence"
)
n_1850 %>%
filter(name %in% interesting_names)
## Source: local data frame [82 x 3]
## Groups: name
##
## name BIRTHYR count
## 1 patience 1831 3
## 2 prudence 1832 3
## 3 patience 1782 2
## 4 patience 1792 2
## 5 patience 1795 2
## 6 patience 1796 2
## 7 patience 1800 2
## 8 patience 1802 2
## 9 patience 1824 2
## 10 patience 1830 2
## 11 patience 1840 2
## 12 prudence 1786 2
## 13 prudence 1798 2
## 14 prudence 1810 2
## 15 prudence 1816 2
## 16 prudence 1821 2
## 17 prudence 1823 2
## 18 prudence 1836 2
## 19 honor 1815 1
## 20 honor 1820 1
## 21 honor 1830 1
## 22 honor 1831 1
## 23 patience 1767 1
## 24 patience 1772 1
## 25 patience 1780 1
## 26 patience 1784 1
## 27 patience 1785 1
## 28 patience 1786 1
## 29 patience 1791 1
## 30 patience 1797 1
## 31 patience 1798 1
## 32 patience 1801 1
## 33 patience 1803 1
## 34 patience 1805 1
## 35 patience 1809 1
## 36 patience 1811 1
## 37 patience 1815 1
## 38 patience 1819 1
## 39 patience 1820 1
## 40 patience 1822 1
## 41 patience 1825 1
## 42 patience 1827 1
## 43 patience 1828 1
## 44 patience 1834 1
## 45 patience 1835 1
## 46 patience 1836 1
## 47 patience 1838 1
## 48 patience 1839 1
## 49 patience 1841 1
## 50 patience 1843 1
## 51 patience 1845 1
## 52 prudence 1768 1
## 53 prudence 1780 1
## 54 prudence 1788 1
## 55 prudence 1792 1
## 56 prudence 1793 1
## 57 prudence 1794 1
## 58 prudence 1796 1
## 59 prudence 1797 1
## 60 prudence 1800 1
## 61 prudence 1802 1
## 62 prudence 1803 1
## 63 prudence 1806 1
## 64 prudence 1808 1
## 65 prudence 1809 1
## 66 prudence 1812 1
## 67 prudence 1814 1
## 68 prudence 1815 1
## 69 prudence 1819 1
## 70 prudence 1820 1
## 71 prudence 1824 1
## 72 prudence 1825 1
## 73 prudence 1828 1
## 74 prudence 1831 1
## 75 prudence 1833 1
## 76 prudence 1834 1
## 77 prudence 1837 1
## 78 prudence 1838 1
## 79 prudence 1841 1
## 80 prudence 1842 1
## 81 prudence 1847 1
## 82 prudence 1849 1
The short answer is that the 1850 census is too far from the time period that you are interested in, but unfortunately it’s the first census for which IPUMS has the data available.