The New Bedford Whaling Museum just released a database of crewmember information. It seems like a good opportunity to give a more code-heavy narrative of how you might look for patterns in a new dataset with interesting historical categories.
First, load in some packages. Hadley Wickham's ggplot2 and plyr make R much more consistent and easy to use: it's always worth including them.
require(plyr)
## Loading required package: plyr
require(ggplot2)
## Loading required package: ggplot2
require(reshape2)
## Loading required package: reshape2
Then, load in the data. I could do this from the web, but I've put it in a folder here.
crews = read.csv("~/shipping/Crewlist.csv")
The first thing to do is use the summary() function, which figures out what the different columns in your database are and gives appropriate descriptions of the types of data in each. For numbers, it gives averages; for categorical data (called 'factors') in R, it lists the most common elements.
summary(crews)
## LastName FirstName Vessel
## Smith : 2320 John :10229 America : 786
## Brown : 1308 William: 5556 Elizabeth : 751
## Williams: 1207 James : 4256 Pacific : 719
## Francis : 896 Joseph : 4205 Canton II : 706
## Johnson : 814 Manuel : 3276 Sarah : 702
## Davis : 762 Charles: 3257 George and Susan: 659
## (Other) :116290 (Other):92818 (Other) :119274
## Rig ApproximateDeparture FullName
## Ship :54502 9/30/1857 : 273 John Williams : 173
## Bark :53523 10/10/1855: 249 John Smith : 162
## Schr : 8888 : 237 Joseph Francis : 109
## Brig : 5538 10/1/1855 : 235 Manuel Francis : 106
## Bark : 409 7/24/1844 : 210 William Smith : 92
## (Other): 517 10/1/1860 : 208 Francis, Manuel: 90
## NA's : 220 (Other) :122185 (Other) :122865
## Age Height Skin Hair
## Min. : 2 :91579 :72785 :73933
## 1st Qu.:21 5'7 : 4443 Light :22108 Dark :16010
## Median :23 5'8 : 4388 Dark :21532 Brown :11563
## Mean :25 5'6 : 4309 Black : 3282 Black : 7676
## 3rd Qu.:27 5'9 : 2791 B : 1400 Light : 7641
## Max. :82 5'5 : 2481 Brown : 365 B : 3019
## NA's :71838 (Other):13606 (Other): 2125 (Other): 3755
## Eye Residence Rank Lay.
## :123493 :69274 :58308 :67191
## Brown : 36 New Bedford: 8585 Greenhand :19416 1-190 : 3895
## Black : 32 New York : 2093 Seaman : 9326 1-200 : 3424
## Blue : 15 Boston : 1486 Boatsteerer: 7150 1-150 : 3048
## Dark : 7 Brava : 1412 Ordinary : 5042 1-185 : 2978
## Light : 6 Dartmouth : 1265 Master : 4474 1-175 : 2838
## (Other): 8 (Other) :39482 (Other) :19881 (Other):40223
## Remarks Voyage.number VesselNumber
## :112981 Min. : 4 Min. : 1
## DID NOT RETURN FROM VOYAGE: 820 1st Qu.: 3711 1st Qu.: 211
## CROSSED OUT : 671 Median : 7399 Median : 424
## AND BOATSTEERER : 204 Mean : 7416 Mean : 550
## Came home : 193 3rd Qu.:10919 3rd Qu.: 637
## CREW LIST FROM MICROFILM : 178 Max. :17143 Max. :2909
## (Other) : 8550 NA's :7 NA's :27
There's some good descriptive data about people, which suggests a chance for something about bodies–measurements, physical descriptions, and ages all have interesting interplays. That will be particularly valuable if we can tie it in to some other sorts of information. Before I get into that, there are couple variables that I just want to see fuller counts on: table() in R gives the best way to do that. I'm interested in names because I could link them up to census information and because they provide some clues to ethnicity;
sort(table(crews$Remarks), decreasing = T)[1:19]
##
## DID NOT RETURN FROM VOYAGE
## 112981 820
## CROSSED OUT AND BOATSTEERER
## 671 204
## Came home CREW LIST FROM MICROFILM
## 193 178
## AND SHIPKEEPER AND CARPENTER
## 146 115
## AS PER AGREEMENT came home
## 78 75
## AND BLACKSMITH DID NOT SHIP
## 57 53
## AND SEAMAN DESERTED
## 42 41
## AND GREENHAND Run away
## 40 37
## AND 1/2 SLUSH AND HALF SLUSH
## 35 32
## 1902
## 31
sort(table(crews$LastName), decreasing = T)[1:45]
##
## Smith Brown Williams Francis Johnson Davis Joseph Allen
## 2320 1308 1207 896 814 762 760 697
## Silva Gifford Lewis Antone Jones Tripp Wilson Thomas
## 696 650 605 573 557 547 534 532
## Clark Thompson King Perry Howland Sylvia Taber Baker
## 530 509 508 489 477 472 462 443
## Fisher Kanaka White Lopes Robinson Taylor Briggs Silvia
## 438 400 396 372 372 368 363 358
## Adams Wood Chase Enos Martin Miller Rogers Russell
## 351 346 343 343 337 336 326 325
## Jackson Reed Green Hathaway Sherman
## 306 306 299 299 292
sort(table(crews$FirstName), decreasing = T)[1:45]
##
## John William James Joseph Manuel Charles
## 10229 5556 4256 4205 3276 3257
## George Thomas Henry Antone Frank Edward
## 3168 2598 2275 1960 1357 1185
## Samuel Peter William H. Robert Francis David
## 1128 1124 1094 994 962 766
## Benjamin Jose Daniel George W. Richard Charles H.
## 743 724 686 685 634 577
## Andrew Antonio Alexander John H. Michael Isaac
## 566 516 502 481 440 436
## Albert Stephen George H. Jacob John W. John A.
## 424 410 384 379 349 335
## James H. Lewis Wm Joe Alfred John C.
## 320 311 308 289 282 281
## Charles W. Frederick John M.
## 262 260 252
sort(table(crews$Residence), decreasing = T)[1:45]
##
## New Bedford New York
## 69274 8585 2093
## Boston Brava Dartmouth
## 1486 1412 1265
## Azores Germany Fairhaven
## 1185 1007 993
## Fayal New Bedford, Ma Cape Verde
## 948 888 855
## Nantucket Westport Philadelphia
## 818 752 709
## Rochester Edgartown Flores
## 445 425 420
## Pico, Azores New York City Falmouth
## 406 400 385
## Mattapoisett Sao Jorge, Azores Saint Helena
## 372 355 353
## Providence Fall River Boston, Ma
## 323 300 272
## Brava, Cape Verde Tiverton Chilmark
## 263 256 253
## Sao Miguel, Azores Marion Wareham
## 245 239 237
## Fogo Tisbury Baltimore
## 236 220 200
## Lowell Brooklyn Albany
## 175 174 173
## Sandwich Santo Antao, Cape Verde Sao Nicolau, Cape Verde
## 173 165 163
## New York, Ny Acushnet England
## 161 155 154
Residence is extremely valuable, but unstructured. Most of the names are straight Angle, but a few last names (Lopes, Silva, Sylvia) seem to capture Portuguese speakers, and (“Kanaka” is going to catch Polynesians)[http://en.wikipedia.org/wiki/Kanaka]. The residences also look somewhat useful, and could let us start to bootstrap up by looking for, say, Cape Verdean names that live in New Bedford. (Can we learn anything new about Cape Verdean ranks or desertion patterns?)
But initially, I'm finding the personal data more interesting. So I'll do a few quick transformation to make sure that data is in a sensible form. The voyage date is a string in the csv, as is the height: a few lines of code will clear that up.
The only hard part turns out to be turning text like 5'6 ¾" into numbers. I'm going to go with a 95% solution here that produces a couple mistakes, but most of the time multiplies the first number by 12, then adds any inches and fractions to the end.
crews$date = as.Date(crews$ApproximateDeparture, format = "%d/%m/%Y")
# This doesn't work perfectly yet: certain formats have ages dropped
pullHeights = function(string) {
string = gsub("[Ff]t.?", "'", string) #change foot sign to inch sign
string = gsub("[\"]", "", string) #Ignore inch marks
string = gsub("[Ii]n.?", "", string) #Ignore inch marks
heights = strsplit(as.character(string), "[' ,]")[[1]]
if (length(strings) == 0) {
return(NA)
}
if (length(strings) == 1) {
strings[2:3] = "0"
}
if (length(strings) == 2) {
strings[3] = "0"
}
strings[strings == ""] = "0" #No data is a 'zero'
strings[1] = paste0("12*", strings[1]) #multiply feet by twelve.
try({
return(eval(parse(text = paste(strings[1:3], collapse = "+"))))
})
return(NA)
}
# crews$height = laply(crews$Height,pullHeights) Here's an easier
# approximation for those at home.
crews$height = as.numeric(gsub("\\D.*", "", as.character(crews$Height))) * 12
crews$Age = as.numeric(as.character(crews$Age))
crews$Age[crews$Age <= 9] = NA
Just do some survey plots on the data we have.
ggplot(crews[crews$height > 3 * 12 & crews$height < 7 * 12, ], aes(y = height/12,
x = Age)) + geom_point(size = 1, alpha = 0.2, position = position_jitter(h = 1/12,
w = 1), color = "red") + labs(title = "Height in feet") + geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 91716 rows containing missing values (stat_smooth).
## Warning: Removed 91716 rows containing missing values (geom_point).
ggplot(crews[crews$height > 3 * 12 & crews$height < 7 * 12, ], aes(y = height/12,
x = Age)) + labs(title = "Height in feet") + geom_hex() + scale_fill_gradientn("Number of sailors",
trans = "log10", colours = heat.colors(10))
## Warning: Removed 91716 rows containing missing values (stat_hexbin).
crews[crews$height/12 > 6.5 & !is.na(crews$height), 5:13]
## [1] ApproximateDeparture FullName Age
## [4] Height Skin Hair
## [7] Eye Residence Rank
## <0 rows> (or 0-length row.names)
ggplot(crews[crews$height > 3 * 12 & crews$height < 7 * 12, ]) + geom_point(aes(x = height/12,
y = date), size = 2, alpha = 0.5) + labs(title = "Height in feet")
## Warning: Removed 110587 rows containing missing values (geom_point).
ggplot(crews[crews$Hair != "", ]) + geom_bar(aes(x = Hair)) + coord_flip()
Hair and Skin color are one of the most interesting interactions here. The late 19th century is a period before racial identities have solidified, so the logbooks use a complicated array of vocabulary to describe skin and hair. It is obviously racialized, but in complicated ways. Looking at the interaction of these two variables lets us begin
tabbed = table(crews$Skin, crews$Hair)
tabbed = tabbed[rownames(tabbed) != "", colnames(tabbed) != ""]
tabbed = tabbed[rowSums(tabbed) > 25, colSums(tabbed) > 25]
hairs = names(sort(prcomp(scale(tabbed))$rotation[, 1]))
skins = names(sort(prcomp(scale(t(tabbed)))$rotation[, 1]))
physicalInteractions = melt(tabbed)
names(physicalInteractions) = c("Skin", "Hair", "Number")
physicalInteractions = physicalInteractions[physicalInteractions$Hair != "" &
physicalInteractions$Skin != "" & physicalInteractions$Number > 0, ]
ggplot(physicalInteractions) + geom_point(aes(x = Hair, y = Skin, size = Number),
alpha = 0.3) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_size_continuous(trans = "sqrt", range = c(1, 20)) + labs("Skin and hair interactions in the New Bedford Whaling database")
proportions = melt(scale(t(scale(tabbed))))
names(proportions) = c("Hair", "Skin", "Representation")
proportions$Skin = factor(proportions$Skin, levels = skins)
proportions$Hair = factor(proportions$Hair, levels = hairs)
# Doing a merge eliminates the grid-spots for which there are no actual
# individuals
proportions = merge(proportions, physicalInteractions, all.x = T)
proportions$exists = !is.na(proportions$Number)
ggplot(proportions) + geom_tile(aes(x = Skin, y = Hair, fill = Representation,
lwd = sqrt(Number)), color = "green") + scale_fill_gradient2("Scale over-\nrepresentation") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + coord_equal() +
labs(title = "Relative over- or under-representation of different hair-skin combinations\nin crew manifest descriptions")