Whaling Crewlists

The New Bedford Whaling Museum just released a database of crewmember information. It seems like a good opportunity to give a more code-heavy narrative of how you might look for patterns in a new dataset with interesting historical categories.

First, load in some packages. Hadley Wickham's ggplot2 and plyr make R much more consistent and easy to use: it's always worth including them.

require(plyr)

## Loading required package: plyr

require(ggplot2)

## Loading required package: ggplot2

require(reshape2)

## Loading required package: reshape2

Then, load in the data. I could do this from the web, but I've put it in a folder here.

crews = read.csv("~/shipping/Crewlist.csv")

The first thing to do is use the summary() function, which figures out what the different columns in your database are and gives appropriate descriptions of the types of data in each. For numbers, it gives averages; for categorical data (called 'factors') in R, it lists the most common elements.

summary(crews)

##      LastName        FirstName                  Vessel      
##  Smith   :  2320   John   :10229   America         :   786  
##  Brown   :  1308   William: 5556   Elizabeth       :   751  
##  Williams:  1207   James  : 4256   Pacific         :   719  
##  Francis :   896   Joseph : 4205   Canton II       :   706  
##  Johnson :   814   Manuel : 3276   Sarah           :   702  
##  Davis   :   762   Charles: 3257   George and Susan:   659  
##  (Other) :116290   (Other):92818   (Other)         :119274  
##       Rig        ApproximateDeparture            FullName     
##  Ship   :54502   9/30/1857 :   273    John Williams  :   173  
##  Bark   :53523   10/10/1855:   249    John Smith     :   162  
##  Schr   : 8888             :   237    Joseph Francis :   109  
##  Brig   : 5538   10/1/1855 :   235    Manuel Francis :   106  
##  Bark   :  409   7/24/1844 :   210    William Smith  :    92  
##  (Other):  517   10/1/1860 :   208    Francis, Manuel:    90  
##  NA's   :  220   (Other)   :122185    (Other)        :122865  
##       Age            Height           Skin            Hair      
##  Min.   : 2             :91579          :72785          :73933  
##  1st Qu.:21      5'7    : 4443   Light  :22108   Dark   :16010  
##  Median :23      5'8    : 4388   Dark   :21532   Brown  :11563  
##  Mean   :25      5'6    : 4309   Black  : 3282   Black  : 7676  
##  3rd Qu.:27      5'9    : 2791   B      : 1400   Light  : 7641  
##  Max.   :82      5'5    : 2481   Brown  :  365   B      : 3019  
##  NA's   :71838   (Other):13606   (Other): 2125   (Other): 3755  
##       Eye               Residence              Rank            Lay.      
##         :123493              :69274              :58308          :67191  
##  Brown  :    36   New Bedford: 8585   Greenhand  :19416    1-190 : 3895  
##  Black  :    32   New York   : 2093   Seaman     : 9326    1-200 : 3424  
##  Blue   :    15   Boston     : 1486   Boatsteerer: 7150    1-150 : 3048  
##  Dark   :     7   Brava      : 1412   Ordinary   : 5042    1-185 : 2978  
##  Light  :     6   Dartmouth  : 1265   Master     : 4474    1-175 : 2838  
##  (Other):     8   (Other)    :39482   (Other)    :19881   (Other):40223  
##                        Remarks       Voyage.number    VesselNumber 
##                            :112981   Min.   :    4   Min.   :   1  
##  DID NOT RETURN FROM VOYAGE:   820   1st Qu.: 3711   1st Qu.: 211  
##  CROSSED OUT               :   671   Median : 7399   Median : 424  
##  AND BOATSTEERER           :   204   Mean   : 7416   Mean   : 550  
##  Came home                 :   193   3rd Qu.:10919   3rd Qu.: 637  
##  CREW LIST FROM MICROFILM  :   178   Max.   :17143   Max.   :2909  
##  (Other)                   :  8550   NA's   :7       NA's   :27

There's some good descriptive data about people, which suggests a chance for something about bodies–measurements, physical descriptions, and ages all have interesting interplays. That will be particularly valuable if we can tie it in to some other sorts of information. Before I get into that, there are couple variables that I just want to see fuller counts on: table() in R gives the best way to do that. I'm interested in names because I could link them up to census information and because they provide some clues to ethnicity;

sort(table(crews$Remarks), decreasing = T)[1:19]

## 
##                            DID NOT RETURN FROM VOYAGE 
##                     112981                        820 
##                CROSSED OUT            AND BOATSTEERER 
##                        671                        204 
##                  Came home   CREW LIST FROM MICROFILM 
##                        193                        178 
##             AND SHIPKEEPER              AND CARPENTER 
##                        146                        115 
##           AS PER AGREEMENT                  came home 
##                         78                         75 
##             AND BLACKSMITH               DID NOT SHIP 
##                         57                         53 
##                 AND SEAMAN                   DESERTED 
##                         42                         41 
##              AND GREENHAND                   Run away 
##                         40                         37 
##              AND 1/2 SLUSH             AND HALF SLUSH 
##                         35                         32 
##                       1902 
##                         31

sort(table(crews$LastName), decreasing = T)[1:45]

## 
##    Smith    Brown Williams  Francis  Johnson    Davis   Joseph    Allen 
##     2320     1308     1207      896      814      762      760      697 
##    Silva  Gifford    Lewis   Antone    Jones    Tripp   Wilson   Thomas 
##      696      650      605      573      557      547      534      532 
##    Clark Thompson     King    Perry  Howland   Sylvia    Taber    Baker 
##      530      509      508      489      477      472      462      443 
##   Fisher   Kanaka    White    Lopes Robinson   Taylor   Briggs   Silvia 
##      438      400      396      372      372      368      363      358 
##    Adams     Wood    Chase     Enos   Martin   Miller   Rogers  Russell 
##      351      346      343      343      337      336      326      325 
##  Jackson     Reed    Green Hathaway  Sherman 
##      306      306      299      299      292

sort(table(crews$FirstName), decreasing = T)[1:45]

## 
##       John    William      James     Joseph     Manuel    Charles 
##      10229       5556       4256       4205       3276       3257 
##     George     Thomas      Henry     Antone      Frank     Edward 
##       3168       2598       2275       1960       1357       1185 
##     Samuel      Peter William H.     Robert    Francis      David 
##       1128       1124       1094        994        962        766 
##   Benjamin       Jose     Daniel  George W.    Richard Charles H. 
##        743        724        686        685        634        577 
##     Andrew    Antonio  Alexander    John H.    Michael      Isaac 
##        566        516        502        481        440        436 
##     Albert    Stephen  George H.      Jacob    John W.    John A. 
##        424        410        384        379        349        335 
##   James H.      Lewis         Wm        Joe     Alfred    John C. 
##        320        311        308        289        282        281 
## Charles W.  Frederick    John M. 
##        262        260        252

sort(table(crews$Residence), decreasing = T)[1:45]

## 
##                                     New Bedford                New York 
##                   69274                    8585                    2093 
##                  Boston                   Brava               Dartmouth 
##                    1486                    1412                    1265 
##                  Azores                 Germany               Fairhaven 
##                    1185                    1007                     993 
##                   Fayal         New Bedford, Ma              Cape Verde 
##                     948                     888                     855 
##               Nantucket                Westport            Philadelphia 
##                     818                     752                     709 
##               Rochester               Edgartown                  Flores 
##                     445                     425                     420 
##            Pico, Azores           New York City                Falmouth 
##                     406                     400                     385 
##            Mattapoisett       Sao Jorge, Azores            Saint Helena 
##                     372                     355                     353 
##              Providence              Fall River              Boston, Ma 
##                     323                     300                     272 
##       Brava, Cape Verde                Tiverton                Chilmark 
##                     263                     256                     253 
##      Sao Miguel, Azores                  Marion                 Wareham 
##                     245                     239                     237 
##                    Fogo                 Tisbury               Baltimore 
##                     236                     220                     200 
##                  Lowell                Brooklyn                  Albany 
##                     175                     174                     173 
##                Sandwich Santo Antao, Cape Verde Sao Nicolau, Cape Verde 
##                     173                     165                     163 
##            New York, Ny                Acushnet                 England 
##                     161                     155                     154

Residence is extremely valuable, but unstructured. Most of the names are straight Angle, but a few last names (Lopes, Silva, Sylvia) seem to capture Portuguese speakers, and (“Kanaka” is going to catch Polynesians)[http://en.wikipedia.org/wiki/Kanaka]. The residences also look somewhat useful, and could let us start to bootstrap up by looking for, say, Cape Verdean names that live in New Bedford. (Can we learn anything new about Cape Verdean ranks or desertion patterns?)

But initially, I'm finding the personal data more interesting. So I'll do a few quick transformation to make sure that data is in a sensible form. The voyage date is a string in the csv, as is the height: a few lines of code will clear that up.

The only hard part turns out to be turning text like 5'6 ¾" into numbers. I'm going to go with a 95% solution here that produces a couple mistakes, but most of the time multiplies the first number by 12, then adds any inches and fractions to the end.

crews$date = as.Date(crews$ApproximateDeparture, format = "%d/%m/%Y")

# This doesn't work perfectly yet: certain formats have ages dropped
pullHeights = function(string) {
    string = gsub("[Ff]t.?", "'", string)  #change foot sign to inch sign
    string = gsub("[\"]", "", string)  #Ignore inch marks
    string = gsub("[Ii]n.?", "", string)  #Ignore inch marks

    heights = strsplit(as.character(string), "[' ,]")[[1]]
    if (length(strings) == 0) {
        return(NA)
    }
    if (length(strings) == 1) {
        strings[2:3] = "0"
    }
    if (length(strings) == 2) {
        strings[3] = "0"
    }
    strings[strings == ""] = "0"  #No data is a 'zero'
    strings[1] = paste0("12*", strings[1])  #multiply feet by twelve.
    try({
        return(eval(parse(text = paste(strings[1:3], collapse = "+"))))
    })
    return(NA)
}

# crews$height = laply(crews$Height,pullHeights) Here's an easier
# approximation for those at home.
crews$height = as.numeric(gsub("\\D.*", "", as.character(crews$Height))) * 12
crews$Age = as.numeric(as.character(crews$Age))
crews$Age[crews$Age <= 9] = NA

Just do some survey plots on the data we have.


ggplot(crews[crews$height > 3 * 12 & crews$height < 7 * 12, ], aes(y = height/12, 
    x = Age)) + geom_point(size = 1, alpha = 0.2, position = position_jitter(h = 1/12, 
    w = 1), color = "red") + labs(title = "Height in feet") + geom_smooth()

## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.

## Warning: Removed 91716 rows containing missing values (stat_smooth).

## Warning: Removed 91716 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-2


ggplot(crews[crews$height > 3 * 12 & crews$height < 7 * 12, ], aes(y = height/12, 
    x = Age)) + labs(title = "Height in feet") + geom_hex() + scale_fill_gradientn("Number of sailors", 
    trans = "log10", colours = heat.colors(10))

## Warning: Removed 91716 rows containing missing values (stat_hexbin).

plot of chunk unnamed-chunk-2


crews[crews$height/12 > 6.5 & !is.na(crews$height), 5:13]

## [1] ApproximateDeparture FullName             Age                 
## [4] Height               Skin                 Hair                
## [7] Eye                  Residence            Rank                
## <0 rows> (or 0-length row.names)


ggplot(crews[crews$height > 3 * 12 & crews$height < 7 * 12, ]) + geom_point(aes(x = height/12, 
    y = date), size = 2, alpha = 0.5) + labs(title = "Height in feet")

## Warning: Removed 110587 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-2

ggplot(crews[crews$Hair != "", ]) + geom_bar(aes(x = Hair)) + coord_flip()

plot of chunk unnamed-chunk-2

Hair and Skin color are one of the most interesting interactions here. The late 19th century is a period before racial identities have solidified, so the logbooks use a complicated array of vocabulary to describe skin and hair. It is obviously racialized, but in complicated ways. Looking at the interaction of these two variables lets us begin

tabbed = table(crews$Skin, crews$Hair)
tabbed = tabbed[rownames(tabbed) != "", colnames(tabbed) != ""]
tabbed = tabbed[rowSums(tabbed) > 25, colSums(tabbed) > 25]
hairs = names(sort(prcomp(scale(tabbed))$rotation[, 1]))
skins = names(sort(prcomp(scale(t(tabbed)))$rotation[, 1]))
physicalInteractions = melt(tabbed)
names(physicalInteractions) = c("Skin", "Hair", "Number")
physicalInteractions = physicalInteractions[physicalInteractions$Hair != "" & 
    physicalInteractions$Skin != "" & physicalInteractions$Number > 0, ]
ggplot(physicalInteractions) + geom_point(aes(x = Hair, y = Skin, size = Number), 
    alpha = 0.3) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
    scale_size_continuous(trans = "sqrt", range = c(1, 20)) + labs("Skin and hair interactions in the New Bedford Whaling database")

plot of chunk HairandSkin

proportions = melt(scale(t(scale(tabbed))))
names(proportions) = c("Hair", "Skin", "Representation")
proportions$Skin = factor(proportions$Skin, levels = skins)
proportions$Hair = factor(proportions$Hair, levels = hairs)
# Doing a merge eliminates the grid-spots for which there are no actual
# individuals
proportions = merge(proportions, physicalInteractions, all.x = T)
proportions$exists = !is.na(proportions$Number)
ggplot(proportions) + geom_tile(aes(x = Skin, y = Hair, fill = Representation, 
    lwd = sqrt(Number)), color = "green") + scale_fill_gradient2("Scale over-\nrepresentation") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + coord_equal() + 
    labs(title = "Relative over- or under-representation of different hair-skin combinations\nin crew manifest descriptions")

plot of chunk scaled