In this first experiment with linking the Byers Company sample to U.S. Census records, I limited the scope of who I searched for and where I looked for them. First, I limited the Byers sample to records that were employed at the Pittsburgh plant and worked in 1935 or later (determined by last year at company). I also had to drop two records with missing gender information, leaving 927 records to search for in the census. Because these workers are all associated with the Pittsburgh plant, I only search for matching records in Allegheny County.

Finding Potential Records

Here, I present the initial “blocking” stage that is essential for any record linkage procedure: generating a basic set of potential matches. For each Byers record, I select records in Allegeny County with the same sex, and a birth year within two years (either older or younger). Then, after processing the first and last names from each source (dropping initials, Jr./Sr. non-letter characters, all uppercase), I compute Jaro-Winkler string distances (represented as 1 - JW, so larger values indicate greater similarity). I keep comparisons with first name scores >= 0.75, and last name scores >= 0.8. I also consider “exact matches”, where birth year, first name, and last name are exactly the same between the Byers record and census record.

Summarizing Potential Matches

Let’s start by breaking down the Byers records by how many potential matches they were associated with:

## # A tibble: 39 x 3
##    `Potential Matches` `Number of Byers Records` `% of Sample`
##                  <dbl>                     <int>         <dbl>
##  1                   0                       191          20.6
##  2                   1                       218          23.5
##  3                   2                       146          15.7
##  4                   3                        84           9.1
##  5                   4                        51           5.5
##  6                   5                        34           3.7
##  7                   6                        22           2.4
##  8                   7                        25           2.7
##  9                   8                        19           2  
## 10                   9                        22           2.4
## # … with 29 more rows

These results are encouraging, as only ~20% of records failed to find any potential match. Now, let’s consider how common exact matches were:

## # A tibble: 8 x 3
##   `Exact Matches` `Number of Byers Records` `% of Sample`
##             <dbl>                     <int>         <dbl>
## 1               0                       789          85.1
## 2               1                       117          12.6
## 3               2                         9           1  
## 4               3                         6           0.6
## 5               4                         3           0.3
## 6               5                         1           0.1
## 7               6                         1           0.1
## 8               7                         1           0.1

About one-eighth of the records I searched for have exact matches, which are high-confidence without much additional effort (depending on how common their names are). However, about 2% of the records have multiple exact matches as a result of common names. In the absense of other kinds of information, we would need to throw these matches out because we can’t tell which is correct. Breaking these ties will be an important step to explore.

Who didn’t get any matches?

Now, let’s consider the Byers records that didn’t get any potential matches in Allegheny County for 1940. First, let’s break the Byers records down by sex, as last name changes of women that were married could be one reason for not finding good matches:

## # A tibble: 4 x 5
## # Groups:   Sex [2]
##   Sex   `Any Match?` `N of records` `% of records` `% of sex`
##   <chr> <chr>                 <int>          <dbl>      <dbl>
## 1 F     No                        6            0.6       18.2
## 2 F     Yes                      27            2.9       81.8
## 3 M     No                      185           20         20.7
## 4 M     Yes                     709           76.5       79.3

Here, we see that women were a small portion of this part of the overall Byers sample, and were not more likely than men to get no matches based on name and birth year. Another potential explanation is that some workers were part of the Byers workforce well after 1940, and were not living in Allegheny County during the 1940 Census:

## # A tibble: 2 x 2
##   Matches    `% Start After 1940`
##   <chr>                     <dbl>
## 1 1+ Matches                 73.5
## 2 No Matches                 77

While workers with no matches were slightly more likely to have started after 1940, it’s not a large difference. Finally, let’s try breaking down workers that started after 1940 by their birthplace. If they were born outside Pennsylvania, it’s possible that they did not live in Allegheny County during the 1940 Census and matching them is not possible.

## # A tibble: 4 x 4
## # Groups:   Matches [2]
##   Matches    Birthplace   `N of records after 1940` `% of records after 1940`
##   <chr>      <chr>                            <int>                     <dbl>
## 1 1+ Matches Elsewhere                          145                      21.1
## 2 1+ Matches Pennsylvania                       396                      57.6
## 3 No Matches Elsewhere                           77                      11.2
## 4 No Matches Pennsylvania                        70                      10.2

This hunch seems to be somewhat correct. About one half of records without any potential matches were born in Pennsylvania, and about one half were born elsewhere. By contrast, only about 27% of records that got at least one potential match were born outside of Pennsylvania, showing that out-of-staters are definitely overrepresented in matchless cases.

Examining the Raw Results

Here is the full table of potential matches identified between the Byers records and the Census records, with race and birthplace included for comparison purposes (IPUMS codes for birthplace are available here, select “detailed codes”). Variable names in all uppercase are from the Byers file, while all lowercase names are from IPUMS census microdata.

And here is the full set of Byers information for records without any potential matches.