Technical notes

The OCC algorithms with formal evaluations are OCC_title, OCC_titleSyntaxI, OCC_titleSyntaxU, OCC_syntax, OCC_FsT_nobreaks, OCC, OCC_fullBody, OCC_wikidata.

Summary statistics

The obituaries and coded OCCs

  • There were a total of 1000 hand-coded documents, with 1,302 total codes.
  • There were 116 true OCC codes, with a minimum frequency of 1 (there were 42 codes with this frequency) and a maximum frequency of 151 (001).
  • We’re analzing 60,847 obituaries

The following simply shows the number of obituaries per quarter in the entire dataset.

And the following over the course of a couple years, 2011-2012:

Here’s a histogram of the frequency of OCC codes in the hand-coded set.

The following table shows the frequency of each OCC code in the hand-coded set. This table excludes the codes which account for less than 1% of documents.

OCC.code Total.true.codes
581 001 151
639 220 105
660 285 100
654 275 80
652 272 52
650 270 49
599 043 48
691 s043 44
637 210 39
666 306 30
635 204 29
651 271 29
657 281 28
648 260 27
631 186 24
689 980 23
594 023 22
584 003 21
583 002 20
659 283 19
653 274 18
656 280 18
588 012 17
694 s220 16
638 211 15
621 161 14
655 276 13
582 001a 12
649 263 11
678 470 11
623 165 10
662 291 10

The various algorithms

Brief algorithm descriptions

OCC

This algorithm looks in both the title and first sentence. It first removes any instances of their name, and convert the whole sentence to lower-case. It then uses a dictionary of terms, allowing a 1-word “buffer” for 2-word terms, and a 2-word buffer for 3-word terms, ignoring order. For example, the term “marketing assistant” would match with “assistant of marketing activities” but not “assistant to the marketing department”. We ignore terms which are subsets of other matched terms, for instance ignoring the “president” in “vice president”.

OCC_FsT_nobreaks

This algorithm is the same as OCC, except it does not allow any buffer words.

OCC_fullBody

This algorithm looks in the full body, excluding the title, using our curated dictionary. It does not allow buffer words, to improve performance (there are already way too many false-positives).

OCC_syntax

This one isolates a specific grammatical construction which occurs often in the first sentence of the obituary. That is the appositional modifier, or “APPOS”. The appos is a noun which immediately follows another, typically within a comma-delimited phrase. In the example below, “Archbishop” is the appositional modifier of James Peter Davis.

James Peter Davis, Archbishop of Santa Fe from 1964 to 1974, died Friday.

The algorithm then looks this word (or words, if it’s a compound noun) up in our dictionary and records a match.

OCC_title

This algorithm looks exclusively in the description field of the title, which automatically excludes the obituarized’s name. It again uses our curated dictionary.

OCC_wikidata

If there is a wikidata entry which matches the obituaried’s name exactly, we look at all occupations (P106) of the individual, and match the labels for these occupations against our dictionary. We then code them as ALL of these occupations, as well as any superclass (P279) of this occupation. For example, Martin Luther King Jr. (Q8027) is listed as having the occupation preacher (Q432386), which is a subclass of religious servant (Q4504549), which in turn is a subclass of cleric (Q2259532). After this the superclasses get much more general, and typically nonoccupational (e.g. believer) which are filtered.

OCC_titleSyntaxI and OCC_titleSyntaxU

These are just the intersection (OCC_titleSyntaxI) and union (OCC_titleSyntaxU) of OCC_title and OCC_syntax, the highest precision and most straightforward use of our vocabulary.

Algorithm performance

The table below shows some summary statistics regarding the different algorithms, giving a high-level view of how well they performed.

  • truePos counts the number of true codes which were correctly guessed
  • falsePos counts the number of true codes for which there is a corresponding guess, and which was not correctly guessed
  • Precision is the proportion of those codes for which there is a guess, which were correct
  • Recall is the proportion of true codes which were correctly guessed
truePos falsePos falseNeg NobitsCoded Total.machine.codes Total.true.codes
837 (0.72) 332 (0.28) 291 (0.22) 849 1169 1302
827 (0.71) 331 (0.29) 297 (0.23) 845 1158 1302
989 (0.27) 2612 (0.73) 291 (0.22) 979 3601 1302
478 (0.78) 132 (0.22) 280 (0.22) 531 610 1302
473 (0.8) 119 (0.2) 270 (0.21) 548 592 1302
256 (0.87) 38 (0.13) 139 (0.11) 282 294 1302
695 (0.77) 213 (0.23) 295 (0.23) 725 908 1302
282 (0.27) 755 (0.73) 280 (0.22) 432 1037 1302

The numbers in parentheses of the truePos and falsePos columns show what proportion of all true codes for which there was a guess were correctly guessed, and the number in parentheses of the falseNeg column shows what proportion of the true codes were missed by machine coding.

Algorithm truePosProp falsePosProp Precision Recall Total.true.codes
OCC 0.7159966 0.2840034 0.7159966 0.7420213 1302
OCC_FsT_nobreaks 0.7141623 0.2858377 0.7141623 0.7357651 1302
OCC_fullBody 0.2746459 0.7253541 0.2746459 0.7726562 1302
OCC_syntax 0.7836066 0.2163934 0.7836066 0.6306069 1302
OCC_title 0.7989865 0.2010135 0.7989865 0.6366083 1302
OCC_titleSyntaxI 0.8707483 0.1292517 0.8707483 0.6481013 1302
OCC_titleSyntaxU 0.7654185 0.2345815 0.7654185 0.7020202 1302
OCC_wikidata 0.2719383 0.7280617 0.2719383 0.5017794 1302

[not started] Typical performance on the same task in other work

Fine-grained accuracy of our best algorithm

We check the success of our machine coding of OCC against the set of 1000 obituaries which we have already carefully hand-coded.

It’s interesting to look at our overall accuracy on just the more common codes, as it’s evident in the figure above that those had the best performance.

What patterns are present in the New York Times?

We focus here on those occupations which are correctly identified more than 90% of the time. That is, in more than 90% true positives for the given occupation, the machine correctly identified it. We also won’t consider codes that are exceedingly rare (< 1% of all those hand-coded). This narrows to the following:

OCC.code truePos falsePos falseNeg NobitsCoded Total.machine.codes Total.true.codes Pop.Prop.Guess Pop.Prop.True Total.true.codes.1
639 220 47 4 51 849 51 105 0.0600707 0.105 105
637 210 32 3 3 849 35 39 0.0412250 0.039 39
651 271 13 1 11 849 14 29 0.0164900 0.029 29
657 281 24 2 4 849 26 28 0.0306243 0.028 28
631 186 19 0 5 849 19 24 0.0223793 0.024 24
659 283 17 1 2 849 18 19 0.0212014 0.019 19
653 274 18 1 0 849 19 18 0.0223793 0.018 18
621 161 11 1 1 849 12 14 0.0141343 0.014 14
678 470 6 0 2 849 6 11 0.0070671 0.011 11
662 291 8 0 1 849 8 10 0.0094229 0.010 10

To explore the possibility that the trend, or lack of one, could be related to the differring numbers of obituaries over time, we can look at proportion of documents coded in that time period with that code.

The following plot asks the simple question of whether the proportion we code is variable over time. This could indicate some time-bias in our coding algorithm.

Corrections via Hopkins and King (2010) analysis

NOTE: Although this is quite close, there is a faulty assumption lying in here.

In particular, for disjoint events \(A, B, C\) such that \(A\cup B\cup C = E\), the whole sample space, it’s true that \(P(X\wedge A)+P(X\wedge B)+P(X\wedge C)=P(X)\), as assumed below. But crucially, the events \(\{j'\in\hat{D}\}\) are not disjoint, because \(\hat{D}\) often contains multiple codes. This needs to be corrected.

It seems the correction is to ammend our expansion. We can still do everything else (although it won’t be a simple matrix multiplication and will probably take me some time…)

\[ \begin{split} P(j\in D) &= \sum_{k} P(j\in D | k\in\hat{D}) P(k\in\hat{D}) \\ &- \sum_{k_1 \neq k_2} P(j\in D | k_1\in\hat{D} \wedge k_2\in\hat{D} ) P(k_1\in\hat{D} \wedge k_2\in\hat{D}) \\ &+ \sum_{k_1 \neq k_2 \neq k_3} P(j\in D | k_1\in\hat{D} \wedge k_2\in\hat{D} \wedge k_3\in\hat{D} ) P(k_1\in\hat{D} \wedge k_2\in\hat{D} \wedge k_3\in\hat{D}) \\ &- ... \end{split} \]

Summary of the method

What follows is a quick summary the method used by Hopkins and King (2010). They surmise that misclassifications are systematic. That is, if we observe in our hand-coded ‘golden’ test set that 17% of the time we think they are doctors they are actually lawyers, this will be similar in the larger set (or in arbitrary subsets). We can use this assumption, that true occupation gives some information about the probabilities of misclassification, when computing population proportions. This can be seen through decomposing probabilities.

Let \(j\) represent an occupation, let \(D\) be the set of true occupations, and \(\hat{D}\) be the set of occupations our algorithm codes. Then \(P(j\in D) = \sum_{j'} P(j\in D | j'\in\hat{D}) P(j'\in\hat{D})\). We will then estimate \(P(j\in D | j'\in\hat{D})\) from our data, assuming that these conditional probabilities are somewhat constant over the set of obituaries. This gives us an estimate from the population (and subsets) of the true population proportions \(D\) from our codes \(\hat{D}\).

Results of adjustments

I’ve used this method to adjust our coded population proportions. Here’s a small sample of the modified population proportions:

OCC Modified Raw
001 0.2055 0.2429
001a 0.0211 0.0000
002 0.0281 0.0113
003 0.0331 0.0284
004 0.0052 0.0069
005 0.0021 0.0010
006 0.0058 0.0010
012 0.0264 0.0229
013 0.0016 0.0005
021 0.0068 0.0042

And those which changed the most through this procedure:

OCC Modified Raw
62 220 0.1512 0.0629
126 s043 0.0580 0.1437
129 s220 0.0260 0.0844
1 001 0.2055 0.2429
18 043 0.0851 0.0545

There are some caveats to this method. First, using it gives zero proportion for many actually coded categories.

There are an astounding 148 codes attributed in the entire set which don’t show up in our hand-coding set at all, either in the true or attributed codes. There are even 8 codes which we use in the hand-coding set which never show up in the machine-coding of the larger set. Thus we can only use this method to infer population proportions for the most prevalent occupations. The second main caveat is that we don’t know whether these mis-codings were systematic. And even if they are systematic, we’re not sure if our estimated \(P(j\in D | j'\in\hat{D})\) should be consistent when selecting subsets on covariates, particularly by time.

Can we get immediate improvements by combining OCCs into larger supergroups?

We will now combine reasonably similar groups of OCCs into the same group by the following specification:

OCC
CEO 1, 001a, 1a
ADMIN 2, 4-42
LEGISLATOR 3
DIPLOMAT 43, s043
BIZOP 50-73
FINANCE 80-95
MATH 100-124
ARCHITECT 130, 131, s130
ENGINEER 132-156
SCIENTIST 160-196, s160
COUNSELOR 200-202
CLERGY 204, 205, 206
LAWYER 210, 214, 215
OCC
JUDGE 211
PROF 220, s220
TEACHER 230-234
EDUC OTHER 240-255
ARTIST 260
ACTOR 270
DIRECTOR 271
ATHLETE 272
DANCER 274
MUSICIAN 275
CLOWN 276
ANNOUNCER 280
NEWS 281-283
OCC
PHOTO 291, 292
AUTHOR 284, 285
ARTS OTHER 263, 286-296
DOCTOR 300-307, 312, s300
NURSE 311, 313-355, 360-365
COP 370-395
SALES 470-496
SECRETARY 500-593
FARM 600-613
BLUE COLLAR 620-975
PERSONAL 400-465
MILITARY 980,981,982,s980

Performance of algorithms

I’ll reproduce here exactly the same tables of summary statistics for comparison.
truePos falsePos falseNeg NobitsCoded Total.machine.codes Total.true.codes
881 (0.74) 311 (0.26) 235 (0.18) 849 1192 1289
872 (0.74) 309 (0.26) 240 (0.19) 845 1181 1289
1058 (0.28) 2727 (0.72) 209 (0.16) 979 3785 1289
511 (0.83) 105 (0.17) 236 (0.18) 531 616 1289
506 (0.83) 100 (0.17) 230 (0.18) 548 606 1289
270 (0.9) 30 (0.1) 122 (0.09) 282 300 1289
740 (0.81) 175 (0.19) 239 (0.19) 725 915 1289
320 (0.33) 645 (0.67) 239 (0.19) 432 965 1289
Algorithm truePosProp falsePosProp Precision Recall Total.true.codes
OCC 0.7390940 0.2609060 0.7390940 0.7894265 1289
OCC_FsT_nobreaks 0.7383573 0.2616427 0.7383573 0.7841727 1289
OCC_fullBody 0.2795244 0.7204756 0.2795244 0.8350434 1289
OCC_syntax 0.8295455 0.1704545 0.8295455 0.6840696 1289
OCC_title 0.8349835 0.1650165 0.8349835 0.6875000 1289
OCC_titleSyntaxI 0.9000000 0.1000000 0.9000000 0.6887755 1289
OCC_titleSyntaxU 0.8087432 0.1912568 0.8087432 0.7558733 1289
OCC_wikidata 0.3316062 0.6683938 0.3316062 0.5724508 1289

Performance within groups