The OCC algorithms with formal evaluations are OCC_title, OCC_titleSyntaxI, OCC_titleSyntaxU, OCC_syntax, OCC_FsT_nobreaks, OCC, OCC_fullBody, OCC_wikidata.
The following simply shows the number of obituaries per quarter in the entire dataset.
And the following over the course of a couple years, 2011-2012:
Here’s a histogram of the frequency of OCC codes in the hand-coded set.
The following table shows the frequency of each OCC code in the hand-coded set. This table excludes the codes which account for less than 1% of documents.
| OCC.code | Total.true.codes | |
|---|---|---|
| 581 | 001 | 151 |
| 639 | 220 | 105 |
| 660 | 285 | 100 |
| 654 | 275 | 80 |
| 652 | 272 | 52 |
| 650 | 270 | 49 |
| 599 | 043 | 48 |
| 691 | s043 | 44 |
| 637 | 210 | 39 |
| 666 | 306 | 30 |
| 635 | 204 | 29 |
| 651 | 271 | 29 |
| 657 | 281 | 28 |
| 648 | 260 | 27 |
| 631 | 186 | 24 |
| 689 | 980 | 23 |
| 594 | 023 | 22 |
| 584 | 003 | 21 |
| 583 | 002 | 20 |
| 659 | 283 | 19 |
| 653 | 274 | 18 |
| 656 | 280 | 18 |
| 588 | 012 | 17 |
| 694 | s220 | 16 |
| 638 | 211 | 15 |
| 621 | 161 | 14 |
| 655 | 276 | 13 |
| 582 | 001a | 12 |
| 649 | 263 | 11 |
| 678 | 470 | 11 |
| 623 | 165 | 10 |
| 662 | 291 | 10 |
This algorithm looks in both the title and first sentence. It first removes any instances of their name, and convert the whole sentence to lower-case. It then uses a dictionary of terms, allowing a 1-word “buffer” for 2-word terms, and a 2-word buffer for 3-word terms, ignoring order. For example, the term “marketing assistant” would match with “assistant of marketing activities” but not “assistant to the marketing department”. We ignore terms which are subsets of other matched terms, for instance ignoring the “president” in “vice president”.
This algorithm is the same as OCC, except it does not allow any buffer words.
This algorithm looks in the full body, excluding the title, using our curated dictionary. It does not allow buffer words, to improve performance (there are already way too many false-positives).
This one isolates a specific grammatical construction which occurs often in the first sentence of the obituary. That is the appositional modifier, or “APPOS”. The appos is a noun which immediately follows another, typically within a comma-delimited phrase. In the example below, “Archbishop” is the appositional modifier of James Peter Davis.
James Peter Davis, Archbishop of Santa Fe from 1964 to 1974, died Friday.
The algorithm then looks this word (or words, if it’s a compound noun) up in our dictionary and records a match.
This algorithm looks exclusively in the description field of the title, which automatically excludes the obituarized’s name. It again uses our curated dictionary.
If there is a wikidata entry which matches the obituaried’s name exactly, we look at all occupations (P106) of the individual, and match the labels for these occupations against our dictionary. We then code them as ALL of these occupations, as well as any superclass (P279) of this occupation. For example, Martin Luther King Jr. (Q8027) is listed as having the occupation preacher (Q432386), which is a subclass of religious servant (Q4504549), which in turn is a subclass of cleric (Q2259532). After this the superclasses get much more general, and typically nonoccupational (e.g. believer) which are filtered.
These are just the intersection (OCC_titleSyntaxI) and union (OCC_titleSyntaxU) of OCC_title and OCC_syntax, the highest precision and most straightforward use of our vocabulary.
The table below shows some summary statistics regarding the different algorithms, giving a high-level view of how well they performed.
| truePos | falsePos | falseNeg | NobitsCoded | Total.machine.codes | Total.true.codes |
|---|---|---|---|---|---|
| 837 (0.72) | 332 (0.28) | 291 (0.22) | 849 | 1169 | 1302 |
| 827 (0.71) | 331 (0.29) | 297 (0.23) | 845 | 1158 | 1302 |
| 989 (0.27) | 2612 (0.73) | 291 (0.22) | 979 | 3601 | 1302 |
| 478 (0.78) | 132 (0.22) | 280 (0.22) | 531 | 610 | 1302 |
| 473 (0.8) | 119 (0.2) | 270 (0.21) | 548 | 592 | 1302 |
| 256 (0.87) | 38 (0.13) | 139 (0.11) | 282 | 294 | 1302 |
| 695 (0.77) | 213 (0.23) | 295 (0.23) | 725 | 908 | 1302 |
| 282 (0.27) | 755 (0.73) | 280 (0.22) | 432 | 1037 | 1302 |
The numbers in parentheses of the truePos and falsePos columns show what proportion of all true codes for which there was a guess were correctly guessed, and the number in parentheses of the falseNeg column shows what proportion of the true codes were missed by machine coding.
| Algorithm | truePosProp | falsePosProp | Precision | Recall | Total.true.codes |
|---|---|---|---|---|---|
| OCC | 0.7159966 | 0.2840034 | 0.7159966 | 0.7420213 | 1302 |
| OCC_FsT_nobreaks | 0.7141623 | 0.2858377 | 0.7141623 | 0.7357651 | 1302 |
| OCC_fullBody | 0.2746459 | 0.7253541 | 0.2746459 | 0.7726562 | 1302 |
| OCC_syntax | 0.7836066 | 0.2163934 | 0.7836066 | 0.6306069 | 1302 |
| OCC_title | 0.7989865 | 0.2010135 | 0.7989865 | 0.6366083 | 1302 |
| OCC_titleSyntaxI | 0.8707483 | 0.1292517 | 0.8707483 | 0.6481013 | 1302 |
| OCC_titleSyntaxU | 0.7654185 | 0.2345815 | 0.7654185 | 0.7020202 | 1302 |
| OCC_wikidata | 0.2719383 | 0.7280617 | 0.2719383 | 0.5017794 | 1302 |
We check the success of our machine coding of OCC against the set of 1000 obituaries which we have already carefully hand-coded.
It’s interesting to look at our overall accuracy on just the more common codes, as it’s evident in the figure above that those had the best performance.
We focus here on those occupations which are correctly identified more than 90% of the time. That is, in more than 90% true positives for the given occupation, the machine correctly identified it. We also won’t consider codes that are exceedingly rare (< 1% of all those hand-coded). This narrows to the following:
| OCC.code | truePos | falsePos | falseNeg | NobitsCoded | Total.machine.codes | Total.true.codes | Pop.Prop.Guess | Pop.Prop.True | Total.true.codes.1 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 639 | 220 | 47 | 4 | 51 | 849 | 51 | 105 | 0.0600707 | 0.105 | 105 |
| 637 | 210 | 32 | 3 | 3 | 849 | 35 | 39 | 0.0412250 | 0.039 | 39 |
| 651 | 271 | 13 | 1 | 11 | 849 | 14 | 29 | 0.0164900 | 0.029 | 29 |
| 657 | 281 | 24 | 2 | 4 | 849 | 26 | 28 | 0.0306243 | 0.028 | 28 |
| 631 | 186 | 19 | 0 | 5 | 849 | 19 | 24 | 0.0223793 | 0.024 | 24 |
| 659 | 283 | 17 | 1 | 2 | 849 | 18 | 19 | 0.0212014 | 0.019 | 19 |
| 653 | 274 | 18 | 1 | 0 | 849 | 19 | 18 | 0.0223793 | 0.018 | 18 |
| 621 | 161 | 11 | 1 | 1 | 849 | 12 | 14 | 0.0141343 | 0.014 | 14 |
| 678 | 470 | 6 | 0 | 2 | 849 | 6 | 11 | 0.0070671 | 0.011 | 11 |
| 662 | 291 | 8 | 0 | 1 | 849 | 8 | 10 | 0.0094229 | 0.010 | 10 |
To explore the possibility that the trend, or lack of one, could be related to the differring numbers of obituaries over time, we can look at proportion of documents coded in that time period with that code.
The following plot asks the simple question of whether the proportion we code is variable over time. This could indicate some time-bias in our coding algorithm.
In particular, for disjoint events \(A, B, C\) such that \(A\cup B\cup C = E\), the whole sample space, it’s true that \(P(X\wedge A)+P(X\wedge B)+P(X\wedge C)=P(X)\), as assumed below. But crucially, the events \(\{j'\in\hat{D}\}\) are not disjoint, because \(\hat{D}\) often contains multiple codes. This needs to be corrected.
It seems the correction is to ammend our expansion. We can still do everything else (although it won’t be a simple matrix multiplication and will probably take me some time…)
\[ \begin{split} P(j\in D) &= \sum_{k} P(j\in D | k\in\hat{D}) P(k\in\hat{D}) \\ &- \sum_{k_1 \neq k_2} P(j\in D | k_1\in\hat{D} \wedge k_2\in\hat{D} ) P(k_1\in\hat{D} \wedge k_2\in\hat{D}) \\ &+ \sum_{k_1 \neq k_2 \neq k_3} P(j\in D | k_1\in\hat{D} \wedge k_2\in\hat{D} \wedge k_3\in\hat{D} ) P(k_1\in\hat{D} \wedge k_2\in\hat{D} \wedge k_3\in\hat{D}) \\ &- ... \end{split} \]
What follows is a quick summary the method used by Hopkins and King (2010). They surmise that misclassifications are systematic. That is, if we observe in our hand-coded ‘golden’ test set that 17% of the time we think they are doctors they are actually lawyers, this will be similar in the larger set (or in arbitrary subsets). We can use this assumption, that true occupation gives some information about the probabilities of misclassification, when computing population proportions. This can be seen through decomposing probabilities.
Let \(j\) represent an occupation, let \(D\) be the set of true occupations, and \(\hat{D}\) be the set of occupations our algorithm codes. Then \(P(j\in D) = \sum_{j'} P(j\in D | j'\in\hat{D}) P(j'\in\hat{D})\). We will then estimate \(P(j\in D | j'\in\hat{D})\) from our data, assuming that these conditional probabilities are somewhat constant over the set of obituaries. This gives us an estimate from the population (and subsets) of the true population proportions \(D\) from our codes \(\hat{D}\).
I’ve used this method to adjust our coded population proportions. Here’s a small sample of the modified population proportions:
| OCC | Modified | Raw |
|---|---|---|
| 001 | 0.2055 | 0.2429 |
| 001a | 0.0211 | 0.0000 |
| 002 | 0.0281 | 0.0113 |
| 003 | 0.0331 | 0.0284 |
| 004 | 0.0052 | 0.0069 |
| 005 | 0.0021 | 0.0010 |
| 006 | 0.0058 | 0.0010 |
| 012 | 0.0264 | 0.0229 |
| 013 | 0.0016 | 0.0005 |
| 021 | 0.0068 | 0.0042 |
And those which changed the most through this procedure:
| OCC | Modified | Raw | |
|---|---|---|---|
| 62 | 220 | 0.1512 | 0.0629 |
| 126 | s043 | 0.0580 | 0.1437 |
| 129 | s220 | 0.0260 | 0.0844 |
| 1 | 001 | 0.2055 | 0.2429 |
| 18 | 043 | 0.0851 | 0.0545 |
There are some caveats to this method. First, using it gives zero proportion for many actually coded categories.
There are an astounding 148 codes attributed in the entire set which don’t show up in our hand-coding set at all, either in the true or attributed codes. There are even 8 codes which we use in the hand-coding set which never show up in the machine-coding of the larger set. Thus we can only use this method to infer population proportions for the most prevalent occupations. The second main caveat is that we don’t know whether these mis-codings were systematic. And even if they are systematic, we’re not sure if our estimated \(P(j\in D | j'\in\hat{D})\) should be consistent when selecting subsets on covariates, particularly by time.
We will now combine reasonably similar groups of OCCs into the same group by the following specification:
|
|
|
| truePos | falsePos | falseNeg | NobitsCoded | Total.machine.codes | Total.true.codes |
|---|---|---|---|---|---|
| 881 (0.74) | 311 (0.26) | 235 (0.18) | 849 | 1192 | 1289 |
| 872 (0.74) | 309 (0.26) | 240 (0.19) | 845 | 1181 | 1289 |
| 1058 (0.28) | 2727 (0.72) | 209 (0.16) | 979 | 3785 | 1289 |
| 511 (0.83) | 105 (0.17) | 236 (0.18) | 531 | 616 | 1289 |
| 506 (0.83) | 100 (0.17) | 230 (0.18) | 548 | 606 | 1289 |
| 270 (0.9) | 30 (0.1) | 122 (0.09) | 282 | 300 | 1289 |
| 740 (0.81) | 175 (0.19) | 239 (0.19) | 725 | 915 | 1289 |
| 320 (0.33) | 645 (0.67) | 239 (0.19) | 432 | 965 | 1289 |
| Algorithm | truePosProp | falsePosProp | Precision | Recall | Total.true.codes |
|---|---|---|---|---|---|
| OCC | 0.7390940 | 0.2609060 | 0.7390940 | 0.7894265 | 1289 |
| OCC_FsT_nobreaks | 0.7383573 | 0.2616427 | 0.7383573 | 0.7841727 | 1289 |
| OCC_fullBody | 0.2795244 | 0.7204756 | 0.2795244 | 0.8350434 | 1289 |
| OCC_syntax | 0.8295455 | 0.1704545 | 0.8295455 | 0.6840696 | 1289 |
| OCC_title | 0.8349835 | 0.1650165 | 0.8349835 | 0.6875000 | 1289 |
| OCC_titleSyntaxI | 0.9000000 | 0.1000000 | 0.9000000 | 0.6887755 | 1289 |
| OCC_titleSyntaxU | 0.8087432 | 0.1912568 | 0.8087432 | 0.7558733 | 1289 |
| OCC_wikidata | 0.3316062 | 0.6683938 | 0.3316062 | 0.5724508 | 1289 |