Technical notes

The OCC algorithms with formal evaluations are OCC_title, OCC_titleSyntaxI, OCC_titleSyntaxU, OCC_syntax, OCC_FsT_nobreaks, OCC, OCC_fullBody, OCC_wikidata.

Summary statistics

The obituaries and coded OCCs

There were a total of 1000 hand-coded documents, with 1,302 total codes.
There were 116 true OCC codes, with a minimum frequency of 1 (there were 42 codes with this frequency) and a maximum frequency of 151 (001).
We’re analzing 60,847 obituaries

The following simply shows the number of obituaries per quarter in the entire dataset.

And the following over the course of a couple years, 2011-2012:

Here’s a histogram of the frequency of OCC codes in the hand-coded set.

The following table shows the frequency of each OCC code in the hand-coded set. This table excludes the codes which account for less than 1% of documents.

	OCC.code	Total.true.codes
581	001	151
639	220	105
660	285	100
654	275	80
652	272	52
650	270	49
599	043	48
691	s043	44
637	210	39
666	306	30
635	204	29
651	271	29
657	281	28
648	260	27
631	186	24
689	980	23
594	023	22
584	003	21
583	002	20
659	283	19
653	274	18
656	280	18
588	012	17
694	s220	16
638	211	15
621	161	14
655	276	13
582	001a	12
649	263	11
678	470	11
623	165	10
662	291	10

The various algorithms

Brief algorithm descriptions

OCC

This algorithm looks in both the title and first sentence. It first removes any instances of their name, and convert the whole sentence to lower-case. It then uses a dictionary of terms, allowing a 1-word “buffer” for 2-word terms, and a 2-word buffer for 3-word terms, ignoring order. For example, the term “marketing assistant” would match with “assistant of marketing activities” but not “assistant to the marketing department”. We ignore terms which are subsets of other matched terms, for instance ignoring the “president” in “vice president”.

OCC_FsT_nobreaks

This algorithm is the same as OCC, except it does not allow any buffer words.

OCC_fullBody

This algorithm looks in the full body, excluding the title, using our curated dictionary. It does not allow buffer words, to improve performance (there are already way too many false-positives).

OCC_syntax

This one isolates a specific grammatical construction which occurs often in the first sentence of the obituary. That is the appositional modifier, or “APPOS”. The appos is a noun which immediately follows another, typically within a comma-delimited phrase. In the example below, “Archbishop” is the appositional modifier of James Peter Davis.

James Peter Davis, Archbishop of Santa Fe from 1964 to 1974, died Friday.

The algorithm then looks this word (or words, if it’s a compound noun) up in our dictionary and records a match.

OCC_title

This algorithm looks exclusively in the description field of the title, which automatically excludes the obituarized’s name. It again uses our curated dictionary.

OCC_wikidata

If there is a wikidata entry which matches the obituaried’s name exactly, we look at all occupations (P106) of the individual, and match the labels for these occupations against our dictionary. We then code them as ALL of these occupations, as well as any superclass (P279) of this occupation. For example, Martin Luther King Jr. (Q8027) is listed as having the occupation preacher (Q432386), which is a subclass of religious servant (Q4504549), which in turn is a subclass of cleric (Q2259532). After this the superclasses get much more general, and typically nonoccupational (e.g. believer) which are filtered.

OCC_titleSyntaxI and OCC_titleSyntaxU

These are just the intersection (OCC_titleSyntaxI) and union (OCC_titleSyntaxU) of OCC_title and OCC_syntax, the highest precision and most straightforward use of our vocabulary.

Algorithm performance

The table below shows some summary statistics regarding the different algorithms, giving a high-level view of how well they performed.

truePos counts the number of true codes which were correctly guessed
falsePos counts the number of true codes for which there is a corresponding guess, and which was not correctly guessed
Precision is the proportion of those codes for which there is a guess, which were correct
Recall is the proportion of true codes which were correctly guessed

truePos	falsePos	falseNeg	NobitsCoded	Total.machine.codes	Total.true.codes
837 (0.72)	332 (0.28)	291 (0.22)	849	1169	1302
827 (0.71)	331 (0.29)	297 (0.23)	845	1158	1302
989 (0.27)	2612 (0.73)	291 (0.22)	979	3601	1302
478 (0.78)	132 (0.22)	280 (0.22)	531	610	1302
473 (0.8)	119 (0.2)	270 (0.21)	548	592	1302
256 (0.87)	38 (0.13)	139 (0.11)	282	294	1302
695 (0.77)	213 (0.23)	295 (0.23)	725	908	1302
282 (0.27)	755 (0.73)	280 (0.22)	432	1037	1302

The numbers in parentheses of the truePos and falsePos columns show what proportion of all true codes for which there was a guess were correctly guessed, and the number in parentheses of the falseNeg column shows what proportion of the true codes were missed by machine coding.

Algorithm	truePosProp	falsePosProp	Precision	Recall	Total.true.codes
OCC	0.7159966	0.2840034	0.7159966	0.7420213	1302
OCC_FsT_nobreaks	0.7141623	0.2858377	0.7141623	0.7357651	1302
OCC_fullBody	0.2746459	0.7253541	0.2746459	0.7726562	1302
OCC_syntax	0.7836066	0.2163934	0.7836066	0.6306069	1302
OCC_title	0.7989865	0.2010135	0.7989865	0.6366083	1302
OCC_titleSyntaxI	0.8707483	0.1292517	0.8707483	0.6481013	1302
OCC_titleSyntaxU	0.7654185	0.2345815	0.7654185	0.7020202	1302
OCC_wikidata	0.2719383	0.7280617	0.2719383	0.5017794	1302

[not started] Typical performance on the same task in other work

Fine-grained accuracy of our best algorithm

We check the success of our machine coding of OCC against the set of 1000 obituaries which we have already carefully hand-coded.

It’s interesting to look at our overall accuracy on just the more common codes, as it’s evident in the figure above that those had the best performance.

What patterns are present in the New York Times?

We focus here on those occupations which are correctly identified more than 90% of the time. That is, in more than 90% true positives for the given occupation, the machine correctly identified it. We also won’t consider codes that are exceedingly rare (< 1% of all those hand-coded). This narrows to the following:

	OCC.code	truePos	falsePos	falseNeg	NobitsCoded	Total.machine.codes	Total.true.codes	Pop.Prop.Guess	Pop.Prop.True	Total.true.codes.1
639	220	47	4	51	849	51	105	0.0600707	0.105	105
637	210	32	3	3	849	35	39	0.0412250	0.039	39
651	271	13	1	11	849	14	29	0.0164900	0.029	29
657	281	24	2	4	849	26	28	0.0306243	0.028	28
631	186	19	0	5	849	19	24	0.0223793	0.024	24
659	283	17	1	2	849	18	19	0.0212014	0.019	19
653	274	18	1	0	849	19	18	0.0223793	0.018	18
621	161	11	1	1	849	12	14	0.0141343	0.014	14
678	470	6	0	2	849	6	11	0.0070671	0.011	11
662	291	8	0	1	849	8	10	0.0094229	0.010	10

To explore the possibility that the trend, or lack of one, could be related to the differring numbers of obituaries over time, we can look at proportion of documents coded in that time period with that code.

The following plot asks the simple question of whether the proportion we code is variable over time. This could indicate some time-bias in our coding algorithm.

Corrections via Hopkins and King (2010) analysis

NOTE: Although this is quite close, there is a faulty assumption lying in here.

In particular, for disjoint events \(A, B, C\) such that \(A\cup B\cup C = E\), the whole sample space, it’s true that \(P(X\wedge A)+P(X\wedge B)+P(X\wedge C)=P(X)\), as assumed below. But crucially, the events \(\{j'\in\hat{D}\}\) are not disjoint, because \(\hat{D}\) often contains multiple codes. This needs to be corrected.

It seems the correction is to ammend our expansion. We can still do everything else (although it won’t be a simple matrix multiplication and will probably take me some time…)

\[ \begin{split} P(j\in D) &= \sum_{k} P(j\in D | k\in\hat{D}) P(k\in\hat{D}) \\ &- \sum_{k_1 \neq k_2} P(j\in D | k_1\in\hat{D} \wedge k_2\in\hat{D} ) P(k_1\in\hat{D} \wedge k_2\in\hat{D}) \\ &+ \sum_{k_1 \neq k_2 \neq k_3} P(j\in D | k_1\in\hat{D} \wedge k_2\in\hat{D} \wedge k_3\in\hat{D} ) P(k_1\in\hat{D} \wedge k_2\in\hat{D} \wedge k_3\in\hat{D}) \\ &- ... \end{split} \]

Summary of the method

What follows is a quick summary the method used by Hopkins and King (2010). They surmise that misclassifications are systematic. That is, if we observe in our hand-coded ‘golden’ test set that 17% of the time we think they are doctors they are actually lawyers, this will be similar in the larger set (or in arbitrary subsets). We can use this assumption, that true occupation gives some information about the probabilities of misclassification, when computing population proportions. This can be seen through decomposing probabilities.

Let \(j\) represent an occupation, let \(D\) be the set of true occupations, and \(\hat{D}\) be the set of occupations our algorithm codes. Then \(P(j\in D) = \sum_{j'} P(j\in D | j'\in\hat{D}) P(j'\in\hat{D})\). We will then estimate \(P(j\in D | j'\in\hat{D})\) from our data, assuming that these conditional probabilities are somewhat constant over the set of obituaries. This gives us an estimate from the population (and subsets) of the true population proportions \(D\) from our codes \(\hat{D}\).

Results of adjustments

I’ve used this method to adjust our coded population proportions. Here’s a small sample of the modified population proportions:

OCC	Modified	Raw
001	0.2055	0.2429
001a	0.0211	0.0000
002	0.0281	0.0113
003	0.0331	0.0284
004	0.0052	0.0069
005	0.0021	0.0010
006	0.0058	0.0010
012	0.0264	0.0229
013	0.0016	0.0005
021	0.0068	0.0042

And those which changed the most through this procedure:

	OCC	Modified	Raw
62	220	0.1512	0.0629
126	s043	0.0580	0.1437
129	s220	0.0260	0.0844
1	001	0.2055	0.2429
18	043	0.0851	0.0545

There are some caveats to this method. First, using it gives zero proportion for many actually coded categories.

There are an astounding 148 codes attributed in the entire set which don’t show up in our hand-coding set at all, either in the true or attributed codes. There are even 8 codes which we use in the hand-coding set which never show up in the machine-coding of the larger set. Thus we can only use this method to infer population proportions for the most prevalent occupations. The second main caveat is that we don’t know whether these mis-codings were systematic. And even if they are systematic, we’re not sure if our estimated \(P(j\in D | j'\in\hat{D})\) should be consistent when selecting subsets on covariates, particularly by time.

Can we get immediate improvements by combining OCCs into larger supergroups?

We will now combine reasonably similar groups of OCCs into the same group by the following specification:

	OCC
CEO	1, 001a, 1a
ADMIN	2, 4-42
LEGISLATOR	3
DIPLOMAT	43, s043
BIZOP	50-73
FINANCE	80-95
MATH	100-124
ARCHITECT	130, 131, s130
ENGINEER	132-156
SCIENTIST	160-196, s160
COUNSELOR	200-202
CLERGY	204, 205, 206
LAWYER	210, 214, 215

	OCC
JUDGE	211
PROF	220, s220
TEACHER	230-234
EDUC OTHER	240-255
ARTIST	260
ACTOR	270
DIRECTOR	271
ATHLETE	272
DANCER	274
MUSICIAN	275
CLOWN	276
ANNOUNCER	280
NEWS	281-283

	OCC
PHOTO	291, 292
AUTHOR	284, 285
ARTS OTHER	263, 286-296
DOCTOR	300-307, 312, s300
NURSE	311, 313-355, 360-365
COP	370-395
SALES	470-496
SECRETARY	500-593
FARM	600-613
BLUE COLLAR	620-975
PERSONAL	400-465
MILITARY	980,981,982,s980

Performance of algorithms

I’ll reproduce here exactly the same tables of summary statistics for comparison.

truePos	falsePos	falseNeg	NobitsCoded	Total.machine.codes	Total.true.codes
881 (0.74)	311 (0.26)	235 (0.18)	849	1192	1289
872 (0.74)	309 (0.26)	240 (0.19)	845	1181	1289
1058 (0.28)	2727 (0.72)	209 (0.16)	979	3785	1289
511 (0.83)	105 (0.17)	236 (0.18)	531	616	1289
506 (0.83)	100 (0.17)	230 (0.18)	548	606	1289
270 (0.9)	30 (0.1)	122 (0.09)	282	300	1289
740 (0.81)	175 (0.19)	239 (0.19)	725	915	1289
320 (0.33)	645 (0.67)	239 (0.19)	432	965	1289

Algorithm	truePosProp	falsePosProp	Precision	Recall	Total.true.codes
OCC	0.7390940	0.2609060	0.7390940	0.7894265	1289
OCC_FsT_nobreaks	0.7383573	0.2616427	0.7383573	0.7841727	1289
OCC_fullBody	0.2795244	0.7204756	0.2795244	0.8350434	1289
OCC_syntax	0.8295455	0.1704545	0.8295455	0.6840696	1289
OCC_title	0.8349835	0.1650165	0.8349835	0.6875000	1289
OCC_titleSyntaxI	0.9000000	0.1000000	0.9000000	0.6887755	1289
OCC_titleSyntaxU	0.8087432	0.1912568	0.8087432	0.7558733	1289
OCC_wikidata	0.3316062	0.6683938	0.3316062	0.5724508	1289

Occupations in the New York Times obituaries

Alec McGail