A common problem in contingency table analysis is to find relationships between variables in an \(I \times J\) contingency table, where either \(I\) or \(J\), or both, are greater than 2. We partition, or decompose, a \(5 \times 2\) contingency table of psychiatric patients cross-classified as to their diagnostic category and whether they were prescribed drugs. We will discover that for some diagnostic groups, psychiatrists prescribe drugs more often than not. For other groups they prescribe, or do not prescribe drugs, about equally. And for one group, psychiatrists are unlikely to prescribe drugs.
We are presented with a \(5 \times 2\) contingency table presented in Alan Agresti’s Categorical Data Analysis (1990, 2nd ed.). The problem is to partition, or decompose, the table in a statistically rigorous way to “describe similarities and differences among the diagnoses in terms of the relative frequencies of the prescribed drugs,” (Agresti, page 72). The decomposition involves the partitioning of the contingency table and its corresponding Likelihood Ratio Chi-Square statistic, LR \(\chi^{2}\), into orthogonal, additive components (Agresti, pp 50-54). The advantage to partitioning a contingency table into orthogonal components is that independent inferences can be drawn for each component involved in the partitioning. “A [correct] partitioning may show than an association primarily reflects differences between certain categories or groupings of categories,” (Agresti, page 50). Rules for partitioning the table are provided in the Appendix.
We partition (decompose) a \(5 \times 2\) contingency table with data from Alan Agresti’s Categorical Data Analysis (1990, 2nd ed.) – Table 3.10, Problem 3.6, page 72. The echo = TRUE option has been retained so that the reader may easily reproduce the data. The data are displayed with marginal sums added to the table.
Ag.3.10.table.entries <- c(105,12,18,47,0,8,2,19,52,13)
Ag.3.10.table <- as.table(matrix(Ag.3.10.table.entries, nrow = 5, byrow = FALSE, dimnames = list(Diagnosis = c('Schizophrenia', 'Affective.Disorder', 'Neurosis', 'Personality.Disorder', 'Special.Symptoms'), Drugs.Rx = c('Yes', 'No'))))
addmargins(Ag.3.10.table) # add marginal sums to table
## Drugs.Rx
## Diagnosis Yes No Sum
## Schizophrenia 105 8 113
## Affective.Disorder 12 2 14
## Neurosis 18 19 37
## Personality.Disorder 47 52 99
## Special.Symptoms 0 13 13
## Sum 182 94 276
Two hundred seventy-six (276) psychiatric patients were cross classified as to their diagnosis in one of five psychiatric groups: (1) Schizophrenia, (2) Affective Disorder, (3) Neurosis, (4) Personality Disorder, and (5) Special Symptoms and as to whether (or not) they were prescribed drugs in their treatment regimens.
We examine the relationship between the patients’ diagnostic class Diagnosis and whether or not drugs were prescribed Drugs.Rx. We use the Likelihood Ratio \(\chi^{2}\) statistic to test for independence between Diagnosis and Drugs.Rx. Independent partitionings of \(\chi^{2}\) have the property that their LR values and degrees of freedom are additive (Agresti, 1990, pp 50-51). One way to do this is with the loglm() function from the MASS package. First, load the MASS package (Venables and Ripley, 2002). Next, perform a ‘global’ \(\chi^{2}\) test for the hypothesis of independence (no association) between Diagnosis and Drugs.Rx. The null hypothesis for this contingency table states that Diagnosis and Drugs.Rx are independent.
library(MASS) # Venables and Ripley
Ag.3.10.global.loglm <- loglm( ~ Diagnosis + Drugs.Rx, data = Ag.3.10.table)
Ag.3.10.global.loglm
## Call:
## loglm(formula = ~Diagnosis + Drugs.Rx, data = Ag.3.10.table)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 96.54 4 0
## Pearson 84.19 4 0
We reject the null hypothesis that Diagnosis and Drugs.Rx are independent; the LR \(\chi^{2}\) value is 96.54 on 4 df, p << 0.0001. [Note: Ignore the Pearson \(\chi^{2}\) value in these analyses.]
The loglinear analysis reveals a strong relationship between Diagnosis and Drugs.Rx. However, we wish to ascertain more specifically which diagnostic categories, or groupings of diagnostic categories, account for the relationship. We partition (decompose) the table in a statistically rigorous way to “describe similarities and differences among the diagnoses in terms of the relative frequencies of the prescribed drugs,” (Agresti, page 72). The decomposition involves the partitioning of the contingency table and its corresponding Likelihood Ratio \(\chi^{2}\) statistic into independent (orthogonal), additive components (Agresti, pp 50-54). The advantage to this is that independent inferences can be drawn for each component involved in the partitioning. “A [correct] partitioning may show that an association primarily reflects differences between certain categories or groupings of categories,” (Agresti, page 50). Rules for partitioning the table are provided in Agresti (page 53).
We now search for subtables that might be homogeneous, and which can then be combined (collapsed). For example, identify two rows of this table (i.e., two psychiatric diagnostic groups) that appear to have comparable proportions (percentages) of cases classified as Yes, or alternatively as No. Prepare a table comprised of percentages that, for each diagnostic group, sum to 100% across the two categories of whether or not drugs were prescribed (Yes or No).
# compute proportions across columns
Ag.3.10.prop.mar.1.table <- prop.table(Ag.3.10.table, margin = 1)
# transform to percentages
Ag.3.10.percent.mar.1.table <- 100*Ag.3.10.prop.mar.1.table
# present ‘percentages’ table
round(Ag.3.10.percent.mar.1.table, 1) # Percentage
## Drugs.Rx
## Diagnosis Yes No
## Schizophrenia 92.9 7.1
## Affective.Disorder 85.7 14.3
## Neurosis 48.6 51.4
## Personality.Disorder 47.5 52.5
## Special.Symptoms 0.0 100.0
For Neurosis, 48.6% of patients were prescribed drugs (Yes) while 51.4% were not (No). For Personality.Disorder, 47.5% were prescribed drugs while 52.5% were not. The percentages of patients who were prescribed drugs for Neurosis (48.6%) and Personality.Disorder (47.5%) appear to be comparable. From the original \(5 \times 2\) table of observed frequencies, extract the following \(2 \times 2\) subtable.
Ag.3.10.rows.34.table <- as.table(Ag.3.10.table[3:4,])
Ag.3.10.rows.34.table
## Drugs.Rx
## Diagnosis Yes No
## Neurosis 18 19
## Personality.Disorder 47 52
We now perform a \(\chi^{2}\) test for independence between Diagnosis and Drugs.Rx for these two diagnostic classes alone.
Ag.3.10.rows.34.loglm <- loglm( ~ Diagnosis + Drugs.Rx, data = Ag.3.10.rows.34.table)
Ag.3.10.rows.34.loglm
## Call:
## loglm(formula = ~Diagnosis + Drugs.Rx, data = Ag.3.10.rows.34.table)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 0.01487 1 0.9029
## Pearson 0.01488 1 0.9029
The LR \(\chi^{2}\) = 0.015 on 1 df, \(p = 0.90\). Therefore, we do not reject the null hypothesis. We conclude that Diagnosis and Drugs.Rx are independent (homogeneous) for these two diagnostic classes.
Now continue the search for other homogeneous patterns in the original \(5 \times 2\) table. Extract the \(2 \times 2\) subtable considering only those cases associated with Schizophrenia and Affective.Disorder and again test for independence.
## Create rows 1 & 2 subtable
Ag.3.10.rows.12.table <- as.table(Ag.3.10.table[1:2,])
Ag.3.10.rows.12.table # Observed
## Drugs.Rx
## Diagnosis Yes No
## Schizophrenia 105 8
## Affective.Disorder 12 2
## Test for independence in this 2 x 2 sub-table
Ag.3.10.rows.12.loglm <- loglm( ~ Diagnosis + Drugs.Rx, data = Ag.3.10.rows.12.table)
Ag.3.10.rows.12.loglm
## Call:
## loglm(formula = ~Diagnosis + Drugs.Rx, data = Ag.3.10.rows.12.table)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 0.7530 1 0.3855
## Pearson 0.8917 1 0.3450
The LR \(\chi^{2}\) = 0.75 on 1 df, \(p = 0.39\). Do not reject the null hypothesis of independence.
We have identified two \(2 \times 2\) subtables from the original that are homogeneous, one for Neurosis and Personality.Disorder and one for Schizophrenia and Affective.Disorder. When this occurs the counts in the subtable can be combined or collapsed, i.e., summed over its margins, without loss of information. The original \(5 \times 2\) table can now be collapsed (combined) into a \(3 \times 2\) table.
apply(Ag.3.10.table[1:2,], 2, sum)
## Yes No
## 117 10
apply(Ag.3.10.table[3:4,], 2, sum)
## Yes No
## 65 71
Ag.3.10.table[5,]
## Yes No
## 0 13
From the results above, construct the next table.
Ag.3.10.collapsed.entries <- c(117, 65, 0, 10, 71, 13)
Ag.3.10.collapsed.mat <- matrix(Ag.3.10.collapsed.entries, nrow = 3, byrow = FALSE, dimnames = list(Diagnosis = c('Schiz.or.Aff.Dis', 'Neur.or.Pers.Dis', 'Special.Symptoms'), Drugs.Rx = c('Yes', 'No')))
Ag.3.10.collapsed.table <- as.table(Ag.3.10.collapsed.mat)
Ag.3.10.collapsed.table # Observed
## Drugs.Rx
## Diagnosis Yes No
## Schiz.or.Aff.Dis 117 10
## Neur.or.Pers.Dis 65 71
## Special.Symptoms 0 13
The observed counts for Schizophrenia and Affective.Disorder are now combined into a single category now labeled Schiz.or.Aff.Dis. Similarly, the observed counts for Neurosis and Personality.Disorder are combined into a single category now labeled Neur.or.Pers.Dis. Since the counts associated with Special.Symptoms have not been used in a previous sub-table, they are retained in the table here.
Test for independence between Diagnosis and Drugs.Rx in the combined table. Actually, we are less concerned with independence here; we compute the LR \(\chi^{2}\) statistic to perform certain checks in our logic and to complete the steps for the partitioning.
Ag.3.10.collapsed.loglm <- loglm( ~ Diagnosis + Drugs.Rx, data = Ag.3.10.collapsed.table)
Ag.3.10.collapsed.loglm
## Call:
## loglm(formula = ~Diagnosis + Drugs.Rx, data = Ag.3.10.collapsed.table)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 95.77 2 0
## Pearson 83.88 2 0
As the LR \(\chi^{2}\) = 95.77 on 2 df, we reject the null hypothesis for independence, \(p \lt 0.0001\).
Summarizing to this point:
with respect to the original table, Diagnosis and Drugs.Rx are not independent (not homogeneous);
patients diagnosed with either Schizophrenia or Affective Disorder are homogeneous;
patients diagnosed with Neurosis or Personality Disorder are homogeneous; and
from the combined (collapsed) table, Diagnosis and Drugs.Rx are not homogeneous.
When the partitioning is performed correctly, the LR \(\chi^{2}\) values of the sub-tables sum, exactly, to the LR \(\chi^{2}\) value for the original table (Agresti, 1990, pp. 50-51). Similarly, the degrees of freedom associated with each test sum to the degrees of freedom associated with the test from the original table.
Add the three LR \(\chi^{2}\) values associated with the three sub-tables . . .
Ag.3.10.rows.12.loglm$lr + Ag.3.10.rows.34.loglm$lr + Ag.3.10.collapsed.loglm$lr
## [1] 96.54
… and compare to LR \(\chi^{2}\) value for the original \(5 \times 2\) table, on 4 degrees of freedom.
Ag.3.10.global.loglm$lr
## [1] 96.54
Perform a check to determine whether the two quantities are equivalent to 12 decimal places.
# check accuracy to 12 decimal places
round(Ag.3.10.rows.12.loglm$lr + Ag.3.10.rows.34.loglm$lr + Ag.3.10.collapsed.loglm$lr, 12) == round(Ag.3.10.global.loglm$lr, 12)
## [1] TRUE
They are equal. Also, the degrees of freedom for each component are, respectively, 1, 1, and 2, which sum to 4 degrees of freedom associated with the original table.
Psychiatric patients were relatively more or less likely to be prescribed drugs depending on their respective diagnoses. Patients diagnosed with Schizophrenia or Affective Disorder were more likely to be prescribed drugs than not (92% vs. 8%). Patients diagnosed with Neurosis or Personality Disorder, were about equally likely to be prescribed drugs or not (48% vs. 52%). Patients with Special Symptoms were not likely to be prescribed drugs; in fact, no drugs were prescribed for these patients in this sample (0.0% vs. 100.0%).
The Likelihood Ratio (LR) Chi-Square test value has the form
\[ \mbox{LR} \hspace{3 mm} \chi^{2} = 2 \cdot \sum_{I} \sum_{J} O_{ij} \cdot log (\frac{O_{ij}}{E_{ij}} ) \]
where \(O_{ij}\) and \(E_{ij}\) are Observed and Expected values, \(log\) denotes the natural (Naperian) logarithm, and the symbol \(\sum_{I} \sum_{J}\) indicates summation over the (i,j) cells in a given table, \(i = 1, ..., I\) (number of rows in the table) and \(j = 1, ..., J\) (number of columns in the table).
The Pearson Chi-Squared test value has the form
\[ \mbox{Pearson} \hspace{3 mm} \chi^{2} = \sum_{I} \sum_{J} \frac {(O_{ij} - E_{ij})^{2}}{E_{ij}} \]
where \(O_{ij}\) and \(E_{ij}\) are Observed and Expected values, and the symbol \(\sum_{I} \sum_{J}\) indicates summation over the (i,j) cells in a given table, \(i = 1, ..., I\) (number of rows in the table) and \(j = 1, ..., J\) (number of columns in the table).
Here is a brief list of some of the Rules for Partitioning the contingency table (Agresti, page 53):
The degrees of freedom for the sub-tables must sum to the degrees of freedom for the original table.
Each cell count in the original table must be a cell count in one and only one sub-table.
Each marginal total of the original table must be a marginal total for one and only one sub-table.
Agresti A, (1990) Categorical Data Analysis, 2nd ed., Wiley, New York.
Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/
Venables WN & Ripley BD (2002) Modern Applied Statistics with S Fourth Edition. Springer, New York. ISBN 0-387-95457-0
Fox J & Weisberg S (2011). An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion