Abstract

A common problem in contingency table analysis is to find relationships between variables in an \(I \times J\) contingency table, where either \(I\) or \(J\), or both, are greater than 2. We partition, or decompose, a \(5 \times 2\) contingency table of psychiatric patients cross-classified as to their diagnostic category and whether they were prescribed drugs. We will discover that for some diagnostic groups, psychiatrists prescribe drugs more often than not. For other groups they prescribe, or do not prescribe drugs, about equally. And for one group, psychiatrists are unlikely to prescribe drugs.

Introduction

We are presented with a \(5 \times 2\) contingency table presented in Alan Agresti’s Categorical Data Analysis (1990, 2nd ed.). The problem is to partition, or decompose, the table in a statistically rigorous way to “describe similarities and differences among the diagnoses in terms of the relative frequencies of the prescribed drugs,” (Agresti, page 72). The decomposition involves the partitioning of the contingency table and its corresponding Likelihood Ratio Chi-Square statistic, LR \(\chi^{2}\), into orthogonal, additive components (Agresti, pp 50-54). The advantage to partitioning a contingency table into orthogonal components is that independent inferences can be drawn for each component involved in the partitioning. “A [correct] partitioning may show than an association primarily reflects differences between certain categories or groupings of categories,” (Agresti, page 50). Rules for partitioning the table are provided in the Appendix.

Analysis

We partition (decompose) a \(5 \times 2\) contingency table with data from Alan Agresti’s Categorical Data Analysis (1990, 2nd ed.) – Table 3.10, Problem 3.6, page 72. The echo = TRUE option has been retained so that the reader may easily reproduce the data. The data are displayed with marginal sums added to the table.

Ag.3.10.table.entries <- c(105,12,18,47,0,8,2,19,52,13)
Ag.3.10.table <- as.table(matrix(Ag.3.10.table.entries, nrow = 5, byrow = FALSE, dimnames = list(Diagnosis = c('Schizophrenia', 'Affective.Disorder', 'Neurosis', 'Personality.Disorder', 'Special.Symptoms'), Drugs.Rx = c('Yes', 'No'))))
addmargins(Ag.3.10.table) # add marginal sums to table
##                       Drugs.Rx
## Diagnosis              Yes  No Sum
##   Schizophrenia        105   8 113
##   Affective.Disorder    12   2  14
##   Neurosis              18  19  37
##   Personality.Disorder  47  52  99
##   Special.Symptoms       0  13  13
##   Sum                  182  94 276

Two hundred seventy-six (276) psychiatric patients were cross classified as to their diagnosis in one of five psychiatric groups: (1) Schizophrenia, (2) Affective Disorder, (3) Neurosis, (4) Personality Disorder, and (5) Special Symptoms and as to whether (or not) they were prescribed drugs in their treatment regimens.

We examine the relationship between the patients’ diagnostic class Diagnosis and whether or not drugs were prescribed Drugs.Rx. We use the Likelihood Ratio \(\chi^{2}\) statistic to test for independence between Diagnosis and Drugs.Rx. Independent partitionings of \(\chi^{2}\) have the property that their LR values and degrees of freedom are additive (Agresti, 1990, pp 50-51). One way to do this is with the loglm() function from the MASS package. First, load the MASS package (Venables and Ripley, 2002). Next, perform a ‘global’ \(\chi^{2}\) test for the hypothesis of independence (no association) between Diagnosis and Drugs.Rx. The null hypothesis for this contingency table states that Diagnosis and Drugs.Rx are independent.

library(MASS) # Venables and Ripley
Ag.3.10.global.loglm <- loglm( ~ Diagnosis + Drugs.Rx, data = Ag.3.10.table) 
Ag.3.10.global.loglm
## Call:
## loglm(formula = ~Diagnosis + Drugs.Rx, data = Ag.3.10.table)
## 
## Statistics:
##                    X^2 df P(> X^2)
## Likelihood Ratio 96.54  4        0
## Pearson          84.19  4        0

We reject the null hypothesis that Diagnosis and Drugs.Rx are independent; the LR \(\chi^{2}\) value is 96.54 on 4 df, p << 0.0001. [Note: Ignore the Pearson \(\chi^{2}\) value in these analyses.]

The loglinear analysis reveals a strong relationship between Diagnosis and Drugs.Rx. However, we wish to ascertain more specifically which diagnostic categories, or groupings of diagnostic categories, account for the relationship. We partition (decompose) the table in a statistically rigorous way to “describe similarities and differences among the diagnoses in terms of the relative frequencies of the prescribed drugs,” (Agresti, page 72). The decomposition involves the partitioning of the contingency table and its corresponding Likelihood Ratio \(\chi^{2}\) statistic into independent (orthogonal), additive components (Agresti, pp 50-54). The advantage to this is that independent inferences can be drawn for each component involved in the partitioning. “A [correct] partitioning may show that an association primarily reflects differences between certain categories or groupings of categories,” (Agresti, page 50). Rules for partitioning the table are provided in Agresti (page 53).

We now search for subtables that might be homogeneous, and which can then be combined (collapsed). For example, identify two rows of this table (i.e., two psychiatric diagnostic groups) that appear to have comparable proportions (percentages) of cases classified as Yes, or alternatively as No. Prepare a table comprised of percentages that, for each diagnostic group, sum to 100% across the two categories of whether or not drugs were prescribed (Yes or No).

# compute proportions across columns
Ag.3.10.prop.mar.1.table <- prop.table(Ag.3.10.table, margin = 1)
 
# transform to percentages
Ag.3.10.percent.mar.1.table <- 100*Ag.3.10.prop.mar.1.table
 
# present ‘percentages’ table
round(Ag.3.10.percent.mar.1.table, 1) # Percentage
##                       Drugs.Rx
## Diagnosis                Yes    No
##   Schizophrenia         92.9   7.1
##   Affective.Disorder    85.7  14.3
##   Neurosis              48.6  51.4
##   Personality.Disorder  47.5  52.5
##   Special.Symptoms       0.0 100.0

For Neurosis, 48.6% of patients were prescribed drugs (Yes) while 51.4% were not (No). For Personality.Disorder, 47.5% were prescribed drugs while 52.5% were not. The percentages of patients who were prescribed drugs for Neurosis (48.6%) and Personality.Disorder (47.5%) appear to be comparable. From the original \(5 \times 2\) table of observed frequencies, extract the following \(2 \times 2\) subtable.

Ag.3.10.rows.34.table <- as.table(Ag.3.10.table[3:4,])
Ag.3.10.rows.34.table
##                       Drugs.Rx
## Diagnosis              Yes No
##   Neurosis              18 19
##   Personality.Disorder  47 52

We now perform a \(\chi^{2}\) test for independence between Diagnosis and Drugs.Rx for these two diagnostic classes alone.

Ag.3.10.rows.34.loglm <- loglm( ~ Diagnosis + Drugs.Rx, data = Ag.3.10.rows.34.table) 
Ag.3.10.rows.34.loglm
## Call:
## loglm(formula = ~Diagnosis + Drugs.Rx, data = Ag.3.10.rows.34.table)
## 
## Statistics:
##                      X^2 df P(> X^2)
## Likelihood Ratio 0.01487  1   0.9029
## Pearson          0.01488  1   0.9029

The LR \(\chi^{2}\) = 0.015 on 1 df, \(p = 0.90\). Therefore, we do not reject the null hypothesis. We conclude that Diagnosis and Drugs.Rx are independent (homogeneous) for these two diagnostic classes.

Now continue the search for other homogeneous patterns in the original \(5 \times 2\) table. Extract the \(2 \times 2\) subtable considering only those cases associated with Schizophrenia and Affective.Disorder and again test for independence.

## Create rows 1 & 2 subtable
Ag.3.10.rows.12.table <- as.table(Ag.3.10.table[1:2,])
Ag.3.10.rows.12.table # Observed
##                     Drugs.Rx
## Diagnosis            Yes  No
##   Schizophrenia      105   8
##   Affective.Disorder  12   2
## Test for independence in this 2 x 2 sub-table
Ag.3.10.rows.12.loglm <- loglm( ~ Diagnosis + Drugs.Rx, data = Ag.3.10.rows.12.table) 
Ag.3.10.rows.12.loglm
## Call:
## loglm(formula = ~Diagnosis + Drugs.Rx, data = Ag.3.10.rows.12.table)
## 
## Statistics:
##                     X^2 df P(> X^2)
## Likelihood Ratio 0.7530  1   0.3855
## Pearson          0.8917  1   0.3450

The LR \(\chi^{2}\) = 0.75 on 1 df, \(p = 0.39\). Do not reject the null hypothesis of independence.

We have identified two \(2 \times 2\) subtables from the original that are homogeneous, one for Neurosis and Personality.Disorder and one for Schizophrenia and Affective.Disorder. When this occurs the counts in the subtable can be combined or collapsed, i.e., summed over its margins, without loss of information. The original \(5 \times 2\) table can now be collapsed (combined) into a \(3 \times 2\) table.

apply(Ag.3.10.table[1:2,], 2, sum)
## Yes  No 
## 117  10
apply(Ag.3.10.table[3:4,], 2, sum)
## Yes  No 
##  65  71
Ag.3.10.table[5,]
## Yes  No 
##   0  13

From the results above, construct the next table.

Ag.3.10.collapsed.entries <- c(117, 65, 0, 10, 71, 13) 
Ag.3.10.collapsed.mat <- matrix(Ag.3.10.collapsed.entries, nrow = 3, byrow = FALSE, dimnames = list(Diagnosis = c('Schiz.or.Aff.Dis', 'Neur.or.Pers.Dis', 'Special.Symptoms'), Drugs.Rx = c('Yes', 'No')))
Ag.3.10.collapsed.table <- as.table(Ag.3.10.collapsed.mat) 
Ag.3.10.collapsed.table # Observed
##                   Drugs.Rx
## Diagnosis          Yes  No
##   Schiz.or.Aff.Dis 117  10
##   Neur.or.Pers.Dis  65  71
##   Special.Symptoms   0  13

The observed counts for Schizophrenia and Affective.Disorder are now combined into a single category now labeled Schiz.or.Aff.Dis. Similarly, the observed counts for Neurosis and Personality.Disorder are combined into a single category now labeled Neur.or.Pers.Dis. Since the counts associated with Special.Symptoms have not been used in a previous sub-table, they are retained in the table here.

Test for independence between Diagnosis and Drugs.Rx in the combined table. Actually, we are less concerned with independence here; we compute the LR \(\chi^{2}\) statistic to perform certain checks in our logic and to complete the steps for the partitioning.

Ag.3.10.collapsed.loglm <- loglm( ~ Diagnosis + Drugs.Rx, data = Ag.3.10.collapsed.table)
Ag.3.10.collapsed.loglm
## Call:
## loglm(formula = ~Diagnosis + Drugs.Rx, data = Ag.3.10.collapsed.table)
## 
## Statistics:
##                    X^2 df P(> X^2)
## Likelihood Ratio 95.77  2        0
## Pearson          83.88  2        0

As the LR \(\chi^{2}\) = 95.77 on 2 df, we reject the null hypothesis for independence, \(p \lt 0.0001\).

Summarizing to this point:

When the partitioning is performed correctly, the LR \(\chi^{2}\) values of the sub-tables sum, exactly, to the LR \(\chi^{2}\) value for the original table (Agresti, 1990, pp. 50-51). Similarly, the degrees of freedom associated with each test sum to the degrees of freedom associated with the test from the original table.

Add the three LR \(\chi^{2}\) values associated with the three sub-tables . . .

Ag.3.10.rows.12.loglm$lr + Ag.3.10.rows.34.loglm$lr + Ag.3.10.collapsed.loglm$lr
## [1] 96.54

… and compare to LR \(\chi^{2}\) value for the original \(5 \times 2\) table, on 4 degrees of freedom.

Ag.3.10.global.loglm$lr
## [1] 96.54

Perform a check to determine whether the two quantities are equivalent to 12 decimal places.

# check accuracy to 12 decimal places
round(Ag.3.10.rows.12.loglm$lr + Ag.3.10.rows.34.loglm$lr + Ag.3.10.collapsed.loglm$lr, 12) == round(Ag.3.10.global.loglm$lr, 12)
## [1] TRUE

They are equal. Also, the degrees of freedom for each component are, respectively, 1, 1, and 2, which sum to 4 degrees of freedom associated with the original table.

Summary and Interpretation

Psychiatric patients were relatively more or less likely to be prescribed drugs depending on their respective diagnoses. Patients diagnosed with Schizophrenia or Affective Disorder were more likely to be prescribed drugs than not (92% vs. 8%). Patients diagnosed with Neurosis or Personality Disorder, were about equally likely to be prescribed drugs or not (48% vs. 52%). Patients with Special Symptoms were not likely to be prescribed drugs; in fact, no drugs were prescribed for these patients in this sample (0.0% vs. 100.0%).

Appendix

Chi-Square Formulas

The Likelihood Ratio (LR) Chi-Square test value has the form

\[ \mbox{LR} \hspace{3 mm} \chi^{2} = 2 \cdot \sum_{I} \sum_{J} O_{ij} \cdot log (\frac{O_{ij}}{E_{ij}} ) \]

where \(O_{ij}\) and \(E_{ij}\) are Observed and Expected values, \(log\) denotes the natural (Naperian) logarithm, and the symbol \(\sum_{I} \sum_{J}\) indicates summation over the (i,j) cells in a given table, \(i = 1, ..., I\) (number of rows in the table) and \(j = 1, ..., J\) (number of columns in the table).

The Pearson Chi-Squared test value has the form

\[ \mbox{Pearson} \hspace{3 mm} \chi^{2} = \sum_{I} \sum_{J} \frac {(O_{ij} - E_{ij})^{2}}{E_{ij}} \]

where \(O_{ij}\) and \(E_{ij}\) are Observed and Expected values, and the symbol \(\sum_{I} \sum_{J}\) indicates summation over the (i,j) cells in a given table, \(i = 1, ..., I\) (number of rows in the table) and \(j = 1, ..., J\) (number of columns in the table).

Rules for Partitioning

Here is a brief list of some of the Rules for Partitioning the contingency table (Agresti, page 53):

  1. The degrees of freedom for the sub-tables must sum to the degrees of freedom for the original table.

  2. Each cell count in the original table must be a cell count in one and only one sub-table.

  3. Each marginal total of the original table must be a marginal total for one and only one sub-table.

References

Agresti A, (1990) Categorical Data Analysis, 2nd ed., Wiley, New York.

Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/

Venables WN & Ripley BD (2002) Modern Applied Statistics with S Fourth Edition. Springer, New York. ISBN 0-387-95457-0

Fox J & Weisberg S (2011). An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion

Author Comments

The data for this problem come from Alan Agresti’s Categorical Data Analysis (1990, 2nd ed.).

This document is written in R Markdown and knitr with an HTML output. The .Rmd file that produced this document is available to readers by request to W Greg Alvord at greg.alvord@nih.gov. The .pdf output corresponding to this is also available.