Abstract

A common problem in contingency table analysis involves finding relationships between variables in an \(I \times J\) contingency table, where either \(I\) or \(J\), or both, are greater than 2. We partition (decompose) a \(3 \times 3\) contingency table of psychiatrists cross-classified as to their School of psychiatric thought (Eclectic, Medical, or Psychoanalytic) vs. their opinions about the Origin of schizophrenia (Biogenic, Environmental, or as a Combination of the two). The decomposition involves the partitioning of the contingency table into orthogonal, additive components, each component associated with its corresponding Likelihood Ratio Chi-Square statistic. We attack/analyze the problem in detail.

Introduction

We are presented with the \(3 \times 3\) contingency table from Alan Agresti’s Categorical Data Analysis (1990, 2nd ed., p 52, Table 3.5). The table cross-classifies 282 psychiatrists by their School of psychiatric thought (Eclectic, Medical, or Psychoanalytic) and their opinions about the Origin of schizophrenia (Biogenic, Environmental, or as a Combination of the two). The problem is to partition, or decompose, the table in a statistically rigorous way such that certain categories, or groups of categories, better explain the underlying relationships. The decomposition involves the partitioning of the contingency table and its corresponding Likelihood Ratio Chi-Square statistic, LR \(\chi^{2}\), also designated as \(G^{2}\), into orthogonal, additive components (Agresti, pp 50-54). The advantage that accrues from partitioning a contingency table into orthogonal components is that independent inferences can be drawn for each component involved in the partitioning. “A [correct] partitioning may show that an association primarily reflects differences between certain categories or groupings of categories,” (Agresti, p. 50). Rules for partitioning the table are provided in the Appendix.

Analysis

We partition (decompose) a \(3 \times 3\) contingency table with data from Alan Agresti’s Categorical Data Analysis (1990, 2nd ed.) – Table 3.5, page 52. The echo = TRUE option has been retained so that the reader may easily see how the data are entered and prepared in an R table.

Ag.3.5.table.entries <- c(90,12,78,  13,1,6,  19,13,50)
Ag.3.5 <- as.table(matrix(Ag.3.5.table.entries, nrow = 3, byrow = TRUE, dimnames = list(School = c('Eclectic', 'Medical', 'Psychoanalytic'), Origin = c('Biogenic', 'Environmental', 'Combination'))))
Ag.3.5
##                 Origin
## School           Biogenic Environmental Combination
##   Eclectic             90            12          78
##   Medical              13             1           6
##   Psychoanalytic       19            13          50
addmargins(Ag.3.5) # add marginal sums to table
##                 Origin
## School           Biogenic Environmental Combination Sum
##   Eclectic             90            12          78 180
##   Medical              13             1           6  20
##   Psychoanalytic       19            13          50  82
##   Sum                 122            26         134 282

Table Ag.3.5 cross-classifies a sample of 282 psychiatrists according to their School of psychiatric thought by their opinions regarding the Origin of schizophrenia. We use the Likelihood Ratio \(\chi^{2}\) statistic to test for independence between School and Origin. One way to do this is with the loglm() function from the MASS package. First, load the MASS package (Venables and Ripley, 2002). Next, perform a ‘global’ \(\chi^{2}\) test, or \(G^{2}\), for the hypothesis of independence (no association) between School and Origin. The null hypothesis \(H_{0}\) for this table states that the variables School and Origin are independent; that is, \(H_{0}\) states that School and Origin are unrelated statistically.

library(MASS) # Venables and Ripley
Ag.3.5.loglm <- loglm( ~ School + Origin, data = Ag.3.5) 
Ag.3.5.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.5)
## 
## Statistics:
##                    X^2 df  P(> X^2)
## Likelihood Ratio 23.04  4 0.0001245
## Pearson          22.38  4 0.0001685

We reject the null hypothesis that School and Origin are independent; \(G^{2}\), is 23.04 on 4 df, \(p = 0.00012 \sim 0.0001\). [Note: Ignore the Pearson \(\chi^{2}\) value in these analyses.]

The loglinear analysis reveals a strong relationship between School and Origin. However, we wish to ascertain more specifically which Schools of psychiatric thought, or groupings of schools of thought, and which categories of Origin account for the relationship.

We now search for subtables that might be homogeneous, and which might then be combined (collapsed) so that clearer picture can emerge. But where do we start? It requires some thought - there is no prescribed method that will work every time. In an \(I \times J\) table such as we have here, we might try to identify, say, two columns (of the larger table) that appear to have comparable proportions (percentages) of cases down two or more rows. (Alternatively, we might try to identify, say, two rows that appear to have comparable proportions of cases across two or more columns.) For example, in selecting two columns, we might choose Biogenic and Environmental levels of Origin and try to determine whether the percentages of two or more rows, i.e., levels of School, might be comparable. So, we prepare a table comprised of percentages that, for each School, sum to 100% across the two categories of Origin. Here is the code chunk that accomplishes this.

# compute proportions across columns
Ag.3.5.prop.mar.1.cols.12.table <- prop.table(Ag.3.5[,1:2], margin = 1)
 
# transform to percentages
Ag.3.5.percent.mar.1.cols.12.table <- 100*Ag.3.5.prop.mar.1.cols.12.table
 
# present ‘percentages’ table
round(Ag.3.5.percent.mar.1.cols.12.table, 1) # Percentage
##                 Origin
## School           Biogenic Environmental
##   Eclectic           88.2          11.8
##   Medical            92.9           7.1
##   Psychoanalytic     59.4          40.6

Restricting our attention to this \(3 \times 2\) subtable, we see that under the Biogenic level of Origin the percentage for the Eclectic school of thought is 88.2% and the percentage for the Medical school of thought is 92.9%, as compared to 11.8% and 7.1%, respectively under Environmental. This suggests that Eclectic and Medical schools of thought are approximately comparable. On the other hand, the percentage for the Psychoanalytic school of thought (under Biogenic) is only 59.4% compared to 40.6% (under Environmental). Taken together, these observations suggest that the \(2 \times 2\) subtable comprised of the Eclectic and Medical schools of thought with Biogenic and Environmental origins might be homogeneous. In this next code chunk we display table Ag.3.6.A to correspond to the first subtable presented in Agresti, page 52. We then calculate \(G^{2}\) for this \(2 \times 2\) subtable.

Ag.3.6.A <- Ag.3.5[1:2, 1:2]
Ag.3.6.A
##           Origin
## School     Biogenic Environmental
##   Eclectic       90            12
##   Medical        13             1
Ag.3.6.A.loglm <- loglm( ~ School + Origin, data = Ag.3.6.A) 
Ag.3.6.A.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.6.A)
## 
## Statistics:
##                     X^2 df P(> X^2)
## Likelihood Ratio 0.2942  1   0.5875
## Pearson          0.2643  1   0.6072

\(G^{2} = 0.29\) with \(p = 0.59\) on 1 degree of freedom, indicating that this subtable is homogeneous. We do not reject the null hypothesis of independence.

When an \(I \times J\) table is homogeneous, i.e., when it satisfies the null hypothesis \(H_{0}\) of independence, it is true that its row and column marginal totals contain the information that lie within the individual cells. [See Technical Note 1 in the Appendix.] Hence, in a problem such as this one, we can compute the marginal totals across the rows of table Ag.3.6.A, and allow them to be the entries in a newly constructed subtable to be used for further analysis. [See Rules for Partitioning in the Appendix.]

margin.table(Ag.3.6.A, margin = 1)
## School
## Eclectic  Medical 
##      102       14
Ag.3.6.B <- as.table(matrix(c(margin.table(Ag.3.6.A, margin = 1), Ag.3.5[1:2,3]), nrow = 2, byrow = FALSE, dimnames = list(School = c('Ecl', 'Med'), Origin = c('Bio+Env', 'Com'))))  
Ag.3.6.B
##       Origin
## School Bio+Env Com
##    Ecl     102  78
##    Med      14   6
Ag.3.6.B.loglm <- loglm( ~ School + Origin, data = Ag.3.6.B) 
Ag.3.6.B.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.6.B)
## 
## Statistics:
##                    X^2 df P(> X^2)
## Likelihood Ratio 1.359  1   0.2437
## Pearson          1.314  1   0.2517

We conclude that table Ag.3.6.B is homogeneous; \(G^{2} = 1.36\) on 1 degree of freedom, \(p = 0.24\)

Summarizing to this point, we have found that tables Ag.3.6.A and Ag.3.6.B are homogeneous. Since both tables Ag.3.6.A and Ag.3.6.B compare Eclectic and Medical Schools across all three categories of Origin, it would have been possible for us to have started our analysis with our attention restricted to this \(2 \times 3\) subtable. We do that here.

Ag.3.5.Ecl.Med.loglm <- loglm( ~ School + Origin, data = Ag.3.5[1:2,])
Ag.3.5.Ecl.Med.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.5[1:2, ])
## 
## Statistics:
##                    X^2 df P(> X^2)
## Likelihood Ratio 1.653  2   0.4376
## Pearson          1.625  2   0.4437

\(G^{2} = 1.65\) on 2 degress of freedom, \(p = 0.4376 \sim 0.44\) indicating that the subtable comprising the first two rows of table Ag.3.5 is homogeneous. As tables Ag.3.6.A and Ag.3.6.B are orthogonal in the parameter space, this value of \(G^{2} = 1.65\) represents the sum of the \(G^{2}\) statistics from Ag.3.6.A.loglm and Ag.3.6.B.loglm, 0.29 + 1.36.

We found earlier that table Ag.3.6.A was homogeneous. This allowed us to compute the marginal totals for the columns and enter them into table A.3.6.B. As table Ag.3.6.A is homogeneous, we can also compute the marginal totals for its rows and enter them into a new table.

margin.table(Ag.3.6.A, margin = 2)
## Origin
##      Biogenic Environmental 
##           103            13

We use this information to construct table Ag.3.6.C, keeping in mind the Rules for Partitioning (Appendix).

Ag.3.6.C <- as.table(matrix(c(margin.table(Ag.3.6.A, margin = 2), Ag.3.5[3, 1:2]), nrow = 2, byrow = TRUE, dimnames = list(School = c('Ecl+Med', 'Psy'), Origin = c('Bio', 'Env'))))  
Ag.3.6.C
##          Origin
## School    Bio Env
##   Ecl+Med 103  13
##   Psy      19  13

We compute the loglm model for table Ag.3.6.C.

Ag.3.6.C.loglm <- loglm( ~ School + Origin, data = Ag.3.6.C) 
Ag.3.6.C.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.6.C)
## 
## Statistics:
##                    X^2 df  P(> X^2)
## Likelihood Ratio 12.95  1 0.0003194
## Pearson          14.99  1 0.0001082

For table Ag.3.6.C, \(G^{2}\), is 12.95 on 1 df, \(p = 0.0003914 \sim 0.0004\). We reject the null hypothesis \(H_{0}\) of independence and conclude that for this subtable, School and Origin are related. One can see from table Ag.3.6.C that physicians from the combined Eclectic and Medical schools, Ecl+Med, are much more inclined to attribute the cause of schizophrenia to Biogenic (Bio) origins than to Environmental (Env) origins, by a ratio of 103:13. On the other hand, physicians from the Psychoanalytic (Psy) school, are only slightly more inclined (if at all) to attribute the cause of schizophrenia to Biogenic (Bio) origins rather than to Environmental (Env) origins, with a ratio of only 19:13.

Our last three tests on \(2 \times 2\) tables have consumed 3 of 4 degrees of freedom available in our original global test. Hence, one \(2 \times 2\) table remains that is orthogonal to the other three. Following the Rules of Partitioning, we continue to collapse information in previous tables to construct a new table, Ag.3.6.D.

Ag.3.6.D <- as.table(matrix(c(margin.table(Ag.3.6.C, margin = 1), margin.table(Ag.3.6.B, margin = 2)[2], Ag.3.5[3, 3]), nrow = 2, byrow = FALSE, dimnames = list(School = c('Ecl+Med', 'Psy'), Origin = c('Bio+Env', 'Com'))))  
Ag.3.6.D
##          Origin
## School    Bio+Env Com
##   Ecl+Med     116  84
##   Psy          32  50

We compute the loglm model for table Ag.3.6.D.

Ag.3.6.D.loglm <- loglm( ~ School + Origin, data = Ag.3.6.D) 
Ag.3.6.D.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.6.D)
## 
## Statistics:
##                    X^2 df P(> X^2)
## Likelihood Ratio 8.430  1 0.003690
## Pearson          8.397  1 0.003759

For table Ag.3.6.D, \(G^{2}\), is 8.43 on 1 df, \(p = 0.00369 \sim 0.0037\). We reject the null hypothesis \(H_{0}\) of independence and conclude that for this subtable, School and Origin are related.

If we have followed the Rules for Partitioning correctly in our decomposition of the original \(3 \times 3\) table, the LR \(\chi^{2}\) (\(G^{2}\)) statistics for each of the subtables form strictly orthogonal components and, therefore, must sum (exactly) to the LR \(\chi^{2}\) statistic obtained from the orginal table. Also, the degrees of freedom associated with each subtable sum to the degrees of freedom available from the analysis of the original table. Subtables Ag.3.6.A, Ag.3.6.B, Ag.3.6.C and Ag.3.6.D each consume one (1) degree of freedom; they sum to the degrees of freedom (4) available from the analysis of table Ag.3.5.

This next set of code chunks confirms that the LR \(\chi^{2}\) statistics associated with the four subtables sum to the LR \(\chi^{2}\) statistic obtained in the loglinear analysis of the original table Ag.3.5. The LR \(\chi^{2}\) statistic for table Ag.3.5 is

Ag.3.5.loglm$lr
## [1] 23.04

The LR \(\chi^{2}\) statistics for the four subtables are 0.2942, 1.3588, 12.9529, and 8.4303, respectively. These sum to

Ag.3.6.A.loglm$lr + Ag.3.6.B.loglm$lr + Ag.3.6.C.loglm$lr + Ag.3.6.D.loglm$lr
## [1] 23.04

A check to confirm that these two statistics are identical to 12 decimal places follows.

round(Ag.3.6.A.loglm$lr + Ag.3.6.B.loglm$lr + Ag.3.6.C.loglm$lr + Ag.3.6.D.loglm$lr, 12) == round(Ag.3.5.loglm$lr, 12)
## [1] TRUE

Summary and Interpretation

We evaluate the orginal \(3 \times 3\) table of 282 psychiatrists and find a significant relationship between the school of psychiatric thought (eclectic, medical, or psychoanalytic) in which they were trained and their opinions regarding the origin of schizophrenia (biogenic, environmental, or a combination of the two). The first subtable (Ag.3.6.A) compares the eclectic and medical schools of thought on whether the origin of schizophrenia is biogenic or environmental. The second subtable (Ag.3.6.B) compares the eclectic and medical schools of thought on whether the origin of schizophrenia is due to a combination of biogenic and environmental factors as opposed to either of these alone. As tables Ag.3.6.A and Ag.3.6.B are homogeneous, their \(G^{2}\) statistics sum to the \(G^{2}\) statistic obtained from the test for independence applied to the first two rows of original table Ag.3.5. Statistically, the eclectic and medical schools of thought view the origin of schizophrenia similarly. Hence, we combine the frequencies from the eclectic and medical schools of thought and compare them to those of the psychoanalytic school.

The third subtable Ag.3.6.C compares the combined group of eclectic and medical schools to the psychoanalytic school on just the biogenic and environmental classifications of origin. This subtable reveals a significant relationship between school and origin showing the combined group of eclectic and medical psychiatrists as more likely to ascribe the origin of schizophrenia to biogenic (as opposed to environmental) factors than psychiatrists of the psychoanalytic school. The fourth subtable Ag.3.6.D also reveals a significant relationship. The psychoanalytic school psychiatrists are more likely than the eclectic and medical psychiatrists to ascribe the origin of schizophrenia to a combination of biogenic and environmental factors rather than to either of these factors alone.

Appendix

Technical Note 1

[Agresti, p. 47] In two-way contingency tables with multinomial sampling, the null hypothesis of statistical independence is \(H_{0}: \pi_{ij} = \pi_{i+} \pi_{+j}\) for all \(i\) and \(j\). To test \(H_{0}\), we could use the Pearson, or the Likelihood Ratio, \(\chi^{2}\) statistic with \(n_{ij}\) in place of \(n_{i}\) and \(m_{ij} = n \pi_{ij} = n \pi_{i+} \pi_{+j}\) in place of \(m_{i}\). Here, \(m_{ij}\) is the expected value of \(n_{ij}\) under the null hypothesis.

Chi-Square Formulas

The Likelihood Ratio (LR) Chi-Square test value has the form

\[ \mbox{LR} \hspace{3 mm} \chi^{2} = 2 \cdot \sum_{I} \sum_{J} O_{ij} \cdot log (\frac{O_{ij}}{E_{ij}} ) \]

where \(O_{ij}\) and \(E_{ij}\) are Observed and Expected values, \(log\) denotes the natural (Naperian) logarithm, and the symbol \(\sum_{I} \sum_{J}\) indicates summation over the (i,j) cells in a given table, \(i = 1, ..., I\) (number of rows in the table) and \(j = 1, ..., J\) (number of columns in the table).

The Pearson Chi-Squared test value has the form

\[ \mbox{Pearson} \hspace{3 mm} \chi^{2} = \sum_{I} \sum_{J} \frac {(O_{ij} - E_{ij})^{2}}{E_{ij}} \]

where \(O_{ij}\) and \(E_{ij}\) are Observed and Expected values, and the symbol \(\sum_{I} \sum_{J}\) indicates summation over the (i,j) cells in a given table, \(i = 1, ..., I\) (number of rows in the table) and \(j = 1, ..., J\) (number of columns in the table).

Rules for Partitioning

Here is a brief list of some of the Rules for Partitioning the contingency table (Agresti, page 53):

  1. The degrees of freedom for the sub-tables must sum to the degrees of freedom for the original table.

  2. Each cell count in the original table must be a cell count in one and only one sub-table.

  3. Each marginal total of the original table must be a marginal total for one and only one sub-table.

References

Agresti A, (1990) Categorical Data Analysis, 2nd ed., Wiley, New York.

Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/

Venables WN & Ripley BD (2002) Modern Applied Statistics with S Fourth Edition. Springer, New York. ISBN 0-387-95457-0

Fox J & Weisberg S (2011). An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion

Author Comments

The data for this problem come from Alan Agresti’s Categorical Data Analysis (1990, 2nd ed.), page 52.

This document is written in R Markdown and knitr with an HTML output. The .Rmd file that produced this document is available to readers by request to W Greg Alvord at greg.alvord@nih.gov. The .pdf output corresponding to this is also available.