A common problem in contingency table analysis involves finding relationships between variables in an \(I \times J\) contingency table, where either \(I\) or \(J\), or both, are greater than 2. We partition (decompose) a \(3 \times 3\) contingency table of psychiatrists cross-classified as to their School of psychiatric thought (Eclectic, Medical, or Psychoanalytic) vs. their opinions about the Origin of schizophrenia (Biogenic, Environmental, or as a Combination of the two). The decomposition involves the partitioning of the contingency table into orthogonal, additive components, each component associated with its corresponding Likelihood Ratio Chi-Square statistic. We attack/analyze the problem in detail.
We are presented with the \(3 \times 3\) contingency table from Alan Agresti’s Categorical Data Analysis (1990, 2nd ed., p 52, Table 3.5). The table cross-classifies 282 psychiatrists by their School of psychiatric thought (Eclectic, Medical, or Psychoanalytic) and their opinions about the Origin of schizophrenia (Biogenic, Environmental, or as a Combination of the two). The problem is to partition, or decompose, the table in a statistically rigorous way such that certain categories, or groups of categories, better explain the underlying relationships. The decomposition involves the partitioning of the contingency table and its corresponding Likelihood Ratio Chi-Square statistic, LR \(\chi^{2}\), also designated as \(G^{2}\), into orthogonal, additive components (Agresti, pp 50-54). The advantage that accrues from partitioning a contingency table into orthogonal components is that independent inferences can be drawn for each component involved in the partitioning. “A [correct] partitioning may show that an association primarily reflects differences between certain categories or groupings of categories,” (Agresti, p. 50). Rules for partitioning the table are provided in the Appendix.
We partition (decompose) a \(3 \times 3\) contingency table with data from Alan Agresti’s Categorical Data Analysis (1990, 2nd ed.) – Table 3.5, page 52. The echo = TRUE
option has been retained so that the reader may easily see how the data are entered and prepared in an R table.
Ag.3.5.table.entries <- c(90,12,78, 13,1,6, 19,13,50)
Ag.3.5 <- as.table(matrix(Ag.3.5.table.entries, nrow = 3, byrow = TRUE, dimnames = list(School = c('Eclectic', 'Medical', 'Psychoanalytic'), Origin = c('Biogenic', 'Environmental', 'Combination'))))
Ag.3.5
## Origin
## School Biogenic Environmental Combination
## Eclectic 90 12 78
## Medical 13 1 6
## Psychoanalytic 19 13 50
addmargins(Ag.3.5) # add marginal sums to table
## Origin
## School Biogenic Environmental Combination Sum
## Eclectic 90 12 78 180
## Medical 13 1 6 20
## Psychoanalytic 19 13 50 82
## Sum 122 26 134 282
Table Ag.3.5
cross-classifies a sample of 282 psychiatrists according to their School
of psychiatric thought by their opinions regarding the Origin
of schizophrenia. We use the Likelihood Ratio \(\chi^{2}\) statistic to test for independence between School
and Origin
. One way to do this is with the loglm()
function from the MASS package. First, load the MASS package (Venables and Ripley, 2002). Next, perform a ‘global’ \(\chi^{2}\) test, or \(G^{2}\), for the hypothesis of independence (no association) between School
and Origin
. The null hypothesis \(H_{0}\) for this table states that the variables School
and Origin
are independent; that is, \(H_{0}\) states that School
and Origin
are unrelated statistically.
library(MASS) # Venables and Ripley
Ag.3.5.loglm <- loglm( ~ School + Origin, data = Ag.3.5)
Ag.3.5.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.5)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 23.04 4 0.0001245
## Pearson 22.38 4 0.0001685
We reject the null hypothesis that School
and Origin
are independent; \(G^{2}\), is 23.04 on 4 df, \(p = 0.00012 \sim 0.0001\). [Note: Ignore the Pearson \(\chi^{2}\) value in these analyses.]
The loglinear analysis reveals a strong relationship between School
and Origin
. However, we wish to ascertain more specifically which School
s of psychiatric thought, or groupings of schools of thought, and which categories of Origin
account for the relationship.
We now search for subtables that might be homogeneous, and which might then be combined (collapsed) so that clearer picture can emerge. But where do we start? It requires some thought - there is no prescribed method that will work every time. In an \(I \times J\) table such as we have here, we might try to identify, say, two columns (of the larger table) that appear to have comparable proportions (percentages) of cases down two or more rows. (Alternatively, we might try to identify, say, two rows that appear to have comparable proportions of cases across two or more columns.) For example, in selecting two columns, we might choose Biogenic
and Environmental
levels of Origin
and try to determine whether the percentages of two or more rows, i.e., levels of School
, might be comparable. So, we prepare a table comprised of percentages that, for each School
, sum to 100% across the two categories of Origin
. Here is the code chunk that accomplishes this.
# compute proportions across columns
Ag.3.5.prop.mar.1.cols.12.table <- prop.table(Ag.3.5[,1:2], margin = 1)
# transform to percentages
Ag.3.5.percent.mar.1.cols.12.table <- 100*Ag.3.5.prop.mar.1.cols.12.table
# present ‘percentages’ table
round(Ag.3.5.percent.mar.1.cols.12.table, 1) # Percentage
## Origin
## School Biogenic Environmental
## Eclectic 88.2 11.8
## Medical 92.9 7.1
## Psychoanalytic 59.4 40.6
Restricting our attention to this \(3 \times 2\) subtable, we see that under the Biogenic
level of Origin
the percentage for the Eclectic
school of thought is 88.2% and the percentage for the Medical
school of thought is 92.9%, as compared to 11.8% and 7.1%, respectively under Environmental
. This suggests that Eclectic
and Medical
schools of thought are approximately comparable. On the other hand, the percentage for the Psychoanalytic
school of thought (under Biogenic
) is only 59.4% compared to 40.6% (under Environmental
). Taken together, these observations suggest that the \(2 \times 2\) subtable comprised of the Eclectic
and Medical
schools of thought with Biogenic
and Environmental
origins might be homogeneous. In this next code chunk we display table Ag.3.6.A
to correspond to the first subtable presented in Agresti, page 52. We then calculate \(G^{2}\) for this \(2 \times 2\) subtable.
Ag.3.6.A <- Ag.3.5[1:2, 1:2]
Ag.3.6.A
## Origin
## School Biogenic Environmental
## Eclectic 90 12
## Medical 13 1
Ag.3.6.A.loglm <- loglm( ~ School + Origin, data = Ag.3.6.A)
Ag.3.6.A.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.6.A)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 0.2942 1 0.5875
## Pearson 0.2643 1 0.6072
\(G^{2} = 0.29\) with \(p = 0.59\) on 1 degree of freedom, indicating that this subtable is homogeneous. We do not reject the null hypothesis of independence.
When an \(I \times J\) table is homogeneous, i.e., when it satisfies the null hypothesis \(H_{0}\) of independence, it is true that its row and column marginal totals contain the information that lie within the individual cells. [See Technical Note 1 in the Appendix.] Hence, in a problem such as this one, we can compute the marginal totals across the rows of table Ag.3.6.A
, and allow them to be the entries in a newly constructed subtable to be used for further analysis. [See Rules for Partitioning in the Appendix.]
margin.table(Ag.3.6.A, margin = 1)
## School
## Eclectic Medical
## 102 14
Ag.3.6.B <- as.table(matrix(c(margin.table(Ag.3.6.A, margin = 1), Ag.3.5[1:2,3]), nrow = 2, byrow = FALSE, dimnames = list(School = c('Ecl', 'Med'), Origin = c('Bio+Env', 'Com'))))
Ag.3.6.B
## Origin
## School Bio+Env Com
## Ecl 102 78
## Med 14 6
Ag.3.6.B.loglm <- loglm( ~ School + Origin, data = Ag.3.6.B)
Ag.3.6.B.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.6.B)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 1.359 1 0.2437
## Pearson 1.314 1 0.2517
We conclude that table Ag.3.6.B
is homogeneous; \(G^{2} = 1.36\) on 1 degree of freedom, \(p = 0.24\)
Summarizing to this point, we have found that tables Ag.3.6.A
and Ag.3.6.B
are homogeneous. Since both tables Ag.3.6.A
and Ag.3.6.B
compare Eclectic
and Medical
School
s across all three categories of Origin
, it would have been possible for us to have started our analysis with our attention restricted to this \(2 \times 3\) subtable. We do that here.
Ag.3.5.Ecl.Med.loglm <- loglm( ~ School + Origin, data = Ag.3.5[1:2,])
Ag.3.5.Ecl.Med.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.5[1:2, ])
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 1.653 2 0.4376
## Pearson 1.625 2 0.4437
\(G^{2} = 1.65\) on 2 degress of freedom, \(p = 0.4376 \sim 0.44\) indicating that the subtable comprising the first two rows of table Ag.3.5
is homogeneous. As tables Ag.3.6.A
and Ag.3.6.B
are orthogonal in the parameter space, this value of \(G^{2} = 1.65\) represents the sum of the \(G^{2}\) statistics from Ag.3.6.A.loglm
and Ag.3.6.B.loglm
, 0.29 + 1.36.
We found earlier that table Ag.3.6.A
was homogeneous. This allowed us to compute the marginal totals for the columns and enter them into table A.3.6.B
. As table Ag.3.6.A
is homogeneous, we can also compute the marginal totals for its rows and enter them into a new table.
margin.table(Ag.3.6.A, margin = 2)
## Origin
## Biogenic Environmental
## 103 13
We use this information to construct table Ag.3.6.C
, keeping in mind the Rules for Partitioning (Appendix).
Ag.3.6.C <- as.table(matrix(c(margin.table(Ag.3.6.A, margin = 2), Ag.3.5[3, 1:2]), nrow = 2, byrow = TRUE, dimnames = list(School = c('Ecl+Med', 'Psy'), Origin = c('Bio', 'Env'))))
Ag.3.6.C
## Origin
## School Bio Env
## Ecl+Med 103 13
## Psy 19 13
We compute the loglm
model for table Ag.3.6.C
.
Ag.3.6.C.loglm <- loglm( ~ School + Origin, data = Ag.3.6.C)
Ag.3.6.C.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.6.C)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 12.95 1 0.0003194
## Pearson 14.99 1 0.0001082
For table Ag.3.6.C
, \(G^{2}\), is 12.95 on 1 df, \(p = 0.0003914 \sim 0.0004\). We reject the null hypothesis \(H_{0}\) of independence and conclude that for this subtable, School
and Origin
are related. One can see from table Ag.3.6.C
that physicians from the combined Eclectic and Medical schools, Ecl+Med
, are much more inclined to attribute the cause of schizophrenia to Biogenic (Bio
) origins than to Environmental (Env
) origins, by a ratio of 103:13. On the other hand, physicians from the Psychoanalytic (Psy
) school, are only slightly more inclined (if at all) to attribute the cause of schizophrenia to Biogenic (Bio
) origins rather than to Environmental (Env
) origins, with a ratio of only 19:13.
Our last three tests on \(2 \times 2\) tables have consumed 3 of 4 degrees of freedom available in our original global test. Hence, one \(2 \times 2\) table remains that is orthogonal to the other three. Following the Rules of Partitioning, we continue to collapse information in previous tables to construct a new table, Ag.3.6.D
.
Ag.3.6.D <- as.table(matrix(c(margin.table(Ag.3.6.C, margin = 1), margin.table(Ag.3.6.B, margin = 2)[2], Ag.3.5[3, 3]), nrow = 2, byrow = FALSE, dimnames = list(School = c('Ecl+Med', 'Psy'), Origin = c('Bio+Env', 'Com'))))
Ag.3.6.D
## Origin
## School Bio+Env Com
## Ecl+Med 116 84
## Psy 32 50
We compute the loglm
model for table Ag.3.6.D
.
Ag.3.6.D.loglm <- loglm( ~ School + Origin, data = Ag.3.6.D)
Ag.3.6.D.loglm
## Call:
## loglm(formula = ~School + Origin, data = Ag.3.6.D)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 8.430 1 0.003690
## Pearson 8.397 1 0.003759
For table Ag.3.6.D
, \(G^{2}\), is 8.43 on 1 df, \(p = 0.00369 \sim 0.0037\). We reject the null hypothesis \(H_{0}\) of independence and conclude that for this subtable, School
and Origin
are related.
If we have followed the Rules for Partitioning correctly in our decomposition of the original \(3 \times 3\) table, the LR \(\chi^{2}\) (\(G^{2}\)) statistics for each of the subtables form strictly orthogonal components and, therefore, must sum (exactly) to the LR \(\chi^{2}\) statistic obtained from the orginal table. Also, the degrees of freedom associated with each subtable sum to the degrees of freedom available from the analysis of the original table. Subtables Ag.3.6.A
, Ag.3.6.B
, Ag.3.6.C
and Ag.3.6.D
each consume one (1) degree of freedom; they sum to the degrees of freedom (4) available from the analysis of table Ag.3.5
.
This next set of code chunks confirms that the LR \(\chi^{2}\) statistics associated with the four subtables sum to the LR \(\chi^{2}\) statistic obtained in the loglinear analysis of the original table Ag.3.5
. The LR \(\chi^{2}\) statistic for table Ag.3.5
is
Ag.3.5.loglm$lr
## [1] 23.04
The LR \(\chi^{2}\) statistics for the four subtables are 0.2942, 1.3588, 12.9529, and 8.4303, respectively. These sum to
Ag.3.6.A.loglm$lr + Ag.3.6.B.loglm$lr + Ag.3.6.C.loglm$lr + Ag.3.6.D.loglm$lr
## [1] 23.04
A check to confirm that these two statistics are identical to 12 decimal places follows.
round(Ag.3.6.A.loglm$lr + Ag.3.6.B.loglm$lr + Ag.3.6.C.loglm$lr + Ag.3.6.D.loglm$lr, 12) == round(Ag.3.5.loglm$lr, 12)
## [1] TRUE
We evaluate the orginal \(3 \times 3\) table of 282 psychiatrists and find a significant relationship between the school of psychiatric thought (eclectic, medical, or psychoanalytic) in which they were trained and their opinions regarding the origin of schizophrenia (biogenic, environmental, or a combination of the two). The first subtable (Ag.3.6.A) compares the eclectic and medical schools of thought on whether the origin of schizophrenia is biogenic or environmental. The second subtable (Ag.3.6.B) compares the eclectic and medical schools of thought on whether the origin of schizophrenia is due to a combination of biogenic and environmental factors as opposed to either of these alone. As tables Ag.3.6.A and Ag.3.6.B are homogeneous, their \(G^{2}\) statistics sum to the \(G^{2}\) statistic obtained from the test for independence applied to the first two rows of original table Ag.3.5. Statistically, the eclectic and medical schools of thought view the origin of schizophrenia similarly. Hence, we combine the frequencies from the eclectic and medical schools of thought and compare them to those of the psychoanalytic school.
The third subtable Ag.3.6.C compares the combined group of eclectic and medical schools to the psychoanalytic school on just the biogenic and environmental classifications of origin. This subtable reveals a significant relationship between school and origin showing the combined group of eclectic and medical psychiatrists as more likely to ascribe the origin of schizophrenia to biogenic (as opposed to environmental) factors than psychiatrists of the psychoanalytic school. The fourth subtable Ag.3.6.D also reveals a significant relationship. The psychoanalytic school psychiatrists are more likely than the eclectic and medical psychiatrists to ascribe the origin of schizophrenia to a combination of biogenic and environmental factors rather than to either of these factors alone.
[Agresti, p. 47] In two-way contingency tables with multinomial sampling, the null hypothesis of statistical independence is \(H_{0}: \pi_{ij} = \pi_{i+} \pi_{+j}\) for all \(i\) and \(j\). To test \(H_{0}\), we could use the Pearson, or the Likelihood Ratio, \(\chi^{2}\) statistic with \(n_{ij}\) in place of \(n_{i}\) and \(m_{ij} = n \pi_{ij} = n \pi_{i+} \pi_{+j}\) in place of \(m_{i}\). Here, \(m_{ij}\) is the expected value of \(n_{ij}\) under the null hypothesis.
The Likelihood Ratio (LR) Chi-Square test value has the form
\[ \mbox{LR} \hspace{3 mm} \chi^{2} = 2 \cdot \sum_{I} \sum_{J} O_{ij} \cdot log (\frac{O_{ij}}{E_{ij}} ) \]
where \(O_{ij}\) and \(E_{ij}\) are Observed and Expected values, \(log\) denotes the natural (Naperian) logarithm, and the symbol \(\sum_{I} \sum_{J}\) indicates summation over the (i,j) cells in a given table, \(i = 1, ..., I\) (number of rows in the table) and \(j = 1, ..., J\) (number of columns in the table).
The Pearson Chi-Squared test value has the form
\[ \mbox{Pearson} \hspace{3 mm} \chi^{2} = \sum_{I} \sum_{J} \frac {(O_{ij} - E_{ij})^{2}}{E_{ij}} \]
where \(O_{ij}\) and \(E_{ij}\) are Observed and Expected values, and the symbol \(\sum_{I} \sum_{J}\) indicates summation over the (i,j) cells in a given table, \(i = 1, ..., I\) (number of rows in the table) and \(j = 1, ..., J\) (number of columns in the table).
Here is a brief list of some of the Rules for Partitioning the contingency table (Agresti, page 53):
The degrees of freedom for the sub-tables must sum to the degrees of freedom for the original table.
Each cell count in the original table must be a cell count in one and only one sub-table.
Each marginal total of the original table must be a marginal total for one and only one sub-table.
Agresti A, (1990) Categorical Data Analysis, 2nd ed., Wiley, New York.
Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/
Venables WN & Ripley BD (2002) Modern Applied Statistics with S Fourth Edition. Springer, New York. ISBN 0-387-95457-0
Fox J & Weisberg S (2011). An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion