First, before proceeding with this tutorial, we will have to install the following required packages:

Package MASS
Package rcompanion
Package lsr
Package vcd
Package DescTools

The following chunk will check each package’s availability and install them if necessary

required_packages <- c('MASS', 'rcompanion', 'lsr', 'vcd', 'DescTools')
for (p in required_packages) {
  if(!require(p,character.only = TRUE)) {
    install.packages(p, dep = TRUE)
  }
}

From the previous tutorial, we have seen previous parameters to measure the correlation between two continuous, or numerical, variables. However, all those of them are not defined when the data is categorical. As a result, something else is needed here.

Here, we will introduce different measures of association between two categorical variables. First, we will introduced the Pearson’s chi-squared test, along with the variations, Cramer’s V and Contingency Coefficient C.

1 Contingency table

1.1 Definition:

Contingency tables (also called crosstabs or two-ways tables) are used in statistics to summarize the relationship between categorical variables. A contingency table is a special type of frequency distribution table, where two variables are shown simultaneously.

1.2 How to create a contingency table in `R`:

First, for this tutorial, we will use the built-in dataset Cars93 from the package MASS. This dataset contains information about 93 cars on sale in the US in 1993, including 27 features (some of which are categorical).

library('MASS')
head(Cars93)

##   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city
## 1        Acura Integra   Small      12.9  15.9      18.8       25
## 2        Acura  Legend Midsize      29.2  33.9      38.7       18
## 3         Audi      90 Compact      25.9  29.1      32.3       20
## 4         Audi     100 Midsize      30.8  37.7      44.6       19
## 5          BMW    535i Midsize      23.7  30.0      36.2       22
## 6        Buick Century Midsize      14.2  15.7      17.3       22
##   MPG.highway            AirBags DriveTrain Cylinders EngineSize
## 1          31               None      Front         4        1.8
## 2          25 Driver & Passenger      Front         6        3.2
## 3          26        Driver only      Front         6        2.8
## 4          26 Driver & Passenger      Front         6        2.8
## 5          30        Driver only       Rear         4        3.5
## 6          31        Driver only      Front         4        2.2
##   Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
## 1        140 6300         2890             Yes               13.2
## 2        200 5500         2335             Yes               18.0
## 3        172 5500         2280             Yes               16.9
## 4        172 5500         2535             Yes               21.1
## 5        208 5700         2545             Yes               21.1
## 6        110 5200         2565              No               16.4
##   Passengers Length Wheelbase Width Turn.circle Rear.seat.room
## 1          5    177       102    68          37           26.5
## 2          5    195       115    71          38           30.0
## 3          5    180       102    67          37           28.0
## 4          6    193       106    70          37           31.0
## 5          4    186       109    69          39           27.0
## 6          6    189       105    69          41           28.0
##   Luggage.room Weight  Origin          Make
## 1           11   2705 non-USA Acura Integra
## 2           15   3560 non-USA  Acura Legend
## 3           14   3375 non-USA       Audi 90
## 4           17   3405 non-USA      Audi 100
## 5           13   3640 non-USA      BMW 535i
## 6           16   2880     USA Buick Century

To see how many types of car and how many cars in each type, we use the table function. To convert into fractions, we can use the function prop.table.

table(Cars93$Type)

## 
## Compact   Large Midsize   Small  Sporty     Van 
##      16      11      22      21      14       9

prop.table(table(Cars93$Type))

## 
##    Compact      Large    Midsize      Small     Sporty        Van 
## 0.17204301 0.11827957 0.23655914 0.22580645 0.15053763 0.09677419

Here, as an example for a contingency table, we will look at the types of cars with respect to their origin. To do this, we can use the function table again, but with two arguments now.

table(Cars93$Type, Cars93$Origin)

##          
##           USA non-USA
##   Compact   7       9
##   Large    11       0
##   Midsize  10      12
##   Small     7      14
##   Sporty    8       6
##   Van       5       4

2 Pearson \(\chi^2\) test

2.1 Definition

Pearson’s chi-squared test (\(\chi^2\)) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests.

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical

2.2 Computational procedure

Pearson’s chi-squared test is used to assess three types of comparison: goodness of fit, homogeneity, and independence. For all three tests, the computational procedure includes the following steps:

Calculate the chi-squared test statistic, \(\chi^2\).
Determine the degree of freedom (df) of that statistic.
Select a desired level of confidence for the result of the test.
Compare \(\chi^2\) to the critical value from the chi-squared distribution with df degrees of freedom and the selected confidence level.
Sustain or reject the null hypothesis.
- If the test statistic exceeds the critical value of \(\chi^2\), the null hypothesis can be rejected, and the alternative hypothesis can be accepted.
- If the test statistic falls below the critical value of \(\chi^2\), no clear conclusion can be reached. The null hypothesis is sustained, but not necessarily accepted.

2.3 Calculating the test-statistic

The value of the test-statistic is

\[ \chi^2 = \sum_{i = 0}^n \frac{(O_i - E_i)^2}{E_i} = N \sum_{i = 1}^n \frac{(O_i/N - p_i)^2}{p_i} \]

where

\(\chi^2\): Pearson’s cumulative test statistic
\(O_i\): the number of observations of type \(i\)
\(N\): total number of observations
\(E_i = Np_i\): the expected count of type \(i\)
\(n\): the number of cells in the table

2.4 Chi-squared test in `R`

We can perform the chi-squared test in R using the function chisq.test().

chisq.test(Cars93$Type, Cars93$Origin)

## Warning in chisq.test(Cars93$Type, Cars93$Origin): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  Cars93$Type and Cars93$Origin
## X-squared = 14.08, df = 5, p-value = 0.01511

Here, we have a \(\chi^2\) value of \(14.08\). Since we get a p-value of less than the significance level of \(0.05\), we can reject the null hypothesis and conclude that the two variables are, indeed, independent.

A problem with Pearson’s \(\chi^2\) coefficient is that the range of its maximum value depends on the sample size and the size of the contingency table. These values may vary in different situations. To overcome this problem, the coefficient can be standardized to lie between 0 and 1 so that it is independent of the sample size as well as the dimension of the contingency table. Several coefficients have been defined for this purpose, and we will consider some of them in the following section.

3 Coefficients using chi-squared statistic

3.1 Contingency coefficient \(C\)

3.1.1 Definition

Originally, the Pearson’s contingency coefficient is calculated as:

\[ C = \sqrt{\frac{\chi^2}{\chi^2 + n}} \]

with \(n\) being the total number of observations. However, there is another option to correct this contingency coefficient as:

\[ C_{corr} = \frac{C}{C_{max}} = \sqrt{\frac{\min(k,l)}{\min(k,l) - 1}} \sqrt{\frac{\chi^2}{\chi^2 + n}} \]

with

\[ C = \sqrt{\frac{\chi^2}{\chi^2 + n}} \text{ and } C_{max} = \sqrt{\frac{\min(k,l) - 1}{\min(k,l)}} \]

3.1.2 Calculating (corrected) contintency coefficient in `R`

To calculate (corrected), contingency coefficient, we can use the function ContCoef() from the package DescTools. This function allows us to calculate both the original and corrected contigency table by changing the parameter correct to True or False.

library('DescTools')
ContCoef(Cars93$Type, Cars93$Origin, correct = FALSE)

## [1] 0.3626145

ContCoef(Cars93$Type, Cars93$Origin, correct = TRUE)

## [1] 0.5128144

3.2 Cramer’s V

3.2.1 Definition

For a \(k \times l\) contingency table, \(n(\min(k,l) - 1)\) is the maximal value of the \(\chi^2\) statistic, dividing \(\chi^2\) by the maximal value leads to a scaled version with maximal value \(1\). This idea is used by Cramer’s \(V\) as follow:

\[ V = \sqrt{\frac{\chi^2}{n(\min(k,l) - 1)}} \]

3.2.2 Calculating Cramer’s V in `R`

To calculate Cramer’s V in R, we can use the function cramerV() from the package rcompanion. In contrast to the function cramersV() from the lsr package, cramerV() offers an option to correct for bias.

Besides, we can also use the functions assocstats and xtabs contained in the package vcd. For example:

library('rcompanion')
cramerV(Cars93$Type, Cars93$Origin, bias.correct = FALSE)

## Cramer V 
##   0.3891

and

library('lsr')
cramersV(Cars93$Type, Cars93$Origin)

## Warning in chisq.test(...): Chi-squared approximation may be incorrect

## [1] 0.3890967

and

library('vcd')

## Loading required package: grid

assocstats(xtabs(~Cars93$Type + Cars93$Origin))

##                     X^2 df  P(> X^2)
## Likelihood Ratio 18.362  5 0.0025255
## Pearson          14.080  5 0.0151101
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.363 
## Cramer's V        : 0.389

yields the same result for Cramer’s V, which is approximately 0.389, while the corrected version of Cramer’s V in the package rcompanion, which is

cramerV(Cars93$Type, Cars93$Origin, bias.correct = TRUE)

## Cramer V 
##   0.3132

There are several other coefficients that are also defined for the purpose of measuring the strength of association between two discrete (categorical) variables, including some that also use the chi-squared statistic. Here are some of the examples:

Goodman Kruskal’s lambda
Phi coefficient (mean square contingency coefficient)
Tschuprow’s

Note that the Phi coefficient are only defined for \(2 \times 2\) contingency table.

4 The problem of symmetry

With Cramer’s V, we are losing valuable information due to its symmetry. For example, consider the following simple dataset:

x	y
A	m
A	n
A	m
B	p
B	p
B	q

We can see that if the value of \(y\) is known, the value of \(x\) is guaranteed; but even when the value of \(x\) is known, we can not determine the value of \(y\). This asymmetry is lost when we use Cramer’s V. As a result, we need another coefficient that can preserve this information between any pairs of variables, and this is exactly what Thiel’s U offers.

5 Thiel’s U correlation coefficient (Uncertainty Coefficient)

5.1 Definition

The uncertainty coefficient (also called entropy coefficient or Thiel’s U) is a measure of nominal association. It is based on the concept of information entropy.

Suppose we have samples of two discrete random variables, \(X\) and \(Y\). By constructing the point distribution \(p_{X,Y}(x,y)\), from which we can calculate the conditional distributions, \(p_{X|Y}(x|y) = \frac{p_{X,Y}(x,y)}{p_Y(Y)}\) and \(p_{Y|X} = \frac{p_{X,Y}(x,y)}{p_X(x)}\), and calculating various entropies, we can determine the degree of association between the two variables..

The entropy of a single distribution is given as:

\[ H(X) = -\sum_{x} p_X(x)\log p_X(x) \] while the conditional entropy is given as:

\[ H(X|Y) = -\sum_{x,y} p_{X,Y}(x,y) \log p_{X,Y}(x|y) \]

The uncertainty coefficient or proficiency is defined as:

\[ U(X|Y) = \frac{H(X) - H(X|Y)}{H(X)} = \frac{I(X;Y)}{H(X)} \] which tells us: given \(Y\), what fraction of the bits of \(X\) can we predict.

As mentioned previously, uncertainty coefficient is not symmetric with respect to the roles of X and Y. The roles can be reversed and a symmetrical measure thus defined as a weighted average between the two:

\[ U(X,Y) = \frac{H(X)U(X|Y) + H(Y)U(Y|X)}{H(X) + H(Y)} = 2 \left[ \frac{H(X) + H(Y) - H(X,Y)}{H(X) + H(Y)} \right] \]

5.2 Calculating Uncertainty coefficient in `R`

To calculate the uncertainty coefficent in R, we will use the function UncertCoef from the package DescTools.

This function measures the uncertainty coefficient in the column variable Y that is explained by the row variable X. The function has interfaces for a table, a matrix, a data.frame and for single vectors.

5.2.1 Usage and arguments

UncerCoef(x, y, direction = c("symmetric","row","column"), conf.level = NA, p.zero.correction = 1/sum(x)^2,...)

with

x: a numeric vector, a factor, matrix or data frame
y: NULL (default) or a vector, an ordered vector, matrix or data frame with compatible dimensions to x
direction: direction of the calculation. Can be row (default) or column, where row calculates UncertCoef(R|C) (“column dependent”)
conf.level: confidence level of the interval. If set to NA (which is the default), no confidence interval will be calculated
p.zero.correction: slightly nudge zero values so that the logarithm can be calculated

5.2.2 Examples

Coming back to the previous example, we will calculate the correlation between Type and Origin in dataset Cars93. First, we will calculate \(U(Type|Origin)\). Noting that the function UnderCoef is compatible with different forms (vector, matrix, datasets), we can write the function as follow:

UncertCoef(table(Cars93$Type, Cars93$Origin), direction = "column")

## [1] 0.1425076

UncertCoef(Cars93$Type, Cars93$Origin, direction = "column")

## [1] 0.1425076

which yield the same result. In the first call of the function, we use the function table to create a contingency table of the two variables, and then use the direction "column" to indicate calculating the uncertainty coefficient of Type given Origin. Similarly, to calculate \(U(Origin|Type)\), we can do as follows:

UncertCoef(table(Cars93$Type, Cars93$Origin), direction = "row")

## [1] 0.05661699

UncertCoef(Cars93$Type, Cars93$Origin, direction = "row")

## [1] 0.05661699

Finally, to calculate the symmetric measure of the uncertainty coefficient, we can write

UncertCoef(table(Cars93$Type, Cars93$Origin), direction = "symmetric")

## [1] 0.08103823

UncertCoef(Cars93$Type, Cars93$Origin, direction = "symmetric")

## [1] 0.08103823

This new coefficient can provide us with much more information on the relations between different features.

Within the two tutorials, we have seen measures of correlation between two continuous (numerical) variables or between two discrete (categorical) variables. In the next tutorial, we will mix things up, i.e to consider the correlation between a continuous feature and a categorical feature.

References

al., Andri Signorell et mult. 2019. DescTools: Tools for Descriptive Statistics. https://cran.r-project.org/package=DescTools.

Christian Heumann, Michael Schomaker Shalabh. 2016. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R. Springer. doi:10.1007/978-3-319-46162-5.

Deryto, Lukasz. 13AD–2018. “Contingenct Tables in R.” DataCamp, December. https://www.datacamp.com/community/tutorials/contingency-tables-r.

Deviations, Outside Two Standard. 13AD–2018. “An Overview of Correlation Measures Between Categorical and Continuous Variables.” Medium, September. https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365.

Magniafico, Salvatore. 2019. An R Companion for the Handbook of Biological Statistics.

Meyer, David, Achim Zeileis, and Kurt Hornik. 2017. Vcd: Visualizing Categorical Data.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.

Wikipedia contributors. 2019a. “Cramér’s V — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Cram%C3%A9r%27s_V&oldid=930122832.

———. 2019b. “Pearson’s Chi-Squared Test — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Pearson%27s_chi-squared_test&oldid=929754098.

———. 2019c. “Uncertainty Coefficient — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Uncertainty_coefficient&oldid=918027595.

Zychlinski, Shaked. 24AD–2018. “The Search for Categorical Correlation.” Towards Data Science, February. https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9.

Correlation between discrete (categorical) variables

Hoang Anh NGO (École Polytechnique)

11 December 2019

1 Contingency table

1.1 Definition:

1.2 How to create a contingency table in `R`:

2 Pearson \(\chi^2\) test

2.1 Definition

2.2 Computational procedure

2.3 Calculating the test-statistic

2.4 Chi-squared test in `R`

3 Coefficients using chi-squared statistic

3.1 Contingency coefficient \(C\)

3.1.1 Definition

3.1.2 Calculating (corrected) contintency coefficient in `R`

3.2 Cramer’s V

3.2.1 Definition

3.2.2 Calculating Cramer’s V in `R`

4 The problem of symmetry

5 Thiel’s U correlation coefficient (Uncertainty Coefficient)

5.1 Definition

5.2 Calculating Uncertainty coefficient in `R`

5.2.1 Usage and arguments

5.2.2 Examples

References

Correlation between discrete (categorical) variables

Hoang Anh NGO (École Polytechnique)

11 December 2019

1 Contingency table

1.1 Definition:

1.2 How to create a contingency table in R:

2 Pearson \(\chi^2\) test

2.1 Definition

2.2 Computational procedure

2.3 Calculating the test-statistic

2.4 Chi-squared test in R

3 Coefficients using chi-squared statistic

3.1 Contingency coefficient \(C\)

3.1.1 Definition

3.1.2 Calculating (corrected) contintency coefficient in R

3.2 Cramer’s V

3.2.1 Definition

3.2.2 Calculating Cramer’s V in R

4 The problem of symmetry

5 Thiel’s U correlation coefficient (Uncertainty Coefficient)

5.1 Definition

5.2 Calculating Uncertainty coefficient in R

5.2.1 Usage and arguments

5.2.2 Examples

References

1.2 How to create a contingency table in `R`:

2.4 Chi-squared test in `R`

3.1.2 Calculating (corrected) contintency coefficient in `R`

3.2.2 Calculating Cramer’s V in `R`

5.2 Calculating Uncertainty coefficient in `R`