First, before proceeding with this tutorial, we will have to install the following required packages:
MASS
rcompanion
lsr
vcd
DescTools
The following chunk will check each package’s availability and install them if necessary
required_packages <- c('MASS', 'rcompanion', 'lsr', 'vcd', 'DescTools')
for (p in required_packages) {
if(!require(p,character.only = TRUE)) {
install.packages(p, dep = TRUE)
}
}
From the previous tutorial, we have seen previous parameters to measure the correlation between two continuous, or numerical, variables. However, all those of them are not defined when the data is categorical. As a result, something else is needed here.
Here, we will introduce different measures of association between two categorical variables. First, we will introduced the Pearson’s chi-squared test, along with the variations, Cramer’s V and Contingency Coefficient C.
Contingency tables (also called crosstabs or two-ways tables) are used in statistics to summarize the relationship between categorical variables. A contingency table is a special type of frequency distribution table, where two variables are shown simultaneously.
R
:First, for this tutorial, we will use the built-in dataset Cars93
from the package MASS
. This dataset contains information about 93 cars on sale in the US in 1993, including 27 features (some of which are categorical).
library('MASS')
head(Cars93)
## Manufacturer Model Type Min.Price Price Max.Price MPG.city
## 1 Acura Integra Small 12.9 15.9 18.8 25
## 2 Acura Legend Midsize 29.2 33.9 38.7 18
## 3 Audi 90 Compact 25.9 29.1 32.3 20
## 4 Audi 100 Midsize 30.8 37.7 44.6 19
## 5 BMW 535i Midsize 23.7 30.0 36.2 22
## 6 Buick Century Midsize 14.2 15.7 17.3 22
## MPG.highway AirBags DriveTrain Cylinders EngineSize
## 1 31 None Front 4 1.8
## 2 25 Driver & Passenger Front 6 3.2
## 3 26 Driver only Front 6 2.8
## 4 26 Driver & Passenger Front 6 2.8
## 5 30 Driver only Rear 4 3.5
## 6 31 Driver only Front 4 2.2
## Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
## 1 140 6300 2890 Yes 13.2
## 2 200 5500 2335 Yes 18.0
## 3 172 5500 2280 Yes 16.9
## 4 172 5500 2535 Yes 21.1
## 5 208 5700 2545 Yes 21.1
## 6 110 5200 2565 No 16.4
## Passengers Length Wheelbase Width Turn.circle Rear.seat.room
## 1 5 177 102 68 37 26.5
## 2 5 195 115 71 38 30.0
## 3 5 180 102 67 37 28.0
## 4 6 193 106 70 37 31.0
## 5 4 186 109 69 39 27.0
## 6 6 189 105 69 41 28.0
## Luggage.room Weight Origin Make
## 1 11 2705 non-USA Acura Integra
## 2 15 3560 non-USA Acura Legend
## 3 14 3375 non-USA Audi 90
## 4 17 3405 non-USA Audi 100
## 5 13 3640 non-USA BMW 535i
## 6 16 2880 USA Buick Century
To see how many types of car and how many cars in each type, we use the table
function. To convert into fractions, we can use the function prop.table
.
table(Cars93$Type)
##
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
prop.table(table(Cars93$Type))
##
## Compact Large Midsize Small Sporty Van
## 0.17204301 0.11827957 0.23655914 0.22580645 0.15053763 0.09677419
Here, as an example for a contingency table, we will look at the types of cars with respect to their origin. To do this, we can use the function table
again, but with two arguments now.
table(Cars93$Type, Cars93$Origin)
##
## USA non-USA
## Compact 7 9
## Large 11 0
## Midsize 10 12
## Small 7 14
## Sporty 8 6
## Van 5 4
Pearson’s chi-squared test (\(\chi^2\)) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests.
It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical
Pearson’s chi-squared test is used to assess three types of comparison: goodness of fit, homogeneity, and independence. For all three tests, the computational procedure includes the following steps:
The value of the test-statistic is
\[ \chi^2 = \sum_{i = 0}^n \frac{(O_i - E_i)^2}{E_i} = N \sum_{i = 1}^n \frac{(O_i/N - p_i)^2}{p_i} \]
where
R
We can perform the chi-squared test in R
using the function chisq.test()
.
chisq.test(Cars93$Type, Cars93$Origin)
## Warning in chisq.test(Cars93$Type, Cars93$Origin): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: Cars93$Type and Cars93$Origin
## X-squared = 14.08, df = 5, p-value = 0.01511
Here, we have a \(\chi^2\) value of \(14.08\). Since we get a p-value of less than the significance level of \(0.05\), we can reject the null hypothesis and conclude that the two variables are, indeed, independent.
A problem with Pearson’s \(\chi^2\) coefficient is that the range of its maximum value depends on the sample size and the size of the contingency table. These values may vary in different situations. To overcome this problem, the coefficient can be standardized to lie between 0 and 1 so that it is independent of the sample size as well as the dimension of the contingency table. Several coefficients have been defined for this purpose, and we will consider some of them in the following section.
Originally, the Pearson’s contingency coefficient is calculated as:
\[ C = \sqrt{\frac{\chi^2}{\chi^2 + n}} \]
with \(n\) being the total number of observations. However, there is another option to correct this contingency coefficient as:
\[ C_{corr} = \frac{C}{C_{max}} = \sqrt{\frac{\min(k,l)}{\min(k,l) - 1}} \sqrt{\frac{\chi^2}{\chi^2 + n}} \]
with
\[ C = \sqrt{\frac{\chi^2}{\chi^2 + n}} \text{ and } C_{max} = \sqrt{\frac{\min(k,l) - 1}{\min(k,l)}} \]
R
To calculate (corrected), contingency coefficient, we can use the function ContCoef()
from the package DescTools
. This function allows us to calculate both the original and corrected contigency table by changing the parameter correct
to True
or False
.
library('DescTools')
ContCoef(Cars93$Type, Cars93$Origin, correct = FALSE)
## [1] 0.3626145
ContCoef(Cars93$Type, Cars93$Origin, correct = TRUE)
## [1] 0.5128144
For a \(k \times l\) contingency table, \(n(\min(k,l) - 1)\) is the maximal value of the \(\chi^2\) statistic, dividing \(\chi^2\) by the maximal value leads to a scaled version with maximal value \(1\). This idea is used by Cramer’s \(V\) as follow:
\[ V = \sqrt{\frac{\chi^2}{n(\min(k,l) - 1)}} \]
R
To calculate Cramer’s V in R
, we can use the function cramerV()
from the package rcompanion
. In contrast to the function cramersV()
from the lsr
package, cramerV()
offers an option to correct for bias.
Besides, we can also use the functions assocstats
and xtabs
contained in the package vcd
. For example:
library('rcompanion')
cramerV(Cars93$Type, Cars93$Origin, bias.correct = FALSE)
## Cramer V
## 0.3891
and
library('lsr')
cramersV(Cars93$Type, Cars93$Origin)
## Warning in chisq.test(...): Chi-squared approximation may be incorrect
## [1] 0.3890967
and
library('vcd')
## Loading required package: grid
assocstats(xtabs(~Cars93$Type + Cars93$Origin))
## X^2 df P(> X^2)
## Likelihood Ratio 18.362 5 0.0025255
## Pearson 14.080 5 0.0151101
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.363
## Cramer's V : 0.389
yields the same result for Cramer’s V, which is approximately 0.389, while the corrected version of Cramer’s V in the package rcompanion
, which is
cramerV(Cars93$Type, Cars93$Origin, bias.correct = TRUE)
## Cramer V
## 0.3132
There are several other coefficients that are also defined for the purpose of measuring the strength of association between two discrete (categorical) variables, including some that also use the chi-squared statistic. Here are some of the examples:
Note that the Phi coefficient are only defined for \(2 \times 2\) contingency table.
With Cramer’s V, we are losing valuable information due to its symmetry. For example, consider the following simple dataset:
x | y |
---|---|
A | m |
A | n |
A | m |
B | p |
B | p |
B | q |
We can see that if the value of \(y\) is known, the value of \(x\) is guaranteed; but even when the value of \(x\) is known, we can not determine the value of \(y\). This asymmetry is lost when we use Cramer’s V. As a result, we need another coefficient that can preserve this information between any pairs of variables, and this is exactly what Thiel’s U offers.
The uncertainty coefficient (also called entropy coefficient or Thiel’s U) is a measure of nominal association. It is based on the concept of information entropy.
Suppose we have samples of two discrete random variables, \(X\) and \(Y\). By constructing the point distribution \(p_{X,Y}(x,y)\), from which we can calculate the conditional distributions, \(p_{X|Y}(x|y) = \frac{p_{X,Y}(x,y)}{p_Y(Y)}\) and \(p_{Y|X} = \frac{p_{X,Y}(x,y)}{p_X(x)}\), and calculating various entropies, we can determine the degree of association between the two variables..
The entropy of a single distribution is given as:
\[ H(X) = -\sum_{x} p_X(x)\log p_X(x) \] while the conditional entropy is given as:
\[ H(X|Y) = -\sum_{x,y} p_{X,Y}(x,y) \log p_{X,Y}(x|y) \]
The uncertainty coefficient or proficiency is defined as:
\[ U(X|Y) = \frac{H(X) - H(X|Y)}{H(X)} = \frac{I(X;Y)}{H(X)} \] which tells us: given \(Y\), what fraction of the bits of \(X\) can we predict.
As mentioned previously, uncertainty coefficient is not symmetric with respect to the roles of X and Y. The roles can be reversed and a symmetrical measure thus defined as a weighted average between the two:
\[ U(X,Y) = \frac{H(X)U(X|Y) + H(Y)U(Y|X)}{H(X) + H(Y)} = 2 \left[ \frac{H(X) + H(Y) - H(X,Y)}{H(X) + H(Y)} \right] \]
R
To calculate the uncertainty coefficent in R
, we will use the function UncertCoef
from the package DescTools
.
This function measures the uncertainty coefficient in the column variable Y that is explained by the row variable X. The function has interfaces for a table, a matrix, a data.frame and for single vectors.
UncerCoef(x, y, direction = c("symmetric","row","column"), conf.level = NA, p.zero.correction = 1/sum(x)^2,...)
with
x
: a numeric vector, a factor, matrix or data framey
: NULL
(default) or a vector, an ordered vector, matrix or data frame with compatible dimensions to x
direction
: direction of the calculation. Can be row
(default) or column
, where row
calculates UncertCoef(R|C)
(“column dependent”)conf.level
: confidence level of the interval. If set to NA
(which is the default), no confidence interval will be calculatedp.zero.correction
: slightly nudge zero values so that the logarithm can be calculatedComing back to the previous example, we will calculate the correlation between Type
and Origin
in dataset Cars93
. First, we will calculate \(U(Type|Origin)\). Noting that the function UnderCoef
is compatible with different forms (vector, matrix, datasets), we can write the function as follow:
UncertCoef(table(Cars93$Type, Cars93$Origin), direction = "column")
## [1] 0.1425076
UncertCoef(Cars93$Type, Cars93$Origin, direction = "column")
## [1] 0.1425076
which yield the same result. In the first call of the function, we use the function table
to create a contingency table of the two variables, and then use the direction "column"
to indicate calculating the uncertainty coefficient of Type
given Origin
. Similarly, to calculate \(U(Origin|Type)\), we can do as follows:
UncertCoef(table(Cars93$Type, Cars93$Origin), direction = "row")
## [1] 0.05661699
UncertCoef(Cars93$Type, Cars93$Origin, direction = "row")
## [1] 0.05661699
Finally, to calculate the symmetric measure of the uncertainty coefficient, we can write
UncertCoef(table(Cars93$Type, Cars93$Origin), direction = "symmetric")
## [1] 0.08103823
UncertCoef(Cars93$Type, Cars93$Origin, direction = "symmetric")
## [1] 0.08103823
This new coefficient can provide us with much more information on the relations between different features.
Within the two tutorials, we have seen measures of correlation between two continuous (numerical) variables or between two discrete (categorical) variables. In the next tutorial, we will mix things up, i.e to consider the correlation between a continuous feature and a categorical feature.
al., Andri Signorell et mult. 2019. DescTools: Tools for Descriptive Statistics. https://cran.r-project.org/package=DescTools.
Christian Heumann, Michael Schomaker Shalabh. 2016. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R. Springer. doi:10.1007/978-3-319-46162-5.
Deryto, Lukasz. 13AD–2018. “Contingenct Tables in R.” DataCamp, December. https://www.datacamp.com/community/tutorials/contingency-tables-r.
Deviations, Outside Two Standard. 13AD–2018. “An Overview of Correlation Measures Between Categorical and Continuous Variables.” Medium, September. https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365.
Magniafico, Salvatore. 2019. An R Companion for the Handbook of Biological Statistics.
Meyer, David, Achim Zeileis, and Kurt Hornik. 2017. Vcd: Visualizing Categorical Data.
Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.
Wikipedia contributors. 2019a. “Cramér’s V — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Cram%C3%A9r%27s_V&oldid=930122832.
———. 2019b. “Pearson’s Chi-Squared Test — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Pearson%27s_chi-squared_test&oldid=929754098.
———. 2019c. “Uncertainty Coefficient — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Uncertainty_coefficient&oldid=918027595.
Zychlinski, Shaked. 24AD–2018. “The Search for Categorical Correlation.” Towards Data Science, February. https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9.