contingency

How to make a table

First, let’s get some data. MASS package contains data about 93 cars on sale in the USA in 1993. They’re stored in Cars93 object and include 27 features for each car, some of which are categorical. So let’s load the MASS package and look at the type of vehicles included in cars93:

library(MASS)

## Warning: package 'MASS' was built under R version 3.6.2

Cars93$Type

##  [1] Small   Midsize Compact Midsize Midsize Midsize Large   Large   Midsize
## [10] Large   Midsize Compact Compact Sporty  Midsize Van     Van     Large  
## [19] Sporty  Large   Compact Large   Small   Small   Compact Van     Midsize
## [28] Sporty  Small   Large   Small   Small   Compact Sporty  Sporty  Van    
## [37] Midsize Large   Small   Sporty  Sporty  Small   Compact Small   Small  
## [46] Sporty  Midsize Midsize Midsize Midsize Midsize Large   Small   Small  
## [55] Compact Van     Sporty  Compact Midsize Sporty  Midsize Small   Midsize
## [64] Small   Compact Van     Midsize Compact Midsize Van     Large   Sporty 
## [73] Small   Compact Sporty  Midsize Large   Compact Small   Small   Small  
## [82] Compact Small   Small   Sporty  Midsize Van     Small   Van     Compact
## [91] Sporty  Compact Midsize
## Levels: Compact Large Midsize Small Sporty Van

We have 6 types of cars there. table function tells how many of each type we have:

table(Cars93$Type)

## 
## Compact   Large Midsize   Small  Sporty     Van 
##      16      11      22      21      14       9

Convert into fraction

prop.table(table(Cars93$Type))

## 
##    Compact      Large    Midsize      Small     Sporty        Van 
## 0.17204301 0.11827957 0.23655914 0.22580645 0.15053763 0.09677419

The same with the origin of cars:

table(Cars93$Origin)

## 
##     USA non-USA 
##      48      45

prop.table(table(Cars93$Origin))

## 
##      USA  non-USA 
## 0.516129 0.483871

How to make a contingency table

Great, we saw that our dataset contains a similar number of US and non-US cars and that the most prevalent types are Midsize and Small. However, maybe the US and non-US differ in type?

Let’s look at types of cars with respect to their origin. We can use table again, but with two arguments now. First will become row variable and second will become column variable:

table(Cars93$Type, Cars93$Origin)

##          
##           USA non-USA
##   Compact   7       9
##   Large    11       0
##   Midsize  10      12
##   Small     7      14
##   Sporty    8       6
##   Van       5       4

The table above shows the joint distribution of two categorical variables (Type and Origin). Such tables are called contingency tables.

How to get marginals form contingency table

(tab1<-table(Cars93$Type, Cars93$Origin))

##          
##           USA non-USA
##   Compact   7       9
##   Large    11       0
##   Midsize  10      12
##   Small     7      14
##   Sporty    8       6
##   Van       5       4

rowSums(tab1)

## Compact   Large Midsize   Small  Sporty     Van 
##      16      11      22      21      14       9

colSums(tab1)

##     USA non-USA 
##      48      45

How to get percents form contingency table

prop.table(table(Cars93$Type, Cars93$Origin))

##          
##                  USA    non-USA
##   Compact 0.07526882 0.09677419
##   Large   0.11827957 0.00000000
##   Midsize 0.10752688 0.12903226
##   Small   0.07526882 0.15053763
##   Sporty  0.08602151 0.06451613
##   Van     0.05376344 0.04301075

# convert into percentage
prop.table(table(Cars93$Type, Cars93$Origin))*100

##          
##                 USA   non-USA
##   Compact  7.526882  9.677419
##   Large   11.827957  0.000000
##   Midsize 10.752688 12.903226
##   Small    7.526882 15.053763
##   Sporty   8.602151  6.451613
##   Van      5.376344  4.301075

Notice, that this is a joint probability distribution, from which we can see that e.g., about 7.5% of cars are small and of American origin.

More often, we are interested in the distribution of one variable within groups created by another. Here, distribution of car types among the US and (separately) non-US cars seems interesting. To get this, we use the margin argument to prop.table function. It tells where in rows (margin=1) or in columns (margin=2) grouping variable is:

prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100

##          
##                 USA   non-USA
##   Compact 14.583333 20.000000
##   Large   22.916667  0.000000
##   Midsize 20.833333 26.666667
##   Small   14.583333 31.111111
##   Sporty  16.666667 13.333333
##   Van     10.416667  8.888889

Now we can easily see that small cars are twice as frequent in non-USA than in USA part of our dataset.

Also notice that percents add up to 100 in columns, while in joint distribution table (the one without margin argument), 100 was a sum of a whole table.

(tab2<-prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100)

##          
##                 USA   non-USA
##   Compact 14.583333 20.000000
##   Large   22.916667  0.000000
##   Midsize 20.833333 26.666667
##   Small   14.583333 31.111111
##   Sporty  16.666667 13.333333
##   Van     10.416667  8.888889

colSums((tab2))

##     USA non-USA 
##     100     100

Chi-squared test

The most common question that arises form contingency tables is if the row and column variables are independent. The most basic way to answer it is to run a chi-squared test.

chisq.test(Cars93$Type, Cars93$Origin)

## Warning in chisq.test(Cars93$Type, Cars93$Origin): Chi-squared approximation may
## be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  Cars93$Type and Cars93$Origin
## X-squared = 14.08, df = 5, p-value = 0.01511

Apparently, they’re not, but we also got the Chi-squared approximation may be incorrect warning. This is because chi-squared statistic follows chi-squared distribution only approximately.

Another alternative is the so-called G-test. Its statistic is also approximately chi-squared distributed, but for small samples, this approximation is closer than one that chi-squared test uses. For G-test we can use GTest function from DescTools package. Results are again quite similar to two previous tests: Type and Origin are not independent.

library(DescTools)

## Warning: package 'DescTools' was built under R version 3.6.2

GTest(Cars93$Type, Cars93$Origin)

## 
##  Log likelihood ratio (G-test) test of independence without correction
## 
## data:  Cars93$Type and Cars93$Origin
## G = 18.362, X-squared df = 5, p-value = 0.002526

Reference

contingency_table

How to make a table

How to make a contingency table

How to get marginals form contingency table

How to get percents form contingency table

Chi-squared test