First, let’s get some data. MASS package contains data about 93 cars on sale in the USA in 1993. They’re stored in Cars93 object and include 27 features for each car, some of which are categorical. So let’s load the MASS package and look at the type of vehicles included in cars93:
library(MASS)
## Warning: package 'MASS' was built under R version 3.6.2
Cars93$Type
## [1] Small Midsize Compact Midsize Midsize Midsize Large Large Midsize
## [10] Large Midsize Compact Compact Sporty Midsize Van Van Large
## [19] Sporty Large Compact Large Small Small Compact Van Midsize
## [28] Sporty Small Large Small Small Compact Sporty Sporty Van
## [37] Midsize Large Small Sporty Sporty Small Compact Small Small
## [46] Sporty Midsize Midsize Midsize Midsize Midsize Large Small Small
## [55] Compact Van Sporty Compact Midsize Sporty Midsize Small Midsize
## [64] Small Compact Van Midsize Compact Midsize Van Large Sporty
## [73] Small Compact Sporty Midsize Large Compact Small Small Small
## [82] Compact Small Small Sporty Midsize Van Small Van Compact
## [91] Sporty Compact Midsize
## Levels: Compact Large Midsize Small Sporty Van
We have 6 types of cars there. table function tells how many of each type we have:
table(Cars93$Type)
##
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
Convert into fraction
prop.table(table(Cars93$Type))
##
## Compact Large Midsize Small Sporty Van
## 0.17204301 0.11827957 0.23655914 0.22580645 0.15053763 0.09677419
The same with the origin of cars:
table(Cars93$Origin)
##
## USA non-USA
## 48 45
prop.table(table(Cars93$Origin))
##
## USA non-USA
## 0.516129 0.483871
Great, we saw that our dataset contains a similar number of US and non-US cars and that the most prevalent types are Midsize and Small. However, maybe the US and non-US differ in type?
Let’s look at types of cars with respect to their origin. We can use table again, but with two arguments now. First will become row variable and second will become column variable:
table(Cars93$Type, Cars93$Origin)
##
## USA non-USA
## Compact 7 9
## Large 11 0
## Midsize 10 12
## Small 7 14
## Sporty 8 6
## Van 5 4
The table above shows the joint distribution of two categorical variables (Type and Origin). Such tables are called contingency tables.
(tab1<-table(Cars93$Type, Cars93$Origin))
##
## USA non-USA
## Compact 7 9
## Large 11 0
## Midsize 10 12
## Small 7 14
## Sporty 8 6
## Van 5 4
rowSums(tab1)
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
colSums(tab1)
## USA non-USA
## 48 45
prop.table(table(Cars93$Type, Cars93$Origin))
##
## USA non-USA
## Compact 0.07526882 0.09677419
## Large 0.11827957 0.00000000
## Midsize 0.10752688 0.12903226
## Small 0.07526882 0.15053763
## Sporty 0.08602151 0.06451613
## Van 0.05376344 0.04301075
# convert into percentage
prop.table(table(Cars93$Type, Cars93$Origin))*100
##
## USA non-USA
## Compact 7.526882 9.677419
## Large 11.827957 0.000000
## Midsize 10.752688 12.903226
## Small 7.526882 15.053763
## Sporty 8.602151 6.451613
## Van 5.376344 4.301075
Notice, that this is a joint probability distribution, from which we can see that e.g., about 7.5% of cars are small and of American origin.
More often, we are interested in the distribution of one variable within groups created by another. Here, distribution of car types among the US and (separately) non-US cars seems interesting. To get this, we use the margin argument to prop.table function. It tells where in rows (margin=1) or in columns (margin=2) grouping variable is:
prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100
##
## USA non-USA
## Compact 14.583333 20.000000
## Large 22.916667 0.000000
## Midsize 20.833333 26.666667
## Small 14.583333 31.111111
## Sporty 16.666667 13.333333
## Van 10.416667 8.888889
Now we can easily see that small cars are twice as frequent in non-USA than in USA part of our dataset.
Also notice that percents add up to 100 in columns, while in joint distribution table (the one without margin argument), 100 was a sum of a whole table.
(tab2<-prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100)
##
## USA non-USA
## Compact 14.583333 20.000000
## Large 22.916667 0.000000
## Midsize 20.833333 26.666667
## Small 14.583333 31.111111
## Sporty 16.666667 13.333333
## Van 10.416667 8.888889
colSums((tab2))
## USA non-USA
## 100 100
The most common question that arises form contingency tables is if the row and column variables are independent. The most basic way to answer it is to run a chi-squared test.
chisq.test(Cars93$Type, Cars93$Origin)
## Warning in chisq.test(Cars93$Type, Cars93$Origin): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: Cars93$Type and Cars93$Origin
## X-squared = 14.08, df = 5, p-value = 0.01511
Apparently, they’re not, but we also got the Chi-squared approximation may be incorrect warning. This is because chi-squared statistic follows chi-squared distribution only approximately.
Another alternative is the so-called G-test. Its statistic is also approximately chi-squared distributed, but for small samples, this approximation is closer than one that chi-squared test uses. For G-test we can use GTest function from DescTools package. Results are again quite similar to two previous tests: Type and Origin are not independent.
library(DescTools)
## Warning: package 'DescTools' was built under R version 3.6.2
GTest(Cars93$Type, Cars93$Origin)
##
## Log likelihood ratio (G-test) test of independence without correction
##
## data: Cars93$Type and Cars93$Origin
## G = 18.362, X-squared df = 5, p-value = 0.002526