Welcome. If you’re reading this, you probably attended a workshop or talk I’ve given on this topic. If you’re here by accident, let me fill you in.
I wanted to write a plain English, quick start guide for anyone working with data sets where many, most or all of the features are discrete (nominal, ordinal, frequency, counts, events). The reason I wanted to write it is because nothing like it existed when I really needed it some years ago. I had to strain every synapse to get to grips with it at the time. Perhaps that was a blessing in disguise as the work paid off, in many more ways than I had expected. Anyway, I hope this helps anyone out there who just wants to develop some intuition in this area of statistics, without getting bogged down in notation, theory and proofs.
These notes are for you if:
I’m also trying to keep the code really simple, so someone who wants to learn just enough R can pick up the basics by osmosis.
Note, I hope this is a living document which I can expand as time allows. Check back in a few months. The last date I updated should always be visible in the header.
The focus is on visual analytics, not theories and formulas! My intention is turn you on to this topic, not send you running.
If you want to know more you can follow the links to the vignette, video and book that have been the most useful sources of information for me.
When I say ‘plain English,’ there’s no offence intended towards Michael Friendly, whose work is extremely accessible and well explained. I wouldn’t be doing this without his book in hand. What I want to do here is present a version of things that is as math and formula free as I can make it.
The examples are all in R. These are the libraries we’ll be using. That’s all I really want to say about R.
# I'm going to load all the R libraries here
# If you stop, quit R and come back later...
# ...you need to re-run this chunk.
library(knitr)
library(lattice)
library(ggplot2)
library(vcd)
library(vcdExtra)
library(ca)
library(randomForest)
library(gmodels)
library(mvtnorm)
library(ROSE)
The code examples are mostly based on material contained in the book Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data by Michael Friendly. Michael Friendly is a pioneer in this field and has contributed to the development of modules and libraries for SAS and R.
You can also find a much shorter tutorial on these topics in the long form help for the “vcd” package for R: vcd vignette
This video is a much more in depth lecture, by Michael Friendly, on topics covered in the book.
Any data professional with some practical experience of Machine Learning (ML), possibly crossing over from another discipline. I assume you have an understanding of the ML project life-cycle, data preparation, exploratory analysis, model building and evaluation, and reporting results. I assume some basic statistical knowledge, such as most people have experienced in high school.
You’ll get the most benefit if you already know some R. If you don’t, maybe this will help you to learn some. The code examples are mostly just a few lines each.
Bear in mind, this is just an overview! We will only introduce formal statistics and tests where necessary, and will not develop any mathematical notation, theory or proofs. There is lots of material freely available prepared by real experts, rather than people like me. Head over to www.google.com now, if that’s what you’re looking for.
For now, I am not covering explanatory and predictive modeling. You will probably be already familiar with logistic regression for binary classification. There are related or equivalent techniques for other kinds of classification (ordinal, polytomous/multi-class and counts). Even though these topics fall squarely under both DDA and ML, it’s a massive topic that requires a whole course/module to explain. I’ll get around to it, but I don’t have time right now.
I’m not going to cover R at all. I assume that you have enough experience using a computer, at least to locate, download and install some software, and execute some lines of code that have already been written for you.
I have been working with data since I first looked at some SQL scripts and a Microsoft SQL Server 2000 and realised it was doing something incredibly powerful while being very easy to understand. After that I travelled a very winding road, but most recently I have taken a career break to study. I have an MSc (Distinction) in Business Intelligence and am now researching a PhD in Machine Learning for Human-in-the-Loop Processes.
This branch of stats peaked my interest at three distinct points in my life:
My undergraduate degree in Microbiology - using techniques such as Most Probable Number to estimate the number of bacterial cells in a test tube. You certainly can’t count them all!
My career as a data professional in the tertiary (higher) education sector.
The kinds of questions I used to work on were:
Here are just some of the factors/features to be utilised in the research questions above:
The types of questions we did not need to answer were “what is the average height/weight of students across different groups?” So it was obvious that regular descriptive and quantitative stats (i.e. the normal distribution) were not the right tools for this work.
Tables and cross tabs were no use with so many dimensions. Pivot tables were a little better, but couldn’t help with significance testing and predictive analytics. Most importantly, I needed visual tools to communicate findings with the senior management team. I couldn’t expect those guys to squint over a page of tables and numbers with me, looking for clues. This is when I came upon Michael Friendly’s work on visualising categorical data. This was exactly what I was looking for at the time, so I was highly motivated to learn more about it.
If you want to run the code chunks and see the results in the R Studio UI you need access to a machine with R + R Studio and the source code for this file. After you’ve downloaded it to your working directory, just ditch all my text move all the code chunks to a new .R file with:
# Don't run this if you've already got the code in a separate file
# knitr::purl("DDAR_Display_Page.Rmd")
Predictably, the term refers to analysis of discrete data! DDA is a set of tools and techniques that you would apply to achieve the same objectives as any quantitative statistical analysis but with these data types:
The most powerful set of tools we have is our eyes and brains. Humans are pretty damn impressive when it comes to pattern matching. The trick is to choose the right chart that reveals the right information for the task in hand. A chart is often acceptable as statistical evidence, provided it is enhanced with robust statistical information (confidence intervals, significance test results, etc). More importantly, a good chart really doesn’t need much explaining.
The next three charts are examples of what we tend to reach for when we have a mix of numeric and discrete data. However, you won’t see again these again beyond this introduction because they are not the best tools for DDA.
# set the random number generator same every time
set.seed(123)
# a discrete value to match the groups
n <- 100
grp <- rep(c(1, 2), each = n)
# a random discrete value
rnd <- sample(c(1, 2)
, replace = TRUE
, size = 200)
numvars1 <- rmvnorm(n, c(0, 0)
, sigma = matrix(c(3,2,2,3)
, ncol=2))
numvars2 <- rmvnorm(n, c(1, 1)
, sigma = matrix(c(1,0.5,0.5,4)
, ncol=2))
numvars <- rbind(numvars1, numvars2)
# a discrete value based on a soft linear boundary
noise <- rnorm(2*n, mean = 0, sd = 0.5)
fam <- ifelse(rowSums(numvars) + noise > 3, 1, 2)
# plot, using graphic options to separate the discrete classes
plot(numvars, col = fam, pch = rnd, cex = grp
, main = "scatterplot with size, shape and colour options"
, sub = "typical options for stratifying numeric data by discrete data"
, xlab = "x"
, ylab = "y")
# take row sums then table-apply the sum by each group
grp_total <- tapply(rowSums(numvars), grp, sum)
rnd_total <- tapply(rowSums(numvars), rnd, sum)
fam_total <- tapply(rowSums(numvars), fam, sum)
# totals barplot
barplot(c(grp_total, rnd_total, fam_total)
, col="lightgray", border = 1:2
, main="barplot with border options"
, sub = "numeric data stratified on groups")
# classic box plot
# note that you can't tell
# that grp is the actual groups
# and rnd is a random labelling
boxplot(rowSums(numvars)~grp*rnd, border=1:2
, main = "boxplot distribution analysis by groups"
, sub = "typical options for stratifying numeric data by discrete data"
, xlab = "grp1 grp2 rnd1 rnd2"
, ylab = "row sum value")
The plots above place all the focus on the numeric features, slicing it into groups by the various discrete features. We would like charts where the focus is on the discrete features; something that provides more information about the discrete features themselves.
# length will count the members
grp_count <- tapply(rowSums(numvars), grp, length)
rnd_count <- tapply(rowSums(numvars), rnd, length)
fam_count <- tapply(rowSums(numvars), fam, length)
# counts barplot
barplot(c(grp_count, rnd_count, fam_count)
, col="lightgray", border = 1:2
, main="barplot with border options"
, sub="counts by groups")
By counting the group members, the numeric variable is out of the picture entirely. The only trace of it is that fam group was based on a linear combination of the simulated data. The grp group was fixed at exactly the number of group members, while rnd was drawn at random with even odds. It’s very easy, using this visual reference, to link the count data to the information we already know.
I will provide much more interesting examples a little later on.
Specifically, contigency tables or crosstabs. The categorical features form the axes, and each cell contains the total number of individuals at the intersection of each feature. Here are some examples:
# In 2D - collapse over department
margin.table(UCBAdmissions, 2:1)
## Admit
## Gender Admitted Rejected
## Male 1198 1493
## Female 557 1278
# 3D table - Admit on rows (D1), Gender on columns (D2)
# and Department on levels (D3)
UCBAdmissions # Department has 6 levels
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
# pivot the department onto columns
aperm(UCBAdmissions, c(2, 3, 1))
## , , Admit = Admitted
##
## Dept
## Gender A B C D E F
## Male 512 353 120 138 53 22
## Female 89 17 202 131 94 24
##
## , , Admit = Rejected
##
## Dept
## Gender A B C D E F
## Male 313 207 205 279 138 351
## Female 19 8 391 244 299 317
# flatten to a crosstab in default order
structable(UCBAdmissions)
## Gender Male Female
## Admit Dept
## Admitted A 512 89
## B 353 17
## C 120 202
## D 138 131
## E 53 94
## F 22 24
## Rejected A 313 19
## B 207 8
## C 205 391
## D 279 244
## E 138 299
## F 351 317
# flatten to a crosstab with pivot
structable(aperm(UCBAdmissions, c(2, 3, 1)))
## Dept A B C D E F
## Gender Admit
## Male Admitted 512 353 120 138 53 22
## Rejected 313 207 205 279 138 351
## Female Admitted 89 17 202 131 94 24
## Rejected 19 8 391 244 299 317
# More D
structable(Survived~., data=Titanic)
## Survived No Yes
## Class Sex Age
## 1st Male Child 0 5
## Adult 118 57
## Female Child 0 1
## Adult 4 140
## 2nd Male Child 0 11
## Adult 154 14
## Female Child 0 13
## Adult 13 80
## 3rd Male Child 35 13
## Adult 387 75
## Female Child 17 14
## Adult 89 76
## Crew Male Child 0 0
## Adult 670 192
## Female Child 0 0
## Adult 3 20
# More D - another pivot
structable(Survived+Sex~., data=Titanic)
## Survived No Yes
## Sex Male Female Male Female
## Class Age
## 1st Child 0 0 5 1
## Adult 118 4 57 140
## 2nd Child 0 0 11 13
## Adult 154 13 14 80
## 3rd Child 35 17 13 14
## Adult 387 89 75 76
## Crew Child 0 0 0 0
## Adult 670 3 192 20
Tables like this have several disadvantages. Staring at numbers is not very informative, unless there are obvious large or small values, and things start to get tricky in more dimensions.
Adding row, column and table totals can be informative but only works well in 2D.
# An old fashioned crosstab - 2D
CrossTable(margin.table(UCBAdmissions, 2:1)
, prop.chisq = FALSE
, format = "SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 4526
##
## | Admit
## Gender | Admitted | Rejected | Row Total |
## -------------|-----------|-----------|-----------|
## Male | 1198 | 1493 | 2691 |
## | 44.519% | 55.481% | 59.456% |
## | 68.262% | 53.879% | |
## | 26.469% | 32.987% | |
## -------------|-----------|-----------|-----------|
## Female | 557 | 1278 | 1835 |
## | 30.354% | 69.646% | 40.544% |
## | 31.738% | 46.121% | |
## | 12.307% | 28.237% | |
## -------------|-----------|-----------|-----------|
## Column Total | 1755 | 2771 | 4526 |
## | 38.776% | 61.224% | |
## -------------|-----------|-----------|-----------|
##
##
A huge range of DDA research questions ultimately are answered with a \(\chi^2\) test. It’ll happen at some point under the hood of most of the functions used in this code. That’s simply because tables are the main data structures for DDA and most tasks are driving towards comparing between expected values and actual, observed values.
This test instantly tells you if numbers in table cells are what you expect them to be. You might expect all the numbers to be proportional to the row and column totals, or you might have some very specific numbers you expect. Either way, a \(\chi^2\) test reveals whether your observed numbers are near enough to your expected numbers that you can put any small differences down to some random sampling error.
This is good news, right? Nearly everyone who’s done any stats has some experience done a \(\chi^2\) test and can make sense of the results. A significant result in a \(\chi^2\) test means it’s very improbable that the data matches what we expected. For reasons that I’m not going into now, you can never really prove anything with absolute, unequivocal certainty with statistics. So, we always phrase such tests in terms of the the reverse of what we’re looking for. Essentially, it’s more robust if we can demonstrate that some opposing assumption is very likely to be wrong, because it makes it more probable that we’re right. This opposite assumption that we state at the beginning of a test is called the null hypothesis, or simply the null. We like to give it the symbol \(H_0\)
Here is a null for when we want to show that a category of feature A is much more popular:
And here’s one we would use to show that there is an interaction between features A and B. Certain categories in A and B tend to go together, other combinations don’t get chosen very frequently.
Note, we will touch on statistical distributions later, but not the \(\chi^2\) distribution. This distribution comes up all the time in statitistical testing, but doesn’t really model anything found in the real world that we actually want to measure.
Most people with some stats knowledge are aware of the log odds from logistic regression. The odds is itself is the ratio \(o = \frac{p}{1-p}\) for \(p\), the proportion of yes and \(1-p\), the proportion of no in any binary situation, such as a coin flip. The log odds is simply \(\log(\frac{p}{1-p})\).
The odds is converted back to a probability by \(p = \frac{o}{1+o}\) which means we can connect the number of times something has happened back to the probability that it happens, and compare these probabilities across categories to see if they show any patterns.
An odds ratio is simply a ratio of odds from two binary systems \(\frac{o_1}{o_2}\) and the log odds ratio is \(\log(\frac{o_1}{o_2})\). There’s a pattern here:
\[\log(\frac{a}{b})\]
Let’s think about why we like to take the log of a fraction. First of all, I’m going to set two conditions: \(a\) and \(b\) are always positive, and \(b\) can never be \(0\). Now, think about what happens as 1. \(a\) and \(b\) are equal, 2. \(a\) gets much bigger than \(b\) and 3. \(b\) gets much bigger than \(a\).
p <- c(0.05, .1, .25, .50, .75, .9, .95)
odds <- p / (1 - p)
logodds <- log(odds)
logits <- data.frame(p, odds, logodds)
logits
## p odds logodds
## 1 0.05 0.05263158 -2.944439
## 2 0.10 0.11111111 -2.197225
## 3 0.25 0.33333333 -1.098612
## 4 0.50 1.00000000 0.000000
## 5 0.75 3.00000000 1.098612
## 6 0.90 9.00000000 2.197225
## 7 0.95 19.00000000 2.944439
a <- c(1000, 1000, 1)
b <- c(1000, 1, 1000)
log(a/b)
## [1] 0.000000 6.907755 -6.907755
Hopefully you can see that while fractions and ratios are centred on \(1\) when \(a = b\), they have this nasty habit of flying off towards \(\infty\) as \(a\) gets much bigger, while squashing closer and closer to \(0\) as \(b\) gets much bigger. Comparing counts or ratios of \(a:b\) over the levels of third variable \(c\), becomes very problematic because we would have to handle to of the most difficult situations in computing - potentially infinite and infinitesimal numbers - in the same operation. That’s really unhelpful.
Here are the top three reasons the log odds ratio is so useful in DDA:
The log function is very special for lots of other reasons but, in DDA especially, it is very helpful to have this neat way to manage ratios.
That’s all I want to say about the log ratio and the log odds ratio right now. We’ll come back to it in later sections.
There are several distributions to help with the most common DDA research questions. You can recognise them because a random variable from any of these can only be a natural number \(\{ 0,1,2,\dots \}\) (no negative numbers):
You probably have data matching one of these distributions if you have data that looks like this:
seed <- 125
set.seed(seed)
# the first number is the number of samples
# the other numbers are the parameters of the distribution
rbinom(10, size = 10, prob = 0.2)
## [1] 3 1 1 1 4 5 2 1 2 2
rpois(10, lambda = 5)
## [1] 1 7 4 2 5 4 4 9 6 6
rgeom(10, prob = 0.5)
## [1] 2 0 0 0 0 0 0 2 0 0
size <- 2
prob <- 0.2
count <- table(rnbinom(1000, 2, 0.4))
freq <- count/sum(count)
waittime <- as.integer(names(count))
xyplot(freq~waittime
, xlab = "Wait Time (minutes)"
, ylab = "% of total"
, type = c("h", "p") # this will draw lollipops
, pch = 16, lwd = 2
, main = paste("Negative Binomial. Rate ="
, size
, ", Dispersion ="
, 1/prob)
, sub = paste("Data from R's pseudo-random\n number generator with seed = ", seed))
The main reason we like to use distributions in data analysis where possible, is that so many clever people have spent the last two or three hundred years researching how they work! This means that all the most accurate, provable and supported tooling for statistics is based on distributions. What we try to do is match the data we have to a known distribution. If we find a good match, we can assume the mathematical properties of the distribution will apply to the real world data. This is what people mean by parametric tests; using all the tried and tested distributional properties to make assumptions about our real-world data.
However, distributions are not real world data. They are just mathematical constructions. They model many things in the real world but you need to be pretty strict about how and when to use them.
On a project using real-world, not simulated data, there simply might not be a distribution that fits and there are hundreds of non-parametric tools that make a model of the data directly from the data itself. That’s what a lot of people think of as Machine Learning (though such definitions are highly debatable).
Anyway, the moral of the story. Use a distribution when it makes sense. That’s often a good place to start with 1 dimensional data (just a bag of integers as above). If it doesn’t, then move on and find another method to analyse your data. Data that is in more than one dimension, with lots of categorical features, needs a lot more work.
This is a non-exhaustive list of common statistical tasks and questions that require specific DDA techniques.
We don’t have time develop a fluency for these tasks in this workshop but if this is something useful for you there are some “building block” skills that are worth spending time on:
The most useful data structures for DDA in R are not data frames, but n-dimensional arrays, n-way tables, and matrices, sometimes data frames with a frequency column (for the count data).
It’s good to become fluent with functions for manipulating table data - pivoting and collapsing dimensions, and converting between data structures to suit your analysis. Useful functions to get to know are table(), addmargins(), prop.table(), ftable(), xtabs(), xtable(), structable(), apply(), margin.table(), collapse.table(), aggregate(), aperm(), expand.dft(), subset(), using integer indexing to re-order categories and pivot dimensions. Most of the specialised objects have as.table() and as.data.frame() methods that generate the most intuitive shape for the output data structure.
Library “vcd” comes with its own plotting system. It is based on the grids package but is different from R base graphics, ggplot2 and lattice! It’s easy to get started, but a bit of effort is required to learn all the tweaks for making publication ready graphical output. However, interpreting these area-based plots is really intuitive and it’s worth the effort to learn so that results can be communicated succinctly (a picture tells a thousand words).
Exploring relationships between categorical features. Mosaic plots are very powerful for visualising categorical data. Area and shading can be used in many ways to highlight important the important interactions. These visual techniques intuitive up to four or five categorical dimensions. This allows you to identify very complex interactions.
Note, the ‘strucplot framework’ that is the heart of the vcd package requires learning a new approach to structuring data and writing plotting routines compared to previous experience with R. However, it’s worth the effort for the intuitive views of complex relationships in the data.
The exploratory data analysis phase of any project. Do this early on in the project once you’ve gathered and cleaned up your data enough to start sense-making. It doesn’t make any sense to reach for more complex tools before you know what you’ve got.
Spending some time “getting to know” a new dataset and developing some intuition about important features and interactions will pay off by revealing a better strategy when it comes to modelling.
The HairEyeColor dataset ships with base R. It is a survey of 592 statistics students at the University of Delaware in 1974. It is a 3-D array cross-tabulating the observations. The features and their levels are as follows:
No | Name | Levels |
---|---|---|
1 | Hair | Black, Brown, Red, Blond |
2 | Eye | Brown, Blue, Hazel, Green |
3 | Sex | Male, Female |
For now, we will ignore the sex of each individual by collapsing the 3-D table down to 2-D, and re-order the table from dark to light eye colour (hair is already in that order).
# ignore gender for now
haireye <- margin.table(HairEyeColor, 1:2)
# re-arrange order of eye colour from dark to light
# it's already correct for hair
haireye <- as.table(haireye[, c("Brown", "Hazel", "Green", "Blue")])
haireye
## Eye
## Hair Brown Hazel Green Blue
## Black 68 15 5 20
## Brown 119 54 29 84
## Red 26 14 14 17
## Blond 7 10 16 94
Big numbers and small numbers stand out but the table still has 16 cells which is a lot to take in at once. It’s not immediately obvious what the relationships are.
A naive approach might be to produce a bar plot of the counts by groups.
# non-intuitive views
# this is what the Frequency form looks like
as.data.frame(haireye)
## Hair Eye Freq
## 1 Black Brown 68
## 2 Brown Brown 119
## 3 Red Brown 26
## 4 Blond Brown 7
## 5 Black Hazel 15
## 6 Brown Hazel 54
## 7 Red Hazel 14
## 8 Blond Hazel 10
## 9 Black Green 5
## 10 Brown Green 29
## 11 Red Green 14
## 12 Blond Green 16
## 13 Black Blue 20
## 14 Brown Blue 84
## 15 Red Blue 17
## 16 Blond Blue 94
# You're forced to favour one variable over the other
# Does eye colour depend on hair colour
barplot(haireye, beside = TRUE, legend = TRUE)
# Or does hair colour depend on eye colour
# pivot the other way
barplot(aperm(haireye, 2:1), beside = TRUE, legend = TRUE)
It’s hard to compare bars across groups, because your eye has to jump to find each member. A more intuitive view would preserve the table structure and not condition one feature on the other:
tile(haireye) # vcd package
Hopefully, you can see this simple concept of putting categorical features on the axes and plotting a shape whose area is proportional to the count in each cell. A pattern of some kind is self-evident. Let’s formally report it with a \(\chi^2\) test. The null in this case is just like our second example above.
chisq.test(haireye)
##
## Pearson's Chi-squared test
##
## data: haireye
## X-squared = 138.29, df = 9, p-value < 2.2e-16
With such a small p-value, we have very strong evidence to reject the null hypothesis that there is no relationship between hair colour and eye colour. We can now assume they are interacting, but what exactly is the pattern? The only thing the test tells us is there is a very significant difference between actual and expected counts.
Let’s use a mosaic plot to visualise expected counts as if there were no interaction between the features. Here’s what no interaction looks like:
# visualising expected counts for independent features
expected = independence_table(haireye)
round(expected, 1)
## Eye
## Hair Brown Hazel Green Blue
## Black 40.1 17.0 11.7 39.2
## Brown 106.3 44.9 30.9 103.9
## Red 26.4 11.2 7.7 25.8
## Blond 47.2 20.0 13.7 46.1
mosaic(expected
, shade = TRUE
, main="Expected frequencies"
, labeling = labeling_values
, value_type = "expected"
, gp_text = gpar(fontface = 1))
Note that the areas are proportional to the counts. The grid lines between the tiles are dead straight because each cell is proportional to the marginal totals.
Here’s what actuals looks like:
# visualising actual counts for independent features
mosaic(haireye
, gp = shading_Friendly # shade = TRUE
, main="Actual frequencies"
, labeling = labeling_values
, value_type = "observed"
, gp_text = gpar(fontface = 1)
, rot_labels = c(top = -20))
Here the grid lines between cells are no longer continuous because the cells are out of proportion with the marginal totals. Unexpectedly large cells show that the combination is over-represented and small cells the reverse.
The beauty of the mosaic chart is the use of shading to highlight the the cells that differed the most from expected counts. We refer to these differences as the residuals. This shading scheme uses bold fills for statistically significant residuals but also adds an outline colour to highlight the sign of even the non-significant residuals. This is visually very powerful. There is a clear “opposite corner” pattern.
Let’s take this up to all three dimensions now.
# showing all three dimensions
HEC <- HairEyeColor[, c("Brown", "Hazel", "Green", "Blue"), ]
mosaic(HEC
, gp = shading_Friendly # shade = TRUE
, main="Actual frequencies"
, labeling = labeling_values
, value_type = "observed"
, gp_text = gpar(fontface = 1), rot_labels = c(right = -45))
# with a different pivot
mosaic(aperm(HEC, 3:1)
, gp = shading_Friendly # shade = TRUE
, main="Actual frequencies"
, labeling = labeling_values
, value_type = "observed"
, gp_text = gpar(fontface = 1), rot_labels = c(right = -45))
We can still see the general opposite corner pattern, and now some additional detail regarding gender.
This type of plot is still fairly intuitive at 4 dimensions. Take a look at the Titanic dataset. Who died and who survived? Did it depend on some combination of passenger class, gender and age?
mosaic(aperm(Titanic, c(4, 2, 1, 3))
, gp = shading_Friendly # shade = TRUE
, main="Who Died and Who Survived the Titanic?"
, labeling = labeling_values
, value_type = "observed"
, gp_text = gpar(fontface = 1), rot_labels = c(right = -45))
More dimensions can be added by other means. A cotabplot allows you to stratify over a further dimension. It looks just like facet_wrap() and facet_grid() in ggplot2, or a lattice/trellis plot.
In this example, with just default settings, two categories have been stratified. You could add further dimensions if you have more and still make sense of the plots.
# This doesn't appear to render well on the HTML output
# note, this is all the defaults except the shading!
cotabplot(Titanic, gp = shading_Friendly)
There is a whole methodology around loglinear modeling which allows you to make statistical hypothesis tests about complex n-way relationships between features. We’ll leave that for another time.
After model training. Use this method to deepen your understanding of how well your model is performing.
This task can be guided by two things. Firstly, if you’ve done a good exploratory analysis, you will have developed an understanding of which features could be important. Secondly, several well known ML models can output feature importance information in their own right, saving you all the hard work.
However, these feature importance measures tend to be crude. The feature is important, but why? All feature importance really tells you is that the model would be less accurate if you excluded that feature. It doesn’t tell anything about how that feature is helping or hindering.
Before moving on, let’s develop some intuition about this and show how the vcd package can help.
We will do some ML and train a random forest model to predict whether families in Slovenia should receive a state sponsored nursery place. The target variable is decision
set.seed(54321)
nursery <- read.csv(gzfile("nursery.csv.gz"))
summary(nursery)
## parents has_nurs form children
## great_pret :4320 critical :2592 complete :3238 1 :3238
## pretentious:4320 improper :2592 completed :3240 2 :3240
## usual :4318 less_proper:2592 foster :3240 3 :3240
## proper :2590 incomplete:3240 more:3240
## very_crit :2592
## housing finance social
## convenient:4318 convenient:6478 nonprob :4319
## critical :4320 inconv :6480 problematic :4320
## less_conv :4320 slightly_prob:4319
##
##
## health decision
## not_recom :4320 not_recom :4320
## priority :4320 priority :4266
## recommended:4318 spec_prior:4044
## very_recom: 328
##
All the features are categorical. Random forest ought to perform very well on this without any effort.
# train test split
idx <- sample(c(TRUE, FALSE)
, size = nrow(nursery)
, prob = c(0.7, 0.3)
, replace = TRUE)
nurs_train <- nursery[idx, ]
nurs_test <- nursery[!idx, ]
# train rf accepting all the defaults
nurs_rf <- randomForest(decision~., data=nurs_train)
# get the predictions on new data
nurs_preds <- predict(nurs_rf, newdata = nurs_test)
# confusion matrix
confmat <- with(nurs_test, table(nurs_preds, decision))
confmat
## decision
## nurs_preds not_recom priority spec_prior very_recom
## not_recom 1319 0 0 0
## priority 0 1232 13 50
## spec_prior 0 24 1142 0
## very_recom 0 0 0 57
Indeed, the model performs well and needed no effort to tune. Ideally, I would like visualise the confusion matrix directly, but it turns out to be a bit of a fussy task in R. I’ve looked online for any other way to do it, and I can only find some customised ggplot examples. Let’s compare the different plotting options from R:
# base
# there are no words to describe how horrible this is
image(confmat)
# lattice
# wrong diagonal aspect, horrible colours
levelplot(confmat)
# same
levelplot(t(confmat))
# ggplot
# too much fuss
# default colours not subtle enough to show error
# also the wrong diagonal
# First, convert to frequency form
confmat_freq <- as.data.frame(confmat)
# Next, tell ggplot what to do
ggplot(aes(x=nurs_preds
, y=decision
, fill=Freq)
, data=confmat_freq) +
geom_tile()
I’m not sure if Michael Friendly and the programmers of vcd meant for it to be used this way, but strucplot options are ideal for confusion matrices.
# strucplot sieve
# the shading is a bit too dense on the diagonal
# there are lots of options
# it might be possible to customise
sieve(confmat, labeling = labeling_values)
# strucplot tile - I like this one best
tile(confmat, labeling = labeling_values)
There is a little bit of confusion among the classes priority and spec_prior. The model also struggles with the minority very_recom class, preferring to label these as priority.
One of the most useful features of random forest models is the ability to report variable importance. This measures, for each feature, how much worse the model would do if that feature wasn’t available at training time. There is a custom plot function for this.
varImpPlot(nurs_rf)
We can see from this plot that the health variable seems to hold a majority of the predictive power for this model. Almost all the information is in this and the next three most important features.
What do we do now we know that? We investigate further! There is always a next question. Now we’ve become expert in handling 3-D tables, let’s stratify (slice) the confusion matrix by this health variable and see what’s going on.
# unfortunately, there is no cotabplot option for tile
# this has to be done manually to make any sense
# from the summary, health can take one of three values
# not_recom, priority, recommended
cm_3d <- with(nurs_test
, table(nurs_preds, decision, health))
re <- cm_3d[,,"recommended"]
pr <- cm_3d[,,"priority"]
nr <- cm_3d[,,"not_recom"]
tile(re, labeling = labeling_values)
tile(pr, labeling = labeling_values)
tile(nr, labeling = labeling_values)
This last plot is very striking. One level of the health feature has all the information required to completely classify one of the class labels. It’s not surprising that health is by far the most important variable! However, it turns out to be very uninteresting from a ML point of view. A very simple program could do almost as well as this random forest.
A situation like this is quite unusual in the real world, and truthfully, should be identified during exploratory analysis. The sensible thing to do is probably to filter out all the instances in this health category and build a new model. However, it serves as a good example to show that confusion matrix results are not as informative as they first appear.
In binary classification, this could never happen. There would be no ML to do if one feature was a perfect representation of the class labels. Furthermore, problems are often organised so that the two classes are balanced and this should remain the case in the confusion matrix of a well performing model. How could this slicing technique help?
When you are evaluating your classification model performance with a confusion matrix and you have a reason to believe that one or more categorical features contributed to good/bad perforemance. You need to investigate further to improve model performance or correct for any bias.
You want to use a non-paramtric model, such as a tree based method but your situation is sensitive to unbalanced loss. For example, a medical or financial model where either false positives or false negatives are especially costly. You need a detailed understanding of features and categories where the model is more or less confident.
We will now do a random forest binary classification. The german data set is a credit rating data set, where individuals are given either a good or bad rating. Incorrect good predictions cost the business much more, because of the risk of loan defaults. Incorrect bad results only result in a small reduction of revenue from loan interest payments in the long term.
Let’s refer to correct identification of a bad result as a true positive outcome. When there is a high proportion of bad instances that are incorrectly classified as good, this gives rise to a high false negative rate and is very undesirable.
We could crudely force the model to be more cautious over all, whereby it will make more false positive results (incorrect bad ratings). However, it would be better if we could develop a more targetted strategy to improve the false negative rate.
german <- read.csv(gzfile("german.csv.gz"))
summary(german)
## chk dur crhis pps amt
## A11:274 Min. : 4.0 A30: 40 A43 :280 Min. : 250
## A12:269 1st Qu.:12.0 A31: 49 A40 :234 1st Qu.: 1366
## A13: 63 Median :18.0 A32:530 A42 :181 Median : 2320
## A14:394 Mean :20.9 A33: 88 A41 :103 Mean : 3271
## 3rd Qu.:24.0 A34:293 A49 : 97 3rd Qu.: 3972
## Max. :72.0 A46 : 50 Max. :18424
## (Other): 55
## svng emp rate pers debt res
## A61:603 A71: 62 Min. :1.000 A91: 50 A101:907 Min. :1.000
## A62:103 A72:172 1st Qu.:2.000 A92:310 A102: 41 1st Qu.:2.000
## A63: 63 A73:339 Median :3.000 A93:548 A103: 52 Median :3.000
## A64: 48 A74:174 Mean :2.973 A94: 92 Mean :2.845
## A65:183 A75:253 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :4.000 Max. :4.000
##
## prop age plans hous creds
## A121:282 Min. :19.00 A141:139 A151:179 Min. :1.000
## A122:232 1st Qu.:27.00 A142: 47 A152:713 1st Qu.:1.000
## A123:332 Median :33.00 A143:814 A153:108 Median :1.000
## A124:154 Mean :35.55 Mean :1.407
## 3rd Qu.:42.00 3rd Qu.:2.000
## Max. :75.00 Max. :4.000
##
## job deps tel foreign rating
## A171: 22 Min. :1.000 A191:596 A201:963 bad :300
## A172:200 1st Qu.:1.000 A192:404 A202: 37 good:700
## A173:630 Median :1.000
## A174:148 Mean :1.155
## 3rd Qu.:1.000
## Max. :2.000
##
set.seed(54321)
# train test split
idx <- sample(c(TRUE, FALSE)
, size = nrow(german)
, prob = c(0.7, 0.3)
, replace = TRUE)
german_train <- german[idx, ]
german_test <- german[!idx, ]
# make the training data balanced in case over fits good
# originally 300 bad, 700 good
german_train <- ovun.sample(rating~.
, data=german_train
, method = "over"
, p = 0.5)$data
# restore the factor levels to correct order
german_train$rating <- relevel(german_train$rating, "bad")
# train rf accepting all the defaults
german_rf <- randomForest(rating~., data=german_train)
# get the predictions on new data
german_preds <- predict(german_rf, newdata = german_test)
# confusion matrix
confmat <- with(german_test, table(german_preds, rating))
confmat
## rating
## german_preds bad good
## bad 34 28
## good 42 185
The confusion matrix looks ok, but how does it really do, in light of our high cost of false negatives?
# accuracy
sum(diag(confmat))/sum(confmat)
## [1] 0.7577855
# False Negative Rate FN / (TP + FN)
confmat[2, 1]/sum(confmat[, 1])
## [1] 0.5526316
# False Ommision Rate FN / (TN + FN)
confmat[2, 1]/sum(confmat[2, ])
## [1] 0.185022
So it’s not a great model, but let’s assume it’s better than what the company was doing before and move on. These numbers are useful but we want an intuitive visual representation for this, so what works best?
A rose plot is a surprisingly intuitive choice for viewing a binary confusion matrix, which is a \(2 \times 2\) table. Rose plots are a twist on the pie chart. Pie charts divide a circle into segments whose internal angle \(\theta\) is proportional to the category count. A rose plot does the reverse. The segments all have angle \(\theta = \frac{2 \pi}{N}\), for \(N\) segments. For a binary matrix with \(4\) options \(\theta = \frac{\pi}{2} = 90^o\). Then, the radius length varies in proportion with the count of each cell in the table. This has a useful property that \(\mathit{count} \propto \sqrt{\mathit{area}}\), which means the rose plot will gracefully handle numbers that differ by one or two orders of magnitude.
Most rose plot options available in R are basically bar charts that are transformed into polar coordinates. Even for a \(2 \times 2\) matrix, this would be fiddly to code because it requires converting the matrix to a frequency data frame and then forcing the plot to arrange the four cell values in a way that preserves the confusion matrix shape.
The vcd package contains a specialised fourfold plot whose purpose is to compare several \(2 \times 2\) tables over the levels of a third feature. It is intended for a different purpose than what we want to do, but the behaviour can be easily changed with a parameter to the plot function call that makes it perfectly suited for our needs.
# fourfold plot preserves the table structure it was given
# however, default balances the areas to compare odds ratios no matter the individual count values
# that's not what we want
fourfold(confmat)
# std = "all.max" suppresses balancing the areas
# behaves like a rose plot
fourfold(confmat, std = "all.max")
The rose plot makes it visually clear that the model is great a classifying true negatives, but is predicting too many negatives over all and misclassifying more true prositives than it correctly identifies. Such a high FNR is not acceptable. Why is the model not doing well?
If a categorical feature \(c\) has a high importance, it would suggest that the model has encoded how the ratio of good and bad ratings is different over the different levels of \(c\) and can make use of that information in predicting.
Let’s use feature importance to break down the results. This data set had a mix of numeric and categorical features.
varImpPlot(german_rf)
First thing to note is that there is no elbow point where we could safely make the model simpler by just picking the top performing features.
Secondly, and very fortunately for us, several categorical features are considered important, including the top one. Three of them are in the top six. Of course, it doesn’t matter if categorical variables don’t have high importance. Any numeric variable could be binned/discretised, and then have this technique applied.
As we want to customise the fourfold plot a little, we can make a new function to save repeating the code each time.
# customise the fourfold plot
my_rose <- function(cm) {
fourfold(cm
, std = "all.max" # mimic a 4 way rose plot
, conf_level = 0 # suppress confint tests, not useful here
, color = c("#FF0000", "#000080") # need to force colours due to suppressing confint test
)
}
# while we're at it, let's make a function to get the numeric statistics that we want from the conf mat
my_confmat_stats <- function(cm) {
cf_stats <- list()
for (i in 1:(dim(cm)[3])) {
category <- dimnames(cm)[[3]][i]
acc <- sum(diag(cm[ , , i]))/sum(cm[ , , i])
# False Negative Rate FN / (TP + FN)
fnr <- cm[2, 1, i]/sum(cm[ , 1, i])
# False Ommision Rate FN / (TN + FN)
fomr <- cm[2, 1, i]/sum(cm[2, , i])
# collect
cf_stats[[category]] <-
c(acc=acc, fnr=fnr, fomr=fomr)
}
return(t(as.data.frame(cf_stats)))
}
We’ll now make rose plots of the confusion matrix, stratified by the top categorical features. Try to relate the numeric stats to the patterns in the plots.
# stratify the conf mat by chk
confmat_c <- with(german_test
, table(german_preds
, rating
, chk))
# view the stats then plot
my_confmat_stats(confmat_c)
## acc fnr fomr
## A11 0.6777778 0.3333333 0.2708333
## A12 0.6212121 0.6363636 0.2978723
## A13 0.8421053 1.0000000 0.1111111
## A14 0.8859649 1.0000000 0.1140351
my_rose(confmat_c)
Here we can see a pattern. When an instance has chk \(\in \{ \mathrm{A13}, \mathrm{A14} \}\), the model almost always classifies as good. One instance in each case was incorrectly classified as bad. Instances having chk \(\in \{ \mathrm{A11}, \mathrm{A12} \}\) have a rather high FNR at \(\approx 0.5\)
We now proceed with the next two most important categorical features, but leave any further diagnostics as talking points.
confmat_p <- with(german_test
, table(german_preds
, rating
, pps))
my_confmat_stats(confmat_p)
## acc fnr fomr
## A40 0.7702703 0.4166667 0.1886792
## A41 0.7619048 0.7500000 0.1666667
## A410 0.5000000 NaN 0.0000000
## A42 0.6530612 0.8461538 0.2682927
## A43 0.8409091 0.4705882 0.1095890
## A44 0.5000000 1.0000000 0.5000000
## A45 0.6000000 1.0000000 0.4000000
## A46 0.7058824 0.3750000 0.3000000
## A48 1.0000000 NaN 0.0000000
## A49 0.7200000 0.5000000 0.1666667
my_rose(confmat_p)
confmat_h <- with(german_test
, table(german_preds
, rating
, crhis))
my_confmat_stats(confmat_h)
## acc fnr fomr
## A30 0.8000000 0.0000000 0.0000000
## A31 0.7142857 0.4000000 0.2222222
## A32 0.7152318 0.5909091 0.2241379
## A33 0.7916667 0.2000000 0.0625000
## A34 0.8222222 0.8125000 0.1547619
my_rose(confmat_h)
confmat_e <- with(german_test
, table(german_preds
, rating
, emp))
my_confmat_stats(confmat_e)
## acc fnr fomr
## A71 0.5714286 0.8000000 0.36363636
## A72 0.7674419 0.5000000 0.24242424
## A73 0.7019231 0.7037037 0.22619048
## A74 0.7777778 0.2222222 0.05405405
## A75 0.8513514 0.4736842 0.14516129
my_rose(confmat_e)
The dur feature is integer valued, with 22 unique levels. This technique scales usefully even up to this number of strata! While the exact details would take some time to analyse fully, there are obvious visible patterns.
confmat_d <- with(german_test
, table(german_preds
, rating
, dur))
my_rose(confmat_d)
This technique can be scaled to more dimensions, with some sacrifices, using log odds ratios. It is a little less intuitive at first, but can be very useful once you have a clear understanding of that the log odds ratio is telling you.
Recall from the introduction how this statistic was defined. The formula can be rearranged such that
\[\phi = \frac{\mathrm{TP} \times \mathrm{TN}}{\mathrm{FP} \times \mathrm{FN}}\]
The log odds ratio, therefore is larger as the model becomes more accurate. A model that does no better than a random guess has a log odds ratio that is not significantly different from zero. Note, this approach is not robust to very unbalanced classes.
A the log odds ratio of a stratified confidence matrix can easily be plotted to see which categories of the stratifying factor provide better accuracy. This will scale up to two additional dimensions, using position, colour and plot character. Such plots are be challenging to interpret and only works well if the features have two or three levels each at the most. However, under those circumstances, interesting interactions may be revealed.
Note, all the following plots are suppressing the confidence interval. The reason for this is that the data set is rather small and slicing along multiple dimensions results in low counts. Large standard errors can dominate the plot.
plot(loddsratio(confmat_c), confidence = FALSE)
plot(loddsratio(confmat_e), confidence = FALSE)
# higher dimensional
confmat_ce <- with(german_test
, table(german_preds
, rating
, chk
, emp))
confmat_ce <- with(german_test
, table(german_preds
, rating
, chk
, emp))
plot(loddsratio(confmat_ce), confidence = FALSE)
# back to one stratum, but many levels
plot(loddsratio(confmat_d), confidence = FALSE)
In Machine Learning, the standard way to assess the quality of a supervised classification model is with a confusion matrix. This is a two way table comparing True and Predicted class labels. If you are trying to classify ordered labels, the relationship between the labels matters. Standard accuracy measures penalise near disagreements as if they were full errors. This leads to accuracy estimates that are far too conservative; You might throw out a fairly good model, when there is a good chance you can improve it.
Let’s take a step back and recap confusion matrices; when classes are well balanced, it’s easy to measure accuracy over a confidence matrix as \(\frac{\sum(\textit{diag})}{\sum(\textit{total})}\). When they’re out of balance you can have a situation such as this:
The ratio of class 1 : class 2 is 9:1. Predict every instance as class 1 and you will get 90% accuracy. This not a useful result!
So we use Cohen’s \(\kappa\) statistic to weight according to the occurrence of each class in the data. \(\kappa = 1\) means perfect agreement while \(\kappa = 0\) means no better than chance agreement. The above example would score \(\kappa = 0\).
That’s all well and good, but what class labels are related to each other by a natural order such as [least, middle, most] then you have an additional complication. You want your model to predict as well as possible, but to make an off-by-one error isn’t as bad as a full opposite error. Also, the classification distribution might be biased, giving different skew and kurtosis (flattened or polarised) by the predictive model.
If you can detect this kind of bias, you can correct it! This is where agreement plots can help. They are based on statistical techniques for comparing two raters, such as two doctors, two film critics, two wine critics etc, to validate ratings and analyse these kinds of bias.
After training an ordinal classification model, when you want to evaluate its performance.
In the following example we compare two sets of diagnoses for the same patients suspected of suffering from Multiple Sclerosis. The diagnoses were made by two neurologists, one in Winnipeg and the other in New Orleans:
# combine results from two cities (same raters are involved in all cases)
MSP <- margin.table(MSPatients, 1:2)
On this dataset, we can’t say which is the true and which is predicted because they’re both independent raters, but for now let’s assume that Winnipeg Neurologist is the true class and New Orleans Neurologist is the predicted class. The highest accuracy we could expect to get with constant model that always makes the same rating is the proportion of the majority true class. This would give a score of \(\kappa = 0\):
margin.table(MSP, 2)
## Winnipeg Neurologist
## Certain Probable Possible Doubtful
## 95 66 22 35
props <- prop.table(margin.table(MSP, 2))
props
## Winnipeg Neurologist
## Certain Probable Possible Doubtful
## 0.4357798 0.3027523 0.1009174 0.1605505
# The baseline accuracy is
props[which(props == max(props))]
## Certain
## 0.4357798
Without looking at the confusion matrix, let’s take a naive look at the accuracy:
# basic accuracy
sum(diag(MSP))/sum(MSP)
## [1] 0.4449541
On the face of it, this looks like a terrible result! Is New Orleans Neurologist just predicting the most frequent Winnipeg class?
MSP
## Winnipeg Neurologist
## New Orleans Neurologist Certain Probable Possible Doubtful
## Certain 43 8 0 1
## Probable 36 22 7 0
## Possible 12 27 8 10
## Doubtful 4 9 7 24
There are definitely some problems with the inter-rater agreement but it doesn’t look like random guessing. Why is accuracy so low? Does class imbalance play a part? What is the \(\kappa\) score?
# cohen's kappa - the function in the vcd package provides weights and confidence intervals for this statistic
confint(Kappa(MSP, weights = "Fleiss-Cohen"))
##
## Kappa lwr upr
## Unweighted 0.1728083 0.3411072
## Weighted 0.4987456 0.6785713
Unweighted \(\kappa\), the standard measure, also shows that this is not a very good result.
However, Fleiss-Cohen Weighted \(\kappa\) favours near disagreements and taking this into account gives a much better result. Off-by-n errors appear to be common, so these raters are closer than it first appears.
An agreement plot is the best way to visualise a situation like this because it shows how off-by-n errors and other bias could be affecting the outcome.
op <- par(mar = c(4, 3, 4, 1) + .1)
B <- agreementplot(MSP
, main = "MS Patient Ratings"
, xlab_rot = -20
)
This plot needs some explaining. The rectangle outlines show the maximum possible agreement, given the marginal totals. The black rectangles indicate where there is full agreement between raters. The shaded area indicates the off by \(n\) disagreement. Note, if there were more categories, there would be more grades of shading.
Ideally, we want the black area to fill the outline and it’s not very close in this chart. However, the shaded area is coming much closer to filling the available space.
unlist(B)[1 : 2]
## Bangdiwala Bangdiwala_Weighted
## 0.2472555 0.7503234
The Bangdiwala statistic is calculated based on the ratio of these shaded rectangles to the maximal outline. The large interval between the weighted and unweighted statistics shows that near disagreements are very common and if taken into account, they indicate a much better alignment than was indicated by a naive accuracy measure and even \(\kappa\).
There is another feature of this plot, which is the relationship of the rectangles to the diagonal line. This indicates how well aligned the raters are in using each category. In this chart, we can see how the Winnipeg Neurologist rates cases as “Certain” or “Probable” much more frequently than New Orleans Neurologist. Neither \(\kappa\) nor Bangdiwala statistics can detect this. There are statistical tests available but this chart shows the problem directly.
With a little intervention, guided by this information, these two raters could become much better aligned.
Most Machine Learning practitioners are familiar with dimension reduction. It can radically simplify and compress a dataset, making it easier to handle and train models on. Many dimension reduction techniques also double up as unsupervised learning methods i.e. clustering. This is useful as an end in itself, as well as being an additional tool for exploratory data analysis: multidimensional data is projected into 2-D which makes it intuitive to visualise.
If you already know something about Principle Components Analysis, this is similar, but not quite the same. Remember, in DDA, the “dimensions” we want to reduce or cluster are not only the features, but also the categories within each feature. PCA identifies new dimensions that capture the most variance. The equivalent concept in Correspondence Analysis is the “inertia” of cells in the table with the largest and smallest numbers. The only way to really describe what’s happening is to see some examples.
Correspondence Analysis is an advanced topic. In particular, the way you pivot the multidimensional array controls the analysis. This requires some understanding of the data manipulation and modeling techniques. For today, we will just focus on interpreting the plots.
You can use Correspondence Analysis during the exploratory data analysis phase of any project. Do this early on in the project once you’ve gathered and cleaned up your data enough to start sense-making.
This is a good technique for of feature engineering. If several features are well clustered you can replace them with the real valued dimensions that this method returns as output. This may convert several categorical dimensions into just a couple of numeric ones that might better suit the next stage of your project. The following two situations are immediately applicable:
You could use it as a stand alone model. This is an unsupervised clustering method for categorical features. Use it where you don’t have a specific class label, but just want to find out the associations between all the features.
The audience viewing data from Neilsen Media Research for the week starting November 6, 1995
It is a 3-D array cross-tabulating the viewing figures for three networks, between 8-11pm, Monday to Friday. The features and their levels are as follows:
No | Name | Levels |
---|---|---|
1 | Day | Monday, Tuesday, Wednesday, Thursday, Friday |
2 | Time | 8, 9, 10 |
3 | Network | ABC, CBS, NBC |
data("TV", package = "vcdExtra")
# The original data is collected in 15 minutes slices.
# Convert 3-D array to a frequency data frame
# This has a row for each cell of the array
# and a new column for the cell value
TV.df <- as.data.frame.table(TV)
# Convert it into hourly slices
levels(TV.df$Time) <- rep(c("8", "9", "10"), c(4,4,3))
# Convert frequency data back to 3-D array, now with just 3 time levels
TV3 <- xtabs(Freq~Day+Time+Network, TV.df)
# create a multiple correspondence analysis
TV3.mca <- mjca(TV3)
# the plot function uses all base R plot stuff
# but needs a bit of manipulation
cols <- c("blue", "black", "red")
# "blank plot"
res <- plot(TV3.mca, labels=0, pch='.', cex.lab=1.2)
# combine Dims, factor names and levels
coords <- data.frame(res$cols, TV3.mca$factors)
# hard-coded from known number of levels
# day, time, network
nlev <- c(5,3,3)
# everything needs to be in semantic order
coords <- coords[ order(coords[,"factor"], coords[,"level"]), ]
# quick fix for ordering e.g. day of week, not alphabetical
coords$order <- c(5, 1, 4, 2, 3, 6, 7, 8, 11, 9, 10)
coords <- coords[order(coords[, "order"]), ]
# place the points with separate plot chars and colours
points(coords[,1:2], pch=rep(16:18, nlev), col=rep(cols, nlev), cex=1.2)
# place the text
pos <- c(1,4,3)
text(coords[,1:2], labels=coords$level, col=rep(cols, nlev), pos=rep(pos,nlev), cex=1.1, xpd=TRUE)
# join things in sequence
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Day", lty=1, lwd=1, col="blue")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Time", lty=1, lwd=1, col="red")
# add segement from the origin to channels
nw <- subset(coords, factor=="Network")
segments(0, 0, nw[,"Dim1"], nw[, "Dim2"], col = "black", lwd = 0.5, lty = 3)
# add a legend
legend("topright", legend=c("Day", "Network", "Time"),
title="Factor", title.col="black",
col=cols, text.col=cols, pch=16:18,
bg="gray95")
This plot reveals an incredibly strong effect relating NBC and Thursday nights. This is apparently most of the inertia captured by Dimension 1. What’s going on there?
margin.table(TV3, 1)
## Day
## Monday Tuesday Wednesday Thursday Friday
## 8399 8081 5929 8987 6214
margin.table(TV3, c(1, 3))
## Network
## Day ABC CBS NBC
## Monday 2847 2923 2629
## Tuesday 3110 2403 2568
## Wednesday 2434 1283 2212
## Thursday 1766 1335 5886
## Friday 2737 1479 1998
Viewing figures are at their highest on Thursdays compared to any night of the week and for NBC they dramatically outnumber the other two channels, but only on this night. NBC has viewing figures that are comparable to the others most other nights. Everyone appears to be glued to NBC all evening. A little digging and domain knowledge (showing my age, here) means we can give this effect a name: This is the “Friends, Seinfeld and ER” effect! These were three of the most popular shows of that era.
The 9pm slot clusters closer to the other four days than 8pm and 10pm. What’s happening there?
margin.table(TV3, 2)
## Time
## 8 9 10
## 12757 13596 11257
t(TV3[4,,]) # Thursday
## Time
## Network 8 9 10
## ABC 735 682 349
## CBS 680 385 270
## NBC 1927 1858 2101
It turns out that 9pm is the most viewed time slot throughout the week, independent of day or network. Yet, notice how the NBC audience numbers dip just a little for that time slot on Thursday, even though they still have far greater numbers than either of the other two channels. This dip in numbers runs counter to the “Friends, Seinfeld and ER” effect which has completely dominated the analysis, and so it has also been captured on Dimension 1.
Dimension 2 is forced to capture differences between ABC and CBS schedules on the other nights as well as possible, but there is undoubtedly more detail to untangle.
Now we have identified this difficulty, we can tackle the problem. We’re going to use the simple (2-Way) Correspondence Analysis to do this. Using the pivoting function structable(), we effectively create a new feature that combines time and day. Counter-intuitively, this allows the Correspondence Analysis more flexibility along the combined dimension. The resulting plot has a point for each time slot over the week and teases apart the effect of people channel hopping from the “Friends, Seinfeld and ER” effect. It turns out to be very easy to do (see the code for yourself).
# Flatten to 2-D by stacking Time onto Day
# Note: The data shaping choice here controls the specifics of the analysis.
TV3s <- as.matrix(structable(Network~Time+Day, TV3))
# Create the Correspondence Analysis objects
TV3s.ca <- ca(TV3s)
# Generate the plot
res <- plot(TV3s.ca)
# add some segments from the origin to make things clearer
segments(0, 0, res$cols[,1], res$cols[,2], col = "red", lwd = 1)
segments(0, 0, res$rows[,1], res$rows[,2], col = "blue", lwd = 0.5, lty = 3)
As you can see, the small channel hop from NBC on Thursday has moved onto Dimension 2, along with the other variations in within-evening viewing patterns. On the previous plot, there is a very slight affinity 8pm-CBS, and 9pm-ABC. Did you notice? Look again. We can see clearly on this new plot, something of great interest at 8pm, Monday, but not later on the same evening and not again at the same time later in the week. We also can see several 9pm slots clustered around ABC.
An interesting feature of this new analysis, is that distance from the origin is proportional to the size of positive residual. What does this mean? Well for example, from 10pm the viewing figures dwindle, independent of day and channel as people go to bed or go out for a late drink. Thursday night, 10pm, NBC has massive audience figures relative to what’s normally expected for that time slot the rest of the week. This inertia moves the point furthest from the origin out of all the nightly time slots. Monday at 8pm on CBS is notable for the same reason. Why not Monday 9pm and ABC? There are other nights when ABC is popular at 9pm, making this combination less unusual.
t(TV3[1,,]) # Monday
## Time
## Network 8 9 10
## ABC 536 1401 910
## CBS 1167 967 789
## NBC 858 946 825
TV3[,2,1] # Every day, 9pm, ABC
## Monday Tuesday Wednesday Thursday Friday
## 1401 1205 1022 682 907
If the next stage of your project requires real valued features for modeling (such as a distance or similarity based model), you can access the coordinates of each point via the plot object and replace them in your data set. This is more powerful than converting to an integer series or dummy variables, because so much information is retained. For example, it’s very tempting to convert [Monday, Tuesday, Wednesday, Thursday, Friday] to [1,2,3,4,5]. However, we have seen in this case that this retains none of the information about the feature interactions. Likewise, [8,9,10] seems obvious to leave as numbers, but you lose information about the audience figures being higher at 9pm than any other time. Also, what’s the natural order for [ABC,CBS,NBC]? Alphabetical? Think again!
res$rows
## Dim1 Dim2
## 8_Monday 0.15812304 -0.455606847
## 8_Tuesday -0.02638318 -0.010526784
## 8_Wednesday 0.25070348 0.001493149
## 8_Thursday -0.34416868 -0.074239579
## 8_Friday 0.13899420 0.195131693
## 9_Monday 0.24587580 0.032709280
## 9_Tuesday 0.23549271 -0.032622366
## 9_Wednesday 0.07374072 0.228909131
## 9_Thursday -0.46679506 0.046282352
## 9_Friday 0.23682706 0.094854237
## 10_Monday 0.16432975 -0.070170394
## 10_Tuesday 0.33565719 -0.012832280
## 10_Wednesday -0.12640134 0.130600285
## 10_Thursday -0.74421704 -0.028465081
## 10_Friday 0.13391483 0.086194069
res$cols
## Dim1 Dim2
## ABC 0.2489731 0.170454834
## CBS 0.2711437 -0.225264013
## NBC -0.3769860 -0.004916094
We’ll take a look at one more of these, from the 4-D Titanic data set again.
# one more mca from a 4-D dataset
# this is simpler than it looks
# all the magic happens here
titanic.mca <- mjca(Titanic)
# saving the plot object supplies the coordinate positions
res <- plot(titanic.mca, labels=0, pch='.', cex.lab=1.2)
# extract factor names and levels
coords <- data.frame(res$cols, titanic.mca$factors)
# everything else is handling base R plotting stuff
cols <- c("blue", "red", "brown", "black")
nlev <- c(4,2,2,2)
points(coords[,1:2], pch=rep(16:19, nlev), col=rep(cols, nlev), cex=1.2)
pos <- c(3,1,1,3)
text(coords[,1:2], labels=coords$level, col=rep(cols, nlev), pos=rep(pos,nlev), cex=1.1, xpd=TRUE)
coords <- coords[ order(coords[,"factor"], coords[,"Dim1"]), ]
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Class", lty=1, lwd=2, col="blue")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Sex", lty=1, lwd=2, col="red")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Age", lty=1, lwd=2, col="brown")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Survived", lty=1, lwd=2, col="black")
legend("topleft", legend=c("Class", "Sex", "Age", "Survived"),
title="Factor", title.col="black",
col=cols, text.col=cols, pch=16:19,
bg="gray95", cex=1.2)
Note that the two category (binary) features all pass through the origin.
# Simple ca works on two dimensions
# pivot table of Titanic
Titanic_piv <- structable(~Class+Sex+Survived+Age, Titanic)
Titanic_piv
## Sex Male Female
## Age Child Adult Child Adult
## Class Survived
## 1st No 0 118 0 4
## Yes 5 57 1 140
## 2nd No 0 154 0 13
## Yes 11 14 13 80
## 3rd No 35 387 17 89
## Yes 13 75 14 76
## Crew No 0 670 0 3
## Yes 0 192 0 20
# all the work happens here
Titanic.ca <- ca(as.matrix(Titanic_piv))
# plot and save the object to get the coords for line segments
res <- plot(Titanic.ca)
segments(0, 0, res$cols[,1], res$cols[,2], col = "red", lwd = 1)
segments(0, 0, res$rows[,1], res$rows[,2], col = "blue", lwd = 0.5, lty = 3)
As you can see Correspondence Analysis is an extremely powerful dimension reduction and clustering tool for categorical data.
Count, or frequency data has many varied applications from event monitoring to text analysis. 1-way analysis uses distributions to compare different samples, and monitor for changes in event frequency.
For the first task we will look at two common distributions for count data and compare two samples.
The Poisson distribution has just one parameter \(\lambda\) for the rate, and both the mean and variance are equal to this parameter.
set.seed(12345)
KL <- expand.grid(k = 0 : 20, lambda = c(1, 5, 10, 20))
pois_df <- data.frame(KL, prob = dpois(KL$k, KL$lambda))
pois_df$lambda = factor(pois_df$lambda)
xyplot(prob ~ k | lambda, data = pois_df,
type = c("h", "p"), pch = 16, lwd = 2
, cex = 1.25, layout = c(4, 1)
, main = expression(paste("Poisson Distribution by ", lambda))
, xlab = list("Number of events (k)", cex = 1.25)
, ylab = list("Probability", cex = 1.25))
The negative binomial distribution has two parameters, rate (size) and dispersion (\(\frac{1}{\mathrm{prob}}\)). The mean is equal to the rate, but the variance is always greater than the rate. It can appear as a right skewed Poisson distribution and behaves as if a Poisson distribution is being sampled over a period of time while the rate is non-constant.
XN <- expand.grid(k = 0 : 20, n = c(1, 5, 10, 20), p = c(0.4, 0.2))
nbin_df <- data.frame(XN, prob = dnbinom(XN$k, XN$n, XN$p))
nbin_df$n <- factor(nbin_df$n)
nbin_df$p <- factor(nbin_df$p)
xyplot(prob ~ k | n + p, data = subset(nbin_df, p == 0.4)
, main = "Neg. binom by size and prob"
, xlab = list("Number of failures (k)", cex = 1.25)
, ylab = list("Probability", cex = 1.25)
, type = c("h", "p"), pch = 16, lwd = 2
, strip = strip.custom(strip.names = TRUE)
)
xyplot(prob ~ k | n + p, data = subset(nbin_df, p == 0.2)
, main = "Neg. binom by size and prob"
, xlab = list("Number of failures (k)", cex = 1.25)
, ylab = list("Probability", cex = 1.25)
, type = c("h", "p"), pch = 16, lwd = 2
, strip = strip.custom(strip.names = TRUE)
)
However, Poisson and negative binomial data can look very similar. Identifying the best match and estimating the parameters is essential for accurate event monitoring and reporting results.
The following technique has been used to determine if documents are written by same or different authors. Distributions of marker words are fit to text that is known to come from one author.
Any previously unseen text can then be tested against these known distributions to determine if they are written by the same author.
In this simple case study, we will use the Federalist dataset to analyse how many times the word ‘may’ appear per 200 word block. This is a marker word, that could help identify an author by how often they use it. Most studies of this kind would compare results across many such tests with different words.
Notes:
data("Federalist", package = "vcd")
Federalist
## nMay
## 0 1 2 3 4 5 6
## 156 63 29 8 4 1 1
sum(Federalist) # number of blocks
## [1] 262
sum(expand.dft(as.data.frame(Federalist))$nMay) # number of occurences
## [1] 172
mean(expand.dft(as.data.frame(Federalist))$nMay)
## [1] 0.6564885
var(expand.dft(as.data.frame(Federalist))$nMay)
## [1] 1.007985
The descriptive stats showed a slightly higher variance but we don’t know if this is significant. What’s the right test to use here?
A very naive approach is to take the mean and visually compare a Poisson distribution with this value for \(\lambda\) against the actual data.
barplot(sqrt(200 * dpois(0:6, mean(expand.dft(as.data.frame(Federalist))$nMay)
))
, main = "Number of text blocks by Number of Occurrences\nnaive expected values"
, ylab = "sqrt(Naive Expected)"
, xlab = "Occurences of 'may'"
, col = "lightblue")
The actual data looks a bit like this distribution. Could this be a good fit? How can we easily compare them?
barplot(sqrt(Federalist)
, main = "Number of text blocks by Number of Occurrences\nactual values"
, xlab = "Occurences of 'may'"
, ylab = "Sqrt(Actual)"
, col = "lightgreen"
)
Use the goodfit function in the vcd package to estimate the parameters and compare the fit of different distributions. It provides hanging rootograms that are intuitive views of discrete goodness of fit testing.
First, a Poisson fit using maximum likelihood - exactly what we’ve just done, but with bells on it:
Fed_fit0 <- goodfit(Federalist, type = "poisson")
# This will show the rate parameter, estimated by Maximum Likelihood
# This estimate is the same as the naive mean. This is to be expected when fitting a poisson.
unlist(Fed_fit0$par)
## lambda
## 0.6564885
# This will show a Chi-Square Goodness of fit test against the expected values of a poisson distribution with this parameter
summary(Fed_fit0)
##
## Goodness-of-fit test for poisson distribution
##
## X^2 df P(> X^2)
## Likelihood Ratio 25.24312 5 0.0001250511
The \(\chi^2\) test is very strong evidence to reject the null hypothesis. This sample is not from a Poisson distribution with rate = sample mean. However, we don’t know much about what caused this result. However, the rootogram plot makes it very clear what the problem is by showing how the variable diverges from the expectations.
plot(Fed_fit0, type = "hanging", shade = TRUE)
By hanging each bar top down from the expected distribution, we just need to look at how the bottoms of the bars miss the \(y = 0\) reference line. This is visually much more intuitive than comparing the tops of the bars to a curve. The contribution of each bar to the \(\chi^2\) statistic is also made clear by the shading.
We can try fitting a Poisson using a different estimation method, minimising the \(\chi^2\) errors instead:
Fed_fit0 <- goodfit(Federalist, type = "poisson", method = "MinChisq")
# This has found a different value for the rate parameter.
unlist(Fed_fit0$par)
## lambda
## 0.8236976
# This will show a Chi-Square Goodness of fit test against the expected values of a poisson distribution with this parameter
summary(Fed_fit0)
##
## Goodness-of-fit test for poisson distribution
##
## X^2 df P(> X^2)
## Pearson 46.86767 5 6.045556e-09
The \(\chi^2\) test is also very strong evidence against the null.
plot(Fed_fit0, type = "hanging", shade = TRUE)
The rootogram shows that the residuals have just been shuffled around so they balance better by sign. Clearly Poisson is not the right choice.
Next, fit a negative binomial fit:
Fed_fit1 <- goodfit(Federalist, type = "nbinomial")
# This will show the rate and dispersion parameters, estimated by Maximum Likelihood
# prob is the rate. Here it's very close to the poisson mean
# size is the dispersion
unlist(Fed_fit1$par)
## size prob
## 1.1863297 0.6437582
summary(Fed_fit1)
##
## Goodness-of-fit test for nbinomial distribution
##
## X^2 df P(> X^2)
## Likelihood Ratio 1.964028 4 0.7423751
plot(Fed_fit1, type = "hanging", shade = TRUE)
We have estimated some parameters and the \(\chi^2\) goodness of fit test result is not significant. We accept that null hypothesis that this sample is from a negative binomial distribution having the estimated parameters. The plot speaks for itself now that the bars drop close to the \(y = 0\) reference line, with only insignificant residuals.
We will now assess a different body of text.
# Some dummy data
nonFederalist <- structure(c(135L, 71L, 36L, 17L, 8L, 4L, 1L), .Dim = 7L, .Dimnames = structure(list(
c("0", "1", "2", "3", "4", "5", "6")), .Names = "nMay"), class = "table")
nonFederalist
## nMay
## 0 1 2 3 4 5 6
## 135 71 36 17 8 4 1
sum(nonFederalist) # number of blocks
## [1] 272
sum(expand.dft(as.data.frame(nonFederalist))$nMay) # number of occurences
## [1] 252
mean(expand.dft(as.data.frame(nonFederalist))$nMay)
## [1] 0.9264706
var(expand.dft(as.data.frame(nonFederalist))$nMay)
## [1] 1.470588
barplot(sqrt(nonFederalist)
, main = "Number of text blocks by Number of Occurrences\nunseen text"
, xlab = "Occurences of 'may'"
, ylab = "Sqrt(Actual)"
, col = "lightgreen"
)
The descriptive statistics are a bit larger than Federalist, but is this significant? We can test it properly using the parameters from the goodfit object we just created:
# we pass in our fitted parameter list this time
Fed_fit3 <- goodfit(nonFederalist, type = "nbinomial", par = Fed_fit1$par)
summary(Fed_fit3)
##
## Goodness-of-fit test for nbinomial distribution
##
## X^2 df P(> X^2)
## Pearson 22.41764 6 0.001016934
## Likelihood Ratio 19.51130 6 0.003381872
The \(\chi^2\) result is significant, and strong evidence that this new text is not from the same distribution. However, on it’s own, the test can’t provide detailed information.
plot(Fed_fit3, type = "hanging", shade = TRUE)
The rootogram hangs the frequency bars off the expected distribution, which this time is fixed with parameters we estimated on the original text. We can now see from the pattern of residuals that the word ‘may’ seems to appear more frequently in this text. Specifically, the tail, with 3-5 occurrences per 200 words is fatter, while blocks where this word is absent are fewer.
In our rush to use the latest developments and technologies in Machine Learning and Deep Learning, it’s easy forget that none of it would be possible without pioneering statistical research of previous decades. Most of this work is still relevant to everyday problems in Machine Learning practice, and it’s always worth a review of the foundations before reaching for the newest tools.
Above all, never under-estimate the power of visual analytics. A well thought out plot can save you paragraphs of explanation.