1 Introduction

Welcome. If you’re reading this, you probably attended a workshop or talk I’ve given on this topic. If you’re here by accident, let me fill you in.

I wanted to write a plain English, quick start guide for anyone working with data sets where many, most or all of the features are discrete (nominal, ordinal, frequency, counts, events). The reason I wanted to write it is because nothing like it existed when I really needed it some years ago. I had to strain every synapse to get to grips with it at the time. Perhaps that was a blessing in disguise as the work paid off, in many more ways than I had expected. Anyway, I hope this helps anyone out there who just wants to develop some intuition in this area of statistics, without getting bogged down in notation, theory and proofs.

These notes are for you if:

You need a strategy for the Exploratory Data Analysis (EDA) phase of a machine learning project:
- Assess a new dataset that contains several discrete features
- Develop an intuition about the relationships between those features
You want to demonstrate rigorous and robust, yet visually intuitive reporting of machine learning results in:
- Multi-class problems involving ordinal data
- Impact of important categorical features on classification results
- Frequency data: events, counts, occurrences, wait-times

I’m also trying to keep the code really simple, so someone who wants to learn just enough R can pick up the basics by osmosis.

Note, I hope this is a living document which I can expand as time allows. Check back in a few months. The last date I updated should always be visible in the header.

The focus is on visual analytics, not theories and formulas! My intention is turn you on to this topic, not send you running.

If you want to know more you can follow the links to the vignette, video and book that have been the most useful sources of information for me.

When I say ‘plain English,’ there’s no offence intended towards Michael Friendly, whose work is extremely accessible and well explained. I wouldn’t be doing this without his book in hand. What I want to do here is present a version of things that is as math and formula free as I can make it.

1.1 R

The examples are all in R. These are the libraries we’ll be using. That’s all I really want to say about R.

# I'm going to load all the R libraries here
# If you stop, quit R and come back later...
# ...you need to re-run this chunk.
library(knitr)
library(lattice)
library(ggplot2)
library(vcd)
library(vcdExtra)
library(ca)
library(randomForest)
library(gmodels)
library(mvtnorm)
library(ROSE)

1.2 Credits

The code examples are mostly based on material contained in the book Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data by Michael Friendly. Michael Friendly is a pioneer in this field and has contributed to the development of modules and libraries for SAS and R.

You can also find a much shorter tutorial on these topics in the long form help for the “vcd” package for R: vcd vignette

This video is a much more in depth lecture, by Michael Friendly, on topics covered in the book.

1.3 Intended Audience

Any data professional with some practical experience of Machine Learning (ML), possibly crossing over from another discipline. I assume you have an understanding of the ML project life-cycle, data preparation, exploratory analysis, model building and evaluation, and reporting results. I assume some basic statistical knowledge, such as most people have experienced in high school.

You’ll get the most benefit if you already know some R. If you don’t, maybe this will help you to learn some. The code examples are mostly just a few lines each.

1.4 What We Won’t Cover

Bear in mind, this is just an overview! We will only introduce formal statistics and tests where necessary, and will not develop any mathematical notation, theory or proofs. There is lots of material freely available prepared by real experts, rather than people like me. Head over to www.google.com now, if that’s what you’re looking for.

For now, I am not covering explanatory and predictive modeling. You will probably be already familiar with logistic regression for binary classification. There are related or equivalent techniques for other kinds of classification (ordinal, polytomous/multi-class and counts). Even though these topics fall squarely under both DDA and ML, it’s a massive topic that requires a whole course/module to explain. I’ll get around to it, but I don’t have time right now.

I’m not going to cover R at all. I assume that you have enough experience using a computer, at least to locate, download and install some software, and execute some lines of code that have already been written for you.

1.5 Who am I and Why am I Presenting This?

I have been working with data since I first looked at some SQL scripts and a Microsoft SQL Server 2000 and realised it was doing something incredibly powerful while being very easy to understand. After that I travelled a very winding road, but most recently I have taken a career break to study. I have an MSc (Distinction) in Business Intelligence and am now researching a PhD in Machine Learning for Human-in-the-Loop Processes.

This branch of stats peaked my interest at three distinct points in my life:

My undergraduate degree in Microbiology - using techniques such as Most Probable Number to estimate the number of bacterial cells in a test tube. You certainly can’t count them all!
My career as a data professional in the tertiary (higher) education sector.

The kinds of questions I used to work on were:

Does the number of students who progress from undergrad to post-grad differ, depending on whether they receive an incentive plan?
Do different interventions improve the retention of at-risk students?
What combination of factors most affect student satisfaction (measured on a likert scale)
How do student outcomes (Distinction, Merit, Pass, Fail) compare across various demographics?
Can we identify groups in need of extra support?

Here are just some of the factors/features to be utilised in the research questions above:

subject
full-time / part-time
age group
gender
nationality (well over 50 unique categories at a single institution)
high school attended prior to entry
subjects and grades prior to entry

The types of questions we did not need to answer were “what is the average height/weight of students across different groups?” So it was obvious that regular descriptive and quantitative stats (i.e. the normal distribution) were not the right tools for this work.

Tables and cross tabs were no use with so many dimensions. Pivot tables were a little better, but couldn’t help with significance testing and predictive analytics. Most importantly, I needed visual tools to communicate findings with the senior management team. I couldn’t expect those guys to squint over a page of tables and numbers with me, looking for clues. This is when I came upon Michael Friendly’s work on visualising categorical data. This was exactly what I was looking for at the time, so I was highly motivated to learn more about it.

Now I am involved in (or evolved into) Machine Learning research and practice, and I face completely new tasks. However, I keep noticing that the old tools and techniques are still applicable. I wanted to share these ideas with you and I hope you also find them useful.

2 Preliminaries

If you want to run the code chunks and see the results in the R Studio UI you need access to a machine with R + R Studio and the source code for this file. After you’ve downloaded it to your working directory, just ditch all my text move all the code chunks to a new .R file with:

# Don't run this if you've already got the code in a separate file
# knitr::purl("DDAR_Display_Page.Rmd")

2.1 What is Discrete Data Analysis (DDA)?

Predictably, the term refers to analysis of discrete data! DDA is a set of tools and techniques that you would apply to achieve the same objectives as any quantitative statistical analysis but with these data types:

Nominal
Ordinal
Counts / Frequencies

2.2 The Basic Tools of DDA

2.2.1 Plots, Charts and Graphs

The most powerful set of tools we have is our eyes and brains. Humans are pretty damn impressive when it comes to pattern matching. The trick is to choose the right chart that reveals the right information for the task in hand. A chart is often acceptable as statistical evidence, provided it is enhanced with robust statistical information (confidence intervals, significance test results, etc). More importantly, a good chart really doesn’t need much explaining.

The next three charts are examples of what we tend to reach for when we have a mix of numeric and discrete data. However, you won’t see again these again beyond this introduction because they are not the best tools for DDA.

# set the random number generator same every time
set.seed(123)

# a discrete value to match the groups
n <- 100
grp <- rep(c(1, 2), each = n)

# a random discrete value
rnd <- sample(c(1, 2)
              , replace = TRUE
              , size = 200)
numvars1 <- rmvnorm(n, c(0, 0)
                    , sigma = matrix(c(3,2,2,3)
                                     , ncol=2))

numvars2 <- rmvnorm(n, c(1, 1)
                    , sigma = matrix(c(1,0.5,0.5,4)
                                     , ncol=2))
numvars <- rbind(numvars1, numvars2)

# a discrete value based on a soft linear boundary
noise <- rnorm(2*n, mean = 0, sd = 0.5)
fam <- ifelse(rowSums(numvars) + noise > 3, 1, 2)

# plot, using graphic options to separate the discrete classes
plot(numvars, col = fam, pch = rnd, cex = grp
     , main = "scatterplot with size, shape and colour options"
     , sub = "typical options for stratifying numeric data by discrete data"
     , xlab = "x"
     , ylab = "y")

# take row sums then table-apply the sum by each group
grp_total <- tapply(rowSums(numvars), grp, sum)
rnd_total <- tapply(rowSums(numvars), rnd, sum)
fam_total <- tapply(rowSums(numvars), fam, sum)

# totals barplot
barplot(c(grp_total, rnd_total, fam_total)
        , col="lightgray", border = 1:2
        , main="barplot with border options"
        , sub = "numeric data stratified on groups")

# classic box plot
# note that you can't tell
# that grp is the actual groups
# and rnd is a random labelling
boxplot(rowSums(numvars)~grp*rnd, border=1:2
     , main = "boxplot distribution analysis by groups"
     , sub = "typical options for stratifying numeric data by discrete data"
     , xlab = "grp1 grp2 rnd1 rnd2"
     , ylab = "row sum value")

The plots above place all the focus on the numeric features, slicing it into groups by the various discrete features. We would like charts where the focus is on the discrete features; something that provides more information about the discrete features themselves.

# length will count the members
grp_count <- tapply(rowSums(numvars), grp, length)
rnd_count <- tapply(rowSums(numvars), rnd, length)
fam_count <- tapply(rowSums(numvars), fam, length)

# counts barplot
barplot(c(grp_count, rnd_count, fam_count)
        , col="lightgray", border = 1:2
        , main="barplot with border options"
        , sub="counts by groups")

By counting the group members, the numeric variable is out of the picture entirely. The only trace of it is that fam group was based on a linear combination of the simulated data. The grp group was fixed at exactly the number of group members, while rnd was drawn at random with even odds. It’s very easy, using this visual reference, to link the count data to the information we already know.

I will provide much more interesting examples a little later on.

2.2.2 Tables

Specifically, contigency tables or crosstabs. The categorical features form the axes, and each cell contains the total number of individuals at the intersection of each feature. Here are some examples:

# In 2D - collapse over department
margin.table(UCBAdmissions, 2:1)

##         Admit
## Gender   Admitted Rejected
##   Male       1198     1493
##   Female      557     1278

# 3D table - Admit on rows (D1), Gender on columns (D2)
# and Department on levels (D3)

UCBAdmissions # Department has 6 levels

## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317

# pivot the department onto columns
aperm(UCBAdmissions, c(2, 3, 1))

## , , Admit = Admitted
## 
##         Dept
## Gender     A   B   C   D   E   F
##   Male   512 353 120 138  53  22
##   Female  89  17 202 131  94  24
## 
## , , Admit = Rejected
## 
##         Dept
## Gender     A   B   C   D   E   F
##   Male   313 207 205 279 138 351
##   Female  19   8 391 244 299 317

# flatten to a crosstab in default order
structable(UCBAdmissions)

##               Gender Male Female
## Admit    Dept                   
## Admitted A            512     89
##          B            353     17
##          C            120    202
##          D            138    131
##          E             53     94
##          F             22     24
## Rejected A            313     19
##          B            207      8
##          C            205    391
##          D            279    244
##          E            138    299
##          F            351    317

# flatten to a crosstab with pivot
structable(aperm(UCBAdmissions, c(2, 3, 1)))

##                 Dept   A   B   C   D   E   F
## Gender Admit                                
## Male   Admitted      512 353 120 138  53  22
##        Rejected      313 207 205 279 138 351
## Female Admitted       89  17 202 131  94  24
##        Rejected       19   8 391 244 299 317

# More D
structable(Survived~., data=Titanic)

##                    Survived  No Yes
## Class Sex    Age                   
## 1st   Male   Child            0   5
##              Adult          118  57
##       Female Child            0   1
##              Adult            4 140
## 2nd   Male   Child            0  11
##              Adult          154  14
##       Female Child            0  13
##              Adult           13  80
## 3rd   Male   Child           35  13
##              Adult          387  75
##       Female Child           17  14
##              Adult           89  76
## Crew  Male   Child            0   0
##              Adult          670 192
##       Female Child            0   0
##              Adult            3  20

# More D - another pivot
structable(Survived+Sex~., data=Titanic)

##             Survived   No         Yes       
##             Sex      Male Female Male Female
## Class Age                                   
## 1st   Child             0      0    5      1
##       Adult           118      4   57    140
## 2nd   Child             0      0   11     13
##       Adult           154     13   14     80
## 3rd   Child            35     17   13     14
##       Adult           387     89   75     76
## Crew  Child             0      0    0      0
##       Adult           670      3  192     20

Tables like this have several disadvantages. Staring at numbers is not very informative, unless there are obvious large or small values, and things start to get tricky in more dimensions.

Adding row, column and table totals can be informative but only works well in 2D.

# An old fashioned crosstab - 2D
CrossTable(margin.table(UCBAdmissions, 2:1)
           , prop.chisq = FALSE
           , format = "SPSS")

## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |             Row Percent |
## |          Column Percent |
## |           Total Percent |
## |-------------------------|
## 
## Total Observations in Table:  4526 
## 
##              | Admit 
##       Gender | Admitted  | Rejected  | Row Total | 
## -------------|-----------|-----------|-----------|
##         Male |     1198  |     1493  |     2691  | 
##              |   44.519% |   55.481% |   59.456% | 
##              |   68.262% |   53.879% |           | 
##              |   26.469% |   32.987% |           | 
## -------------|-----------|-----------|-----------|
##       Female |      557  |     1278  |     1835  | 
##              |   30.354% |   69.646% |   40.544% | 
##              |   31.738% |   46.121% |           | 
##              |   12.307% |   28.237% |           | 
## -------------|-----------|-----------|-----------|
## Column Total |     1755  |     2771  |     4526  | 
##              |   38.776% |   61.224% |           | 
## -------------|-----------|-----------|-----------|
## 
##

2.2.3 \(\chi^2\) Test

A huge range of DDA research questions ultimately are answered with a \(\chi^2\) test. It’ll happen at some point under the hood of most of the functions used in this code. That’s simply because tables are the main data structures for DDA and most tasks are driving towards comparing between expected values and actual, observed values.

This test instantly tells you if numbers in table cells are what you expect them to be. You might expect all the numbers to be proportional to the row and column totals, or you might have some very specific numbers you expect. Either way, a \(\chi^2\) test reveals whether your observed numbers are near enough to your expected numbers that you can put any small differences down to some random sampling error.

This is good news, right? Nearly everyone who’s done any stats has some experience done a \(\chi^2\) test and can make sense of the results. A significant result in a \(\chi^2\) test means it’s very improbable that the data matches what we expected. For reasons that I’m not going into now, you can never really prove anything with absolute, unequivocal certainty with statistics. So, we always phrase such tests in terms of the the reverse of what we’re looking for. Essentially, it’s more robust if we can demonstrate that some opposing assumption is very likely to be wrong, because it makes it more probable that we’re right. This opposite assumption that we state at the beginning of a test is called the null hypothesis, or simply the null. We like to give it the symbol \(H_0\)

Here is a null for when we want to show that a category of feature A is much more popular:

\(H_0 =\) There is an equal number of instances belonging to each category of feature A. Any observed differences are small enough to be considered a sampling error.

And here’s one we would use to show that there is an interaction between features A and B. Certain categories in A and B tend to go together, other combinations don’t get chosen very frequently.

\(H_0 =\) Membership of any category in feature A is independent of membership of any category in feature B. Frequency in each combination is proportional to the joint distribution \(P(A) \times P(B)\) for every combination of categories in A and B. Any observed differences are small enough to be considered a sampling error.

Note, we will touch on statistical distributions later, but not the \(\chi^2\) distribution. This distribution comes up all the time in statitistical testing, but doesn’t really model anything found in the real world that we actually want to measure.

2.2.4 Log Odds Ratios

Most people with some stats knowledge are aware of the log odds from logistic regression. The odds is itself is the ratio \(o = \frac{p}{1-p}\) for \(p\), the proportion of yes and \(1-p\), the proportion of no in any binary situation, such as a coin flip. The log odds is simply \(\log(\frac{p}{1-p})\).

The odds is converted back to a probability by \(p = \frac{o}{1+o}\) which means we can connect the number of times something has happened back to the probability that it happens, and compare these probabilities across categories to see if they show any patterns.

An odds ratio is simply a ratio of odds from two binary systems \(\frac{o_1}{o_2}\) and the log odds ratio is \(\log(\frac{o_1}{o_2})\). There’s a pattern here:

\[\log(\frac{a}{b})\]

Let’s think about why we like to take the log of a fraction. First of all, I’m going to set two conditions: \(a\) and \(b\) are always positive, and \(b\) can never be \(0\). Now, think about what happens as 1. \(a\) and \(b\) are equal, 2. \(a\) gets much bigger than \(b\) and 3. \(b\) gets much bigger than \(a\).

p <- c(0.05, .1, .25, .50, .75, .9, .95)
odds <- p / (1 - p)
logodds <- log(odds)
logits <- data.frame(p, odds, logodds)
logits

##      p        odds   logodds
## 1 0.05  0.05263158 -2.944439
## 2 0.10  0.11111111 -2.197225
## 3 0.25  0.33333333 -1.098612
## 4 0.50  1.00000000  0.000000
## 5 0.75  3.00000000  1.098612
## 6 0.90  9.00000000  2.197225
## 7 0.95 19.00000000  2.944439

a <- c(1000, 1000, 1)
b <- c(1000, 1, 1000)

log(a/b)

## [1]  0.000000  6.907755 -6.907755

Hopefully you can see that while fractions and ratios are centred on \(1\) when \(a = b\), they have this nasty habit of flying off towards \(\infty\) as \(a\) gets much bigger, while squashing closer and closer to \(0\) as \(b\) gets much bigger. Comparing counts or ratios of \(a:b\) over the levels of third variable \(c\), becomes very problematic because we would have to handle to of the most difficult situations in computing - potentially infinite and infinitesimal numbers - in the same operation. That’s really unhelpful.

Here are the top three reasons the log odds ratio is so useful in DDA:

The \(\log(\frac{a}{b})\) is centred on zero and completely linear and symmetrical. Swapping the positions of \(a\) and \(b\) in the fraction simply switches the sign of the result. What a beautiful thing that is!
Massive and tiny numbers are both transformed into a very limited domain: Anything outside the range \(-3 < x < 3,\ x = \log(\frac{a}{b})\) reveals enormous relative differences between \(a\) and \(b\). Again, that’s very intuitive.
A log odds ratio \(\phi = \log(\frac{o_a}{o_b})\) is an approximately normally distributed statistic with a mean of \(\phi\) and a standard error \(\mathrm{SE}_{\phi} = \sqrt{\frac{1}{p_a} + \frac{1}{1 - p_a} + \frac{1}{p_b} + \frac{1}{1 - p_b}}\). This means that confidence intervals for of odds ratios are also symmetrical in the log, facilitating visually intuitive and statistically robust comparisons of the odds ratio of two binary features \(a\) and \(b\) between levels of a third feature \(c\).

The log function is very special for lots of other reasons but, in DDA especially, it is very helpful to have this neat way to manage ratios.

That’s all I want to say about the log ratio and the log odds ratio right now. We’ll come back to it in later sections.

2.2.5 Discrete Distributions

There are several distributions to help with the most common DDA research questions. You can recognise them because a random variable from any of these can only be a natural number \(\{ 0,1,2,\dots \}\) (no negative numbers):

Binomial - how many heads in \(n\) coin tosses?
Poisson - how many buses come every \(n\) minutes?
Geometric - how many minutes do I wait before a bus comes?
Negative Binomial - how many minutes is the gap between each bus?

You probably have data matching one of these distributions if you have data that looks like this:

seed <- 125
set.seed(seed)
# the first number is the number of samples
# the other numbers are the parameters of the distribution
rbinom(10, size = 10, prob = 0.2)

##  [1] 3 1 1 1 4 5 2 1 2 2

rpois(10, lambda = 5)

##  [1] 1 7 4 2 5 4 4 9 6 6

rgeom(10, prob = 0.5)

##  [1] 2 0 0 0 0 0 0 2 0 0

size <- 2
prob <- 0.2
count <- table(rnbinom(1000, 2, 0.4))
freq <- count/sum(count)
waittime <- as.integer(names(count))
xyplot(freq~waittime
       , xlab = "Wait Time (minutes)"
       , ylab = "% of total"
       , type = c("h", "p") # this will draw lollipops
       , pch = 16, lwd = 2
       , main = paste("Negative Binomial. Rate ="
                    , size
                    , ", Dispersion ="
                    , 1/prob)
       , sub = paste("Data from R's pseudo-random\n number generator with seed = ", seed))

The main reason we like to use distributions in data analysis where possible, is that so many clever people have spent the last two or three hundred years researching how they work! This means that all the most accurate, provable and supported tooling for statistics is based on distributions. What we try to do is match the data we have to a known distribution. If we find a good match, we can assume the mathematical properties of the distribution will apply to the real world data. This is what people mean by parametric tests; using all the tried and tested distributional properties to make assumptions about our real-world data.

However, distributions are not real world data. They are just mathematical constructions. They model many things in the real world but you need to be pretty strict about how and when to use them.

On a project using real-world, not simulated data, there simply might not be a distribution that fits and there are hundreds of non-parametric tools that make a model of the data directly from the data itself. That’s what a lot of people think of as Machine Learning (though such definitions are highly debatable).

Anyway, the moral of the story. Use a distribution when it makes sense. That’s often a good place to start with 1 dimensional data (just a bag of integers as above). If it doesn’t, then move on and find another method to analyse your data. Data that is in more than one dimension, with lots of categorical features, needs a lot more work.

2.3 Common DDA Tasks

This is a non-exhaustive list of common statistical tasks and questions that require specific DDA techniques.

1-way analysis and goodness of fit
- \(H_0 =\) My data fits a common, known distribution \(\delta\), so I can use mathematical/analytical properties of \(\delta\) for parametric tests.
- Does my data fit a known distribution and if so, what are the parameters?
- Does one random sample come from the same distribution as another?
2-way analysis
- 2 \(\times\) 2 tables are a special case where a standard \(\chi^2\) test is not recommended. What tools should I use instead?
- \(\chi^2\) tests on \(n \times m\) tables can tell me that an association between features is present/absent, but if \(n\) or \(m\) are larger than 3-4, how can I find and test for complex relationships between so many combinations of categories?
- Some of my categories are ordinal - what is the right test?
- Classification problems with ordinal data - how can I quantify the agreement/disagreement between True and Predicted classes when there is a relationship between classes?
n-way analysis:
- Visualising multidimensional tables for exploratory analysis
- Statistical hypothesis testing in n-way tables
- Correspondence Analysis, a dimension reduction technique, like PCA and clustering for categorical features!
Explanatory and Predictive modeling:
- Any of the above, measuring the magnitude of interactions
- Any of the above, measuring the effect of covariates and co-factors

2.4 DDA in R and task specific visualisations

We don’t have time develop a fluency for these tasks in this workshop but if this is something useful for you there are some “building block” skills that are worth spending time on:

The most useful data structures for DDA in R are not data frames, but n-dimensional arrays, n-way tables, and matrices, sometimes data frames with a frequency column (for the count data).
It’s good to become fluent with functions for manipulating table data - pivoting and collapsing dimensions, and converting between data structures to suit your analysis. Useful functions to get to know are table(), addmargins(), prop.table(), ftable(), xtabs(), xtable(), structable(), apply(), margin.table(), collapse.table(), aggregate(), aperm(), expand.dft(), subset(), using integer indexing to re-order categories and pivot dimensions. Most of the specialised objects have as.table() and as.data.frame() methods that generate the most intuitive shape for the output data structure.
Library “vcd” comes with its own plotting system. It is based on the grids package but is different from R base graphics, ggplot2 and lattice! It’s easy to get started, but a bit of effort is required to learn all the tweaks for making publication ready graphical output. However, interpreting these area-based plots is really intuitive and it’s worth the effort to learn so that results can be communicated succinctly (a picture tells a thousand words).

3 Workshop

3.1 Task: Analysing n-way tables

3.1.1 What is this for?

Exploring relationships between categorical features. Mosaic plots are very powerful for visualising categorical data. Area and shading can be used in many ways to highlight important the important interactions. These visual techniques intuitive up to four or five categorical dimensions. This allows you to identify very complex interactions.

Note, the ‘strucplot framework’ that is the heart of the vcd package requires learning a new approach to structuring data and writing plotting routines compared to previous experience with R. However, it’s worth the effort for the intuitive views of complex relationships in the data.

3.1.2 When would I use this?

The exploratory data analysis phase of any project. Do this early on in the project once you’ve gathered and cleaned up your data enough to start sense-making. It doesn’t make any sense to reach for more complex tools before you know what you’ve got.

Spending some time “getting to know” a new dataset and developing some intuition about important features and interactions will pay off by revealing a better strategy when it comes to modelling.

3.1.3 Case Study: Exploring the relationships between hair colour, eye colour and gender.

The HairEyeColor dataset ships with base R. It is a survey of 592 statistics students at the University of Delaware in 1974. It is a 3-D array cross-tabulating the observations. The features and their levels are as follows:

No	Name	Levels
1	Hair	Black, Brown, Red, Blond
2	Eye	Brown, Blue, Hazel, Green
3	Sex	Male, Female

For now, we will ignore the sex of each individual by collapsing the 3-D table down to 2-D, and re-order the table from dark to light eye colour (hair is already in that order).

# ignore gender for now
haireye <- margin.table(HairEyeColor, 1:2)

# re-arrange order of eye colour from dark to light
# it's already correct for hair
haireye <- as.table(haireye[, c("Brown", "Hazel", "Green", "Blue")])
haireye

##        Eye
## Hair    Brown Hazel Green Blue
##   Black    68    15     5   20
##   Brown   119    54    29   84
##   Red      26    14    14   17
##   Blond     7    10    16   94

Big numbers and small numbers stand out but the table still has 16 cells which is a lot to take in at once. It’s not immediately obvious what the relationships are.

A naive approach might be to produce a bar plot of the counts by groups.

# non-intuitive views
# this is what the Frequency form looks like
as.data.frame(haireye)

##     Hair   Eye Freq
## 1  Black Brown   68
## 2  Brown Brown  119
## 3    Red Brown   26
## 4  Blond Brown    7
## 5  Black Hazel   15
## 6  Brown Hazel   54
## 7    Red Hazel   14
## 8  Blond Hazel   10
## 9  Black Green    5
## 10 Brown Green   29
## 11   Red Green   14
## 12 Blond Green   16
## 13 Black  Blue   20
## 14 Brown  Blue   84
## 15   Red  Blue   17
## 16 Blond  Blue   94

# You're forced to favour one variable over the other

# Does eye colour depend on hair colour
barplot(haireye, beside = TRUE, legend = TRUE)

# Or does hair colour depend on eye colour
# pivot the other way
barplot(aperm(haireye, 2:1), beside = TRUE, legend = TRUE)

It’s hard to compare bars across groups, because your eye has to jump to find each member. A more intuitive view would preserve the table structure and not condition one feature on the other:

tile(haireye) # vcd package

Hopefully, you can see this simple concept of putting categorical features on the axes and plotting a shape whose area is proportional to the count in each cell. A pattern of some kind is self-evident. Let’s formally report it with a \(\chi^2\) test. The null in this case is just like our second example above.

chisq.test(haireye)

## 
##  Pearson's Chi-squared test
## 
## data:  haireye
## X-squared = 138.29, df = 9, p-value < 2.2e-16

With such a small p-value, we have very strong evidence to reject the null hypothesis that there is no relationship between hair colour and eye colour. We can now assume they are interacting, but what exactly is the pattern? The only thing the test tells us is there is a very significant difference between actual and expected counts.

Let’s use a mosaic plot to visualise expected counts as if there were no interaction between the features. Here’s what no interaction looks like:

# visualising expected counts for independent features
expected = independence_table(haireye)
round(expected, 1)

##        Eye
## Hair    Brown Hazel Green  Blue
##   Black  40.1  17.0  11.7  39.2
##   Brown 106.3  44.9  30.9 103.9
##   Red    26.4  11.2   7.7  25.8
##   Blond  47.2  20.0  13.7  46.1

mosaic(expected
      , shade = TRUE
      , main="Expected frequencies"
      , labeling = labeling_values
      , value_type = "expected"
      , gp_text = gpar(fontface = 1))

Note that the areas are proportional to the counts. The grid lines between the tiles are dead straight because each cell is proportional to the marginal totals.

Here’s what actuals looks like:

# visualising actual counts for independent features
mosaic(haireye
      , gp = shading_Friendly # shade = TRUE
      , main="Actual frequencies"
      , labeling = labeling_values
      , value_type = "observed"
      , gp_text = gpar(fontface = 1)
      , rot_labels = c(top = -20))

Here the grid lines between cells are no longer continuous because the cells are out of proportion with the marginal totals. Unexpectedly large cells show that the combination is over-represented and small cells the reverse.

The beauty of the mosaic chart is the use of shading to highlight the the cells that differed the most from expected counts. We refer to these differences as the residuals. This shading scheme uses bold fills for statistically significant residuals but also adds an outline colour to highlight the sign of even the non-significant residuals. This is visually very powerful. There is a clear “opposite corner” pattern.

Let’s take this up to all three dimensions now.

# showing all three dimensions
HEC <- HairEyeColor[, c("Brown", "Hazel", "Green", "Blue"), ]
mosaic(HEC
      , gp = shading_Friendly # shade = TRUE
      , main="Actual frequencies"
      , labeling = labeling_values
      , value_type = "observed"
      , gp_text = gpar(fontface = 1), rot_labels = c(right = -45))

# with a different pivot
mosaic(aperm(HEC, 3:1)
      , gp = shading_Friendly # shade = TRUE
      , main="Actual frequencies"
      , labeling = labeling_values
      , value_type = "observed"
      , gp_text = gpar(fontface = 1), rot_labels = c(right = -45))

We can still see the general opposite corner pattern, and now some additional detail regarding gender.

This type of plot is still fairly intuitive at 4 dimensions. Take a look at the Titanic dataset. Who died and who survived? Did it depend on some combination of passenger class, gender and age?

mosaic(aperm(Titanic, c(4, 2, 1, 3))
      , gp = shading_Friendly # shade = TRUE
      , main="Who Died and Who Survived the Titanic?"
      , labeling = labeling_values
      , value_type = "observed"
      , gp_text = gpar(fontface = 1), rot_labels = c(right = -45))

More dimensions can be added by other means. A cotabplot allows you to stratify over a further dimension. It looks just like facet_wrap() and facet_grid() in ggplot2, or a lattice/trellis plot.

In this example, with just default settings, two categories have been stratified. You could add further dimensions if you have more and still make sense of the plots.

# This doesn't appear to render well on the HTML output
# note, this is all the defaults except the shading!
cotabplot(Titanic, gp = shading_Friendly)

There is a whole methodology around loglinear modeling which allows you to make statistical hypothesis tests about complex n-way relationships between features. We’ll leave that for another time.

3.1.4 Talking points

Recall that we’ve made sure the categories are ordered light to dark. There is a very pronounced “opposite corner” relationship. What do you think that reveals (in plain English) and does this fit with intuition?
Can you see a difference between males and females? If so, can you explain it? Note, the original study was a self-reporting survey. The participants categorised themselves.
Why do the grid lines remain aligned horizontally between hair categories on both actual frequencies mosaics?
If you discovered important relationships in categorical features during your exploratory phase, how would this affect your Machine Learning strategy?

3.2 Task: Unpacking a Confusion Matrix by Important Features

3.2.1 What is this for?

After model training. Use this method to deepen your understanding of how well your model is performing.

This task can be guided by two things. Firstly, if you’ve done a good exploratory analysis, you will have developed an understanding of which features could be important. Secondly, several well known ML models can output feature importance information in their own right, saving you all the hard work.

However, these feature importance measures tend to be crude. The feature is important, but why? All feature importance really tells you is that the model would be less accurate if you excluded that feature. It doesn’t tell anything about how that feature is helping or hindering.

Before moving on, let’s develop some intuition about this and show how the vcd package can help.

We will do some ML and train a random forest model to predict whether families in Slovenia should receive a state sponsored nursery place. The target variable is decision

set.seed(54321)
nursery <- read.csv(gzfile("nursery.csv.gz"))
summary(nursery)

##         parents            has_nurs            form      children   
##  great_pret :4320   critical   :2592   complete  :3238   1   :3238  
##  pretentious:4320   improper   :2592   completed :3240   2   :3240  
##  usual      :4318   less_proper:2592   foster    :3240   3   :3240  
##                     proper     :2590   incomplete:3240   more:3240  
##                     very_crit  :2592                                
##        housing           finance               social    
##  convenient:4318   convenient:6478   nonprob      :4319  
##  critical  :4320   inconv    :6480   problematic  :4320  
##  less_conv :4320                     slightly_prob:4319  
##                                                          
##                                                          
##          health           decision   
##  not_recom  :4320   not_recom :4320  
##  priority   :4320   priority  :4266  
##  recommended:4318   spec_prior:4044  
##                     very_recom: 328  
##

All the features are categorical. Random forest ought to perform very well on this without any effort.

# train test split
idx <- sample(c(TRUE, FALSE)
              , size = nrow(nursery)
              , prob = c(0.7, 0.3)
              , replace = TRUE)

nurs_train <- nursery[idx, ]
nurs_test <- nursery[!idx, ]

# train rf accepting all the defaults
nurs_rf <- randomForest(decision~., data=nurs_train)

# get the predictions on new data
nurs_preds <- predict(nurs_rf, newdata = nurs_test)

# confusion matrix
confmat <- with(nurs_test, table(nurs_preds, decision))
confmat

##             decision
## nurs_preds   not_recom priority spec_prior very_recom
##   not_recom       1319        0          0          0
##   priority           0     1232         13         50
##   spec_prior         0       24       1142          0
##   very_recom         0        0          0         57

Indeed, the model performs well and needed no effort to tune. Ideally, I would like visualise the confusion matrix directly, but it turns out to be a bit of a fussy task in R. I’ve looked online for any other way to do it, and I can only find some customised ggplot examples. Let’s compare the different plotting options from R:

# base
# there are no words to describe how horrible this is
image(confmat)

# lattice
# wrong diagonal aspect, horrible colours
levelplot(confmat)

# same
levelplot(t(confmat))

# ggplot
# too much fuss
# default colours not subtle enough to show error
# also the wrong diagonal
# First, convert to frequency form
confmat_freq <- as.data.frame(confmat)
# Next, tell ggplot what to do
ggplot(aes(x=nurs_preds
                 , y=decision
                 , fill=Freq)
             , data=confmat_freq) +
  geom_tile()

I’m not sure if Michael Friendly and the programmers of vcd meant for it to be used this way, but strucplot options are ideal for confusion matrices.

# strucplot sieve
# the shading is a bit too dense on the diagonal
# there are lots of options
# it might be possible to customise
sieve(confmat, labeling = labeling_values)

# strucplot tile - I like this one best
tile(confmat, labeling = labeling_values)

There is a little bit of confusion among the classes priority and spec_prior. The model also struggles with the minority very_recom class, preferring to label these as priority.

One of the most useful features of random forest models is the ability to report variable importance. This measures, for each feature, how much worse the model would do if that feature wasn’t available at training time. There is a custom plot function for this.

varImpPlot(nurs_rf)

We can see from this plot that the health variable seems to hold a majority of the predictive power for this model. Almost all the information is in this and the next three most important features.

What do we do now we know that? We investigate further! There is always a next question. Now we’ve become expert in handling 3-D tables, let’s stratify (slice) the confusion matrix by this health variable and see what’s going on.

# unfortunately, there is no cotabplot option for tile
# this has to be done manually to make any sense

# from the summary, health can take one of three values
# not_recom, priority, recommended

cm_3d <- with(nurs_test
              , table(nurs_preds, decision, health))

re <- cm_3d[,,"recommended"]
pr <- cm_3d[,,"priority"]
nr <- cm_3d[,,"not_recom"]

tile(re, labeling = labeling_values)

tile(pr, labeling = labeling_values)

tile(nr, labeling = labeling_values)

This last plot is very striking. One level of the health feature has all the information required to completely classify one of the class labels. It’s not surprising that health is by far the most important variable! However, it turns out to be very uninteresting from a ML point of view. A very simple program could do almost as well as this random forest.

A situation like this is quite unusual in the real world, and truthfully, should be identified during exploratory analysis. The sensible thing to do is probably to filter out all the instances in this health category and build a new model. However, it serves as a good example to show that confusion matrix results are not as informative as they first appear.

In binary classification, this could never happen. There would be no ML to do if one feature was a perfect representation of the class labels. Furthermore, problems are often organised so that the two classes are balanced and this should remain the case in the confusion matrix of a well performing model. How could this slicing technique help?

3.2.2 When would I use this?

When you are evaluating your classification model performance with a confusion matrix and you have a reason to believe that one or more categorical features contributed to good/bad perforemance. You need to investigate further to improve model performance or correct for any bias.

You want to use a non-paramtric model, such as a tree based method but your situation is sensitive to unbalanced loss. For example, a medical or financial model where either false positives or false negatives are especially costly. You need a detailed understanding of features and categories where the model is more or less confident.

3.2.3 Case Study: Investigating High False Negative Rate in a Credit Rating data Using Random Forest Feature Importance

We will now do a random forest binary classification. The german data set is a credit rating data set, where individuals are given either a good or bad rating. Incorrect good predictions cost the business much more, because of the risk of loan defaults. Incorrect bad results only result in a small reduction of revenue from loan interest payments in the long term.

Let’s refer to correct identification of a bad result as a true positive outcome. When there is a high proportion of bad instances that are incorrectly classified as good, this gives rise to a high false negative rate and is very undesirable.

We could crudely force the model to be more cautious over all, whereby it will make more false positive results (incorrect bad ratings). However, it would be better if we could develop a more targetted strategy to improve the false negative rate.

german <- read.csv(gzfile("german.csv.gz"))
summary(german)

##   chk           dur       crhis          pps           amt       
##  A11:274   Min.   : 4.0   A30: 40   A43    :280   Min.   :  250  
##  A12:269   1st Qu.:12.0   A31: 49   A40    :234   1st Qu.: 1366  
##  A13: 63   Median :18.0   A32:530   A42    :181   Median : 2320  
##  A14:394   Mean   :20.9   A33: 88   A41    :103   Mean   : 3271  
##            3rd Qu.:24.0   A34:293   A49    : 97   3rd Qu.: 3972  
##            Max.   :72.0             A46    : 50   Max.   :18424  
##                                     (Other): 55                  
##   svng      emp           rate        pers       debt          res       
##  A61:603   A71: 62   Min.   :1.000   A91: 50   A101:907   Min.   :1.000  
##  A62:103   A72:172   1st Qu.:2.000   A92:310   A102: 41   1st Qu.:2.000  
##  A63: 63   A73:339   Median :3.000   A93:548   A103: 52   Median :3.000  
##  A64: 48   A74:174   Mean   :2.973   A94: 92              Mean   :2.845  
##  A65:183   A75:253   3rd Qu.:4.000                        3rd Qu.:4.000  
##                      Max.   :4.000                        Max.   :4.000  
##                                                                          
##    prop          age         plans       hous         creds      
##  A121:282   Min.   :19.00   A141:139   A151:179   Min.   :1.000  
##  A122:232   1st Qu.:27.00   A142: 47   A152:713   1st Qu.:1.000  
##  A123:332   Median :33.00   A143:814   A153:108   Median :1.000  
##  A124:154   Mean   :35.55                         Mean   :1.407  
##             3rd Qu.:42.00                         3rd Qu.:2.000  
##             Max.   :75.00                         Max.   :4.000  
##                                                                  
##    job           deps         tel      foreign     rating   
##  A171: 22   Min.   :1.000   A191:596   A201:963   bad :300  
##  A172:200   1st Qu.:1.000   A192:404   A202: 37   good:700  
##  A173:630   Median :1.000                                   
##  A174:148   Mean   :1.155                                   
##             3rd Qu.:1.000                                   
##             Max.   :2.000                                   
##

set.seed(54321)
# train test split
idx <- sample(c(TRUE, FALSE)
              , size = nrow(german)
              , prob = c(0.7, 0.3)
              , replace = TRUE)

german_train <- german[idx, ]
german_test <- german[!idx, ]

# make the training data balanced in case over fits good
# originally 300 bad, 700 good
german_train <- ovun.sample(rating~.
                      , data=german_train
                      , method = "over"
                      , p = 0.5)$data

# restore the factor levels to correct order
german_train$rating <- relevel(german_train$rating, "bad") 

# train rf accepting all the defaults
german_rf <- randomForest(rating~., data=german_train)

# get the predictions on new data
german_preds <- predict(german_rf, newdata = german_test)

# confusion matrix
confmat <- with(german_test, table(german_preds, rating))
confmat

##             rating
## german_preds bad good
##         bad   34   28
##         good  42  185

The confusion matrix looks ok, but how does it really do, in light of our high cost of false negatives?

# accuracy
sum(diag(confmat))/sum(confmat)

## [1] 0.7577855

# False Negative Rate FN / (TP + FN)
confmat[2, 1]/sum(confmat[, 1])

## [1] 0.5526316

# False Ommision Rate FN / (TN + FN)
confmat[2, 1]/sum(confmat[2, ])

## [1] 0.185022

So it’s not a great model, but let’s assume it’s better than what the company was doing before and move on. These numbers are useful but we want an intuitive visual representation for this, so what works best?

A rose plot is a surprisingly intuitive choice for viewing a binary confusion matrix, which is a \(2 \times 2\) table. Rose plots are a twist on the pie chart. Pie charts divide a circle into segments whose internal angle \(\theta\) is proportional to the category count. A rose plot does the reverse. The segments all have angle \(\theta = \frac{2 \pi}{N}\), for \(N\) segments. For a binary matrix with \(4\) options \(\theta = \frac{\pi}{2} = 90^o\). Then, the radius length varies in proportion with the count of each cell in the table. This has a useful property that \(\mathit{count} \propto \sqrt{\mathit{area}}\), which means the rose plot will gracefully handle numbers that differ by one or two orders of magnitude.

Most rose plot options available in R are basically bar charts that are transformed into polar coordinates. Even for a \(2 \times 2\) matrix, this would be fiddly to code because it requires converting the matrix to a frequency data frame and then forcing the plot to arrange the four cell values in a way that preserves the confusion matrix shape.

The vcd package contains a specialised fourfold plot whose purpose is to compare several \(2 \times 2\) tables over the levels of a third feature. It is intended for a different purpose than what we want to do, but the behaviour can be easily changed with a parameter to the plot function call that makes it perfectly suited for our needs.

# fourfold plot preserves the table structure it was given
# however, default balances the areas to compare odds ratios no matter the individual count values
# that's not what we want
fourfold(confmat)

# std = "all.max" suppresses balancing the areas
# behaves like a rose plot
fourfold(confmat, std = "all.max")

The rose plot makes it visually clear that the model is great a classifying true negatives, but is predicting too many negatives over all and misclassifying more true prositives than it correctly identifies. Such a high FNR is not acceptable. Why is the model not doing well?

If a categorical feature \(c\) has a high importance, it would suggest that the model has encoded how the ratio of good and bad ratings is different over the different levels of \(c\) and can make use of that information in predicting.

Let’s use feature importance to break down the results. This data set had a mix of numeric and categorical features.

varImpPlot(german_rf)

First thing to note is that there is no elbow point where we could safely make the model simpler by just picking the top performing features.

Secondly, and very fortunately for us, several categorical features are considered important, including the top one. Three of them are in the top six. Of course, it doesn’t matter if categorical variables don’t have high importance. Any numeric variable could be binned/discretised, and then have this technique applied.

As we want to customise the fourfold plot a little, we can make a new function to save repeating the code each time.

# customise the fourfold plot
my_rose <- function(cm) {
  fourfold(cm
         , std = "all.max" # mimic a 4 way rose plot
         , conf_level = 0 # suppress confint tests, not useful here
         , color = c("#FF0000", "#000080") # need to force colours due to suppressing confint test
         )
}

# while we're at it, let's make a function to get the numeric statistics that we want from the conf mat
my_confmat_stats <- function(cm) {
  cf_stats <- list()
  for (i in 1:(dim(cm)[3])) {
    
    category <- dimnames(cm)[[3]][i]
    acc <- sum(diag(cm[ , , i]))/sum(cm[ , , i])
    
    # False Negative Rate FN / (TP + FN)
    fnr <- cm[2, 1, i]/sum(cm[ , 1, i])
    
    # False Ommision Rate FN / (TN + FN)
    fomr <- cm[2, 1, i]/sum(cm[2, , i])
    
    # collect
    cf_stats[[category]] <-
      c(acc=acc, fnr=fnr, fomr=fomr)
  }
  
  return(t(as.data.frame(cf_stats)))
}

We’ll now make rose plots of the confusion matrix, stratified by the top categorical features. Try to relate the numeric stats to the patterns in the plots.

# stratify the conf mat by chk
confmat_c <- with(german_test
                  , table(german_preds
                          , rating
                          , chk))

# view the stats then plot
my_confmat_stats(confmat_c)

##           acc       fnr      fomr
## A11 0.6777778 0.3333333 0.2708333
## A12 0.6212121 0.6363636 0.2978723
## A13 0.8421053 1.0000000 0.1111111
## A14 0.8859649 1.0000000 0.1140351

my_rose(confmat_c)

Here we can see a pattern. When an instance has chk \(\in \{ \mathrm{A13}, \mathrm{A14} \}\), the model almost always classifies as good. One instance in each case was incorrectly classified as bad. Instances having chk \(\in \{ \mathrm{A11}, \mathrm{A12} \}\) have a rather high FNR at \(\approx 0.5\)

We now proceed with the next two most important categorical features, but leave any further diagnostics as talking points.

confmat_p <- with(german_test
                  , table(german_preds
                          , rating
                          , pps))

my_confmat_stats(confmat_p)

##            acc       fnr      fomr
## A40  0.7702703 0.4166667 0.1886792
## A41  0.7619048 0.7500000 0.1666667
## A410 0.5000000       NaN 0.0000000
## A42  0.6530612 0.8461538 0.2682927
## A43  0.8409091 0.4705882 0.1095890
## A44  0.5000000 1.0000000 0.5000000
## A45  0.6000000 1.0000000 0.4000000
## A46  0.7058824 0.3750000 0.3000000
## A48  1.0000000       NaN 0.0000000
## A49  0.7200000 0.5000000 0.1666667

my_rose(confmat_p)

confmat_h <- with(german_test
                  , table(german_preds
                          , rating
                          , crhis))

my_confmat_stats(confmat_h)

##           acc       fnr      fomr
## A30 0.8000000 0.0000000 0.0000000
## A31 0.7142857 0.4000000 0.2222222
## A32 0.7152318 0.5909091 0.2241379
## A33 0.7916667 0.2000000 0.0625000
## A34 0.8222222 0.8125000 0.1547619

my_rose(confmat_h)

confmat_e <- with(german_test
                  , table(german_preds
                          , rating
                          , emp))

my_confmat_stats(confmat_e)

##           acc       fnr       fomr
## A71 0.5714286 0.8000000 0.36363636
## A72 0.7674419 0.5000000 0.24242424
## A73 0.7019231 0.7037037 0.22619048
## A74 0.7777778 0.2222222 0.05405405
## A75 0.8513514 0.4736842 0.14516129

my_rose(confmat_e)

The dur feature is integer valued, with 22 unique levels. This technique scales usefully even up to this number of strata! While the exact details would take some time to analyse fully, there are obvious visible patterns.

confmat_d <- with(german_test
                  , table(german_preds
                          , rating
                          , dur))
my_rose(confmat_d)

This technique can be scaled to more dimensions, with some sacrifices, using log odds ratios. It is a little less intuitive at first, but can be very useful once you have a clear understanding of that the log odds ratio is telling you.

Recall from the introduction how this statistic was defined. The formula can be rearranged such that

\[\phi = \frac{\mathrm{TP} \times \mathrm{TN}}{\mathrm{FP} \times \mathrm{FN}}\]

The log odds ratio, therefore is larger as the model becomes more accurate. A model that does no better than a random guess has a log odds ratio that is not significantly different from zero. Note, this approach is not robust to very unbalanced classes.

A the log odds ratio of a stratified confidence matrix can easily be plotted to see which categories of the stratifying factor provide better accuracy. This will scale up to two additional dimensions, using position, colour and plot character. Such plots are be challenging to interpret and only works well if the features have two or three levels each at the most. However, under those circumstances, interesting interactions may be revealed.

Note, all the following plots are suppressing the confidence interval. The reason for this is that the data set is rather small and slicing along multiple dimensions results in low counts. Large standard errors can dominate the plot.

plot(loddsratio(confmat_c), confidence = FALSE)

plot(loddsratio(confmat_e), confidence = FALSE)

# higher dimensional
confmat_ce <- with(german_test
                   , table(german_preds
                           , rating
                           , chk
                           , emp))

confmat_ce <- with(german_test
                   , table(german_preds
                           , rating
                           , chk
                           , emp))
plot(loddsratio(confmat_ce), confidence = FALSE)

# back to one stratum, but many levels
plot(loddsratio(confmat_d), confidence = FALSE)

3.2.4 Talking Points

How would you diagnose the patterns in the further rose plots above?
Can you determine any interesting interactions between chk and emp from the log odds ratio plots?
How could you use information from the stratified confusion matrices to improve the model?
How could you use this information to improve the model?
Can you think of other ways to use this information?

3.3 Task: Analysing a Confusion Matrix for Ordinal Classification

3.3.1 What is this for?

In Machine Learning, the standard way to assess the quality of a supervised classification model is with a confusion matrix. This is a two way table comparing True and Predicted class labels. If you are trying to classify ordered labels, the relationship between the labels matters. Standard accuracy measures penalise near disagreements as if they were full errors. This leads to accuracy estimates that are far too conservative; You might throw out a fairly good model, when there is a good chance you can improve it.

Let’s take a step back and recap confusion matrices; when classes are well balanced, it’s easy to measure accuracy over a confidence matrix as \(\frac{\sum(\textit{diag})}{\sum(\textit{total})}\). When they’re out of balance you can have a situation such as this:

The ratio of class 1 : class 2 is 9:1. Predict every instance as class 1 and you will get 90% accuracy. This not a useful result!

So we use Cohen’s \(\kappa\) statistic to weight according to the occurrence of each class in the data. \(\kappa = 1\) means perfect agreement while \(\kappa = 0\) means no better than chance agreement. The above example would score \(\kappa = 0\).

That’s all well and good, but what class labels are related to each other by a natural order such as [least, middle, most] then you have an additional complication. You want your model to predict as well as possible, but to make an off-by-one error isn’t as bad as a full opposite error. Also, the classification distribution might be biased, giving different skew and kurtosis (flattened or polarised) by the predictive model.

If you can detect this kind of bias, you can correct it! This is where agreement plots can help. They are based on statistical techniques for comparing two raters, such as two doctors, two film critics, two wine critics etc, to validate ratings and analyse these kinds of bias.

3.3.2 When would I use this?

After training an ordinal classification model, when you want to evaluate its performance.

3.3.3 Case Study: Comparing Inter-Rater Agreement for Severity of Multiple Sclerosis.

In the following example we compare two sets of diagnoses for the same patients suspected of suffering from Multiple Sclerosis. The diagnoses were made by two neurologists, one in Winnipeg and the other in New Orleans:

Certain
Probable
Possible
Doubtful

# combine results from two cities (same raters are involved in all cases)
MSP <- margin.table(MSPatients, 1:2)

On this dataset, we can’t say which is the true and which is predicted because they’re both independent raters, but for now let’s assume that Winnipeg Neurologist is the true class and New Orleans Neurologist is the predicted class. The highest accuracy we could expect to get with constant model that always makes the same rating is the proportion of the majority true class. This would give a score of \(\kappa = 0\):

margin.table(MSP, 2)

## Winnipeg Neurologist
##  Certain Probable Possible Doubtful 
##       95       66       22       35

props <- prop.table(margin.table(MSP, 2))
props

## Winnipeg Neurologist
##   Certain  Probable  Possible  Doubtful 
## 0.4357798 0.3027523 0.1009174 0.1605505

# The baseline accuracy is
props[which(props == max(props))]

##   Certain 
## 0.4357798

Without looking at the confusion matrix, let’s take a naive look at the accuracy:

# basic accuracy
sum(diag(MSP))/sum(MSP)

## [1] 0.4449541

On the face of it, this looks like a terrible result! Is New Orleans Neurologist just predicting the most frequent Winnipeg class?

MSP

##                        Winnipeg Neurologist
## New Orleans Neurologist Certain Probable Possible Doubtful
##                Certain       43        8        0        1
##                Probable      36       22        7        0
##                Possible      12       27        8       10
##                Doubtful       4        9        7       24

There are definitely some problems with the inter-rater agreement but it doesn’t look like random guessing. Why is accuracy so low? Does class imbalance play a part? What is the \(\kappa\) score?

# cohen's kappa - the function in the vcd package provides weights and confidence intervals for this statistic
confint(Kappa(MSP, weights = "Fleiss-Cohen"))

##             
## Kappa              lwr       upr
##   Unweighted 0.1728083 0.3411072
##   Weighted   0.4987456 0.6785713

Unweighted \(\kappa\), the standard measure, also shows that this is not a very good result.

However, Fleiss-Cohen Weighted \(\kappa\) favours near disagreements and taking this into account gives a much better result. Off-by-n errors appear to be common, so these raters are closer than it first appears.

An agreement plot is the best way to visualise a situation like this because it shows how off-by-n errors and other bias could be affecting the outcome.

op <- par(mar = c(4, 3, 4, 1) + .1)
B <- agreementplot(MSP
                   , main = "MS Patient Ratings"
                   , xlab_rot = -20
)

This plot needs some explaining. The rectangle outlines show the maximum possible agreement, given the marginal totals. The black rectangles indicate where there is full agreement between raters. The shaded area indicates the off by \(n\) disagreement. Note, if there were more categories, there would be more grades of shading.

Ideally, we want the black area to fill the outline and it’s not very close in this chart. However, the shaded area is coming much closer to filling the available space.

unlist(B)[1 : 2]

##          Bangdiwala Bangdiwala_Weighted 
##           0.2472555           0.7503234

The Bangdiwala statistic is calculated based on the ratio of these shaded rectangles to the maximal outline. The large interval between the weighted and unweighted statistics shows that near disagreements are very common and if taken into account, they indicate a much better alignment than was indicated by a naive accuracy measure and even \(\kappa\).

There is another feature of this plot, which is the relationship of the rectangles to the diagonal line. This indicates how well aligned the raters are in using each category. In this chart, we can see how the Winnipeg Neurologist rates cases as “Certain” or “Probable” much more frequently than New Orleans Neurologist. Neither \(\kappa\) nor Bangdiwala statistics can detect this. There are statistical tests available but this chart shows the problem directly.

With a little intervention, guided by this information, these two raters could become much better aligned.

3.3.4 Talking Points

What other statistics, tools and techniques can you use to analyse a confusion matrix?
Have you worked with, or can you think of an ordinal classification problem?
How could you use the Bangdiwala statistic during training to improve alignment?

3.4 Task: Dimension Reduction and Clustering

3.4.1 What is this for?

Most Machine Learning practitioners are familiar with dimension reduction. It can radically simplify and compress a dataset, making it easier to handle and train models on. Many dimension reduction techniques also double up as unsupervised learning methods i.e. clustering. This is useful as an end in itself, as well as being an additional tool for exploratory data analysis: multidimensional data is projected into 2-D which makes it intuitive to visualise.

If you already know something about Principle Components Analysis, this is similar, but not quite the same. Remember, in DDA, the “dimensions” we want to reduce or cluster are not only the features, but also the categories within each feature. PCA identifies new dimensions that capture the most variance. The equivalent concept in Correspondence Analysis is the “inertia” of cells in the table with the largest and smallest numbers. The only way to really describe what’s happening is to see some examples.

Correspondence Analysis is an advanced topic. In particular, the way you pivot the multidimensional array controls the analysis. This requires some understanding of the data manipulation and modeling techniques. For today, we will just focus on interpreting the plots.

3.4.2 When would I use this?

You can use Correspondence Analysis during the exploratory data analysis phase of any project. Do this early on in the project once you’ve gathered and cleaned up your data enough to start sense-making.

This is a good technique for of feature engineering. If several features are well clustered you can replace them with the real valued dimensions that this method returns as output. This may convert several categorical dimensions into just a couple of numeric ones that might better suit the next stage of your project. The following two situations are immediately applicable:

You want to reduce the number of features in your dataset overall
You want to implement a model that requires only numeric, or real-valued functions. For example, k-NN requires a distance metric that works on all the features, while your dataset has some categorical features. Of course you could convert all your categories to factors, integers or binary encode them (aka create dummy variables). However, Correspondence Analysis actually reduces the number of features while retaining information about the way the original features interact.

You could use it as a stand alone model. This is an unsupervised clustering method for categorical features. Use it where you don’t have a specific class label, but just want to find out the associations between all the features.

3.4.3 Case Study: Analysis of Audience Viewing Data

The audience viewing data from Neilsen Media Research for the week starting November 6, 1995

It is a 3-D array cross-tabulating the viewing figures for three networks, between 8-11pm, Monday to Friday. The features and their levels are as follows:

No	Name	Levels
1	Day	Monday, Tuesday, Wednesday, Thursday, Friday
2	Time	8, 9, 10
3	Network	ABC, CBS, NBC

data("TV", package = "vcdExtra")
# The original data is collected in 15 minutes slices.

# Convert 3-D array to a frequency data frame
# This has a row for each cell of the array
# and a new column for the cell value
TV.df <- as.data.frame.table(TV)

# Convert it into hourly slices
levels(TV.df$Time) <- rep(c("8", "9", "10"), c(4,4,3))

# Convert frequency data back to 3-D array, now with just 3 time levels
TV3 <- xtabs(Freq~Day+Time+Network, TV.df)

# create a multiple correspondence analysis
TV3.mca <- mjca(TV3)

# the plot function uses all base R plot stuff
# but needs a bit of manipulation
cols <- c("blue", "black", "red")

# "blank plot"
res <- plot(TV3.mca, labels=0, pch='.', cex.lab=1.2)

# combine Dims, factor names and levels
coords <- data.frame(res$cols, TV3.mca$factors)

# hard-coded from known number of levels
# day, time, network
nlev <- c(5,3,3)

# everything needs to be in semantic order
coords <- coords[ order(coords[,"factor"], coords[,"level"]), ]
# quick fix for ordering e.g. day of week, not alphabetical
coords$order <- c(5, 1, 4, 2, 3, 6, 7, 8, 11, 9, 10)
coords <- coords[order(coords[, "order"]), ]

# place the points with separate plot chars and colours
points(coords[,1:2], pch=rep(16:18, nlev), col=rep(cols, nlev), cex=1.2)

# place the text
pos <- c(1,4,3)
text(coords[,1:2], labels=coords$level, col=rep(cols, nlev), pos=rep(pos,nlev), cex=1.1, xpd=TRUE)

# join things in sequence
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Day", lty=1, lwd=1, col="blue")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Time",  lty=1, lwd=1, col="red")

# add segement from the origin to channels
nw <- subset(coords, factor=="Network")
segments(0, 0, nw[,"Dim1"], nw[, "Dim2"], col = "black", lwd = 0.5, lty = 3)

# add a legend
legend("topright", legend=c("Day", "Network", "Time"),
       title="Factor", title.col="black",
       col=cols, text.col=cols, pch=16:18,
       bg="gray95")

This plot reveals an incredibly strong effect relating NBC and Thursday nights. This is apparently most of the inertia captured by Dimension 1. What’s going on there?

margin.table(TV3, 1)

## Day
##    Monday   Tuesday Wednesday  Thursday    Friday 
##      8399      8081      5929      8987      6214

margin.table(TV3, c(1, 3))

##            Network
## Day          ABC  CBS  NBC
##   Monday    2847 2923 2629
##   Tuesday   3110 2403 2568
##   Wednesday 2434 1283 2212
##   Thursday  1766 1335 5886
##   Friday    2737 1479 1998

Viewing figures are at their highest on Thursdays compared to any night of the week and for NBC they dramatically outnumber the other two channels, but only on this night. NBC has viewing figures that are comparable to the others most other nights. Everyone appears to be glued to NBC all evening. A little digging and domain knowledge (showing my age, here) means we can give this effect a name: This is the “Friends, Seinfeld and ER” effect! These were three of the most popular shows of that era.

The 9pm slot clusters closer to the other four days than 8pm and 10pm. What’s happening there?

margin.table(TV3, 2)

## Time
##     8     9    10 
## 12757 13596 11257

t(TV3[4,,]) # Thursday

##        Time
## Network    8    9   10
##     ABC  735  682  349
##     CBS  680  385  270
##     NBC 1927 1858 2101

It turns out that 9pm is the most viewed time slot throughout the week, independent of day or network. Yet, notice how the NBC audience numbers dip just a little for that time slot on Thursday, even though they still have far greater numbers than either of the other two channels. This dip in numbers runs counter to the “Friends, Seinfeld and ER” effect which has completely dominated the analysis, and so it has also been captured on Dimension 1.

Dimension 2 is forced to capture differences between ABC and CBS schedules on the other nights as well as possible, but there is undoubtedly more detail to untangle.

Now we have identified this difficulty, we can tackle the problem. We’re going to use the simple (2-Way) Correspondence Analysis to do this. Using the pivoting function structable(), we effectively create a new feature that combines time and day. Counter-intuitively, this allows the Correspondence Analysis more flexibility along the combined dimension. The resulting plot has a point for each time slot over the week and teases apart the effect of people channel hopping from the “Friends, Seinfeld and ER” effect. It turns out to be very easy to do (see the code for yourself).

# Flatten to 2-D by stacking Time onto Day
# Note: The data shaping choice here controls the specifics of the analysis.
TV3s <- as.matrix(structable(Network~Time+Day, TV3))

# Create the Correspondence Analysis objects
TV3s.ca <- ca(TV3s)

# Generate the plot
res <- plot(TV3s.ca)
# add some segments from the origin to make things clearer
segments(0, 0, res$cols[,1], res$cols[,2], col = "red", lwd = 1)
segments(0, 0, res$rows[,1], res$rows[,2], col = "blue", lwd = 0.5, lty = 3)

As you can see, the small channel hop from NBC on Thursday has moved onto Dimension 2, along with the other variations in within-evening viewing patterns. On the previous plot, there is a very slight affinity 8pm-CBS, and 9pm-ABC. Did you notice? Look again. We can see clearly on this new plot, something of great interest at 8pm, Monday, but not later on the same evening and not again at the same time later in the week. We also can see several 9pm slots clustered around ABC.

An interesting feature of this new analysis, is that distance from the origin is proportional to the size of positive residual. What does this mean? Well for example, from 10pm the viewing figures dwindle, independent of day and channel as people go to bed or go out for a late drink. Thursday night, 10pm, NBC has massive audience figures relative to what’s normally expected for that time slot the rest of the week. This inertia moves the point furthest from the origin out of all the nightly time slots. Monday at 8pm on CBS is notable for the same reason. Why not Monday 9pm and ABC? There are other nights when ABC is popular at 9pm, making this combination less unusual.

t(TV3[1,,]) # Monday

##        Time
## Network    8    9   10
##     ABC  536 1401  910
##     CBS 1167  967  789
##     NBC  858  946  825

TV3[,2,1] # Every day, 9pm, ABC

##    Monday   Tuesday Wednesday  Thursday    Friday 
##      1401      1205      1022       682       907

If the next stage of your project requires real valued features for modeling (such as a distance or similarity based model), you can access the coordinates of each point via the plot object and replace them in your data set. This is more powerful than converting to an integer series or dummy variables, because so much information is retained. For example, it’s very tempting to convert [Monday, Tuesday, Wednesday, Thursday, Friday] to [1,2,3,4,5]. However, we have seen in this case that this retains none of the information about the feature interactions. Likewise, [8,9,10] seems obvious to leave as numbers, but you lose information about the audience figures being higher at 9pm than any other time. Also, what’s the natural order for [ABC,CBS,NBC]? Alphabetical? Think again!

res$rows

##                     Dim1         Dim2
## 8_Monday      0.15812304 -0.455606847
## 8_Tuesday    -0.02638318 -0.010526784
## 8_Wednesday   0.25070348  0.001493149
## 8_Thursday   -0.34416868 -0.074239579
## 8_Friday      0.13899420  0.195131693
## 9_Monday      0.24587580  0.032709280
## 9_Tuesday     0.23549271 -0.032622366
## 9_Wednesday   0.07374072  0.228909131
## 9_Thursday   -0.46679506  0.046282352
## 9_Friday      0.23682706  0.094854237
## 10_Monday     0.16432975 -0.070170394
## 10_Tuesday    0.33565719 -0.012832280
## 10_Wednesday -0.12640134  0.130600285
## 10_Thursday  -0.74421704 -0.028465081
## 10_Friday     0.13391483  0.086194069

res$cols

##           Dim1         Dim2
## ABC  0.2489731  0.170454834
## CBS  0.2711437 -0.225264013
## NBC -0.3769860 -0.004916094

We’ll take a look at one more of these, from the 4-D Titanic data set again.

# one more mca from a 4-D dataset
# this is simpler than it looks
# all the magic happens here
titanic.mca <- mjca(Titanic)

# saving the plot object supplies the coordinate positions
res <- plot(titanic.mca, labels=0, pch='.', cex.lab=1.2)

# extract factor names and levels
coords <- data.frame(res$cols, titanic.mca$factors)

# everything else is handling base R plotting stuff
cols <- c("blue", "red", "brown", "black")
nlev <- c(4,2,2,2)
points(coords[,1:2], pch=rep(16:19, nlev), col=rep(cols, nlev), cex=1.2)
pos <- c(3,1,1,3)
text(coords[,1:2], labels=coords$level, col=rep(cols, nlev), pos=rep(pos,nlev), cex=1.1, xpd=TRUE)
coords <- coords[ order(coords[,"factor"], coords[,"Dim1"]), ]
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Class", lty=1, lwd=2, col="blue")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Sex",  lty=1, lwd=2, col="red")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Age",  lty=1, lwd=2, col="brown")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Survived",  lty=1, lwd=2, col="black")

legend("topleft", legend=c("Class", "Sex", "Age", "Survived"),
       title="Factor", title.col="black",
       col=cols, text.col=cols, pch=16:19,
       bg="gray95", cex=1.2)

Note that the two category (binary) features all pass through the origin.

# Simple ca works on two dimensions
# pivot table of Titanic
Titanic_piv <- structable(~Class+Sex+Survived+Age, Titanic)
Titanic_piv

##                Sex  Male       Female      
##                Age Child Adult  Child Adult
## Class Survived                             
## 1st   No               0   118      0     4
##       Yes              5    57      1   140
## 2nd   No               0   154      0    13
##       Yes             11    14     13    80
## 3rd   No              35   387     17    89
##       Yes             13    75     14    76
## Crew  No               0   670      0     3
##       Yes              0   192      0    20

# all the work happens here
Titanic.ca <- ca(as.matrix(Titanic_piv))

# plot and save the object to get the coords for line segments
res <- plot(Titanic.ca)
segments(0, 0, res$cols[,1], res$cols[,2], col = "red", lwd = 1)
segments(0, 0, res$rows[,1], res$rows[,2], col = "blue", lwd = 0.5, lty = 3)

As you can see Correspondence Analysis is an extremely powerful dimension reduction and clustering tool for categorical data.

3.4.4 Talking Points

What can you infer from the Titanic Correspondence Analysis Plots?
Compare the two types of Correspondence Analysis (simple and multiple/joint)

3.5 Task: 1-way analysis (goodness of fit) for monitoring the frequency of events.

Count, or frequency data has many varied applications from event monitoring to text analysis. 1-way analysis uses distributions to compare different samples, and monitor for changes in event frequency.

For the first task we will look at two common distributions for count data and compare two samples.

The Poisson distribution has just one parameter \(\lambda\) for the rate, and both the mean and variance are equal to this parameter.

set.seed(12345)

KL <- expand.grid(k = 0 : 20, lambda = c(1, 5, 10, 20))
pois_df <- data.frame(KL, prob = dpois(KL$k, KL$lambda))
pois_df$lambda = factor(pois_df$lambda)

xyplot(prob ~ k | lambda, data = pois_df,
       type = c("h", "p"), pch = 16, lwd = 2
       , cex = 1.25, layout = c(4, 1)
       , main = expression(paste("Poisson Distribution by ", lambda))
       , xlab = list("Number of events (k)", cex = 1.25)
       , ylab = list("Probability", cex = 1.25))

The negative binomial distribution has two parameters, rate (size) and dispersion (\(\frac{1}{\mathrm{prob}}\)). The mean is equal to the rate, but the variance is always greater than the rate. It can appear as a right skewed Poisson distribution and behaves as if a Poisson distribution is being sampled over a period of time while the rate is non-constant.

XN <- expand.grid(k = 0 : 20, n = c(1, 5, 10, 20), p = c(0.4, 0.2))
nbin_df <- data.frame(XN, prob = dnbinom(XN$k, XN$n, XN$p))
nbin_df$n <- factor(nbin_df$n)
nbin_df$p <- factor(nbin_df$p)

xyplot(prob ~ k | n + p, data = subset(nbin_df, p == 0.4)
       , main = "Neg. binom by size and prob"
       , xlab = list("Number of failures (k)", cex = 1.25)
       , ylab = list("Probability",  cex = 1.25)
       , type = c("h", "p"), pch = 16, lwd = 2
       , strip = strip.custom(strip.names = TRUE)
)

xyplot(prob ~ k | n + p, data = subset(nbin_df, p == 0.2)
       , main = "Neg. binom by size and prob"
       , xlab = list("Number of failures (k)", cex = 1.25)
       , ylab = list("Probability",  cex = 1.25)
       , type = c("h", "p"), pch = 16, lwd = 2
       , strip = strip.custom(strip.names = TRUE)
)

However, Poisson and negative binomial data can look very similar. Identifying the best match and estimating the parameters is essential for accurate event monitoring and reporting results.

3.5.1 Case Study in Text Analysis: Marker word frequency

The following technique has been used to determine if documents are written by same or different authors. Distributions of marker words are fit to text that is known to come from one author.

Any previously unseen text can then be tested against these known distributions to determine if they are written by the same author.

In this simple case study, we will use the Federalist dataset to analyse how many times the word ‘may’ appear per 200 word block. This is a marker word, that could help identify an author by how often they use it. Most studies of this kind would compare results across many such tests with different words.

Notes:

The rate is occurrences/200 words.
The expected count is the rate parameter \(\times\) 200
Plotting the square root transformation of the frequency provides more detail on the small numbers in the right tail of these charts. Log transformation would also work but the rootogram visualisations use square root transformation by default. So, this is also applied manually where necessary for comparison.

data("Federalist", package = "vcd")
Federalist

## nMay
##   0   1   2   3   4   5   6 
## 156  63  29   8   4   1   1

sum(Federalist) # number of blocks

## [1] 262

sum(expand.dft(as.data.frame(Federalist))$nMay) # number of occurences

## [1] 172

mean(expand.dft(as.data.frame(Federalist))$nMay)

## [1] 0.6564885

var(expand.dft(as.data.frame(Federalist))$nMay)

## [1] 1.007985

The descriptive stats showed a slightly higher variance but we don’t know if this is significant. What’s the right test to use here?

A very naive approach is to take the mean and visually compare a Poisson distribution with this value for \(\lambda\) against the actual data.

barplot(sqrt(200 * dpois(0:6, mean(expand.dft(as.data.frame(Federalist))$nMay)
))
        , main = "Number of text blocks by Number of Occurrences\nnaive expected values"
        , ylab = "sqrt(Naive Expected)"
        , xlab = "Occurences of 'may'"
        , col = "lightblue")

The actual data looks a bit like this distribution. Could this be a good fit? How can we easily compare them?

barplot(sqrt(Federalist)
        , main = "Number of text blocks by Number of Occurrences\nactual values"
        , xlab = "Occurences of 'may'"
        , ylab = "Sqrt(Actual)"
        , col = "lightgreen"
)

Use the goodfit function in the vcd package to estimate the parameters and compare the fit of different distributions. It provides hanging rootograms that are intuitive views of discrete goodness of fit testing.

First, a Poisson fit using maximum likelihood - exactly what we’ve just done, but with bells on it:

Fed_fit0 <- goodfit(Federalist, type = "poisson")
# This will show the rate parameter, estimated by Maximum Likelihood
# This estimate is the same as the naive mean. This is to be expected when fitting a poisson.
unlist(Fed_fit0$par)

##    lambda 
## 0.6564885

# This will show a Chi-Square Goodness of fit test against the expected values of a poisson distribution with this parameter
summary(Fed_fit0)

## 
##   Goodness-of-fit test for poisson distribution
## 
##                       X^2 df     P(> X^2)
## Likelihood Ratio 25.24312  5 0.0001250511

The \(\chi^2\) test is very strong evidence to reject the null hypothesis. This sample is not from a Poisson distribution with rate = sample mean. However, we don’t know much about what caused this result. However, the rootogram plot makes it very clear what the problem is by showing how the variable diverges from the expectations.

plot(Fed_fit0, type = "hanging", shade = TRUE)

By hanging each bar top down from the expected distribution, we just need to look at how the bottoms of the bars miss the \(y = 0\) reference line. This is visually much more intuitive than comparing the tops of the bars to a curve. The contribution of each bar to the \(\chi^2\) statistic is also made clear by the shading.

We can try fitting a Poisson using a different estimation method, minimising the \(\chi^2\) errors instead:

Fed_fit0 <- goodfit(Federalist, type = "poisson", method = "MinChisq")
# This has found a different value for the rate parameter.
unlist(Fed_fit0$par)

##    lambda 
## 0.8236976

# This will show a Chi-Square Goodness of fit test against the expected values of a poisson distribution with this parameter
summary(Fed_fit0)

## 
##   Goodness-of-fit test for poisson distribution
## 
##              X^2 df     P(> X^2)
## Pearson 46.86767  5 6.045556e-09

The \(\chi^2\) test is also very strong evidence against the null.

plot(Fed_fit0, type = "hanging", shade = TRUE)

The rootogram shows that the residuals have just been shuffled around so they balance better by sign. Clearly Poisson is not the right choice.

Next, fit a negative binomial fit:

Fed_fit1 <- goodfit(Federalist, type = "nbinomial")
# This will show the rate and dispersion parameters, estimated by Maximum Likelihood
# prob is the rate. Here it's very close to the poisson mean
# size is the dispersion
unlist(Fed_fit1$par)

##      size      prob 
## 1.1863297 0.6437582

summary(Fed_fit1)

## 
##   Goodness-of-fit test for nbinomial distribution
## 
##                       X^2 df  P(> X^2)
## Likelihood Ratio 1.964028  4 0.7423751

plot(Fed_fit1, type = "hanging", shade = TRUE)

We have estimated some parameters and the \(\chi^2\) goodness of fit test result is not significant. We accept that null hypothesis that this sample is from a negative binomial distribution having the estimated parameters. The plot speaks for itself now that the bars drop close to the \(y = 0\) reference line, with only insignificant residuals.

We will now assess a different body of text.

# Some dummy data
nonFederalist <- structure(c(135L, 71L, 36L, 17L, 8L, 4L, 1L), .Dim = 7L, .Dimnames = structure(list(
    c("0", "1", "2", "3", "4", "5", "6")), .Names = "nMay"), class = "table")
nonFederalist

## nMay
##   0   1   2   3   4   5   6 
## 135  71  36  17   8   4   1

sum(nonFederalist) # number of blocks

## [1] 272

sum(expand.dft(as.data.frame(nonFederalist))$nMay) # number of occurences

## [1] 252

mean(expand.dft(as.data.frame(nonFederalist))$nMay)

## [1] 0.9264706

var(expand.dft(as.data.frame(nonFederalist))$nMay)

## [1] 1.470588

barplot(sqrt(nonFederalist)
        , main = "Number of text blocks by Number of Occurrences\nunseen text"
        , xlab = "Occurences of 'may'"
        , ylab = "Sqrt(Actual)"
        , col = "lightgreen"
)

The descriptive statistics are a bit larger than Federalist, but is this significant? We can test it properly using the parameters from the goodfit object we just created:

# we pass in our fitted parameter list this time
Fed_fit3 <- goodfit(nonFederalist, type = "nbinomial", par = Fed_fit1$par)
summary(Fed_fit3)

## 
##   Goodness-of-fit test for nbinomial distribution
## 
##                       X^2 df    P(> X^2)
## Pearson          22.41764  6 0.001016934
## Likelihood Ratio 19.51130  6 0.003381872

The \(\chi^2\) result is significant, and strong evidence that this new text is not from the same distribution. However, on it’s own, the test can’t provide detailed information.

plot(Fed_fit3, type = "hanging", shade = TRUE)

The rootogram hangs the frequency bars off the expected distribution, which this time is fixed with parameters we estimated on the original text. We can now see from the pattern of residuals that the word ‘may’ seems to appear more frequently in this text. Specifically, the tail, with 3-5 occurrences per 200 words is fatter, while blocks where this word is absent are fewer.

3.5.2 Talking points

Does this test imply that the second text was not written by the same author?
Text analysis has come a long way since these statistical techniques were developed. What are some newer, more sophisticated methods?
Can you describe, in plain English, what is the implication of finding a Poisson distributed word frequency? Why is the negative binomial a more natural distribution for word frequency? Hints:
- Frequency is the number of times the word appeared per block of 200 words
- Review the notes about these two distributions at the top of this task

4 Summary

We have demonstrated some classical statistical techniques that are not usually included in a basic stats curriculum.
We have discussed how these techniques can be applied to tasks that are typical in many practical Machine Learning projects.
- Text/Event Analysis
- Comparing True and Predicted Classes for Ordinal data
- Exploratory Data Analysis
- Dimension Reduction and Clustering
We have shown that simple R functions and visual analyses are extremely powerful and you don’t have to dive deep into stats to get the benefits.
Nearly all the resources you need to take the next steps yourself are available in the “vcd” and “vcdExtra” package for R, R~/vignette(“vcd”)
If you want to go much, much deeper with these topics, Michael Friendly’s book pulls together decades of research in this field and is a one-stop reference for this branch of statistics.

In our rush to use the latest developments and technologies in Machine Learning and Deep Learning, it’s easy forget that none of it would be possible without pioneering statistical research of previous decades. Most of this work is still relevant to everyday problems in Machine Learning practice, and it’s always worth a review of the foundations before reaching for the newest tools.

Above all, never under-estimate the power of visual analytics. A well thought out plot can save you paragraphs of explanation.

Discrete Data Analysis: A Friendly Guide to Visualising Categorical Data for Machine Learning Practitioners, with Examples in R

Julian Hatwell

17 February, 2019