Introduction

In this blog, we illustrate the role of jittering in graphical analysis of datasets through an example in financial markets. We argue that jittering can provide useful information about the distribution of the data and the relation between variables.

The goal is not to focus on the command syntax in the ggplot2 library in which this example is implemented but rather the interpretation of the graphical output.

Example

The example uses the liquidity scores for a fixed income portfolio as of Jan 31, 2020 across 2 vendors B and T. We have anonymized the vendors and the specific securities prior to loading the data. However, we have not changed the liquidity score values in any other way.

First we load the data file and describe the summary statistics. The data file consists of observations each of which is a position in the portfolio. For that position, we provide four types of information:

  1. An anonymized identifier of each position
  2. Market value of the position
  3. Bond sector classification in three fields of a classification hierarchy.
  4. Liquidity score information from each of the vendors B and T.
draw = read_csv("Blog2_Jitter_Example_Data.csv",
                col_types=cols( MarketValueDirtyLocal = col_double())
                )
slice(draw, 1:4) %>% 
  dplyr::select(ID, Sector, Class, SubClass, BSector, BScore, TScore) %>% kable()
ID Sector Class SubClass BSector BScore TScore
Bond_1 Corporate Investment Grade Unsecured Corporate Debt 66 1
Bond_2 Corporate Investment Grade Unsecured Corporate Debt 76 4
Bond_3 Comm. Real Estate Large Loan Multi-Borrower Securitized Debt NA NA
Bond_4 Corporate Investment Grade Unsecured Corporate Debt 42 1
skimr::skim(draw) %>% yank("numeric") %>% 
  dplyr::select(-mean, -p25,-p75, -sd, -hist) %>% kable()
skim_variable n_missing complete_rate p0 p50 p100
MarketValueDirtyLocal 1 0.9974093 -320735 1434048 37306509
BScore 30 0.9222798 6 86 100
In_B 0 1.0000000 0 1 1
In_T 3 0.9922280 1 192 383
TScore 154 0.6010363 1 4 10

The portfolio has 386 total positions.
Vendor B provides liquidity scores for 92.2% percent while Vendor T provides 60.1% coverage of the portfolio.

Using BoxPlot with and without Jitter

Consider the question of how vendor B and vendor T calculate liquidity.
Do the outputs of the two models agree to a sufficient degree that we can deem them to be equivalent? Do the scores of the two vendor models have the same meaning?

In order to answer these question, we could build a formal regression model, but data visualization will turn out to be sufficient for initial analysis.

First, the scores from each vendor need to be understood clearly:

Vendor B defines its scores on an ordinal scale from 1 to 100. For Vendor B, 1 means the least liquid. 100 means the most liquid. Vendor T uses a different ordinal scale from 1 to 10. Again, 1 means the least liquid and 10 the most liquid in the Vendor T model.

We conclude that linear regression might not be the most appropriate approach since both vendor scores are not continuous variables.

We can use a grouped box plot to search for a trend in the distribution of the vendor score data.

The first box plot below displays the Vendor T scores on the horizontal axis from 1 to 10.
Note that the subset of observations with no scores from vendor T are placed in the NA boxplot on the rightmost side. On the vertical axis, the distribution of Vendor B scores are shown.
At each value of Vendor T score, we see the box plot of Vendor B scores.

draw %>% mutate( yy = ifelse(is.na(TScore), 11, as.integer(TScore) ) ) %>%
  ggplot(aes(x=fct_reorder(as.factor(TScore), yy),  y=BScore )) + 
  geom_boxplot() + 
  xlab("T-Score") + 
  ylab("B-Score") + 
  ggtitle("Liquidity Scores Without Jitter: Vendor T vs. B")

A few conclusions are clearly observed:

  1. The median vendor B score appears to positively proportional to the vendor T score.
  2. The variation in the vendor B score as measured by the interquartile range appears to be greatest for low T-scores and decrease at higher T-scores.
  3. The greatest variation in vendor B scores in a single boxplot category is for bonds with no defined T-score.

What we don’t know but where jitter can be helpful in answering is more subtle questions:

  1. Do some T-Scores have large concentration of observations?
  2. Is the distribution of T-scores lumpy or smooth?

In the jittered version of the above boxplot below, we can answer these questions.

draw %>% mutate( yy = ifelse(is.na(TScore), 11, as.integer(TScore) ) ) %>%
  ggplot(aes(x=fct_reorder(as.factor(TScore), yy),  y=BScore )) + 
  geom_boxplot() + 
  geom_jitter(alpha=.3 , color="red", 
              position=position_jitter(width=0.2, height=0.1)) + 
  xlab("T-Score") + 
  ylab("B-Score") + 
  ggtitle("Liquidity Scores with Jitter Vendor T vs B")

We see that the distribution of B-Scores for the lowest T-Score = 1 ranges from 1 to 92. We also see that a large number of points fall in the NA boxplot. The T-Score NA boxplot is highly left skewed but also has a concentration at B-Score = 100.

Thus, we see both that the count of observations within each T-Score level is lumpy. We also see that the B-Score distribution within each T-Score level is highly skewed despite the positive trend in the B-Score medians with increasing T-Scores.

Conclusion

Returning to our initial questions, the two vendors models do not appear equivalent.
Despite the existence of a trend between their scores, there are structural differences in the scores granularity, heteroskedasticity in the inter-score relationship and an unexplained population of bonds with NA T-Score. The reasons for these variations can be traced in part to differences in how the vendors designed their models, the degree of coverage by asset type of their liquidity models. Jitter has been helpful in detecting these anomalies.