In this blog, we illustrate the role of jittering in graphical analysis of datasets through an example in financial markets. We argue that jittering can provide useful information about the distribution of the data and the relation between variables.
The goal is not to focus on the command syntax in the ggplot2 library in which this example is implemented but rather the interpretation of the graphical output.
The example uses the liquidity scores for a fixed income portfolio as of Jan 31, 2020 across 2 vendors B and T. We have anonymized the vendors and the specific securities prior to loading the data. However, we have not changed the liquidity score values in any other way.
First we load the data file and describe the summary statistics. The data file consists of observations each of which is a position in the portfolio. For that position, we provide four types of information:
draw = read_csv("Blog2_Jitter_Example_Data.csv",
col_types=cols( MarketValueDirtyLocal = col_double())
)
slice(draw, 1:4) %>%
dplyr::select(ID, Sector, Class, SubClass, BSector, BScore, TScore) %>% kable()
| ID | Sector | Class | SubClass | BSector | BScore | TScore |
|---|---|---|---|---|---|---|
| Bond_1 | Corporate | Investment Grade | Unsecured | Corporate Debt | 66 | 1 |
| Bond_2 | Corporate | Investment Grade | Unsecured | Corporate Debt | 76 | 4 |
| Bond_3 | Comm. Real Estate | Large Loan | Multi-Borrower | Securitized Debt | NA | NA |
| Bond_4 | Corporate | Investment Grade | Unsecured | Corporate Debt | 42 | 1 |
skimr::skim(draw) %>% yank("numeric") %>%
dplyr::select(-mean, -p25,-p75, -sd, -hist) %>% kable()
| skim_variable | n_missing | complete_rate | p0 | p50 | p100 |
|---|---|---|---|---|---|
| MarketValueDirtyLocal | 1 | 0.9974093 | -320735 | 1434048 | 37306509 |
| BScore | 30 | 0.9222798 | 6 | 86 | 100 |
| In_B | 0 | 1.0000000 | 0 | 1 | 1 |
| In_T | 3 | 0.9922280 | 1 | 192 | 383 |
| TScore | 154 | 0.6010363 | 1 | 4 | 10 |
The portfolio has 386 total positions.
Vendor B provides liquidity scores for 92.2% percent while Vendor T provides 60.1% coverage of the portfolio.
Consider the question of how vendor B and vendor T calculate liquidity.
Do the outputs of the two models agree to a sufficient degree that we can deem them to be equivalent? Do the scores of the two vendor models have the same meaning?
In order to answer these question, we could build a formal regression model, but data visualization will turn out to be sufficient for initial analysis.
First, the scores from each vendor need to be understood clearly:
Vendor B defines its scores on an ordinal scale from 1 to 100. For Vendor B, 1 means the least liquid. 100 means the most liquid. Vendor T uses a different ordinal scale from 1 to 10. Again, 1 means the least liquid and 10 the most liquid in the Vendor T model.
We conclude that linear regression might not be the most appropriate approach since both vendor scores are not continuous variables.
We can use a grouped box plot to search for a trend in the distribution of the vendor score data.
The first box plot below displays the Vendor T scores on the horizontal axis from 1 to 10.
Note that the subset of observations with no scores from vendor T are placed in the NA boxplot on the rightmost side. On the vertical axis, the distribution of Vendor B scores are shown.
At each value of Vendor T score, we see the box plot of Vendor B scores.
draw %>% mutate( yy = ifelse(is.na(TScore), 11, as.integer(TScore) ) ) %>%
ggplot(aes(x=fct_reorder(as.factor(TScore), yy), y=BScore )) +
geom_boxplot() +
xlab("T-Score") +
ylab("B-Score") +
ggtitle("Liquidity Scores Without Jitter: Vendor T vs. B")
A few conclusions are clearly observed:
What we don’t know but where jitter can be helpful in answering is more subtle questions:
In the jittered version of the above boxplot below, we can answer these questions.
draw %>% mutate( yy = ifelse(is.na(TScore), 11, as.integer(TScore) ) ) %>%
ggplot(aes(x=fct_reorder(as.factor(TScore), yy), y=BScore )) +
geom_boxplot() +
geom_jitter(alpha=.3 , color="red",
position=position_jitter(width=0.2, height=0.1)) +
xlab("T-Score") +
ylab("B-Score") +
ggtitle("Liquidity Scores with Jitter Vendor T vs B")
We see that the distribution of B-Scores for the lowest T-Score = 1 ranges from 1 to 92. We also see that a large number of points fall in the NA boxplot. The T-Score NA boxplot is highly left skewed but also has a concentration at B-Score = 100.
Thus, we see both that the count of observations within each T-Score level is lumpy. We also see that the B-Score distribution within each T-Score level is highly skewed despite the positive trend in the B-Score medians with increasing T-Scores.
Returning to our initial questions, the two vendors models do not appear equivalent.
Despite the existence of a trend between their scores, there are structural differences in the scores granularity, heteroskedasticity in the inter-score relationship and an unexplained population of bonds with NA T-Score. The reasons for these variations can be traced in part to differences in how the vendors designed their models, the degree of coverage by asset type of their liquidity models. Jitter has been helpful in detecting these anomalies.