Can a log transformation act as a zoom-in for a data visualization?

Let´s see it with a graph of claims per age in an insurance company


In some insurance products age really matters, it can make the difference between a risky client and a profitable one. For this reason is really important for insurance companies to have a clear understanding of the age’s distribution of the clients portfolio.

In the following graph the Y axis represents the sum in dollars of claims presented by users, the X variable represents each day from 2010 to 2016, while the color of the dots indicates the average age of all the clients that made a claim on that particular day. For example, the last day recorded on the dataset would be at the end of the X axis, if on that day many clients would have made claims of high value the dot would be at the end of the Y axis; if for some reason most of those users would have been old, the graphed point would appear as a light blue color, instead of a darker one that represents a lower age mean.



As the graph shows, the daily sum of claims can be separated into two groups, in some days the amount doesn’t even sum 10k (first group). The second group has a higher dispersion going from around 10k to 40k, being the 20k area the most represented.

By simply looking at the graph it’s notable that the dots of the second group are mid dark, this means that the ages of clients are well distributed making the average an intermediate point between the extremes. But on the first group things get more complicated as the dots are too close to each other.

A solution can be simply discarding the values of the second group and just graph the ones of the first set, but what if the analyst wants to keep all of the data on the graph? Scaling might be the answer, so let’s apply a scale_y_log10 function of the ggplot2 package to the data.



It can be seen that after applying the log the first cluster becomes much more interpretable, allowing the analyser to read it and draw conclusions. As the visualisation shows, the number of dark points on the first group is relevant, which makes sense due to the fact that young people tend to make claims of lower values, making them profitable.

The increasing tendency of the sum of claims per day gets also more clear after applying the log, once again helping the analyser to read the data.

Before filtering the data and discarding observations, why not try to scale it to see if it get’s more readable? Maybe the graph can be zoomed in!