Abstract
This is a short introduction to the art of visualizing data. It should help the attentative reader to learn how to create appealing, parsimonious, and informative graphs that allow getting insights into the nature of the data at hand. The guide contains many examples of bad and good graphs in order to make clearer what are the advantages of good practices.An effective data visualization should be clear, concise, and visually appealing. The key design principles underlying such a visualization include:
These are very useful sources on depicting data: Tufte (2001) and Zelazny (2001). These videos are valuable too: Karl W. Broman: Creating effective figures and tables and Darkhorse Analytics: Data looks better naked series.
The color wheel is a tool that helps designers and artists understand color relationships and create harmonious color schemes. It consists of a circle with colors arranged in a specific order, typically starting with red, then moving through orange, yellow, green, blue, and purple. The color wheel can be divided into different segments based on color relationships, such as complementary colors, analogous colors, or triadic colors.
The color wheel was invented in 1666 by Isaac Newton (1643–1727), who mapped the color spectrum onto a circle. It is the basis of color theory, because it shows the relationship between colors.
Source: DataNovia
There are several online tools that can be used to choose the appropriate colors for your graphs:
You can use this feature to select colors for your designs and see how they relate to each other on the color wheel.
In order to be clear and concise the plots must contain only relevant information. Data-to-ink ratio is a design principle suggested by Edward Tufte emphasizing the importance of displaying only necessary information in a graph. The goal is to minimize the amount of ink used to display non-essential information (chartjunk), while still conveying the important data effectively. This can improve readability and reduce clutter in a visualization.
How to maximize data-to-ink ratio?
It can be useful to think of the ink that your printer will have to consume in order to plot your graph. If you want to save money on ink, but still need to print the graph, think how to consume as little ink as possible.
Let us consider an example of graph below. It is apparently overloaded with unnecessary “ink”. First, it has a grey background. Second, it contains gridlines that are not needed. Thus, its data-to-ink ratio is very low.
After removing these unnecessary elements, we obtain the following graph. It is still far from being perfect, but still much better than the previous graph.
Edward Tufte also introduced the notion of chartjunk that means all visual elements in graphs that are not necessary to understand the information shown on the graph or that distract the viewer from this information. Below several examples of chartjunk are presented.
The west-north panel has an unnecessary background and the bars are both colored and contain grid. Instead a graph without background and with monochrome bars would be much more appropriate. The north-east panel also has a background that is superfluous and, in addition, it contains a three-dimensional graph, where it is not needed, since it displays only two dimensions: time and values of the variables. In the south-west panel, there is a perspective that just distracts and, moreover, produces a biased impression regarding the magnitude of the two variables displayed in the graph. Moreover, area diagrams here are not needed. Both variables could have been depicted using simple lines. Such graph elements are to be avoided at any price.
A necessary step of any data-based research is the exploratory analysis. It allows to investigate the properties of data that are otherwise impossible to uncover. During such an analysis, various anomalies of data can be found, including outliers, skewness of distribution, etc.
Any graph must satisfy the following requirements. Its title must be informative and interesting. It should not only describe what is shown in the graph, but also, especially in presentations, tell a story. For example, instead of “Dynamics of housing prices in 1990-2020” it could be “Falling prices after a decade of steady growth”. Axes (especially vertical one) must be described. Legend must be informative and not intersect with lines. All numbers must be horizontally placed. One must also avoid rotation of axis labels, since the values plotted in such a way are more difficult to grasp for the reader.
The standard graphs must have the following format. The upper panel shows general format, while the lower panel contains more specific titles and labels.
The graphs should have the following elements: * Informative title * Legend * Axis description * Axis labels should be placed horizontally * Sources of data
Avoid such graphs! They can be improved by correcting and introducing title, legend, and axis labels and by removing unnecessary elements.
North-west panel contains no descriptions. In the north-east panel, the legend is placed wrongly, for it intersects with lines. In south-west panel, labels of vertical axis are not horizontal and, hence, are more difficult to read. In south-east panel, grid lines are superfluous. They could be acceptable, had they been plotted at larger spaces, in lighter color, and dotted or dashed.
Pie diagrams are a popular means of illustrating the structure. However, pie diagrams should be avoided. They distort perception of relative sizes. For human eye cannot correctly evaluate the size of segments.
Compare pie diagram and a barplot. Both show the same — the distribution of Russian exports in 2021 by partner countries. In pie diagram, it appears that exports to China are twice as large as exports to the Netherlands. However, barplot makes it clear that they differ by less than 40%. Pie diagram requires the use of many colors, while barplot is more parsimonious, since it has only one color. Moreover, in pie diagram, the names of countries intersect, whereas in the barplot, they can be read nicely.
Sources: UN Comtrade and own representation
In both charts, there is a little problem. The largest category is the rest of world (ROW). Typically, you would try to avoid having something as a dominant category about which you cannot too much to say. Therefore, we combined all the EU member states into an EU aggregated. In such a way, we managed to reduce the rest of world category.
Sources: UN Comtrade and own calculation
In the new graph, we managed to dramatically reduce the ROW, or “Other” category. Now, the largest category is the European Union.
In a cross-section, if you compare different objects, use the same color for all of them. Painting each bar in a different color is superfluous. Consider, for instance, the following chart.
Here, the frequencies of three groups of dwellings (with one, two, and three rooms) are plotted. Each of them is plotted in a different color. This is an additional element that reflects the information that is already known thanks to the labels of the vertical axis. Therefore, this element is superfluous and must be get rid of. In addition, the labels on horizontal axis are placed vertically, which makes it more difficult to read them. And, of course, the box around the graph should be removed without any mercy.
The improved graph could look as follows.
In addition to all the above mentioned things, I changed the color of the label for 3-room dwellings to make it more visible on the dark background and rounded the percentage shares displayed in the labels to make them more uniform. Strictly speaking, the percentage sign in the labels can be dropped too, since this information is already reflected in the horizontal axis label.
However, a different color can be useful, if you want to stress some object of interest. For instance, I want to focus on a country of interest. In the graph below, I show the percentage change in real housing prices in the 3rd quarter 2022 compared to the 2nd quarter. The graph is aimed at German audience. Therefore, I need to stress Germany. I do it by plotting the corresponding bar in a different color.
Source: OECD and own representation
Although different color for Germany does not contain any additional information, it helps the audience to immediately see the object they are most interested in.
For the count data, one could use a dotplot. An advantage of the dotplot is that the number of dots is equal to the represented value. This simplifies grasping the diagram. Below, I show the monthly number of passengers in 10 largest airports worldwide.
Source: Civil Aviation Administration of China
It can be seen that the largest airports are similar in terms of their capacity: Each month they serve between six and nine million passengers.
If you want to provide an additional information, for example, the country where the airport is located, you can place the flags of the corresponding countries in the diagram.
Source: Civil Aviation Administration of China
It can be seen immediately that half of the 10 largest airports are located in the USA.
Sometimes, for illustrative purposes it can be needed to plot images instead of lines or bars. Below, we represent the gender structure of population of two imaginary countries A and B. Let one male or female image represent one million persons. However, if much more populous countries are considered, the value of one image can be increased to 10 or even 100 millions.
It can be seen that population in country B is more than two times smaller than in country A and that country B has relatively more females (56%) than country A (41%). The images are taken from a list of Egyptian hieroglyphs.
The following graph also uses images of the objects under discussion. It compares the number of cats and dogs per 10 persons in several countries. Since the ratios are seldom integer numbers, the fractional part is displayed as a part of the corresponding pet. The images of the pets are borrowed from the isotypes of Gerd Arntz.
It is important to choose an appropriate scale of ratios. If the number of pets per person is displayed, then the ratio is going to be smaller than one. If, however, pets per 100 persons is considered, then the ratio will be close to integer but to high to be grasped at a single glance. Therefore, here I show the number of cats and dogs per 10 persons.
Source: FEDIAF Annual Report 2024 and own calculations
As can be seen in the figure, the inhabitants of Austria, France, and Switzerland are predominantly “cat people,” while Spaniards and especially Portuguese are “dog people.” In the UK, however, the numbers of dogs and cats are balanced. Germany and Russia are also close to equal ratios of cats and dogs. Turkey has very few pets relative to the number of people compared to other countries considered here.
Correct identification fo true distribution is of utmost importance, since it can affect the conclusions drawn from the analysis of the data.
Typically, the distribution of both numeric and nominal data is analyzed using histograms. In case of numeric data, an intermediate step of breaking down the whole range of values into several group and assigning observations to them is needed.
Source: SOEP and own calculations
Here, we consider the monthly income per capita. We see that the vast majority of people are earning less than 5000 euros per month. There is somebody with an income of around 15,000 euros a month. Apparently, we do not have here the whole range of distribution, since the very rich people are missing in our sample. However, it seems to cover the “normal” people.
One useful diagram to examine the distribution of data is the boxplot. It visualizes such important descriptive statistics as median (the bold black line) as well as the first (Q1) and the third (Q3) quartiles represented by the lower and the upper border of the box, correspondingly. In addition, it depicts outliers as dots above Q3+1.5∗IQR and below Q1−1.5∗IQR, where IQR stands for the interquartile range.
Source: SOEP and own calculations
In the above figure, we can see that the median is around 2500 euros and that the outliers start at 5000 euros. Therefore, we can remove them and make another boxplot without outliers.
Source: SOEP and own calculations
The boxplot allows us comparing the distribution of certain feature across different groups: e.g., periods, places, or genders. Below, we compare the monthly incomes of males and females.
Source: SOEP and own calculations
We can see that in Germany females earn substantially less than males. The median female income is about 1500 euros, while that of males is around 2500 euros, that is, more than 50% higher.
In addition, using boxplot we can compare the sizes of both groups.
Source: SOEP and own calculations
It can be seen that both groups (females and males) are of a comparable size, albeit males are a bit more numerous.
Another interesting type of graph representing the distribution is the violin chart, which is an hybrid between the boxplot and empirical density. The black box is the same thing as the box in the normal boxplot, where the white line corresponds to the median. The gray area around the box is the empirical density function.
As seen each violin diagram is symmetrical and its shape resembles that of a chord instrument, hence, the name of the diagram. However, as we are striving for parsimonious graphs without lost of information, it is better to drop one of the sides, which is just a mirror image of the remaining side.
Source: SOEP and own calculations
This violin chart basically confirms our previous conclusions about the differences in income between both genders. In addition, we can see that the variable of income for females has an asymmetric distribution. It looks like as if the density were truncated at zero.
We can also use a series of empirical density functions for different groups of observations. For example, it can be applied to the monthly data, if there is a clear seasonal pattern. The graph below depicts the distribution of the monthly number of deaths in Germany between 1991 and 2023.
Source: Destatis and own calculations
It can be seen that the number of deaths is the lowest during summers. It is much higher during winters, but the range is also large in the winter months. Two points show the number of death during COVID-19 pandemic years. They are always located to the right of the mode, 2021 being generally the year with bigger number of deaths, especially in January.
Sometimes you have several groups with many features. In these cases, it is better to use only up to 3 groups and 10–12 characteristics to guarantee visibility.
Suppose you want to compare these groups and see differences between them. In such cases, a radar diagram (spider chart or a star plot) can be useful.
Radar diagram is a graphical representation of multivariate data. It consists of a number of axes that radiate from a central point, with each axis representing a different variable. Data are plotted as a series of points on each axis.%, and the shape of the resulting polygon provides a visual representation of the data. Radar diagrams are useful for comparing the relative strengths and weaknesses of different variables across multiple categories. They can be particularly effective for displaying data with a cyclical or circular nature.
Sources: idealista and own calculations
In the above example, we see that regulated dwellings offered for rent in Catalonia in 2020-2022 were more expensive, older, located closer to the city center of Barcelona and in larger buildings than unregulated dwellings.
A large part of empirical work is about finding relationships. Therefore, it is important to have adequate tools at hand. If we have two variables, then a scatter plot can be useful. With more variables, scatter matrix comes into play. However, for a compact representation of relationships other tools may be needed.
The simplest way to represent a relationship between two numeric variables is scatterplot. Each axis displays values of the corresponding variable. One can immediately see the strength and direction of relationship. Any extreme values can be easily detected.
Consider a simple scatterplot. It shows a relationship between the vacancy rate (share of dwellings staying empty) and price of building plots at the level of German Kreise. There are more than 400 Kreise in Germany, including both cities (kreisfreie Staedte) and countryside (Landkreise). The lower the vacancy rate the tighter the housing market and, therefore, the higher the prices.
Sources: Destatis, empirica, IVD, and own calculations (German Kreise in 2020)
Indeed, there is a cloud of points that is stretched from north-west to south-east pointing to the negative relationship between the vacancy rates and price of land.
However, we have more data and would like to reflect them in the same figure. It is likely that the size of population is related to the vacancy rates and real estate prices. The larger the population of a region the more amenities it has and more further people it attracts. This reduces the vacancy rates, since demand for housing is growing faster than supply of it. Therefore, in the following figure we make the size of circles proportional to the population size of the corresponding Kreise.
Sources: Destatis, empirica, IVD, and own calculations (German Kreise in 2020)
Larger circles are located in the north-west part of the graph. This means that more populous regions have tighter housing markets: they have less vacant dwellings and higher land prices. Nevertheless, there are some relatively big regions with higher vacancies lower prices. These can be big rural areas or cities located in depressed areas.
We know, though, that squares reflect relative sizes better than circles. Therefore, in the next graph we replace circles by squares.
Sources: Destatis, empirica, IVD, and own calculations (German Kreise in 2020)
In addition, we can take advantage of color in order to reflect an additional feature of data. We know whether each region is an urban or rural area. This is a nominal variable and it can be represented by color. The rural areas we denote by the green color as allusion to the green fields, while the urban areas we paint in brown, given that they produce lots of pollution.
Sources: Destatis, empirica, IVD, and own calculations (German Kreise in 2020)
We can see that vacancy rates tend to be the lowest and the land prices are the highest in the big cities, while small rural regions are characterized by a more relaxed real-estate market with lower prices.
An indispensable tool of the analysis of relationships is the computation of correlation coefficients. If there are multiple variables, it would be useful to estimate correlation matrix summarizing the correlations between pairs of variables. However, a correlation matrix is rather non-spectacular. Therefore, it makes sense to visualize the matrix, as shown below.
Correlations between food prices
Sources: Numbeo and own calculations
The size and color of cells reflect the intensity and direction of correlation. Blue color corresponds to a positive correlation, while red color corresponds to a negative correlation. If correlation coefficient is not statistically significant, it is not displayed in the graph. Thus, human eye can immediately recognize patterns in the matrix. There is only one major issue: when there are too many variables (say, more than 10), the visualized correlation matrix becomes a mess.
The line diagrams are especially useful to depict continuous data. They are also widely used to show time series. Plotting a single time series does not pose any difficulties. It becomes much more interesting, if more than one time series are to be depicted. First, it is advisable to plot no more than five variables in the same figure. Second, a care must be taken, if the units of measurement of the variables are different.
The plot below shows the dynamics of prices for land and for flats in multi-family houses per square meter in the top 7 German cities (Berlin, Cologne, Duesseldorf, Frankfurt am Main, Hamburg, Munich, and Stuttgart).
Source: IVD and own calculation.
The square-meter prices for land plots are on average five times lower than the prices for flats. Therefore, the dynamics of the land prices are hardly discernible in the figure above. There are different solutions to this issue. The first is to plot each time series in a separate graph. Assume, however, that for some reason (for example, in order to save space) we need to plot them in the same figure. Hence, the second option is to add another vertical axis, like this.
Source: IVD and own calculation
As seen, both time series evolve in a very similar way.
We can stress the correspondence between the lines and axes using the color. Let us use for each of vertical axes a color that is used for depicting the respective time series.
Source: IVD and own calculation
The third possibility is to use one vertical axis, but break it at the intermediate values — between the “large” and the “small” variables.
Source: IVD and own calculation
In this particular case, the second option appears to be the best: both lines are clearly distinguishable. In the third case, both are seen well, but in less detailed way than in the second case. The overall comovement between both time series can be seen, but not the smaller fluctuations, like decline of flat prices and increase of land prices.