Some Thoughts on Data Visualization

Data visualization is the reason I became interested in data science. I have an education in fine art but have always felt really out of place in the art world and gravitated in my career toward more technical jobs. So when I discovered data science as a field of study I was initially (and still am) most fascinated by the intersection of art and science that is data visualization. Data visualization is a statistical method that I hadn’t really considered as such until I learned more about it. I never realized how important it was to visualize data not just to communicate a message, but to understand your data, gain new insight into your data, and to asses the validity of your models.

My initial thoughts about data visualization were much more influenced by my knowledge of art and design principles, and I still think those are important considerations in developing a good data visualization, but I’ve since learned that there are many statistical considerations as well. While color, composition, line, shape, texture, etc… all play a role, data types, how many variables you want to show in one plot, the type of audience you are presenting to, the extent of their statistical knowledge, and the insight you are trying to gain from the visualization are paramount. There is also a considerable amount of new research being done to understand how human beings interpret what they see and the impact that has on the effectiveness of your visualization.

Data Types

Certain data types just don’t work with certain types of plots. Understanding what type of data you have and what types of plots work for your data types is probably the first thing to think about when planning a visualization. Bar graphs for instance only work with discrete variables on the x-axis while histograms are the continuous equivalent. Histograms discretize your continuous variable by binning your data into ranges. This r-statistics.co ‘Masterlist’ is a good resource that’s organized by the statistical relationships you are trying to show to make it easier for the numbers people to find an appropriate visualization. While this plot type resource from THE R GRAPH GALLERY might make it easier for the more visual thinkers to find appropriate plots.

The Audience

Another major consideration that often gets forgotten is your audience. If you are publishing a web or print article meant for mass circulation to the general public you are most likely going to want a very different graphic than if you are presenting to a board of directors of a major corporation or to a group of scientists or mathematicians at a conference or in a published journal. The former requires more color and less detail, while the latter requires greater detail of information and less frills. The middle is not surprisingly somewhere in between. The level of statistical knowledge you can expect and so can successfully represent in a graphic meant for a general audience is much lower than what you could expect from board members which is less than you would expect from scientists or mathematicians.

‘Sketching’ Your Data

I wrote in my previous post about using statistics to enhance stories, and that includes visualizations. Good data visualizations are like works of art and can tell a story all by themselves; often not needing any further explanation. However, the plot you would make to tell a story with a visualization is a lot different than the plot you would generate to test if your model residuals are nearly normal and homoscedastic, or to see what type of distribution fits your data, or if there appears to be a linear relationship between your target and your predictor variables. These latter types of visualizations may never be seen by anyone but you and their design is to a statistician or data scientist like a sketch is to an artist. They allow you to see the shapes and relationships within your data in ways that you wouldn’t be able to with tables of numbers alone.

Published on RPubs at: http://rpubs.com/betsyrosalen/DATA621_Blog_Post_3