Data Visualisation Workshop - Intro and Design

EPA Victoria - June 2018

Dr James Baglin, School of Science, Mathematical Sciences, RMIT University

Last updated: 24 June, 2018

Workshop Overview

Facilitator

  • Dr James Baglin
  • Senior Lecturer, School of Science, Mathematical Sciences
  • Email: james.baglin@rmit.edu.au
  • Phone: 03 9925 6618
  • Teaching: Statistics and Data Visualisation - Master of Analytics, Master of Statistics and Operations Research and Master of Data Science
  • Research areas: Higher education, educational technology, statistics education and data visualisation

Workshop Outline

  • Slides are available at http://rpubs.com/jbaglin/

      1. Intro and Design (9:00am - 10:30am)
      • Break (15 minutes)
      1. Theory (10:45am - 12:30 noon)
      • Lunch (1 hour)
      1. Colour (1:30pm - 3:00pm)
      • Break (15 minutes)
      1. Methods (3:15pm - 5:00pm)

Format

  • Each module will include the following:
    • Presentation
    • Discussion
    • Brief Exercise
    • Short Quiz

Design

Activity 1 - Turning Tables

Why do we visualise data?

  • Andy Kirk (2012) defined data visualisation as “the representation and presentation of data that exploits our visual perception abilities in order to amplify cognition” (p.17)
  • Why do we visualise?
    • Exploration - identifying interesting and important features (Buja et al. 2009)
    • Assist informal inference (Bakker et al. 2008)
    • Assist in teaching statistics (Paparistodemou and Meletiou-Mavrotheris 2008)
    • Exploits our visual processing power to rapidly process, compare and identify trends or interesting features in the data (Kirk 2012)
    • Makes data more appealing

Why do we visualise data? Cont.

The greatest value of a picture is when it forces up to notice what we never expected to see - John W Tukey, Exploratory Data Analysis, (1977)


By Source, Fair use, Link

The Power of Data Visualisation

  • Data visualisation has power and we need to respect it
  • Tal and Wansink (2016) found that the mere presence of a “trivial” graph had an significant impact on persuasiveness of the the efficacy of a medication.
  • The effect was greater in people that had a strong belief in science.

Some Revision

  • Before we begin to dig deeper, we need review levels of measurement:

    • Categorical or Nominal (Qualitative): Categorical variables are grouping variables, or categories if you will. Examples include binary variables (e.g. yes/no, male/female) and multinomial variables (e.g. religious affiliation, hair colour, ethnicity, suburb).
    • Ordinal (Qualitative): Ordinal data has a rank order by which it can be sorted, but the differences between the ranks are not relative or measurable. Therefore, ordinal data is not strictly quantitative. For example, consider the 1st, 2nd and 3rd place in a race. We know who was faster or slower, but we have no idea by how much.

Some Revision Cont.

  • Levels of measurement continued:

    • Interval (Quantitative): An interval variable is similar to an ordinal variable except that the intervals between the values of the interval scale are equally spaced. Interval variables have an arbitrary zero point and therefore no meaningful ratios. For example, think about our calendar year and Celsius scale.
    • Ratio (Quantitative): A ratio variable is similar to an interval variable; however, there is an absolute zero point and ratios are meaningful. An example is time given in seconds, length in centimetres, or heart beats per minute. A value of 0 implies the absence of a variable.

A Layered Grammar or Graphics

  • Wickham (2010) proposed the layered grammar of graphics, which built upon the original Grammar of Graphics first proposed by Wilkinson (2005).
  • The idea was to build a grammar that could describe any data visualisation as succinctly as possible.
  • This big idea because it allows us to move away a narrow list of methods, into unlimited possibilities

A Layered Grammar or Graphics Cont.

  • Wickham proposed that a graphic is a series of layers consisting of…
    • a default dataset and set of mappings from variables to aesthetics,
    • one or more layers, with each layer having
      • one geometric object
      • one statistical transformation,
      • one position adjustment,
      • and optionally, one dataset and set of aesthetic mappings
      • one scale for each aesthetic mapping used
    • a coordinate system
    • a facet specification.

Layers

  • Any graphic can be thought of as a series of layers…

  • Put them together and we create a graph…

Layers Cont.

Data

  • Layers are composed of data, aesthetic mappings, statistical transformations, geometric objects and optional position adjustments.
  • Data are obvious…
##     Ozone Solar.R Wind Temp Month Day       date
## 1      41     190  7.4   67     5   1 1973-05-01
## 2      36     118  8.0   72     5   2 1973-05-02
## 3      12     149 12.6   74     5   3 1973-05-03
## 4      18     313 11.5   62     5   4 1973-05-04
## 5      NA      NA 14.3   56     5   5 1973-05-05
## 6      28      NA 14.9   66     5   6 1973-05-06
## 7      23     299  8.6   65     5   7 1973-05-07
## 8      19      99 13.8   59     5   8 1973-05-08
## 9       8      19 20.1   61     5   9 1973-05-09
## 10     NA     194  8.6   69     5  10 1973-05-10
## 11      7      NA  6.9   74     5  11 1973-05-11
## 12     16     256  9.7   69     5  12 1973-05-12
## 13     11     290  9.2   66     5  13 1973-05-13
## 14     14     274 10.9   68     5  14 1973-05-14
## 15     18      65 13.2   58     5  15 1973-05-15
## 16     14     334 11.5   64     5  16 1973-05-16
## 17     34     307 12.0   66     5  17 1973-05-17
## 18      6      78 18.4   57     5  18 1973-05-18
## 19     30     322 11.5   68     5  19 1973-05-19
## 20     11      44  9.7   62     5  20 1973-05-20
## 21      1       8  9.7   59     5  21 1973-05-21
## 22     11     320 16.6   73     5  22 1973-05-22
## 23      4      25  9.7   61     5  23 1973-05-23
## 24     32      92 12.0   61     5  24 1973-05-24
## 25     NA      66 16.6   57     5  25 1973-05-25
## 26     NA     266 14.9   58     5  26 1973-05-26
## 27     NA      NA  8.0   57     5  27 1973-05-27
## 28     23      13 12.0   67     5  28 1973-05-28
## 29     45     252 14.9   81     5  29 1973-05-29
## 30    115     223  5.7   79     5  30 1973-05-30
## 31     37     279  7.4   76     5  31 1973-05-31
## 32     NA     286  8.6   78     6   1 1973-06-01
## 33     NA     287  9.7   74     6   2 1973-06-02
## 34     NA     242 16.1   67     6   3 1973-06-03
## 35     NA     186  9.2   84     6   4 1973-06-04
## 36     NA     220  8.6   85     6   5 1973-06-05
## 37     NA     264 14.3   79     6   6 1973-06-06
## 38     29     127  9.7   82     6   7 1973-06-07
## 39     NA     273  6.9   87     6   8 1973-06-08
## 40     71     291 13.8   90     6   9 1973-06-09
## 41     39     323 11.5   87     6  10 1973-06-10
## 42     NA     259 10.9   93     6  11 1973-06-11
## 43     NA     250  9.2   92     6  12 1973-06-12
## 44     23     148  8.0   82     6  13 1973-06-13
## 45     NA     332 13.8   80     6  14 1973-06-14
## 46     NA     322 11.5   79     6  15 1973-06-15
## 47     21     191 14.9   77     6  16 1973-06-16
## 48     37     284 20.7   72     6  17 1973-06-17
## 49     20      37  9.2   65     6  18 1973-06-18
## 50     12     120 11.5   73     6  19 1973-06-19
## 51     13     137 10.3   76     6  20 1973-06-20
## 52     NA     150  6.3   77     6  21 1973-06-21
## 53     NA      59  1.7   76     6  22 1973-06-22
## 54     NA      91  4.6   76     6  23 1973-06-23
## 55     NA     250  6.3   76     6  24 1973-06-24
## 56     NA     135  8.0   75     6  25 1973-06-25
## 57     NA     127  8.0   78     6  26 1973-06-26
## 58     NA      47 10.3   73     6  27 1973-06-27
## 59     NA      98 11.5   80     6  28 1973-06-28
## 60     NA      31 14.9   77     6  29 1973-06-29
## 61     NA     138  8.0   83     6  30 1973-06-30
## 62    135     269  4.1   84     7   1 1973-07-01
## 63     49     248  9.2   85     7   2 1973-07-02
## 64     32     236  9.2   81     7   3 1973-07-03
## 65     NA     101 10.9   84     7   4 1973-07-04
## 66     64     175  4.6   83     7   5 1973-07-05
## 67     40     314 10.9   83     7   6 1973-07-06
## 68     77     276  5.1   88     7   7 1973-07-07
## 69     97     267  6.3   92     7   8 1973-07-08
## 70     97     272  5.7   92     7   9 1973-07-09
## 71     85     175  7.4   89     7  10 1973-07-10
## 72     NA     139  8.6   82     7  11 1973-07-11
## 73     10     264 14.3   73     7  12 1973-07-12
## 74     27     175 14.9   81     7  13 1973-07-13
## 75     NA     291 14.9   91     7  14 1973-07-14
## 76      7      48 14.3   80     7  15 1973-07-15
## 77     48     260  6.9   81     7  16 1973-07-16
## 78     35     274 10.3   82     7  17 1973-07-17
## 79     61     285  6.3   84     7  18 1973-07-18
## 80     79     187  5.1   87     7  19 1973-07-19
## 81     63     220 11.5   85     7  20 1973-07-20
## 82     16       7  6.9   74     7  21 1973-07-21
## 83     NA     258  9.7   81     7  22 1973-07-22
## 84     NA     295 11.5   82     7  23 1973-07-23
## 85     80     294  8.6   86     7  24 1973-07-24
## 86    108     223  8.0   85     7  25 1973-07-25
## 87     20      81  8.6   82     7  26 1973-07-26
## 88     52      82 12.0   86     7  27 1973-07-27
## 89     82     213  7.4   88     7  28 1973-07-28
## 90     50     275  7.4   86     7  29 1973-07-29
## 91     64     253  7.4   83     7  30 1973-07-30
## 92     59     254  9.2   81     7  31 1973-07-31
## 93     39      83  6.9   81     8   1 1973-08-01
## 94      9      24 13.8   81     8   2 1973-08-02
## 95     16      77  7.4   82     8   3 1973-08-03
## 96     78      NA  6.9   86     8   4 1973-08-04
## 97     35      NA  7.4   85     8   5 1973-08-05
## 98     66      NA  4.6   87     8   6 1973-08-06
## 99    122     255  4.0   89     8   7 1973-08-07
## 100    89     229 10.3   90     8   8 1973-08-08
## 101   110     207  8.0   90     8   9 1973-08-09
## 102    NA     222  8.6   92     8  10 1973-08-10
## 103    NA     137 11.5   86     8  11 1973-08-11
## 104    44     192 11.5   86     8  12 1973-08-12
## 105    28     273 11.5   82     8  13 1973-08-13
## 106    65     157  9.7   80     8  14 1973-08-14
## 107    NA      64 11.5   79     8  15 1973-08-15
## 108    22      71 10.3   77     8  16 1973-08-16
## 109    59      51  6.3   79     8  17 1973-08-17
## 110    23     115  7.4   76     8  18 1973-08-18
## 111    31     244 10.9   78     8  19 1973-08-19
## 112    44     190 10.3   78     8  20 1973-08-20
## 113    21     259 15.5   77     8  21 1973-08-21
## 114     9      36 14.3   72     8  22 1973-08-22
## 115    NA     255 12.6   75     8  23 1973-08-23
## 116    45     212  9.7   79     8  24 1973-08-24
## 117   168     238  3.4   81     8  25 1973-08-25
## 118    73     215  8.0   86     8  26 1973-08-26
## 119    NA     153  5.7   88     8  27 1973-08-27
## 120    76     203  9.7   97     8  28 1973-08-28
## 121   118     225  2.3   94     8  29 1973-08-29
## 122    84     237  6.3   96     8  30 1973-08-30
## 123    85     188  6.3   94     8  31 1973-08-31
## 124    96     167  6.9   91     9   1 1973-09-01
## 125    78     197  5.1   92     9   2 1973-09-02
## 126    73     183  2.8   93     9   3 1973-09-03
## 127    91     189  4.6   93     9   4 1973-09-04
## 128    47      95  7.4   87     9   5 1973-09-05
## 129    32      92 15.5   84     9   6 1973-09-06
## 130    20     252 10.9   80     9   7 1973-09-07
## 131    23     220 10.3   78     9   8 1973-09-08
## 132    21     230 10.9   75     9   9 1973-09-09
## 133    24     259  9.7   73     9  10 1973-09-10
## 134    44     236 14.9   81     9  11 1973-09-11
## 135    21     259 15.5   76     9  12 1973-09-12
## 136    28     238  6.3   77     9  13 1973-09-13
## 137     9      24 10.9   71     9  14 1973-09-14
## 138    13     112 11.5   71     9  15 1973-09-15
## 139    46     237  6.9   78     9  16 1973-09-16
## 140    18     224 13.8   67     9  17 1973-09-17
## 141    13      27 10.3   76     9  18 1973-09-18
## 142    24     238 10.3   68     9  19 1973-09-19
## 143    16     201  8.0   82     9  20 1973-09-20
## 144    13     238 12.6   64     9  21 1973-09-21
## 145    23      14  9.2   71     9  22 1973-09-22
## 146    36     139 10.3   81     9  23 1973-09-23
## 147     7      49 10.3   69     9  24 1973-09-24
## 148    14      20 16.6   63     9  25 1973-09-25
## 149    30     193  6.9   70     9  26 1973-09-26
## 150    NA     145 13.2   77     9  27 1973-09-27
## 151    14     191 14.3   75     9  28 1973-09-28
## 152    18     131  8.0   76     9  29 1973-09-29
## 153    20     223 11.5   68     9  30 1973-09-30

Mapping Aesthetics

  • Features of geometric objects, such as points, lines, colours, and shapes, are referred to as aesthetics
  • The process of assigning variables from a dataset to the aesthetics is known as mapping.
  • For example in the air quality example x = Date and y = Ozone.
  • Points are geometric objects and the position they are drawn on the plot is determined by the mappings.

Geometric Objects

  • Geometric objects are use to represent data or statistical transformations of the data.
  • We are already familiar with many common geometric objects.
    • Box plots
    • Histograms
    • Bar charts
    • Scatter plots, etc.

Statistical Transformations

  • Many visualisations use statistical summaries of the raw data.
  • Examples of stats transformations include the following:
    • quartiles of box plots
    • means
    • error bar/confidence intervals
    • binning in histograms and dot plots
    • tallies, counts, proportions, percentages in bar charts
    • lines of best fit for linear regression.
  • Statistical transformations are the reason why statistics is so important for data visualisation.

Statistical Transformations Cont.

  • The smoothed trend line in the Air Quality visualisation was estimated using a non-parametric, locally weighted regression model

Scales

  • Scales are used to control the mapping between a variable and an aesthetic.
  • We are all familiar with x and y axis position aesthetics and scales, but other aesthetics and scales such as colour, shape and size can be mapped to a variable.

Coordinate System

  • Coordinate systems are the basis of mapping a variable’s aesthetics to a plot.
  • Common systems include the following:

    • Cartesian
    • Transformed
    • Polar
    • Map

Faceting

  • Faceting is a powerful way to break a data visualisation into small multiples. This process is also known as latticing or trellising.

Design Process

Kirk’s (2012) visual design process…

  • Objectives:
    • Strive for form and function
    • Justify everything you do
    • Keep it accessible and intuitive
    • Avoid deceiving the viewer
  • Purpose - Why are you visualising the data?
  • Editorial Focus - What is the narrative?
  • Options - What decisions need to be made?
  • Methods - What method should you use? Existing or new?
  • Construction - Getting the job done. What technology should I use?
  • Evaluation - Did you achieve your goals?

Trifecta Check-up

Kaiser Fung, author of Junk Charts, provides a very simple and powerful framework, called the Trifecta Check-up, to use when evaluating a data visualisation.

Trifecta Check-up by Kaiser Fung

Critique

  • Critique the following data visualisation by Greg Laden according to the Trifecta check-up

Artic Death Spiral

  • You can read Kaiser’s review here.

Critiques

According to Kaiser’s Trifecta check-up, there are eight possible critiques for a data visualisation.

Type Description
Q Poor question
D Poor data
V Poor visuals
QD Poor question and problematic data, but visuals are OK.
QV Data are good, but question and visuals are out of sync
DV Good question, but issues with data and visuals
QDV The data visualisation fails everything
Trifecta Q, D and V are in sync. Good data visualisation

Spit and Polish

  • Don’t be that visualisation…

The US is an outlier - Vox

Spit and Polish Cont.

  • Always check for common issues including the following:

    • Check for human errors!
    • Do you need a title?
    • Is your text readable?
    • Have you checked spelling (and grammar)?
    • Do you need to include a data source?
    • Have you included descriptive labels?
    • Is your vis Clutter-free
    • Are you using colour appropriately?
    • Is the visualisation the right size?

Quick Quiz 1

References

References

Bakker, A., P. Kent, J. Derry, R. Noss, and C. Hoyles. 2008. “Statistical inference at work: Statistical process control as an example.” Statistical Education Research Journal 7 (2): 130–45. http://www.stat.auckland.ac.nz/serj.

Buja, A., D. Cook, H. Hofmann, M. Lawrence, E.-K. Lee, D. F. Swayne, and H. Wickham. 2009. “Statistical inference for exploratory data analysis and model diagnostics.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 367 (1906): 4361–83. doi:10.1098/rsta.2009.0120.

Kirk, A. 2012. Data visualization: a successful design process. Birmingham, UK: Packt Publishing Ltd.

Paparistodemou, E., and M. Meletiou-Mavrotheris. 2008. “Developing young students’ informal inference skills in data analysis.” Statistics Education Research Journal 7 (2): 83–106. http://www.stat.auckland.ac.nz/serj.

Tal, A., and B. Wansink. 2016. “Blinded with science: Trivial graphs and formulas increase ad persuasiveness and belief in product efficacy.” Public Understanding of Science 25 (1): 117–25. doi:10.1177/0963662514549688.

Tukey, J. W. 1977. Exploratory data analysis. Reading, MA: Addison-Wesley.

Wickham, H. 2010. “A layered grammar of graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. doi:10.1198/jcgs.2009.07098.

Wilkinson, L. 2005. The Grammar of Graphics. Statistics and Computing. New York: Springer-Verlag. doi:10.1007/0-387-28695-0.