PSYC40940: Foundations of Data Visualisation and ggplot2

General outline for weeks 2 – 5

Week 2: Principles of data visualisation
Week 3: Grammar of graphics; aesthetics and attributes
Week 4: Major visualisation tools
Week 5: Customising visualisations (scales, themes, and labels)

Overview

Motivation / Relevance
ggplot2 teaser
Principles of data visualisation

Download “Exercises” folder from NOW Learning Room (week 2). Move files into R-Project folder.

The data

Picture naming task
Written and spoken responses
Manipulation: Prior familiarisation with most-common name for a picture

Exploring data

Rows: 8,670
Columns: 17
$ ppt_id            <dbl> 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, …
$ ppt_vocab         <dbl> 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.…
$ image_id          <chr> "almond.jpg", "ambulance.jpg", "aubergine.jpg", "austronaut.jpg", "bagpipes.jpg", "basketbal…
$ resp              <chr> "almond", "ambulance", "aubergine", "austronaut", "bagpipes", "basketball", "binoculars", "b…
$ name_familiarised <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, …
$ modality          <chr> "speech", "speech", "speech", "speech", "speech", "speech", "speech", "speech", "speech", "s…
$ rt                <dbl> 1312, 1057, 967, 1148, 1100, 1295, 1205, 2470, 1292, 1012, 1476, 2241, 3120, 1401, 1232, 119…
$ dur               <dbl> 649, 544, 800, 680, 587, 452, 693, 618, 721, 740, 407, 873, 3510, 708, 691, 738, 287, 542, 9…
$ spell_div         <dbl> 0.52909473, 0.39199841, 1.04293875, 1.23439860, 0.88868655, 0.44764042, 1.16363177, 0.416013…
$ name_div          <dbl> 3.7269149, 0.1407271, 3.7693227, 2.9867884, 0.2224148, 0.7472810, 0.2578952, 2.8129985, 4.49…
$ aoa               <dbl> 7.67, 6.16, NA, 6.28, NA, 5.30, 6.79, 4.63, 8.72, 5.20, 9.76, 8.22, 12.25, 5.84, 9.61, 6.18,…
$ freq              <dbl> 1.0769429, 4.8264469, NA, NA, 1.9932336, 2.7816910, 2.6863808, 5.5655792, 2.3297058, 3.51928…
$ nsyl              <dbl> 2, 3, 4, 3, 3, 3, 4, 2, 2, 3, 2, 4, 3, 2, 2, 4, 1, 3, 2, 3, 2, 1, 1, 2, 2, 3, 2, 2, 3, 3, 2,…
$ nchar             <dbl> 6, 9, 9, 9, 8, 10, 10, 7, 7, 8, 4, 10, 8, 8, 8, 11, 5, 9, 9, 10, 7, 6, 3, 7, 6, 8, 7, 7, 8, …
$ nphon             <dbl> 5, 9, NA, NA, 7, 9, 10, 6, 5, 7, 3, 10, 8, 5, 7, 8, 3, 8, 6, 8, 4, NA, 3, 5, 5, NA, 6, 6, 7,…
$ cat               <chr> "is natural", "is manmade", "is natural", NA, "is manmade", "is manmade", "is manmade", "is …
$ semcat            <dbl> -0.12349131, -0.44492604, -0.72806943, NA, 0.05315648, 0.48631844, 1.55619767, 0.33721132, 1…

Exploring data

d_ppt_pic <- distinct(d_spellname, ppt_id, image_id)
count(d_ppt_pic, ppt_id)

# A tibble: 72 × 2
   ppt_id     n
    <dbl> <int>
 1      1   141
 2      2   133
 3      3   142
 4      4   139
 5      5   136
 6      6   133
 7      7   137
 8      8   141
 9      9   142
10     10   136
# ℹ 62 more rows

What is data visualisation?

Graphical representation of data
Graphical data analysis
What do we want to know?
What do we want to communicate?
What do people take away from your visualisation?
Exploratory plots (for small specialist audience)
Explanatory plots: inform and persuade wider audience

Building up a plot

d_vocab <- summarise(d_spellname, 
                      rt = mean(rt),
                      .by = c(ppt_id, ppt_vocab, modality)) 
glimpse(d_vocab, width = 120)

Rows: 72
Columns: 4
$ ppt_id    <dbl> 40, 41, 42, 43, 44, 45, 46, 47, 48, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 62, 63, 64, 65, 66, …
$ ppt_vocab <dbl> 0.9500, 1.0000, 0.9250, 0.9000, 0.9750, 1.0000, 0.9250, 0.9625, 0.9750, 0.8250, 0.9375, 0.8875, 0.97…
$ modality  <chr> "speech", "speech", "speech", "speech", "speech", "speech", "speech", "speech", "speech", "speech", …
$ rt        <dbl> 1358.0756, 1273.9292, 1392.7664, 740.0988, 1188.0392, 1341.9835, 1519.9000, 1512.6337, 1479.2810, 12…

Building up a plot

ggplot(data = d_vocab, 
       mapping = aes(x = ppt_vocab, 
                     y = rt))

Building up a plot

ggplot(data = d_vocab, 
       mapping = aes(x = ppt_vocab, 
                     y = rt))  +
  geom_point()

Building up a plot

ggplot(data = d_vocab, 
       mapping = aes(x = ppt_vocab, 
                     y = rt)) +
  geom_point() +
  stat_smooth(method = "lm")

Building up a plot

ggplot(data = d_vocab, 
       mapping = aes(x = ppt_vocab, 
                     y = rt,
                     colour = modality)) +
  geom_point() +
  stat_smooth(method = "lm")

Building up a plot

ggplot(data = d_vocab, 
       mapping = aes(x = ppt_vocab, 
                     y = rt, 
                     colour = modality,
                     linetype = modality))  +
  geom_point(alpha = .25) +
  stat_smooth(method = "lm", se = T, fullrange = TRUE) +
  scale_y_continuous(labels = scales::comma) +
  ggthemes::theme_clean() +
  ggthemes::scale_color_colorblind() +
  labs(y = "Average reaction time (in msecs)", 
       x = "Vocabulary score",
       colour = "Response modality",
       linetype = "Response modality") +
  theme(legend.position = "top",
        legend.justification = "right",
        axis.title = element_text(hjust = 0))

Creating an exploratory plot

Open RMarkdown document 1_scatterplots.Rmd

Why data visualisation?

“[data visualization] forces us to notice what we never expected to see.” (Tukey 1977)

exploring structures in the data
relationship between variables
distribution of data
develop an understanding of patterns (beyond means and SDs)
selecting appropriate stats
prevent wrong conclusions about data / theory

Anscombe’s quartet (Anscombe 1973)

	x		y		y ~ x
Data set	Mean	SD	Mean	SD	Correlation	Intercept	Slope
1	9	3.32	7.5	2.03	0.82	3	0.5
2	9	3.32	7.5	2.03	0.82	3	0.5
3	9	3.32	7.5	2.03	0.82	3	0.5
4	9	3.32	7.5	2.03	0.82	3	0.5

Anscombe’s quartet

The datasaurus dozen

Matejka and Fitzmaurice (2017): see link

Open RMarkdown document: 2_tdd.Rmd

Principles of data visualisation

No “one fits all” method
Some methods are more informative than others
Maximise what we can learn from data
Going beyond summary statistics
Descriptive summary statistics may conceal / obscure important patterns but minimise what we want to communicate
Visualisation helps us to understand patterns, structures, relationships
Prevent wrong conclusions about data / theory

Principles of data visualisation

Hartwig and Dearing (1979):

Skepticism: any visualization might obscure or misrepresent data
Openness: there might be patterns and structures that we were not expecting

Tufte (1983):

Above all else show the data
Avoid distorting what the data have to say
Present many numbers in a small space
Encourage the eye to compare different pieces of data
Reveal data at several levels of detail, from broad overview to fine structures

6 plots of the same data

Obscuring data and misleading information

Open RMarkdown document 3_scatterplots.Rmd

Principles of data visualisation

Edward Tufte’s principles emphasise clarity, precision, and efficiency in the visual display of information. Tufte’s principles guide us to create visualizations that are:

Clear
Honest
Efficient
Insightful

Principles of data visualisation

Principle 1: Show the Data

Focus on the data itself
Avoid unnecessary decoration
Let the data tell the story

Principle 2: Maximize Data-Ink Ratio

Minimize non-essential elements
Every visual element should serve a purpose

Principle 3: Avoid Chartjunk

Eliminate decorative elements that obscure the message
Simplicity and clarity are key

Principles of data visualisation

Principle 1: Show the Data

Focus on the data itself
Avoid unnecessary decoration
Let the data tell the story

Principle 2: Maximize Data-Ink Ratio

Minimize non-essential elements
Every visual element should serve a purpose

Principle 3: Avoid Chartjunk

Eliminate decorative elements that obscure the message
Simplicity and clarity are key

Principles of data visualisation

Principle 4: Use Small Multiples

Repeat charts across categories for comparison
Supports pattern recognition

Principle 5: Encourage Visual Comparisons

Design graphics to make comparisons easy
Align scales and axes

Principles of data visualisation

Principle 4: Use Small Multiples

Repeat charts across categories for comparison
Supports pattern recognition

Principle 5: Encourage Visual Comparisons

Design graphics to make comparisons easy
Align scales and axes

Principles of data visualisation

Principle 4: Use Small Multiples

Repeat charts across categories for comparison
Supports pattern recognition

Principle 5: Encourage Visual Comparisons

Design graphics to make comparisons easy
Align scales and axes

Principles of data visualisation

Principle 6: Integrate Words, Numbers, and Images

Labels should be clear and close to the data
Avoid legends that require back-and-forth viewing

Principles of data visualisation

Principle 7: Content Over Decoration

Focus on substance, not style
The story should come from the data

Principles of data visualisation

Principle 8: Use Multivariate Displays

Show multiple variables when appropriate
Balance complexity with readability

Principles of data visualisation

Principle 8: Use Multivariate Displays

Show multiple variables when appropriate
Balance complexity with readability

Principles of data visualisation

Principle 8: Use Multivariate Displays

Show multiple variables when appropriate
Balance complexity with readability

Principles of data visualisation

Principle 9: Avoid Distorting the Data

Maintain proportionality and scale
Avoid misleading visuals

Principles of data visualisation

Clarity: Avoid clutter; make the message obvious
Accuracy: Represent data truthfully
Efficiency: Use the right chart for the right data
Consistency: Use consistent scales, colors, and labels
Accessibility: Consider colorblind-friendly palettes and readable fonts

What’s wrong with these?

See Gong and Liu (2022) (now retracted), Rubiah et al. (2024), and Ke (2024) for examples.

Check Figures in van Lieburg et al. (2023); code and data are HERE.

Reading

Next week we will continue with data visualisation in ggplot2. For fundamentals of data visualisation in ggplot2 see

And for principles of data visualisation see this book: Tufte (2001)

Homework

Identify a dataset for the formative assessment.

On Teams, share a poor data visualisation (from a published research papers, news websites, social media, etc) and your reason why it is poor. Which principle(s) of data visualisation were violated?

References

Andrews, Mark. 2021. Doing Data Science in R: An Introduction for Social Scientists. SAGE Publications Ltd.

Anscombe, Francis J. 1973. “Graphs in Statistical Analysis.” The American Statistician 27: 17–21.

Gong, Ruyao, and Binghong Liu. 2022. “[Retracted] Monitoring of Sports Health Indicators Based on Wearable Nanobiosensors.” Advances in Materials Science and Engineering 2022 (1): 3802603. https://doi.org/https://doi.org/10.1155/2022/3802603.

Hartwig, Frederick, and Brian E. Dearing. 1979. Exploratory Data Analysis. 16. Sage.

Ke, Y. 2024. “Examining Simultaneous Pausing on the Cognitive Writing Process: A Micro-Formative Writing Assessment.” Current Psychology 43 (1): 39–50.

Matejka, Justin, and George Fitzmaurice. 2017. “Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics Through Simulated Annealing.” In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 1290–94.

Rubiah, R., I. N. S. Degeng, P. Setyosari, and D. Kuswandi. 2024. “The Effect of Problem-Based Learning Assisted with Concept Mapping Founded on Cognitive Style on the Creativity of Writing Exposition Text.” Creativity Studies 17 (2): 419–34.

Tufte, Edward R. 1983. The Visual Display of Information. Cheshire, Ct: Graphics Press.

———. 2001. The Visual Display of Quantitative Information. 2nd ed. Cheshire, CT: Graphics Press.

Tukey, John W. 1977. Exploratory Data Analysis. Vol. 2.

van Lieburg, R., E. Sijyeniyo, R. J. Hartsuiker, and Sarah Bernolet. 2023. “The Development of Abstract Syntactic Representations in Beginning L2 Learners of Dutch.” Journal of Cultural Cognitive Science 7: 289–309. https://doi.org/10.1007/s41809-023-00131-5.

Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28.

———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.