Data visualization is a field of statistics that encompasses the creation of visual displays of data, such as graphs, and communication of that data in an efficient manner through visual techniques. One of the most important aspects of data visualization is the manipulation of color. Color plays a role in the majority of everyday life for individuals, and each color has a meaning. Whether that meaning is through a red light indicating for a car to stop, or colors on a flag symbolizing a specific country. Color is aesthetically pleasing, but it also contains meaning in certain cases. In particular, in data visualization, color is strategically embedded in graphical displays to thoroughly communicate the data at hand. Often, the color choice will elicit emotions from the viewer. For example, according to a staff writer at Independence University, blue evokes a calming emotion and has a serene like feel to it. On the other hand, red is aggressive and demands attention. These particular emotions can ultimately affect a viewers decision. Thus, color plays a vital role in professions that rely heavily on decision-making based on data visualization. These decisions can heavily impact society, clients, or patients, therefore it is crucial the professional interpret the visual image in a sound manner.

Markus Christen, Peter Brugger and Sara Irina Fabrikant of University of Zurich conducted an experiment to determine the impact of color on the emotions of professionals who make decisions based on data driven images. There were three groups total; neuroimaging experts who interpret neuroimages, geovisualization imaging experts who interpret geographic images, and ordinary individuals who are in neither profession. Neuroimaging scans are essentially images of a patient’s brain and they are crucial for diagnosing patients and treating them. In contrast, geographic images are satellite images of a specific geographic region and hold importance because they represent impacts such as climate change in specific areas, and lead geoscientists to make geospatial driven decisions for that particular region. The researchers assessed the individuals’ trust in the information portrayed by the visualization and the actual data interpretation of the visualization. Ultimately, the researchers hoped to determine the impact of color on a response from the professionals and lay people in their field and out of it when interpreting scans and images.

The researchers derived four hypotheses for their research. First, they hypothesized that experts in their own domain have the lowest variability in both trust and interpretation. With this being said they believed when the experts were interpreting images in their own field, the answers would not vary a large amount because they would have the knowledge to interpret and trust the image in similar ways. Second, they hypothesized that the response variability would decrease with increasing experience in visualization practices. Similar to the first hypothesis, those experienced in a particular field would contain similar responses. Third, it was hypothesized that domain experts’ opinions within their own data context would not influence response variability. This hypothesis is in relation to the context behind an image, such as a neuroimaging expert interpreting if a patient being brain dead equates to the patient’s actual death. Lastly, the researchers hypothesized that geovis experts would show greater awareness on design principles in comparison to neuroimaging experts.

To begin the study, the researchers emailed published authors in the neuroimaging field and in the geographic information visualization (geovis) field. Those approached completed an expert survey they were sent via email. This ensured credibility and expertise amongst the professionals chosen. For the lay individuals, the researchers approached residents of the United States on Amazon Mechanical Turk. In total, four hundred and eighty-six individuals participated in the study. Following, the neuroimaging and geovis groups were pair matched based on years of experience, age, gender, whether the participant is a consumer or producer of the specific images and if the participant’s lab they work in produces images or not. For the lay group, the participants were placed into a triad with neuroimaging and geovis experts with respect to gender and age.

There were five images per imaging expertise. For the neuroimaging domain, five PET-images from the Coma Science-Group at the University Hospital Liege were color manipulated with rainbow color scales and black backgrounds. The images portrayed a brain of a healthy individual, a locked-in-person, individual in a minimally conscious state, an individual in a vegetative state and a brain dead individual. For the geographic imaging visualization field the researchers simulated five future land scenarios for the Netherlands in 2030 containing multiple land use categories, such as residential, industrial, recreational and nature. They simulated these hypothetical situations using ESRI’s ArcMap10.2 geographical information system. The researchers utilized this system to ensure that the pixel value corresponded to the color scale so that the image was clear and distinguishable.

library(ggplot2)
surface <- reshape2::melt(volcano)
library(patchwork)

colors2 <- c("#FF0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000066", "Black")
g41 <- ggplot(surface) +
  geom_tile( aes(Var1, Var2, fill = value)) +#, color = "white", alpha = 0.5) + 
  scale_fill_gradientn(colours = rev(colors2)) + 
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(), 
        axis.title.y = element_blank(), axis.text.y = element_blank()) +
  ggtitle("Rainbow Color Scale")

#g4 <- ggplot(surface) +
  #geom_tile( aes(Var1, Var2, fill = value)) +#, color = "white", alpha = 0.5) + 
  #scale_fill_distiller(palette="Spectral", 
                       #limits = c(94, 200)) + 
  #theme(axis.title.x = element_blank(), axis.text.x = element_blank(), 
        #axis.title.y = element_blank(), axis.text.y = element_blank())

g5 <- ggplot(surface) +
  geom_tile( aes(Var1, Var2, fill = value)) +#, color = "white", alpha = 0.5) + 
  scale_fill_gradient2(low = "black", mid = "red",
                       high = "yellow", midpoint = 143) +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(), 
        axis.title.y = element_blank(), axis.text.y = element_blank()) +
        ggtitle("Heated Body Scale")

g6 <- ggplot(surface) +
  geom_tile( aes(Var1, Var2, fill = value)) +#, color = "white", alpha = 0.5) + 
  scale_fill_distiller(palette="Greens") + 
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(),
        axis.title.y = element_blank(), axis.text.y = element_blank()) +
        ggtitle("Univariate Color Intensity Change")

g7 <- ggplot(surface) +
  geom_tile( aes(Var1, Var2, fill = value)) +#, color = "white", alpha = 0.5) + 
  scale_fill_distiller(palette="RdBu") +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(), 
        axis.title.y = element_blank(), axis.text.y = element_blank()) +
        ggtitle("Bivariate Blue-Red-White")

g8 <- g41 + g5 + g6 + g7
g8

Creation of the Four Scales Used in the Experiment

The researchers used four different color scales for their experiement; a rainbow color scale, a heated body scale, a univariate color intensity change and a bivariate blue-red-white. They manipulated the neuroimages or geographical map images with these different scales and collected data based off the change in color for the respective images. I created a visualization of the four scales to exemplify the scales in a hypothetical situation. The data I used was from R and it contains topographic information about the volcano Mt Eden. Because the research paper is focused on geovis experts and neuroimaging experts, I decided to choose a data set that was geared more towards one of the fields in question, and that happened to be the geovis field. Although it is not an exact map image the participants were asked to examine, this visualization can potentially create a similar feel for what the individuals looked at.

The four graphs are tile maps and are scaled based on a specific color gradient that relates to the scales the researchers used. There is a legend on the side of each map to indicate the the gradient used on the specific map. The maps were strategically placed next to one another in two rows and two columns so the viewer could easily compare the scales. Additionally, the viewer would be able to experience an approximation of scenarios the participants analyzed. One critique that may be made for my visualization is that the axis titles and scales were removed. However, I removed them because I did not want to take away from the color scale. The variables used in the data set were named V1-V50, giving little to no information, therefore I thought it was best to leave them out and simply have a tile map with topographic information to focus on the color scales and not content, especially because this data set does not relate to the actual experiment.

To conduct the experiment, the neuroimaging experts performed trials containing neuroimages first, then proceeded to complete the trials for geographical map images. Conversely, geovisualization experts completed trials for geographical map images first, then followed suit with neuroimages. The lay participants received images in a random sequence. In the image-portion of the experiment, the participants were first asked a question in regards to their attitude towards the context of the visualization. In the neuroimaging domain, participants were asked whether they agreed, disagreed or had no opinion on the statement: “The state of brain death equals the death of a person” (Christen, Brugger, Fabrikant). In the geovis domain, the participants were given the same options, but on the statement: “The climate change we experience now is caused by human activities” (Christen, Brugger, Fabrikant).

Following the investigation of attitudes, trust and data interpretation were examined. On the grounds of trust, the participants had to rate how much they trusted the image at hand to have sound evidence to support the statement given. They rated the image on an 8-point Likert scale. Conversely, for data interpretation, the participants were asked to rank three images of intermediate brain states in between images of two extreme states, such as a healthy brain and a brain dead brain. The same scenarios were given for geovis images as well, but with geographical questions regarding maps.

Following the experiment, participants were prompted to answer questions on their experience with creating visualizations in their expertise. Lay individuals answered questions in regards to their experiences with general scientific visualizations. The participants were questioned on their training, current occupation, years of experience in the domain, professional relationship to image creation and usage, size of the lab they work in, whether or not the lab produces images and if the person uses or creates images. Ultimately, this section of questions established expertise and credibility following the experiment. Additionally, the participants were provided with four color scales that were utilized within the experiment. They were asked which of the four scales should be conveyed in their specific field to demonstrate the increase of a statistical parameter of any type. Also, participants were asked which one of the color scales should serve as a standard in their respective field.

Additionally, the participants who indicated they were involved in the production of images for their field were then asked questions in regards to the creation of images. The image creators were asked which softwares their specific professions and labs use, the types of images they created and the purposes for each, which techniques they used, how the software enhances the image and what type of post-processing of the image takes place, if applicable.

These questions were most likely examined by the researchers to determine whether the different software and techniques used could potentially affect the portrayal of data visually and in turn, affect the emotions evoked from the image. It is often difficult to produce similar results on different software platforms, especially when one has more advanced enhancement of images. Enhanced images are often easier to distinguish, and they are vital to professional imaging in order to ensure the interpreter of the image can view the image clearly. The expert interpreting the image needs a clear image with enhanced colors in order to make a fair judgement or decision based on the visualization. Therefore, the researchers had a valid point when they asked producers these specific questions. Different techniques and software could also affect their own responses. It could potentially decrease the trust component of the experiment because they were more familiar with their work in creating data visualizations and the colors they often use.

After the completion of the experiment, the results were compiled. The researchers computed trust variability, interpretation variability and an overall response variability. The overall response variability was computed by using the sum of the trust and interpretation variability. However, the three different variabilities were computed separately for each data context, neuroimaging and geovisualization imaging.

The first hypothesis made by the researchers was not confirmed. Unfortunately, the data for the neuroimaging experts indicated that the response variability was greater than the response variability for both geovis experts and lay individuals. It was seen that this effect was in regards to the trust variability, rather than the interpretation variability. The mean scores for the trust variability were: neuroimaging experts = 4.34, geovis experts = 3.16, lay individuals = 3.00; p < 0.001. The mean scores differ between neuroimaging experts and geovis experts by 1.18. This is relatively large, and even larger when compared to lay individuals, 1.34. Thus, it can be concluded that the trust levels of neuroimaging experts differ.

In contrast, the mean scores for the interpretation variability were: neuroimaging experts = 3.00, geovis experts = 2.70, lay individuals = 2.74. In this case, neuroimaging experts mean score differs from geovis experts and lay individuals by 0.30 and 0.26, respectively. When compared to the differences between trust variability, these differences are much smaller, therefore the researchers were right to conclude the rejection of the hypothesis was due to the trust variability rather than the interpretation variability.

However, in the field of geographic visualization, the geovis experts displayed high interpretation variability. The mean scores for the interpretation variability were: neuroimaging experts = 3.09, geovis experts = 3.09, lay individuals = 2.81; p < 0.05. Although the mean score is approximately the same as neuroimaging experts, the scores for lay individuals and geovis experts differ by 0.28. Interestingly enough, non-experts, lay individuals, seemed to be the least affected by the changes in color scales. This group had the lowest variability in both their trust and interpretation responses.

The researchers brought up the point that one possibility for these irregular results could be due to exposure to domain-specific visualization standards. The manipulation of the images by using non-typical color scales could potentially increase interpretation variability in domain specific experts. In turn it could decrease their trust for that specific image, which could explain the high variability in both interpretation and trust for neuroimaging experts. However, neuroimaging experts trusted the color scale when interpreting geographic data visualizations, but in this particular field outside of their domain, their interpretation variability did not decrease in accordance with their trust variability. This finding reiterates the importance of the implementation of domain-specific data visualization standards, especially when utilizing color. It seems to be the case that neuroimaging has set standards, and when the color scale is changed, the experts’ variability increases. Professions other than neuroscience can incorporate this result into their data visualization to ensure that the experts interpreting the images have an overarching set of standards to base their decisions on. This way, bias will be decreased, and if there is collaboration, the two different groups will be using similar decision making techniques derived from the data visualization standards.

Next, the researchers evaluated the trust rating values and the interpretation variability between the four color scales used for geovis experts and neuroimaging experts. It was found that geovis experts responses aligned with visualization principles recommended by cartographic design theory. This being, the color scale displaying decreasing data magnitudes with less intensity obtained the highest trust ratings, and the least preferred was the rainbow scale. Conversely, neuroimaging experts trusted the heated body scale the most, which is a commonly used scale in the neuroimaging field. In both cases, the suitability ratings coincide with typical color scale uses in their respective field.

Although, trust ratings differ from trust variability computed in the experiment, it is important to note the differences and evaluate them. Often the two responses differed and showed interesting patterns. For example, participants in both fields showed little trust in the rainbow scale prior to examining the scans in the experiment, however in the actual experiment they exhibited much more trust for this scale than the trust ratings previously led on. Neuroimaging experts generally work with heated body scales, which is relatively similar to a rainbow scale, but geovis experts are often trained to stray from using a rainbow scale, therefore this finding is unusual for those in the geographic visualization imaging field. As explained, neuroimaging experts generally utilize the heated body scale, and their trust ratings align with this, but their trust variability is not as generous in agreeing with this general fact. In contrast, geovis experts trusted the heated body scale the most, even though this scale is scarcely used in the geographic visualization imaging field. This result is irregular because these experts are trained under the cartographic design theory, which explains that one should assign darker color intensities to convey greater data magnitudes. Essentially, geovis experts are trained to have a decreasing color intensity scale for their images. However, this is opposite of the heated body scale, which does not vary in value, but rather has a gradient scale of different hues.

knitr::include_graphics("journal.pone.0246479.g002.PNG")

Box Plots of Variability for Participants

Not only did this research paper discuss the importance of color and how different hues and intensities evoke different emotions, but the researchers also created data visualizations to represent the data they collected throughout the experiment. Their first visualization, Figure 2, included three separate visualizations for overall response variability, trust variability and interpretation variability. The graphs were side by side box plots and contained six boxplots in each one, three for the geovis data context and three for the neuroimaging context. Each of the three in the different contexts were represented by neuroimaging experts, geovis experts, and lay individuals. Each of these three groups of participants was denoted by a different color and was labeled as such in the legend.

The researchers chose a different hue for each group, yellow, brown and blue for neuroimaging experts, geovis experts and lay individuals respectively. Because these are different categories, the researchers did an appropriate job in color choice because these are categories, rather than values. The different hues make for a clear distinction between the different groups. Additionally, the deliberate choice to create multiple side by side boxplots was a positive attribute to this visualization. The side by side boxplots allowed for easy comparison between the three participant groups. Also, faceting the three separate graphs for each variance variable made it helpful to distinguish which variable was in question. The strategic placing of the three graphs made it so the different participant groups could be compared throughout the different variance variables, and not just within its respective variable. However, the labeling of whether it is in the geographic data context or the neuroimaging context could’ve been improved. At first glance, it is slightly difficult to distinguish between the two because the labeling is fairly small. In a graph with six boxplots, it can often get confusing. The researchers appropriately placed the boxplots side by side. Had they placed one data context over the other in a stacked manner it would be difficult to compare the two. Perhaps the researchers could have enlarged the labels of the data context or even added a dashed line between the two data contexts in order to clearly separate them within their specific graph.

knitr::include_graphics("journal.pone.0246479.g003.PNG")

Box Plots of Expert Responses in Regards to Color Scales

Secondly, similar to the first data visualization, in figure 3 the researchers constructed two separate graphs with eight side by side boxplots in each one. Overall, the graphs were split between neuroimaging experts and geovis experts. Because this data visualization was only representing expert opinions on color, the lay individuals responses were not compiled. In each graph, there were two boxplots for suitability ratings and trust ratings for each of the four color scales. The color scales were distinguished by different hues and labeled as such. Additionally, the means of the trust and suitability ratings were denoted with either a filled in triangle or an outlined triangle, this too had a label. The researchers were right to add clear labels because there are multiple variables being represented in the graphs, therefore the labels gave clear indication of what each aspect of the graph is. Also, similar to the first data visualization, the researchers’ choice of side by side boxplots was appropriate because it allowed for easy comparison between the two ratings amongst the different color scales. Rather than keeping the mean a straight line and labeling and grouping the boxplots by rating, the researchers made good use of shape by denoting the different means with triangles. It was easy to compare the two ratings for a certain color when they were right next to each other. Although, the color choice could have been improved. The researchers chose light blue, light green, dark blue and dark green to represent the four color scales. Even though these colors are different hues, it would have been more of an impact to not choose two shades of the same color and rather four completely different colors on the color wheel. Having a light blue and a dark blue can often be mistaken for different values in the same category, but this is not the case because the four color scales are separate entities. They could have added to the indication of suitability and trust ratings by choosing different hues of a color for each specific color scale. In doing this, they would be able to show that within the same category, color scale, there are two separate values. This change would also impact the comparison of the two ratings within a certain color scale because the viewer would be able to clearly see different color values and different diamonds on the side by side boxplots.

data1 <- c(8, 6, 6, 4, 4, 2, 2, 2)
data2 <- c(8, 8, 8, 7, 7, 5, 5, 1)
data3 <- c(8, 8, 8, 6, 6, 5.7, 2, 2)
data4 <- c(8, 7, 6.5, 6, 6, 5, 2, 1)
data5 <- c(8, 6, 6, 4, 4, 2, 2, 2)
data6 <- c(8, 7, 6.5, 6, 6, 4, 2, 1)
data7 <- c(8, 6, 6, 4, 4, 2, 2, 2)
data8 <- c(8, 7, 6.5, 6, 6, 3, 1, 1)

rating <- c(1, 2, 3, 4, 5, 6, 7, 8)

#Create data second graph
data9 <- c(8, 6, 4, 4, 3, 3, 3, 2)
data10 <- c(8, 7, 6, 6, 6, 5, 2, 1)
data11 <- c(8, 6, 6, 4, 4, 2, 2, 2)
data12 <- c(8, 8, 8, 7, 7, 5, 5, 1)
data13 <- c(8, 8, 8, 6, 6, 5.7, 2, 2)
data14 <- c(8, 7, 6, 6, 6, 5, 2, 1)
data15 <- c(8, 7, 6, 6, 6, 5, 2, 2)
data16 <- c(8, 8, 8, 6, 6, 5.7, 2, 1)

library(tidyverse)

#Create data frame
df <- data.frame(data1, data2, data3, data4, data5, data6, data7, data8,
                 rating)


df4 <- data.frame(data9, data10, data11, data12, data13, data14, data15, data16,
                  rating)

# PROGRESS
library(reshape2)
df3<-melt(df,id.var=c("rating"))

colors <- c("navy", "cyan", "darkgreen", "chartreuse3", "purple",
            "darkmagenta", "darkorange3", "orange")
# First Plot 
g1 <- ggplot(df3, aes(variable, value)) + 
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot(fill = colors) +
  scale_y_continuous(limits = c(0, 8)) + ggtitle("Neuroimaging Experts") +
  labs(y = "Rating") +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank()) +
  stat_summary(fun="mean", fill="black", shape=23, size = 1)

# Second Plot
df5<-melt(df4,id.var=c("rating"))
g2 <- ggplot(df5, aes(variable, value)) + 
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot(fill = colors) +
  scale_y_continuous(limits = c(0, 8)) + ggtitle("Geovis Experts") + 
  labs(y="Rating") +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank()) +
  stat_summary(fun="mean", fill="black", shape=23, size = 1)


# Plots together
#install.packages("cowplot")
#library(cowplot)
#plot_grid(g1, g2)

#install.packages("patchwork")
library(patchwork)
g3 <- g1 + g2
g3

Recreation of Figure 3

In response to the eight side by side boxplots, I created a similar visualization, but added improvements to it as I saw fit after critiquing the author’s original graphs. Because there was not useful data available to accurately recreate the visualization, I decided to create sixteen different vectors that would look similar to those on the original graphs. The sixteen vectors were split into two separate data frames, a neroimaging data frame and a geovis data frame. From the two data frames, I created two graphs with eight side by side boxplots and then placed them side by side. The largest critique I have for myself is the fact that multiple of the boxplots are not the exact same as the original, such as the mean shown through the diamond, but it was simply a recreation and not an exact replica. The most impactful changes I made in my recreation was changing the colors of the last four boxplots on each graph and changing the value of the second boxplot in each color scale category for each graph. On the original visualization, the researchers used different variations of blues and greens, which could potentially confuse the viewer into believing they were different values of the same category. This is not the case though, as there are four different categories being analyzed; rainbow color scale, heated body scale, univariate color intensity change and bivariate blue-red-white. I decided to change the dark blue and dark green to purple and orange to indicate that these are separate categories and not values of the same category. Also, making the trust ratings boxplot a lighter hue of the same color allows for the viewer to easily compare between the two ratings.

Conclusively, the hypothesis made by the researchers was not supported. The neuroimaging experts had a larger response variability than both geovis experts and lay individuals. This went against their initial hypothesis that in their own field, neuroimaging experts would have a low response variability. Although it cannot be explained why this effect was in place, it could potentially be due to the fact that experts in their field are inclined to trust their background knowledge and training. The trust rating results indicate that they trust the heated body scale while their trust variability shows that they trust the rainbow scale, which is relatively similar to the heated body scale. This finding exemplifies that neuroimaging experts indicated that they trusted a heated body scale, but when they analyzed actual images in the experiment, they trusted the rainbow scale more. In their field the heated body scale is often utilized, therefore these results align with their real life applications. Because of these findings, it is only fair to make the judgment that in professional fields, there should be a set of standards when it comes to color usage in data visualization. It is essential that experts who are interpreting images have little to no bias in regards to color. Color evokes emotions, which can in turn skew one’s judgement when making a decision. When professionals are making decisions for patients or societal issues, it is vital the decision be made with no bias. Therefore, a set standard of guidelines for color usage in data visualization for a particular field could prove to be useful. Experts across the field would potentially trust and interpret the image correctly for their field because a set standard would be in place and they would be able to recognize the context of the data visualization with ease.

References

How to use color in design to evoke powerful emotions. (n.d.). Retrieved from https://www.independence.edu/blog/using-color-in-design-to-evoke-emotion

Christen, M., Brugger, P., Fabrikant, S. (2021). Susceptibility of domain experts to color manipulation indicate a need for design principles in data visualization. Retrived March 3, 2021, from https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0246479

Paper on Academic Research for Color in Data Visualization

Madison Nguyen

3/19/2021

References