本节包含一些有关如何使数据可视化的想法。大多数内容来自他人撰写的书籍和帖子,但我将负责将其放在此处。


11.1 标签Labeling

图上的所有内容均应标记以下内容:

  • title-一个清晰的简短标题,使读者知道他们在看什么:性别与经验和工资之间的关系

  • subtitle -可选的第二个(较小的字体)标题,提供其他信息:2016-2018年

  • caption -数据的来源,资料来源:美国劳工部-www.bls.gov/bls/blswage.htm

  • axis labels s-x和y轴的标签

    • 简短但具有描述性
    • 包括计量单位
      • 发动机排量(立方英寸)
      • 生存时间(天)
      • 患者年龄(岁)
  • Legend -简短的标题和标签

    • 男性和女性 -不是0和1!
  • 线和条 -标记任何趋势线,注释线和误差线

基本上,读者应该能够理解您的图表,而不必费力浏览文本的各个段落。如有疑问,请向尚未阅读您的文章或海报的人展示您的数据可视化,并询问他们是否不清楚。


11.2 信噪比

在数据科学中,数据可视化的目的是传达信息。任何不支持此目标的内容都应减少或消除。


11.3 颜色选择

颜色的选择不仅仅是美观,选择有助于传达图中信息的颜色。

文章如何为您的数据可视化选择完美的色彩组合是一个很好的起点。

  • sequential - for plotting a quantitative variable that goes from low to high
  • diverging - for contrasting the extremes (low, medium, and high) of a quantitative variable
  • qualitative - for distinguishing among the levels of a categorical variable

上面的文章可以帮助您在这些方案中进行选择。另外,RColorBrewer包提供按这种方式分类的调色板。

其他需要记住的事情:

  • 确保文本清晰易读-避免在深色背景上使用深色文本,在浅色背景上使用浅色文本,以及颜色以不协调的方式发生冲突(例如,它们看起来很受伤!)。
  • 避免使用红色和绿色的组合-色盲的观众很难分辨这些颜色

11.4 y轴 scaling

根据缩放数字y轴的方式,您可以使效果显得庞大或微不足道。考虑以下示例,比较男性和女性助理教授的9个月薪水。 数据来自“学术工资”数据集。  

# load data
data(Salaries, package="carData")

# get means, standard deviations, and 95% confidence intervals for assistant professor salary by sex 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df <- Salaries %>%
  filter(rank == "AsstProf") %>%
  group_by(sex) %>%
  summarize(n = n(),
            mean = mean(salary), 
            sd = sd(salary),
            se = sd / sqrt(n),
            ci = qt(0.975, df = n - 1) * se)

df
## # A tibble: 2 x 6
##   sex        n   mean    sd    se    ci
##   <fct>  <int>  <dbl> <dbl> <dbl> <dbl>
## 1 Female    11 78050. 9372. 2826. 6296.
## 2 Male      56 81311. 7901. 1056. 2116.
# create and save the plot
library(ggplot2)
p <- ggplot(df, 
            aes(x = sex,
                y = mean)) +
  geom_point(size = 4) +
  geom_line(aes(group = 1)) +
  scale_y_continuous(limits = c(77000, 82000),
                     labels = scales::dollar) +
  labs(title = "Mean salary differences by gender",
       subtitle = "9-mo academic salary in 2007-2008",
       caption = paste("source: Fox J. and Weisberg, S. (2011)",
                       "An R Companion to Applied Regression,", 
                       "Second Edition Sage"),
       x = "Gender",
       y = "Salary")

p

First, let’s plot this with a y-axis going from 77,000 to 82,000.

# plot in a narrow range of y
p + scale_y_continuous(limits=c(77000, 82000))
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

性别差异似乎很大。接下来,让我们以y轴从0到125,000绘制相同的数据。

# plot in a wide range of y
p + scale_y_continuous(limits = c(0, 125000))
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

似乎没有性别差异!数据可视化的目的是要以最小的失真展示发现,这意味着为y轴应该选择合适的范围。

条形图应几乎总是从y = 0开始。对于其他图表,限制实际上取决于对值的预期范围的主题知识。我们还可以通过添加不确定性指标来改进图形。

df %>% 
  ggplot(aes(sex,mean)) +
  geom_point(size = 3,col = "red") +
  geom_errorbar(aes(ymin = mean - ci,ymax = mean + ci),width = 0.2,col = "steelblue") +
  geom_line(aes(group = 1))

p <- ggplot(df, 
            aes(x = sex, y = mean)) +
  geom_point(size = 4) +
  geom_line(aes(group = 1)) +
  scale_y_continuous(limits = c(70000, 85000),
                     labels = scales::dollar) +
  labs(title = "Mean salary differences by gender",
       subtitle = "9-mo academic salary in 2007-2008",
       caption = paste("source: Fox J. and Weisberg, S. (2011)",
                       "An R Companion to Applied Regression,", 
                       "Second Edition Sage"),
       x = "Gender",
       y = "Salary")

# plot with confidence limits
p +  geom_errorbar(aes(ymin = mean - ci, 
                       ymax = mean + ci), 
                       width = .1) +
  ggplot2::annotate("text", 
           label = "I-bars are 95% \nconfidence intervals", 
           x=2, 
           y=73500,
           size = 4) +
  theme(text = element_text(family = "Times New Roman"))

看看离散变量绘制线段,有意思哈哈

df %>% 
  ggplot(aes(sex,mean)) +
  geom_point(size = 3) +
  geom_segment(aes(x = 1,y = 78049.91,xend = 2,yend = 81311.46),size = 1)

df %>% 
  ggplot(aes(sex,mean)) +
  geom_point(size = 3) +
  geom_line(group = 1) +
  theme(text = element_text(family = "Times New Roman",face = "italic"))

11.5 属性

除非是你的数据,否则每个图表都应该有一个属性——一个引导读者找到数据来源的说明,这通常出现在图的标题中。

11.6 走的更远

如果您想了解更多关于ggplot2的信息,有几个很好的资源,包括:

  • the ggplot2 homepage
  • the book ggplot2: Elegenat Graphics for Data Anaysis (be sure to get the third edition)
  • the eBook R for Data Science - the data visualization chapter
  • the ggplot2 cheatsheet

全是免费!!!

如果您想了解更多关于数据可视化的内容,这里有一些有用的资源:

  • Harvard Business Reviews - Visualizations that really work
  • the website Information is Beautiful
  • the book Beautiful Data: The Stories Behind Elegant Data Solutions
  • the Wall Street Journal’s - Guide to Information Graphics
  • the book The Truthful Art

11.7 最后的笔记

随着现成数据的增长(或者我应该说泛滥?),数据可视化领域正在爆炸。激动人心的图形工具的可用性支持了这种爆炸式增长,这是一个学习和探索的好时机。享受吧!

