11.4 y轴 scaling
根据缩放数字y轴的方式,您可以使效果显得庞大或微不足道。考虑以下示例,比较男性和女性助理教授的9个月薪水。 数据来自“学术工资”数据集。
# load data
data(Salaries, package="carData")
# get means, standard deviations, and 95% confidence intervals for assistant professor salary by sex
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- Salaries %>%
filter(rank == "AsstProf") %>%
group_by(sex) %>%
summarize(n = n(),
mean = mean(salary),
sd = sd(salary),
se = sd / sqrt(n),
ci = qt(0.975, df = n - 1) * se)
df
## # A tibble: 2 x 6
## sex n mean sd se ci
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Female 11 78050. 9372. 2826. 6296.
## 2 Male 56 81311. 7901. 1056. 2116.
# create and save the plot
library(ggplot2)
p <- ggplot(df,
aes(x = sex,
y = mean)) +
geom_point(size = 4) +
geom_line(aes(group = 1)) +
scale_y_continuous(limits = c(77000, 82000),
labels = scales::dollar) +
labs(title = "Mean salary differences by gender",
subtitle = "9-mo academic salary in 2007-2008",
caption = paste("source: Fox J. and Weisberg, S. (2011)",
"An R Companion to Applied Regression,",
"Second Edition Sage"),
x = "Gender",
y = "Salary")
p

First, let’s plot this with a y-axis going from 77,000 to 82,000.
# plot in a narrow range of y
p + scale_y_continuous(limits=c(77000, 82000))
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

性别差异似乎很大。接下来,让我们以y轴从0到125,000绘制相同的数据。
# plot in a wide range of y
p + scale_y_continuous(limits = c(0, 125000))
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

似乎没有性别差异!数据可视化的目的是要以最小的失真展示发现,这意味着为y轴应该选择合适的范围。
条形图应几乎总是从y = 0开始。对于其他图表,限制实际上取决于对值的预期范围的主题知识。我们还可以通过添加不确定性指标来改进图形。
df %>%
ggplot(aes(sex,mean)) +
geom_point(size = 3,col = "red") +
geom_errorbar(aes(ymin = mean - ci,ymax = mean + ci),width = 0.2,col = "steelblue") +
geom_line(aes(group = 1))
p <- ggplot(df,
aes(x = sex, y = mean)) +
geom_point(size = 4) +
geom_line(aes(group = 1)) +
scale_y_continuous(limits = c(70000, 85000),
labels = scales::dollar) +
labs(title = "Mean salary differences by gender",
subtitle = "9-mo academic salary in 2007-2008",
caption = paste("source: Fox J. and Weisberg, S. (2011)",
"An R Companion to Applied Regression,",
"Second Edition Sage"),
x = "Gender",
y = "Salary")
# plot with confidence limits
p + geom_errorbar(aes(ymin = mean - ci,
ymax = mean + ci),
width = .1) +
ggplot2::annotate("text",
label = "I-bars are 95% \nconfidence intervals",
x=2,
y=73500,
size = 4) +
theme(text = element_text(family = "Times New Roman"))


看看离散变量绘制线段,有意思哈哈
df %>%
ggplot(aes(sex,mean)) +
geom_point(size = 3) +
geom_segment(aes(x = 1,y = 78049.91,xend = 2,yend = 81311.46),size = 1)
df %>%
ggplot(aes(sex,mean)) +
geom_point(size = 3) +
geom_line(group = 1) +
theme(text = element_text(family = "Times New Roman",face = "italic"))


---
title: "利用R进行数据可视化——第十一章建议/最佳做法图"
author: "LJJ"
date: "2020/3/25"
output: 
  html_document:
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: true
    code_folding: hide
    code_download: true
    
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,fig.show = "hold",fig.align = "center",cache = TRUE)
```

本节包含一些有关如何使数据可视化的想法。大多数内容来自他人撰写的书籍和帖子，但我将负责将其放在此处。

---

## 11.1 标签Labeling

图上的所有内容均应标记以下内容：

* title-一个清晰的简短标题，使读者知道他们在看什么：性别与经验和工资之间的关系

* subtitle -可选的第二个（较小的字体）标题，提供其他信息：2016-2018年

* caption -数据的来源,资料来源：美国劳工部-www.bls.gov/bls/blswage.htm

* axis labels s-x和y轴的标签
  - 简短但具有描述性
  - 包括计量单位
     - 发动机排量（立方英寸）
     - 生存时间（天）
     - 患者年龄（岁）

* Legend -简短的标题和标签
  - 男性和女性 -不是0和1！
* 线和条 -标记任何趋势线，注释线和误差线

基本上，读者应该能够理解您的图表，而不必费力浏览文本的各个段落。如有疑问，请向尚未阅读您的文章或海报的人展示您的数据可视化，并询问他们是否不清楚。

---

## 11.2 信噪比

在数据科学中，**数据可视化的目的是传达信息**。任何不支持此目标的内容都应减少或消除。

---

## 11.3 颜色选择

颜色的选择不仅仅是美观,选择有助于传达图中信息的颜色。

文章[如何为您的数据可视化选择完美的色彩组合](https://blog.hubspot.com/marketing/color-combination-data-visualization)是一个很好的起点。

* sequential - for plotting a quantitative variable that goes from low to high
* diverging - for contrasting the extremes (low, medium, and high) of a quantitative variable
* qualitative - for distinguishing among the levels of a categorical variable

上面的文章可以帮助您在这些方案中进行选择。另外，**RColorBrewer包**提供按这种方式分类的调色板。

其他需要记住的事情:

* 确保文本清晰易读-**避免在深色背景上使用深色文本**，在浅色背景上使用浅色文本，以及颜色以不协调的方式发生冲突（例如，它们看起来很受伤！）。
* 避免使用红色和绿色的组合-色盲的观众很难分辨这些颜色

## 11.4 y轴 scaling

根据缩放数字y轴的方式，您可以使效果显得庞大或微不足道。考虑以下示例，比较男性和女性助理教授的9个月薪水。 数据来自“学术工资”数据集。
    
```{r}
# load data
data(Salaries, package="carData")

# get means, standard deviations, and 95% confidence intervals for assistant professor salary by sex 
library(dplyr)
df <- Salaries %>%
  filter(rank == "AsstProf") %>%
  group_by(sex) %>%
  summarize(n = n(),
            mean = mean(salary), 
            sd = sd(salary),
            se = sd / sqrt(n),
            ci = qt(0.975, df = n - 1) * se)

df
```


```{r}
# create and save the plot
library(ggplot2)
p <- ggplot(df, 
            aes(x = sex,
                y = mean)) +
  geom_point(size = 4) +
  geom_line(aes(group = 1)) +
  scale_y_continuous(limits = c(77000, 82000),
                     labels = scales::dollar) +
  labs(title = "Mean salary differences by gender",
       subtitle = "9-mo academic salary in 2007-2008",
       caption = paste("source: Fox J. and Weisberg, S. (2011)",
                       "An R Companion to Applied Regression,", 
                       "Second Edition Sage"),
       x = "Gender",
       y = "Salary")

p

```

First, let’s plot this with a y-axis going from 77,000 to 82,000.

```{r}
# plot in a narrow range of y
p + scale_y_continuous(limits=c(77000, 82000))
```

性别差异似乎很大。接下来，让我们以y轴从0到125,000绘制相同的数据。

```{r}
# plot in a wide range of y
p + scale_y_continuous(limits = c(0, 125000))
```

似乎没有性别差异！数据可视化的目的是要以最小的失真展示发现，这意味着为y轴应该选择合适的范围。 

**条形图应几乎总是从y = 0开始**。对于其他图表，限制实际上取决于对值的预期范围的主题知识。我们还可以通过添加不确定性指标来改进图形。

```{r}
df %>% 
  ggplot(aes(sex,mean)) +
  geom_point(size = 3,col = "red") +
  geom_errorbar(aes(ymin = mean - ci,ymax = mean + ci),width = 0.2,col = "steelblue") +
  geom_line(aes(group = 1))

p <- ggplot(df, 
            aes(x = sex, y = mean)) +
  geom_point(size = 4) +
  geom_line(aes(group = 1)) +
  scale_y_continuous(limits = c(70000, 85000),
                     labels = scales::dollar) +
  labs(title = "Mean salary differences by gender",
       subtitle = "9-mo academic salary in 2007-2008",
       caption = paste("source: Fox J. and Weisberg, S. (2011)",
                       "An R Companion to Applied Regression,", 
                       "Second Edition Sage"),
       x = "Gender",
       y = "Salary")

# plot with confidence limits
p +  geom_errorbar(aes(ymin = mean - ci, 
                       ymax = mean + ci), 
                       width = .1) +
  ggplot2::annotate("text", 
           label = "I-bars are 95% \nconfidence intervals", 
           x=2, 
           y=73500,
           size = 4) +
  theme(text = element_text(family = "Times New Roman"))
```

看看离散变量绘制线段，有意思哈哈

```{r}
df %>% 
  ggplot(aes(sex,mean)) +
  geom_point(size = 3) +
  geom_segment(aes(x = 1,y = 78049.91,xend = 2,yend = 81311.46),size = 1)

df %>% 
  ggplot(aes(sex,mean)) +
  geom_point(size = 3) +
  geom_line(group = 1) +
  theme(text = element_text(family = "Times New Roman",face = "italic"))
```

## 11.5 属性

除非是你的数据，否则每个图表都应该有一个属性——一个引导读者找到数据来源的说明，这通常出现在图的标题中。

## 11.6 走的更远

如果您想了解更多关于ggplot2的信息，有几个很好的资源，包括:

* the ggplot2 homepage
* the book ggplot2: **Elegenat Graphics for Data Anaysis** (be sure to get the third edition)
* the eBook R for Data Science - the data visualization chapter
* the ggplot2 cheatsheet

全是免费！！！

如果您想了解更多关于**数据可视化**的内容，这里有一些有用的资源:

* Harvard Business Reviews - **Visualizations that really work**
* the website **Information is Beautiful**
* the book **Beautiful Data: The Stories Behind Elegant Data Solutions**
* the Wall Street Journal’s - **Guide to Information Graphics**
* the book **The Truthful Art**

## 11.7 最后的笔记

随着现成数据的增长(或者我应该说泛滥?)，数据可视化领域正在爆炸。激动人心的图形工具的可用性支持了这种爆炸式增长,这是一个学习和探索的好时机。享受吧!


