第四章数据分布可视化

Author

221527120+吴思婷

1 解释原始数据

faithful是R语言中自带的一个经典数据集，它记录了美国黄石国家公园老忠实间歇泉(Old Faithful geyser)的喷发数据。这个数据集经常被用于统计教学和数据分析示例。
faithful数据集包含两个变量，共有272个观测值。
```
data = faithful
datatable(data,rownames = FALSE)
```
eruptions: 喷发持续时间，连续数值变量，以分钟为单位，范围：1.6分钟到5.1分钟。
waiting: 两次喷发之间的等待时间，连续数值变量，以分钟为单位，范围：43分钟到96分钟。

2 单变量直方图

2.1 绘图要求

利用geom_histogram(aes(y=..density..))绘制eruptions的直方图，使用预设主题：mytheme；
利用geom_rug()为直方图添加地毯图；
利用geom_density()为直方图添加核密度曲线；
利用annotate()在直方图标注峰度和偏度信息；
利用geom_vline() 为直方图添加一条垂直的均值参考线；
利用geom_point()在横轴上添加一个中位数参考点，并在点上方添加文字注释

2.2 作图代码

library(e1071)    # 用于计算偏度系数和峰度系数  

df <- data  
ggplot(data = df, aes(x = eruptions)) +    # 绘制直方图  
  geom_histogram(aes(y = ..density..), fill = "lightgreen", color = "grey40", bins = 30) +  
  geom_rug(size = 0.2, color = "blue") +    # 添加地毯图，须线的宽度为0.2  
  geom_density(color = "blue", size = 0.7) +    # 添加核密度曲线  
  annotate("text", x = 2.5, y = 0.7, label = paste("偏度系数 =", round(skewness(df$eruptions), 4)), size = 3) +    # 添加注释文本  
  annotate("text", x = 2.5, y = 0.65, label = paste("峰度系数 =", round(kurtosis(df$eruptions), 4)), size = 3) +  
  geom_vline(xintercept = mean(df$eruptions), linetype = "twodash", size = 0.6, color = "red") +    # 添加均值曲线，并设置线形、线宽和颜色  
  annotate("text", x = mean(df$eruptions), y = 0.7, label = paste("均值线 =", round(mean(df$eruptions), 2)), size = 3) +    # 添加注释文本  
  geom_point(aes(x = median(df$eruptions), y = 0), shape = 21, size = 4, fill = "yellow") +    # 添加中位数点  
  annotate("text", x = median(df$eruptions), y = 0.1, label = "中位数", size = 3, color = "red")    # 添加注释文本

2.3 图形观察和代码编写的心得体会

数据分布洞察：从图中能直观看到数据围绕 “eruptions” 变量的分布情况。绿色柱状图呈现的直方图展示了数据在不同区间的频数分布，蓝色曲线的核密度估计则平滑地描绘出数据的分布形态。可以发现数据呈现出双峰分布，意味着可能存在两种不同的 “喷发” 模式或群体。通过图形，能快速捕捉到数据的特征，远比单纯看数字更直观。这让我认识到可视化在数据分析前期探索阶段的重要性，能辅助初步判断数据特征，为后续深入分析指引方向。
绘图包运用：在绘制这类图形时，通常会用到像 ggplot2 这样强大的绘图包。使用过程中，对其图层叠加、数据映射等原理理解更深刻。例如，要绘制直方图和核密度曲线的组合图，需分别设置 geom_histogram() 和 geom_density() 图层，并正确映射数据到相应的美学属性（如 x 轴变量、颜色、填充等）。

3 叠加直方图和镜像直方图

3.1 绘图要求

绘制eruptions和 waiting两个变量的叠加直方图和镜像直方图，使用预设主题：mytheme。
将数据转化为长型数据再作叠加直方图，利用scale_fill_brewer()将叠加直方图配色方案改为set3 。
镜像直方图中eruptions在正方向，waiting在负方向，直方数bins=30，并添加文字标签作标签。
两种图都需要针对原始数据作图和标准标准化数据作图，可以使用scale()函数对变量标准化，分类标准化可以使用plyr::ddply()函数。

3.2 叠加直方图代码

df <- data |>
    gather(eruptions, waiting, key = "指标", value = "指标值") %>% # 融合数据
    ddply("指标", transform, 标准化值 = scale(指标值)) # 计算标准化值并返回数据框

p1 <- ggplot(df) + aes(x = 指标值, y = ..density.., fill = 指标) +
    geom_histogram(position = "identity", color = "grey90", alpha = 0.5, bins = 30) + 
    scale_fill_brewer(palette = "Set3") + 
    theme(legend.position = c(0.8, 0.8), # 设置图例位置
          legend.background = element_rect(fill = "grey90", color = "grey")) +
    ggtitle("(a) 原始数据的叠加直方图")

p2 <- ggplot(df) + aes(x = 标准化值, y = ..density.., fill = 指标) + 
    geom_histogram(position = "identity", color = "grey90", alpha = 0.5, bins = 30) + 
    scale_fill_brewer(palette = "Set3") + 
    theme(legend.position = c(0.8, 0.8), # 设置图例位置
          legend.background = element_rect(fill = "grey90", color = "grey")) +
    ggtitle("(a) 标准化数据的叠加直方图")

grid.arrange(p1, p2, ncol = 2)

3.3 镜像直方图代码

df <- data %>%
  mutate(
    std.eruptions = scale(eruptions),
    std.waiting = scale(waiting)
    )

# 图（a）AQI和PM2.5的镜像核密度图
p1 <- ggplot(df, aes(x = x)) +
    geom_histogram(aes(x = eruptions, y = -..density..), fill = "red", alpha = 0.3, bins = 30) + # 绘制eruptions的核密度图（上图）
    geom_label(aes(x = 20, y = 0.1), label = "eruptions", color = "red") + # 添加标签
    geom_histogram(aes(x = waiting, y = -..density..), fill = "blue", alpha = 0.3, bins = 30) + # 绘制waiting的核密度图（下图）
    geom_label(aes(x = 60, y = -0.075), label = "waiting", color = "blue") +
    xlab("指标值") + ggtitle("(a)原始数据镜像核密度图")

p2 <- ggplot(df, aes(x = x)) +
    geom_histogram(aes(x = std.eruptions, y = -..density..), fill = "red", alpha = 0.3, bins = 30) + # 绘制std.eruptions的核密度图（上图）
    geom_label(aes(x = -0.5, y = 0.5), label = "eruptions", color = "red") + # 添加标签
    geom_histogram(aes(x = std.waiting, y = -..density..), fill = "blue", alpha = 0.3, bins = 30) + # 绘制std.waiting的核密度图（下图）
    geom_label(aes(x = -0.5, y = -0.5), label = "waiting", color = "blue") +
    xlab("指标值") + ggtitle("(b)标准化数据镜像核密度图")

grid.arrange(p1, p2, ncol = 2)

3.4 图形观察和代码编写的心得体会

数据特征对比：通过对比原始数据和标准化数据的镜像核密度图，能清晰看到数据标准化前后分布形态的差异。原始数据图中，“eruptions” 和 “waiting” 变量分布在各自不同的尺度范围；标准化后，数据被调整到相似尺度，便于直接对比两者分布特征，凸显出标准化在统一数据尺度上的重要性。
绘图函数掌握：使用绘图函数（如 ggplot2 包中的函数绘制核密度图）时，要准确设置各种参数来实现预期图形效果。像设置 geom_density() 函数的参数来控制核密度曲线的颜色、线型、填充等，以及调整坐标轴标签、标题等元素，在反复调试中对绘图函数运用更加熟练。

4 核密度图

4.1 绘图要求

绘制eruptions和 waiting两个变量的分组核密度图、分面核密度图和镜像核密度图。
分组核密度图，采用geom_density(position="identity") 。
分面核密度图，采用geom_density()+facet_wrap(~xx,scale="free") 。
镜像核密度图中eruptions在正方向，waiting在负方向，直方数bins=30，并添加文字标签作标签。
分组核密度图和镜像核密度图需要针对原始数据作图和标准标准化数据作图。

4.2 分组核密度图

df<-data |> 
 mutate(
   std.eruptions=scale(eruptions),
   std.waiting=scale(waiting)
 )
p2<-ggplot(df)+aes(x=x)+
   geom_density(aes(x=eruptions,y=..density..),color="grey50",fill="red",alpha=0.3)+ # 绘制直方图（上图）
   geom_label(aes(x=20,y=0.1),label="eruptions",color="red")+  # 添加标签
   geom_density(aes(x=waiting,y=-..density..),fill="blue",alpha=0.3)+ # 绘制PM2.5的直方图（下图）
   geom_label(aes(x=60,y=-0.075),label="PM2.5",color="blue")+  # 添加标签
   xlab("指标值")+ggtitle("(b) 原始数据核密度图")
p2

4.3 分面核密度图

df <- data|> 
  gather(eruptions,waiting,key=指标,value=指标值) %>%  # 融合数据
  ddply("指标",transform,标准化值=scale(指标值))


p3<-ggplot(df)+aes(x=标准化值,y=..density..,fill=指标)+
  geom_density(position="identity",color="gray60",alpha=0.5)+
  scale_fill_brewer(palette = "Set3")+
  facet_wrap(~指标,scale="free")+
  theme(legend.position = c(0.8,0.8),
        legend.background = element_rect(fill = "gray90",color = "grey"))
p3

4.4 镜像核密度图

ggplot(df) +
  aes(x = 标准化值, fill = 指标) +
  geom_density(aes(y = after_stat(density)), color = "gray60", alpha = 0.5) +
  geom_density(aes(y = -after_stat(density)), color = "gray60", alpha = 0.5) +
  scale_fill_brewer(palette = "Set3") +
  geom_hline(yintercept = 0, color = "gray50") +
  coord_flip() +  # 将x和y轴翻转以实现镜像效果
  theme(
    legend.position = c(0.8, 0.8),
    legend.background = element_rect(fill = "gray90", color = "grey")
  )

4.5 图形观察和代码编写的心得体会

eruptions” 指标：分布范围覆盖了从 -2 到 2 的标准化值区间，在某些位置（如接近 0 的区域）数据密度较高，说明在这些标准化值附近数据点较为集中。
“waiting” 指标：同样分布在 -2 到 2 区间，与 “eruptions” 指标的分布有一定重叠区域，但在具体密度和分布形态上存在差异。比如在部分区域，“waiting” 指标的数据密度变化趋势与 “eruptions” 不同，反映出两个指标数据分布特征的不同之处。

5 箱线图和小提琴图

5.1 绘图要求

根据实际数据和标准化后的数据绘制eruptions和waiting两个变量的箱线图geom_boxplot和小提琴图geom_violin。
采用stat_summary(fun="mean",geom="point")在箱线图和均值图中要添加均值点。
小提琴图中要加入点图和箱线图
采用调色板前两种颜色，brewer.pal(6,"Set2")[1:2] ，作为箱体填充颜色。

"#66C2A5" "#FC8D62" "#8DA0CB" "#E78AC3" "#A6D854" "#FFD92F"

5.2 箱线图代码

df <- data|>  select(eruptions,waiting) |> 
  gather(eruptions,waiting,key=指标,value=指标值)%>%# 融合数据
  ddply("指标",transform,标准化值=scale(指标值))

palette<-RColorBrewer::brewer.pal(6,"Set2")[1:2]         # 设置离散型调色板
p1<-ggplot(df,aes(x=指标,y=指标值))+
  geom_boxplot(fill=palette)+      # 绘制箱线图并设置填充颜色
  stat_summary(fun="mean",geom="point",shape=21,size=2.5,fill="white")

p2<-ggplot(df,aes(x=指标,y=标准化值))+
  geom_boxplot(fill=palette)+      # 绘制箱线图并设置填充颜色
  stat_summary(fun="mean",geom="point",shape=21,size=2.5,fill="white")
gridExtra::grid.arrange(p1,p2,ncol=2)

5.3 小提琴图代码

通过d3r::d3_nest将数据框转化为层次数据“d3.js”作为绘图输入

library(tidyverse)
library(gridExtra)

# 数据处理（改用dplyr替代plyr）
df <- data %>%
  select(eruptions, waiting) %>%
  pivot_longer(everything(), names_to = "指标", values_to = "指标值") %>%
  group_by(指标) %>%
  mutate(标准化值 = scale(指标值)) %>%
  ungroup()

# 通用绘图函数
create_plot <- function(data, y_var, title) {
  ggplot(data, aes(x = 指标, y = {{ y_var }}, fill = 指标)) +
    geom_violin(scale = "width", trim = FALSE, alpha = 0.7) +
    geom_boxplot(width = 0.2, fill = "white", outlier.size = 0.7) +
    stat_summary(fun = mean, geom = "point", shape = 21, size = 2, fill = "red") +
    scale_fill_brewer(palette = "Set2") +
    guides(fill = "none") +
    ggtitle(title) +
    theme_minimal()
}

# 绘制图形
p1 <- create_plot(df, 指标值, "(a) 原始数据小提琴图") + 
  geom_jitter(width = 0.1, size = 0.8, alpha = 0.5)

p2 <- create_plot(df, 标准化值, "(b) 标准化小提琴图")

# 组合图形
grid.arrange(p1, p2, ncol = 2, widths = c(1, 1))

5.4 图形观察和代码编写的心得体会

统计特征识别：小提琴图结合了箱线图和核密度图的特点。从图中箱线部分可快速获取中位数、四分位数等信息，如红色点代表中位数位置；从密度曲线能了解数据在不同取值区间的分布密度。这有助于识别数据是否对称、有无异常值以及分布的峰值情况，加深对数据统计特征的理解。
数据预处理：绘制这两种图需先处理数据，标准化过程是关键环节。

6 威尔金森点图、蜂群图和云雨图

6.1 绘图要求

绘制eruptions和 waiting 两个变量的威尔金森点图、蜂群图和云雨图。
三种图形均采用标准化数据作图
威尔金森点图采用geom_dotplot(binaxis="y",bins=30,dotsize = 0.3) ，要求作出居中堆叠和向上堆叠两种情况的图。
蜂群图采用geom_beeswarm(cex=0.8,shape=21,size=0.8)，要求作出不带箱线图和带有箱线图两种情况的图。
云雨图采用geom_violindot(dots_size=0.7,binwidth=0.07) ，要求作出横向和纵向图两种情况的图。

6.2 威尔金森点图代码

分别作矩形热图和极坐标热图

mytheme<-theme_bw()+theme(legend.position="none")
# 数据处理
df <- data |>
  gather(eruptions, waiting, key = "指标", value = "指标值") %>%  # 融合数据
  ddply("指标", transform, 标准化值 = scale(指标值))  # 计算标准化值

# 创建基础绘图对象
p <- ggplot(df, aes(x = 指标, y = 标准化值, fill = 指标))

# 图(a): 居中堆叠点图
p1 <- p + 
  geom_dotplot(
    binaxis = "y",         # 沿y轴分箱
    stackdir = "center",   # 居中堆叠
    binwidth = 0.1,        # 分箱宽度
    dotsize = 0.3,         # 点大小
    stackratio = 0.9       # 堆叠密度
  ) +
  mytheme +
  ggtitle("(a) 居中堆叠点图") +
  ylim(-3, 3)  # 固定y轴范围便于比较

# 图(b): 向上堆叠点图
p2 <- p + 
  geom_dotplot(
    binaxis = "y",         # 沿y轴分箱
    stackdir = "up",       # 向上堆叠
    binwidth = 0.1,        # 分箱宽度
    dotsize = 0.3,         # 点大小
    stackratio = 0.9       # 堆叠密度
  ) +
  mytheme +
  ggtitle("(b) 向上堆叠点图") +
  ylim(-3, 3)  # 固定y轴范围便于比较

# 组合图形
grid.arrange(p1, p2, ncol = 2)

6.3 蜂群图代码

mytheme<-theme_bw()+theme(legend.position="none")

df <- data |>
  gather(eruptions,waiting,key=指标,value=指标值) %>%  # 融合数据
  ddply("指标",transform,标准化值=scale(指标值))  #计算标准化值并返回数据框

p<-ggplot(df,aes(x=指标,y=标准化值))
p1<-p+geom_beeswarm(cex=0.8,shape=21,size=0.7,aes(color=指标))+# 设置蜂群的宽度、点的形状、大小和填充颜色
mytheme+ggtitle("(a) 蜂群图")

# 图（b）箱线图+蜂群图
p2<-p+geom_boxplot(size=0.5,outlier.size=0.8,aes(color=指标))+
geom_beeswarm(shape=21,cex=0.8,size=0.8,aes(color=指标))+
mytheme+ggtitle("(b) 箱线图+蜂群图")

gridExtra::grid.arrange(p1,p2,ncol=2)

6.4 云雨图代码

library(see)  # 提供主题函数theme_modern
mytheme<-theme_modern()+
         theme(legend.position="none",
               plot.title=element_text(size=14,hjust=0.5))   
p1 <- ggplot(df, aes(x = 指标, y = 标准化值, fill = 指标)) +
  geom_violindot(dots_size = 0.7, binwidth = 0.07) + 
  mytheme + 
  ggtitle("(a) 垂直排列(默认)") +
  labs(x = "指标", y = "标准化值") # Explicit axis labels


p2 <- ggplot(df, aes(x = 指标, y = 标准化值, fill = 指标)) +
  geom_violindot(dots_size = 0.7, binwidth = 0.07) +
  coord_flip() + # Flip coordinates for horizontal orientation
  mytheme + 
  ggtitle("(b) 水平排列") +
  labs(x = "指标", y = "标准化值")


grid.arrange(p1, p2, ncol = 2)

6.5 图形观察和代码编写的心得体会

用密集的点展示数据分布，同时叠加核密度估计曲线（对应颜色区域）。红色区域及点代表 “eruptions” 指标，蓝色区域及点代表 “waiting” 指标。
水平排列方式使不同指标分布在垂直方向对比，能直观看到 “waiting” 和 “eruptions” 指标标准化值分布的上下位置差异和形态区别，