第四章数据分布可视化

Author

221527209 王琳慧

1 解释原始数据

faithful是R语言中自带的一个经典数据集，它记录了美国黄石国家公园老忠实间歇泉(Old Faithful geyser)的喷发数据。这个数据集经常被用于统计教学和数据分析示例。
faithful数据集包含两个变量，共有272个观测值。
```
data = faithful
datatable(data,rownames = FALSE)
```
eruptions: 喷发持续时间，连续数值变量，以分钟为单位，范围：1.6分钟到5.1分钟。
waiting: 两次喷发之间的等待时间，连续数值变量，以分钟为单位，范围：43分钟到96分钟。

2 单变量直方图

2.1 绘图要求

利用geom_histogram(aes(y=..density..))绘制eruptions的直方图，使用预设主题：mytheme；
利用geom_rug()为直方图添加地毯图；
利用geom_density()为直方图添加核密度曲线；
利用annotate()在直方图标注峰度和偏度信息；
利用geom_vline() 为直方图添加一条垂直的均值参考线；
利用geom_point()在横轴上添加一个中位数参考点，并在点上方添加文字注释

2.2 作图代码

library(e1071)        # 用于计算偏度系数和峰度系数


df <- data
# 作初始直方图，纵轴默认为频数
ggplot(data=df,aes(x=eruptions))+mytheme+    # 绘制直方图
  geom_histogram(aes(y=..density..),fill="lightgreen",color="gray50")+
  geom_rug(size=0.2,color="blue3")+
  geom_density(color="blue2",size=0.7)+ # 添加核密度曲线
  annotate("text",x=2.5,y=0.7,label=paste0("偏度系数 =",round(skewness(df$eruptions),4)),size=3)+  # 添加注释文本
  annotate("text",x=2.5,y=0.6,label=paste0("峰度系数 =",round(kurtosis(df$eruptions),4)),size=3)+  # 添加注释文本
  geom_vline(xintercept=mean(df$eruptions),linetype="twodash",size=0.6,color="red")+          # 添加均值垂线，并设置线形、线宽和颜色
  annotate("text",x=mean(df$eruptions),y=0.7,label=paste0("均值线",round(mean(df$eruptions),2)),size=3)+  # 添加注释文本
  geom_point(x=median(df$eruptions),y=0,shape=21,size=4,fill="yellow")# 添加中位数点

2.3 图形观察和代码编写的心得体会

该图形通过直方图（浅绿色）和核密度曲线（蓝色）展示了火山喷发时间（eruptions）的分布特征。横轴范围为2-5分钟，纵轴为密度值，可见数据分布右偏（偏度系数0.4135），右侧存在少量较长喷发时间；峰度系数-1.5116表明分布较正态分布更平坦，数据分散性较强。红色虚线标记均值为3.49，黄色圆点表示中位数，二者位置关系进一步验证右偏特性（均值＞中位数）。轴须线（蓝色细线）直观显示数据点的密集区域，核密度曲线则平滑呈现整体分布形态。图形结合统计量，完整揭示了喷发时间的集中趋势、离散程度和分布偏态。

3 叠加直方图和镜像直方图

3.1 绘图要求

绘制eruptions和 waiting两个变量的叠加直方图和镜像直方图，使用预设主题：mytheme。
将数据转化为长型数据再作叠加直方图，利用scale_fill_brewer()将叠加直方图配色方案改为set3 。
镜像直方图中eruptions在正方向，waiting在负方向，直方数bins=30，并添加文字标签作标签。
两种图都需要针对原始数据作图和标准标准化数据作图，可以使用scale()函数对变量标准化，分类标准化可以使用plyr::ddply()函数。

3.2 叠加直方图代码

df <- data |> 
  gather(eruptions,waiting,key=指标,value=指标值)|>
  ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框


p1<-ggplot(df)+aes(x=指标值,y=..density..,fill=指标)+
  geom_histogram(position="identity",bins=30,color="gray60",alpha=0.5)+
  scale_fill_brewer(palette = "Set3")+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(a) 原始数据的叠加直方图")

p2<-ggplot(df)+aes(x=标准化值,y=..density..,fill=指标)+
  geom_histogram(position="identity",color="gray60",alpha=0.5)+
  scale_fill_brewer(palette = "set3")+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(a) 标准化数据的叠加直方图")

grid.arrange(p1,p2,ncol=2)

3.3 镜像直方图代码

df <- data |>
  mutate(
    std.eruptions=scale(eruptions),
    std.waiting=scale(waiting)
  )

p1<-ggplot(df)+aes(x=x)+
   geom_histogram(aes(x=eruptions,y=..density..),bins=30,color="grey50",fill="red",alpha=0.3)+ # 绘制eruptions的直方图（上图）
   geom_label(aes(x=30,y=0.2),label="eruptions",color="red")+  # 添加标签
   geom_histogram(aes(x=waiting,y=-..density..),color="grey50",fill="blue",alpha=0.3)+ # 绘制waiting的直方图（下图）
   geom_label(aes(x=60,y=-0.1),label="waiting",color="blue")+  # 添加标签
   xlab("指标值")+ggtitle("(a) 原始数据的镜像直方图")

p2<-ggplot(df)+aes(x=x)+
   geom_histogram(aes(x=std.eruptions,y=..density..),bins=30,color="grey50",fill="red",alpha=0.3)+ # 绘制std.eruptions的直方图（上图）
   geom_label(aes(x=0.5,y=0.3),label="eruptions",color="red")+  # 添加标签
   geom_histogram(aes(x=std.waiting,y=-..density..),color="grey50",fill="blue",alpha=0.3)+ # 绘制std.waiting的直方图（下图）
   geom_label(aes(x=0.5,y=-0.3),label="waiting",color="blue")+  # 添加标签
   xlab("指标值")+ggtitle("(b) 标准化数据的镜像直方图")

grid.arrange(p1,p2,ncol=2)

3.4 图形观察和代码编写的心得体会

叠加直方图可以直观对比不同组的分布形状（如峰值、偏态、范围差异），而镜像直方图可以清晰对比两组分布（如实验组vs对照组）。编写代码时，注意两种图都需要对原始数据标准化，都需要对原始数据作图和标准标准化数据作图。

4 核密度图

4.1 绘图要求

绘制eruptions和 waiting两个变量的分组核密度图、分面核密度图和镜像核密度图。
分组核密度图，采用geom_density(position="identity") 。
分面核密度图，采用geom_density()+facet_wrap(~xx,scale="free") 。
镜像核密度图中eruptions在正方向，waiting在负方向，直方数bins=30，并添加文字标签作标签。
分组核密度图和镜像核密度图需要针对原始数据作图和标准标准化数据作图。

4.2 分组核密度图

df <- data |> 
  gather(eruptions,waiting,key=指标,value=指标值)|>
  ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框


p1<-ggplot(df)+aes(x=指标值,y=..density..,fill=指标)+
  geom_density(position="identity",bins=30,color="gray60",alpha=0.5)+
  scale_fill_brewer(palette = "Set3")+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(a) 原始数据的核密度图")

p2<-ggplot(df)+aes(x=标准化值,y=..density..,fill=指标)+
  geom_density(position="identity",color="gray60",alpha=0.5)+
  scale_fill_brewer(palette = "set3")+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(a) 标准化数据的核密度图")

grid.arrange(p1,p2,ncol=2)

4.3 分面核密度图

ggplot(df)+aes(x=指标值,fill=指标)+
  geom_density(color="gray60")+
  scale_fill_brewer(palette = "Set3")+
  guides(fill="none")+
  facet_wrap(~指标,ncol=2,scale="free")

4.4 镜像核密度图

df <- data |>
  mutate(
    std.eruptions=scale(eruptions),
    std.waiting=scale(waiting)
  )

p1<-ggplot(df)+aes(x=x)+
   geom_density(aes(x=eruptions,y=..density..),bins=30,color="grey50",fill="red",alpha=0.3)+ # 绘制eruptions的直方图（上图）
   geom_label(aes(x=30,y=0.2),label="eruptions",color="red")+  # 添加标签
   geom_density(aes(x=waiting,y=-..density..),color="grey50",fill="blue",alpha=0.3)+ # 绘制waiting的直方图（下图）
   geom_label(aes(x=60,y=-0.1),label="waiting",color="blue")+  # 添加标签
   xlab("指标值")+ggtitle("(a) 原始数据的镜像核密度图")

p2<-ggplot(df)+aes(x=x)+
   geom_density(aes(x=std.eruptions,y=..density..),bins=30,color="grey50",fill="red",alpha=0.3)+ # 绘制std.eruptions的直方图（上图）
   geom_label(aes(x=0.5,y=0.3),label="eruptions",color="red")+  # 添加标签
   geom_density(aes(x=std.waiting,y=-..density..),color="grey50",fill="blue",alpha=0.3)+ # 绘制std.waiting的直方图（下图）
   geom_label(aes(x=0.5,y=-0.3),label="waiting",color="blue")+  # 添加标签
   xlab("指标值")+ggtitle("(b) 标准化数据的镜像核密度图")

grid.arrange(p1,p2,ncol=2)

4.5 图形观察和代码编写的心得体会

分组核密度图可以直观对比多组数据的分布形态（如峰值、偏态、多峰性），避免了直方图的分箱偏差；分面核密度图可以清晰展示多组（尤其≥5组）的独立分布；镜像核密度图适合两组对比（如实验vs对照），避免重叠，清晰展示分布差异。镜像核密度图通过叠加原始数据及其对称镜像（a）与标准化数据及其镜像（b），直观对比了火山喷发时间（eruptions）和等待时间（waiting）的分布对称性。原始数据图中，eruptions的密度峰值偏向右侧（如横轴0.6处），镜像部分（负向延伸）与原始曲线重合度低，印证了右偏特性（此前偏度系数0.4135）；而waiting的分布可能更接近对称，镜像与原始曲线重叠较多。标准化后（b），数据被转换为均值为0、标准差为1的尺度（指标值范围-2到2），但分布形态未改变——eruptions仍右偏，waiting对称性更明显。两图的核密度曲线均通过轴须线（横轴细线）反映数据点密度，标准化使得不同变量（如eruptions与waiting）的分布可在同一尺度下直接对比。图形揭示：数据标准化仅调整数值范围，不改变分布偏态，而镜像设计强化了对对称性的视觉诊断能力，为分析数据形态提供有效工具。

5 箱线图和小提琴图

5.1 绘图要求

根据实际数据和标准化后的数据绘制eruptions和waiting两个变量的箱线图geom_boxplot和小提琴图geom_violin。
采用stat_summary(fun="mean",geom="point")在箱线图和均值图中要添加均值点。
小提琴图中要加入点图和箱线图
采用调色板前两种颜色，brewer.pal(6,"Set2")[1:2] ，作为箱体填充颜色。

"#66C2A5" "#FC8D62" "#8DA0CB" "#E78AC3" "#A6D854" "#FFD92F"

5.2 箱线图代码

mytheme<-theme(plot.title=element_text(size="11"), # 设置主标题字体大小
   axis.title=element_text(size=10),               # 设置坐标轴标签字体大小
   axis.text=element_text(size=9),                # 设置坐标轴刻度字体大小
   legend.text=element_text(size="8"))            # 设置图例字体大小

# 处理数据
df<-data |> 
  gather(eruptions,waiting,key=指标,value=指标值) |> 
  ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框

palette<-RColorBrewer::brewer.pal(6,"Set2")[1:2] # 设置调色板
 
# 绘制箱线图
p1<-ggplot(df,aes(x=指标,y=log10(指标值)))+     # y值取对数
  geom_boxplot(fill=palette,outlier.size=0.8)+  # 设置填充颜色和离群点大小
  scale_x_discrete(guide=guide_axis(n.dodge=2))+# x轴标签为2行
  stat_summary(fun="mean",geom="point",shape=21,size=2.5,fill="white")+
  ylab("对数值")+ggtitle("(a) 对数变换")

p2<-ggplot(df,aes(x=指标,y=标准化值))+
  geom_boxplot(fill=palette,outlier.size=0.8)+  # 设置填充颜色和离群点大小
  scale_x_discrete(guide=guide_axis(n.dodge=2))+
  stat_summary(fun="mean",geom="point",shape=21,size=2.5,fill="white")+
  ggtitle("(b) 标准化变换")

grid.arrange(p1,p2,ncol=2)        # 组合图形

5.3 小提琴图代码

通过d3r::d3_nest将数据框转化为层次数据“d3.js”作为绘图输入

# 数据处理
df<-data |> 
  gather(eruptions,waiting,key=指标,value=指标值) |> 
  ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框

# 设置图形主题
mytheme<-theme(plot.title=element_text(size="11"), # 设置主标题字体大小
   axis.title=element_text(size=10),               # 设置坐标轴标签字体大小
   axis.text=element_text(size=9),                # 设置坐标轴刻度字体大小
   legend.text=element_text(size="8"))            # 设置图例字体大小

# 图（a）原始数据小提琴图

p1<-ggplot(df,aes(x=指标,y=指标值,fill=指标))+
     geom_violin(scale="width",trim=FALSE)+
     geom_point(color="black",size=0.8)+  # 添加点
     geom_boxplot(outlier.size=0.7,outlier.color="white",size=0.3,
               width=0.2,fill="white")+  # 添加并设置箱线图和离群点参数
     scale_fill_brewer(palette="Set2[1:2]")+
     stat_summary(fun=mean,geom="point",shape=21,size=2)+# 添加均值点
     guides(fill="none")+
     stat_summary(fun="mean",geom="point",shape=21,size=2.5,fill="white")+
     geom_dotplot(aes(x=as.numeric(指标)-0.08,group=指标),
               width=0.5,binaxis="y",binwidth=2.5,stackdir="down")+
     ggtitle("(a) 原始数据小提琴图")

# 图（b）数据标准化后的小提琴图
p2<-ggplot(df,aes(x=指标,y=标准化值,fill=指标))+
     geom_violin(scale="width")+
     geom_point(color="black",size=1)+
     geom_boxplot(,outlier.size=0.7,outlier.color="black",size=0.3,
          width=0.2,fill="white")+
     scale_fill_brewer(palette="Set2[1:2]")+
     guides(fill="none")+
     stat_summary(fun="mean",geom="point",shape=21,size=2.5,fill="white")+
     geom_dotplot(aes(x=as.numeric(指标)-0.08,group=指标),
               width=0.5,binaxis="y",binwidth=2.5,stackdir="down")+
     ggtitle("(b) 标准化小提琴图")

grid.arrange(p1,p2,ncol=2)        # 组合图形p1和p2

5.4 图形观察和代码编写的心得体会

箱线图可以直观简洁地突出数据的中位数、四分位距和极端值，但无法展示数据的具体分布形状；小提琴图信息量比较丰富，可直观反映数据的多峰、偏态或聚集情况，但图形比较复杂。编写小提琴图代码时需要叠加箱线图或标记均值或者增加点图。

6 威尔金森点图、蜂群图和云雨图

6.1 绘图要求

绘制eruptions和 waiting 两个变量的威尔金森点图、蜂群图和云雨图。
三种图形均采用标准化数据作图
威尔金森点图采用geom_dotplot(binaxis="y",bins=30,dotsize = 0.3) ，要求作出居中堆叠和向上堆叠两种情况的图。
蜂群图采用geom_beeswarm(cex=0.8,shape=21,size=0.8)，要求作出不带箱线图和带有箱线图两种情况的图。
云雨图采用geom_violindot(dots_size=0.7,binwidth=0.07) ，要求作出横向和纵向图两种情况的图。

6.2 威尔金森点图代码

分别作矩形热图和极坐标热图

library(ggplot2)
library(ggdist)
mytheme<-theme_bw()+theme(legend.position="none")

# 数据处理
df<-data |> 
  gather(eruptions,waiting,key=指标,value=指标值) |> 
  ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框

p<-ggplot(df,aes(x=指标,y=标准化值,fill=指标))
p1<-p+geom_dotplot(binaxis="y",bins=30,dotsize = 0.3,stackdir="center")+ # 绘制点图
  geom_tile(width = 0.9, height = 0.1) +  # 调整宽度和高度以获得更好的视觉效果
  labs(title = "矩形热图", x = "变量", y = "标准化值") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  mytheme+ggtitle("(a) 居中堆叠")

p2<-p+geom_dotplot(binaxis="y",bins=30,dotsize = 0.3)+ # 绘制点图
  geom_tile(width = 0.9, height = 0.1) +
  coord_polar(theta = "x") +  # 转换为极坐标
  labs(title = "极坐标热图", x = "", y = "标准化值") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  mytheme+ggtitle("(b) 向上堆叠")

grid.arrange(p1,p2,ncol=2)        # 组合图形p1和p2

6.3 蜂群图代码

mytheme<-theme_bw()+theme(legend.position="none")

library(ggbeeswarm)
# 处理数据
df<-data |> 
  gather(eruptions,waiting,key=指标,value=指标值) |> 
  ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框

# 图（a）5项指标的蜂群图
p<-ggplot(df,aes(x=指标,y=标准化值))
p1<-p+geom_beeswarm(cex=0.8,shape=21,fill="black",size=0.8,aes(color=指标))+# 设置蜂群的宽度、点的形状、大小和填充颜色
mytheme+ggtitle("(a) 蜂群图")

# 图（b）箱线图+蜂群图
p2<-p+geom_boxplot(size=0.5,outlier.size=0.8,aes(color=指标))+
geom_beeswarm(shape=21,cex=0.8,size=0.8,aes(color=指标))+
mytheme+ggtitle("(b) 箱线图+蜂群图")

grid.arrange(p1,p2)

6.4 云雨图代码

library(see)  # 提供主题函数theme_modern
mytheme<-theme_modern()+
         theme(legend.position="none",
               plot.title=element_text(size=14,hjust=0.5))   # 调整标题位置

# 处理数据
df<-data |> 
  gather(eruptions,waiting,key=指标,value=指标值) |> 
  ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框

p1<-ggplot(df,aes(x=指标,y=标准化值,fill=指标))+
  geom_violindot(dots_size=0.7,binwidth=0.07)+ # 绘制云雨图并设置点的大小和箱宽
  mytheme+ggtitle("(a) 垂直排列(默认)")

p2<-ggplot(df,aes(x=指标,y=标准化值,fill=指标))+
  geom_violindot(dots_size=0.7,binwidth=0.07)+
  coord_flip()+mytheme+ggtitle("(b) 水平排列")

grid.arrange(p1,p2,ncol=2)        # 按2列组合图形p1和p2

6.5 图形观察和代码编写的心得体会

这些图表通过不同形式（点、箱体、密度曲线）多维度揭示数据特征，蜂群图强调细节密度，云雨图兼顾概括与细节，而标准化与镜像设计（如核密度图）则强化对称性分析。代码编写需注重几何对象的协同与参数微调，才能实现高效的数据叙事。编写三个图的代码是，要注意与前面绘制箱线图、矩形热图和极坐标热图代码结合起来。