第四章数据分布可视化

Author

221527123丁成裕

1 解释原始数据

faithful是R语言中自带的一个经典数据集，它记录了美国黄石国家公园老忠实间歇泉(Old Faithful geyser)的喷发数据。这个数据集经常被用于统计教学和数据分析示例。
faithful数据集包含两个变量，共有272个观测值。
```
data = faithful
datatable(data,rownames = FALSE)
```
eruptions: 喷发持续时间，连续数值变量，以分钟为单位，范围：1.6分钟到5.1分钟。
waiting: 两次喷发之间的等待时间，连续数值变量，以分钟为单位，范围：43分钟到96分钟。

2 单变量直方图

2.1 绘图要求

利用geom_histogram(aes(y=..density..))绘制eruptions的直方图，使用预设主题：mytheme；
利用geom_rug()为直方图添加地毯图；
利用geom_density()为直方图添加核密度曲线；
利用annotate()在直方图标注峰度和偏度信息；
利用geom_vline() 为直方图添加一条垂直的均值参考线；
利用geom_point()在横轴上添加一个中位数参考点，并在点上方添加文字注释

2.2 作图代码

library(e1071)        # 用于计算偏度系数和峰度系数

df <- data
# 作初始直方图，纵轴默认为频数
ggplot(data=df,aes(x=eruptions))+mytheme+    # 绘制直方图
  geom_histogram(aes(y=..density..),fill="lightgreen",color="gray50")+
  geom_rug(size=0.2,color="blue3")+    # 添加地毯图,须线的宽度为0.2
  annotate("text",x=3,y=0.6,label=paste0("偏度系数 =",round(skewness(df$eruptions),4)),size=3)+  # 添加注释文本

  annotate("text",x=3,y=0.7,label=paste0("峰度系数 =",round(kurtosis(df$eruptions),4)),size=3)+ # 添加注释文本
  geom_point(x=median(df$eruptions),y=0,shape=21,size=4,fill="yellow")+# 添加中位数点
  annotate("text",x=median(df$eruptions),y=0.1,label="中位数",size=3,color="red3")+ # 添加注释文本
  ggtitle("(b) 添加频数多边形和中位数点")+


# 图(c) 添加核密度曲线
geom_density(color="blue2",size=0.7)+ # 添加核密度曲线
geom_vline(xintercept=mean(df$eruptions),linetype="twodash",size=0.6,color="red")+
annotate("text",x=3.5,y=0.2,label="均值线",size=3)

2.3 图形观察和代码编写的心得体会

图形观察：分布形态：
- 偏度系数为-0.4135，表明数据呈轻微左偏分布
- 峰度系数为-1.5116，说明分布比正态分布更平坦（低峰态）
- 核密度曲线显示数据可能呈现双峰分布
中心趋势：
- 均值线位于3.49
- 中位数标记在图中明显可见
- 均值和中位数的相对位置可以进一步验证分布的偏态
可视化元素：
- 直方图与核密度曲线结合，既展示数据分布又平滑显示趋势
- 地毯图(rug plot)在x轴显示实际数据点
- 统计量标注清晰，便于解读
  
  代码编写心得：这段代码展示了如何通过R的ggplot2包创建专业的数据分布可视化，有效整合了：
  - 基础分布展示（直方图）
  - 数据细节（地毯图）
  - 平滑趋势（核密度估计）
  - 统计参考线（均值、中位数）
  - 关键统计量标注

3 叠加直方图和镜像直方图

3.1 绘图要求

绘制eruptions和 waiting两个变量的叠加直方图和镜像直方图，使用预设主题：mytheme。
将数据转化为长型数据再作叠加直方图，利用scale_fill_brewer()将叠加直方图配色方案改为set3 。
镜像直方图中eruptions在正方向，waiting在负方向，直方数bins=30，并添加文字标签作标签。
两种图都需要针对原始数据作图和标准标准化数据作图，可以使用scale()函数对变量标准化，分类标准化可以使用plyr::ddply()函数。

3.2 叠加直方图代码

df <- data |>
  gather(eruptions,waiting,key=指标,value=指标值)|>   # 融合数据
  ddply("指标",transform,标准化值 = scale(指标值))

# 图（a）叠加直方图
p1<-ggplot(df)+aes(x=指标值,y=..density..,fill=指标)+
  geom_histogram(position="identity",color="gray60",alpha=0.5)+
  scale_fill_brewer(palette = 'Set3')+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(a) 原来数据叠加直方图")

# 图（a）叠加直方图
p2<-ggplot(df)+aes(x=标准化值,y=..density..,fill=指标)+
  geom_histogram(position="identity",color="gray60",alpha=0.5)+
  scale_fill_brewer(palette = 'Set3')+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(b) 标准化数据叠加直方图")


gridExtra::grid.arrange(p1,p2,ncol=2)        # 组合图形

3.3 镜像直方图代码

df <- data |>
  mutate(std.eruptions = scale(eruptions),std.waiting = scale(waiting))



# 图（b）镜像直方图
p1<-ggplot(df)+aes(x=x)+
   geom_histogram(aes(x=eruptions,y=..density..),color="grey50",fill="red",bins = 30,alpha=0.3)+ # 绘制指标值的直方图（上图）
   geom_label(aes(x=50,y=0.1),label="指标值",color="red")+  # 添加标签
   geom_histogram(aes(x=waiting,y=-..density..),color="grey50",fill="blue",alpha=0.3)+ # 绘制PM2.5的直方图（下图）
   geom_label(aes(x=50,y=-0.1),label="PM2.5",color="blue")+  # 添加标签
   xlab("指标值")+ggtitle("原来数据的镜像直方图")

# 图（b）镜像直方图
p2<-ggplot(df)+aes(x=x)+
   geom_histogram(aes(x=std.eruptions,y=..density..),color="grey50",fill="red",bins = 30,alpha=0.3)+ # 绘制指标值的直方图（上图）
   geom_label(aes(x=-1,y=0.1),label="指标值",color="red")+  # 添加标签
   geom_histogram(aes(x=std.waiting,y=-..density..),color="grey50",fill="blue",alpha=0.3)+ # 绘制PM2.5的直方图（下图）
   geom_label(aes(x=-1,y=-0.1),label="PM2.5",color="blue")+  # 添加标签
   xlab("指标值")+ggtitle("标准数据的镜像直方图")



gridExtra::grid.arrange(p1,p2,ncol=2)        # 组合图形

3.4 图形观察和代码编写的心得体会

图形观察：原始数据分布(图a)：
- eruptions(红色)呈现右偏分布，主要集中1-5分钟范围
- waiting(蓝色)呈现明显的双峰分布，约在40-80分钟有两个峰值
- 两变量量纲差异大，直接对比困难
标准化数据分布(图b)：
- 经过标准化后，两变量在同一尺度上可比
- eruptions的标准化分布保持右偏特征
- waiting的双峰特征更加清晰可见
- 可以更直观比较两变量的分布形态差异
  
  代码心得：这种镜像直方图是一种创新的可视化方法，特别适合：
  1. 比较两个不同尺度变量的分布形态
  2. 观察变量标准化前后的变化
  3. 识别数据偏态和多峰特征

4 核密度图

4.1 绘图要求

绘制eruptions和 waiting两个变量的分组核密度图、分面核密度图和镜像核密度图。
分组核密度图，采用geom_density(position="identity") 。
分面核密度图，采用geom_density()+facet_wrap(~xx,scale="free") 。
镜像核密度图中eruptions在正方向，waiting在负方向，直方数bins=30，并添加文字标签作标签。
分组核密度图和镜像核密度图需要针对原始数据作图和标准标准化数据作图。

4.2 分组核密度图

df <- data |>   
  gather(eruptions,waiting,key=指标,value=指标值) %>%    # 融合数据
  ddply("指标",transform,标准化值=scale(指标值))

p1<-ggplot(df)+aes(x=指标值,y=..density..,fill=指标)+
  geom_density(position="identity",color="gray60",alpha=0.5)+
  scale_fill_brewer(palette = "set3")+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(a) 原始数据叠加直方图")

p2<-ggplot(df)+aes(x=标准化值,y=..density..,fill=指标)+
  geom_density(position="identity",color="gray60",alpha=0.5)+
  scale_fill_brewer(palette = "set3")+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(b) 标准化数据叠加直方图")

grid.arrange(p1,p2,ncol=2)        # 组合图形

4.3 分面核密度图

df <- data |>
  gather(eruptions,waiting,key=指标,value=指标值)|>   # 融合数据
  ddply("指标",transform,标准化值 = scale(指标值))

# 图（a）叠加直方图
p1<-ggplot(df)+aes(x=指标值,y=..density..,fill=指标)+
  geom_density(position="identity",color="gray60",alpha=0.5)+
  facet_wrap(~指标,scale="free")+
  scale_fill_brewer(palette = 'Set3')+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(a) 原来数据叠加直方图")

# 图（a）叠加直方图
p2<-ggplot(df)+aes(x=标准化值,y=..density..,fill=指标)+
  geom_density(position="identity",color="gray60",alpha=0.5)+
  facet_wrap(~指标,scale="free")+
  scale_fill_brewer(palette = 'Set3')+
  theme(legend.position=c(0.8,0.8),# 设置图例位置
       legend.background=element_rect(fill="grey90",color="grey"))+
                                                # 设置图例背景色和边框颜色
  ggtitle("(b) 标准化数据叠加直方图")


gridExtra::grid.arrange(p1,p2,ncol=2)        # 组合图形

4.4 镜像核密度图

df <- data |>
  mutate(
    std.eruptions=scale(eruptions),
    std.waiting=scale(waiting)
  )

p1<-ggplot(df)+aes(x=x)+
   geom_density(aes(x=eruptions,y=..density..),bins=30,color="grey50",fill="red",alpha=0.3)+ # 绘制AQI的直方图（上图）
   geom_label(aes(x=30,y=0.2),label="eruptions",color="red")+  # 添加标签
   geom_density(aes(x=waiting,y=-..density..),bins=30,color="grey50",fill="blue",alpha=0.3)+ # 绘制PM2.5的直方图（下图）
   geom_label(aes(x=60,y=-0.1),label="waiting",color="blue")+  # 添加标签
   xlab("指标值")+ggtitle("(a) 原始数据镜像直方图")

p2<-ggplot(df)+aes(x=x)+
   geom_density(aes(x=std.eruptions,y=..density..),bins=30,color="grey50",fill="red",alpha=0.3)+ # 绘制AQI的直方图（上图）
   geom_label(aes(x=-0.5,y=0.3),label="eruptions",color="red")+  # 添加标签
   geom_density(aes(x=std.waiting,y=-..density..),bins=30,color="grey50",fill="blue",alpha=0.3)+ # 绘制PM2.5的直方图（下图）
   geom_label(aes(x=-0.5,y=-0.3),label="waiting",color="blue")+  # 添加标签
   xlab("指标值")+ggtitle("(b) 标准化数据镜像直方图")

grid.arrange(p1,p2,ncol=2)        # 组合图形

4.5 图形观察和代码编写的心得体会

5 箱线图和小提琴图

5.1 绘图要求

根据实际数据和标准化后的数据绘制eruptions和waiting两个变量的箱线图geom_boxplot和小提琴图geom_violin。
采用stat_summary(fun="mean",geom="point")在箱线图和均值图中要添加均值点。
小提琴图中要加入点图和箱线图
采用调色板前两种颜色，brewer.pal(6,"Set2")[1:2] ，作为箱体填充颜色。

"#66C2A5" "#FC8D62" "#8DA0CB" "#E78AC3" "#A6D854" "#FFD92F"

5.2 箱线图代码

df <- data |>
  gather(everything(),key=指标,value=指标值) %>% 
  mutate(指标=fct_inorder(指标)) %>% 
ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框
palette<-RColorBrewer::brewer.pal(6,"Set2")[1:2]          # 设置离散型调色板
p1 <- ggplot(df,aes(x=指标,y=指标值))+
  geom_boxplot(fill=palette)+  # 绘制箱线图并设置填充颜色
  stat_summary(fun="mean",geom="point",shape=21,size=2.5,fill="white")
p2 <- ggplot(df,aes(x=指标,y=标准化值))+
  geom_boxplot(fill=palette)+  # 绘制箱线图并设置填充颜色
  stat_summary(fun="mean",geom="point",shape=21,size=2.5,fill="white")
gridExtra::grid.arrange(p1,p2,ncol=2)        # 组合图形

5.3 小提琴图代码

通过d3r::d3_nest将数据框转化为层次数据“d3.js”作为绘图输入

df <- data |>
  gather(everything(),key=指标,value=指标值) %>% 
  mutate(指标=fct_inorder(指标)) %>% 
ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框

# 图（a）原始数据小提琴图
p1<-ggplot(df,aes(x=指标,y=指标值,fill=指标))+
     geom_violin(scale="width",trim=FALSE)+
     geom_point(color="black",size=0.8)+  # 添加点
     geom_boxplot(outlier.size=0.7,outlier.color="white",size=0.3,
               width=0.2,fill="white")+  # 添加并设置箱线图和离群点参数
     scale_fill_brewer(palette="Set2")+
     stat_summary(fun=mean,geom="point",shape=21,size=2)+# 添加均值点
     guides(fill="none")+
     ggtitle("(a) 原始数据小提琴图")

# 图（b）数据标准化后的小提琴图
p2<-ggplot(df,aes(x=指标,y=标准化值,fill=指标))+
     geom_violin(scale="width")+
     geom_point(color="black",size=1)+
     geom_boxplot(,outlier.size=0.7,outlier.color="black",size=0.3,
          width=0.2,fill="white")+
     scale_fill_brewer(palette="Set2")+
     guides(fill="none")+
     ggtitle("(b) 标准化小提琴图")

gridExtra::grid.arrange(p1,p2,ncol=2)        # 组合图形p1和p2

5.4 图形观察和代码编写的心得体会

6 威尔金森点图、蜂群图和云雨图

6.1 绘图要求

绘制eruptions和 waiting 两个变量的威尔金森点图、蜂群图和云雨图。
三种图形均采用标准化数据作图
威尔金森点图采用geom_dotplot(binaxis="y",bins=30,dotsize = 0.3) ，要求作出居中堆叠和向上堆叠两种情况的图。
蜂群图采用geom_beeswarm(cex=0.8,shape=21,size=0.8)，要求作出不带箱线图和带有箱线图两种情况的图。
云雨图采用geom_violindot(dots_size=0.7,binwidth=0.07) ，要求作出横向和纵向图两种情况的图。

6.2 威尔金森点图代码

分别作矩形热图和极坐标热图

mytheme<-theme_bw()+theme(legend.position="none")
mytheme<-theme_bw()+theme(legend.position="none")
df <- data |>
  gather(everything(),key=指标,value=指标值) %>% 
  mutate(指标=fct_inorder(指标)) %>% 
ddply("指标",transform,标准化值=scale(指标值)) # 计算标准化值并返回数据框

mytheme<-theme_bw()+theme(legend.position="none")
p<-ggplot(df,aes(x=指标,y=标准化值,fill=指标))
p1<-p+geom_dotplot(binaxis="y",binwidth=0.05,stackdir="center")+ # 绘制点图
  mytheme+ggtitle("(a) 居中堆叠")

p2<-p+geom_dotplot(binaxis="y",binwidth=0.05)+ # 绘制点图
  mytheme+ggtitle("(b) 向上堆叠")
gridExtra::grid.arrange(p1,p2,ncol=2)     # 按2列组合图形

6.3 6.4 图形观察和代码编写的心得体会

图形观察：这个云雨图(Raincloud Plot)可视化展示了老忠实间歇泉”eruptions”(喷发持续时间)和”waiting”(等待时间)两个变量的标准化分布：
1. 垂直排列(图a)：
  - 显示两变量在标准化后的分布范围(-2到2之间)
  - eruptions分布呈现右偏特征
  - waiting分布呈现明显的双峰特征
  - 点图展示实际数据点分布
2. 水平排列(图b)：
  - 相同数据的不同视觉呈现
  - 更适合变量名称较长或变量较多的情况
  - 更符合从左到右的阅读习惯
    
    代码心得：云雨图是一种强大的可视化工具，特别适合：
    1. 同时展示数据分布形状、密度和原始数据点
    2. 比较不同变量的分布特征
    3. 识别数据的偏态、多峰等特征