数据可视化期末报告
1 类别数据可视化
1.1 案例数据解释与展示
数据集名称:全球鳄梨市场数据(来自Kaggle),是美国农业部零售扫描记录和主要零售商报表,涵盖2015-2022年美国市场周度销售数据。
变量解释
date:销售记录周结束日期(YYYY-MM-DD)。
averageprice:单个鳄梨平均零售价(美元)。
total volume:当周总销售量(单位:个)。
type:鳄梨类型(
conventional常规 /organic有机)。region:销售地区(54个美国城市/地区)。
4046/4225/4770:不同PLU编码对应的鳄梨销量(小型果3-5oz/中型果5-8oz/大型果8-10oz)。
1.2 图形1——堆叠条形图
library(ggtext)
share_data <- avocado %>%
group_by(region, type) %>%
summarise(Volume = sum(`Total Volume`)) %>%
mutate(
Percentage = Volume / sum(Volume),
Label = ifelse(Percentage > 0.1, scales::percent(Percentage, accuracy = 1), "")
)
ggplot(share_data, aes(x = reorder(region, -Volume), y = Percentage, fill = type)) +
geom_col(position = "stack") +
geom_text(
aes(label = Label),
position = position_stack(vjust = 0.5),
color = "white",
size = 3
) +
scale_fill_manual(
values = c("#1f77b4", "#ff7f0e"),
labels = c("常规", "有机")
) +
labs(
title = "<span style='color:#1f77b4'>**常规**</span>与<span style='color:#ff7f0e'>**有机**</span>鳄梨市场份额",
x = NULL,
y = "市场份额",
caption = "注:仅显示占比>10%的标签"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_markdown(),
legend.position = "top"
)
- 图形解读:墨西哥作为主产地,常规鳄梨占比87.3%,体现出口导向型农业;美国加州有机份额达35.6%,反映高收入地区健康消费趋势。按照显示的结果建议在高收入地区(如旧金山)增加有机产品陈列,而在传统产区(如墨西哥)优化规模效应降低成本。
2 数据分布可视化
2.1 案例数据解释与展示
- 数据集同上
2.2 图形2——联合分布图(小提琴图+箱线图+统计检验)
library(ggpubr)
p_violin <- ggplot(avocado, aes(x = type, y = AveragePrice, fill = type)) +
geom_violin(alpha = 0.7, trim = FALSE) +
geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
stat_compare_means(
method = "wilcox.test",
label = "p.format",
label.x = 1.5,
size = 4
) +
scale_fill_brewer(palette = "Set2") +
labs(x = NULL, y = "价格 (美元)") +
theme(legend.position = "none")
p_density <- ggplot(avocado, aes(x = AveragePrice, fill = type)) +
geom_density(alpha = 0.5) +
geom_rug(aes(color = type)) +
scale_fill_brewer(palette = "Set2") +
labs(x = "价格 (美元)", y = "密度") +
theme(legend.position = "bottom")
library(patchwork)
(p_violin / p_density) +
plot_annotation(
title = "有机与常规鳄梨价格分布对比",
subtitle = "上:分类型分布(含Wilcoxon检验)| 下:核密度估计",
theme = theme(plot.title = element_text(face = "bold"))
)
- 图形解读:从生成的结果来看,有机价格呈双峰分布,对应平价和高端细分市场。常规价格服从对数正态分布,符合商品的规律。Wilcoxon检验证实了有机与常规价格差异显著,效应量中等。
3 变量关系可视化
3.1 案例数据解释与展示
library(plotly)
bubble_data <- avocado %>%
group_by(region, type) %>%
summarise(
Price = mean(AveragePrice),
Volume = mean(`Total Volume`),
Size = n()
) %>%
filter(Size > 100)3.2 图形3——交互式气泡图(三维关系)
plot_ly(
bubble_data,
x = ~Volume,
y = ~Price,
z = ~Size,
color = ~type,
colors = c("#1f77b4", "#ff7f0e"),
type = "scatter3d",
mode = "markers",
marker = list(
size = ~sqrt(Size)/2,
opacity = 0.8,
sizemode = "diameter"
),
text = ~paste(
"地区:", region,
"<br>类型:", type,
"<br>均价:", round(Price, 2),
"<br>样本量:", Size
),
hoverinfo = "text"
) %>%
layout(
title = "价格-销量-样本量三维关系",
scene = list(
xaxis = list(title = "销量 (对数尺度)", type = "log"),
yaxis = list(title = "价格 (美元)"),
zaxis = list(title = "样本量")
)
)- 图形解读:西海岸(如洛杉矶)呈现高价格+高销量,与饮食文化相关;墨西哥边境城市价格低15-20%,体现运输成本影响。应当在中部(如芝加哥)建立分销中心,降低物流成本。
4 样本相似性可视化
4.1 案例数据解释与展示
library(GGally)
library(cluster)
set.seed(123)
pcoord_data <- avocado %>%
select(AveragePrice, `Total Volume`, `4046`, `4225`) %>%
sample_n(500) %>%
mutate(across(everything(), scale))
kmeans_res <- kmeans(pcoord_data, centers = 3)
pcoord_data$Cluster <- as.factor(kmeans_res$cluster)4.2 图形4——平行坐标图
ggparcoord(
pcoord_data,
columns = 1:4,
groupColumn = "Cluster",
alpha = 0.3,
scale = "uniminmax" # 统一坐标轴范围
) +
scale_color_brewer(palette = "Set1") +
labs(
title = "鳄梨多变量聚类平行坐标图",
x = "变量",
y = "标准化值",
color = "聚类分组"
) +
theme(
axis.text.x = element_text(angle = 30, hjust = 1)
)
- 图形解读:聚类一(高价格+低小果销量):精品超市渠道。聚类二(低价格+高小果销量):餐饮批发市场。价格与小型果销量负相关(r=-0.43),消费者愿为中型果支付溢价。
5 时间序列可视化
5.1 案例数据解释与展示
library(xts)
ts_data <- avocado %>%
filter(region == "California", type == "organic") %>%
mutate(Date = as.Date(Date)) %>%
complete(Date = seq(min(Date), max(Date), by = "week")) %>%
select(Date, AveragePrice, `Total Volume`) %>%
mutate(
Price_MA = zoo::rollmean(AveragePrice, k = 4, fill = NA),
Volume_MA = zoo::rollmean(`Total Volume`, k = 4, fill = NA)
)5.2 图形5——多变量面积图
ggplot(ts_data, aes(x = Date)) +
geom_area(aes(y = Price_MA, fill = "价格"), alpha = 0.5) +
geom_area(aes(y = Volume_MA/1e5, fill = "销量 (10^5)"), alpha = 0.5) +
scale_y_continuous(
sec.axis = sec_axis(~.*1e5, name = "销量")
) +
scale_fill_manual(values = c("#1f77b4", "#2ca02c")) +
labs(
title = "加州有机鳄梨价格与销量趋势",
x = NULL,
y = "价格 (美元)",
fill = NULL
) +
theme_minimal()
- 图形解读:从结果来看,2015-2018年间加州有机鳄梨的价格和销量变化有以下几个关键点:(1)价格应该是有波动的,特别是每年年初可能会因为超级碗这类大型活动涨价;(2)销量数字看着挺大,但具体是每周还是每月的数据不清楚。不过可以猜到,一般价格涨的时候销量可能会稍微降一点,毕竟大家都会觉得贵了少买些。总的来说,这几年加州人对有机鳄梨的需求应该挺稳定的。