1.统计数据显示,普通人后代成为“名人”(在某一领域有较高声望的人)的概率为千分之一,而“名人”后代成为名人的概率则为十分之一,因此血统论是有统计依据的。请对上述判断做出分析。
答:
2.阅读附件中论文Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained,简要列出文章中提到的进行A/B测试时需要注意的问题,并与组内同事共同进行讨论。
答:
3.附件中文件user_purchase_record.csv是用户在豆瓣某产品的购买记录(为保护隐私,对部分数据做了混淆),数据格式为:用户ID,条目ID,金额。一条记录表示一次购买行为,购买金额可以为零。请组内分工协作完成以下任务: a. 用直方图分别展示购买频次、购买金额的分布 b. 用散点图展示购买频次与金额的关系
绘图过程如下:
1.导入数据
# import data
pr <- read.csv("~/Elrond/user_purchase_record.csv", header = FALSE, sep = ",",
col.names = c("user_id", "subject_id", "amount"))
2.整理和汇总数据
library(plyr)
fre_user <- ddply(pr, .(user_id), summarize, user_f = length(subject_id))
fre_subject <- ddply(pr, .(subject_id), summarize, subject_f = length(user_id))
amount_subject <- ddply(pr, .(subject_id), summarize, subject_a = sum(amount))
3.画出用户购买频次分布直方图
library(ggplot2)
fre_user_plot <- ggplot(data = fre_user, aes(x = user_f))
fre_user_hist <- fre_user_plot + geom_histogram(binwidth = 1, aes(fill = ..count..))
fre_user_done <- fre_user_hist + xlab("user frequency") + ylab("total users") +
ggtitle("distribution of user frequency") + theme(legend.position = "none")
fre_user_done
可以放大购买频次小于15的部分,如下:
fre_user_zoom <- fre_user_done + xlim(0, 15)
fre_user_zoom
可以从图中看出:
4.画出用户购买金额分布直方图
amount_user_plot <- ggplot(data = pr, aes(x = amount))
amount_user_hist <- amount_user_plot + geom_histogram(binwidth = 1, aes(fill = ..count..))
amount_user_done <- amount_user_hist + xlab("expense") + ylab("total times") +
ggtitle("distribution of user expense") + theme(legend.position = "none")
amount_user_done
可以放大购买金额小于20的部分,如下:
amount_user_zoom <- amount_user_done + xlim(0, 20)
amount_user_zoom
可以从图中看出:
5.画出条目被购买频次分布直方图
fre_subject_plot <- ggplot(data = fre_subject, aes(x = subject_f))
fre_subject_hist <- fre_subject_plot + geom_histogram(binwidth = 100, aes(fill = ..count..))
fre_subject_done <- fre_subject_hist + xlab("subject frequency") + ylab("total subjects") +
ggtitle("distribution of subject frequency") + theme(legend.position = "none")
fre_subject_done
注:这里的bin的宽度取的是100。
可以从图中看出:
6.画出条目总销售额分布直方图
amount_subject_plot <- ggplot(data = amount_subject, aes(x = subject_a))
amount_subject_hist <- amount_subject_plot + geom_histogram(binwidth = 400,
aes(fill = ..count..))
amount_subject_done <- amount_subject_hist + xlab("total revenue") + ylab("total subjects") +
ggtitle("distribution of subject revenue") + theme(legend.position = "none")
amount_subject_done
注:这里bin的宽度取的是400。
可以从图中看出:
7.画出用户购买频次和金额的散点图
fre_amount <- ddply(pr, .(user_id), summarize, t_fre = length(subject_id), m_amount = mean(amount))
fre_amount_plot <- ggplot(data = fre_amount, aes(y = t_fre, x = m_amount))
fre_amount_scatter <- fre_amount_plot + geom_point(aes(colour = t_fre), shape = 1)
fre_amount_done <- fre_amount_scatter + ylab("frequency") + xlab("average expenses") +
ggtitle("users' frequency vs average expenses") + theme(legend.position = "none")
fre_amount_done
可以从图中看出:
8.画出条目被购买频次和金额的散点图
s_fre_amount <- ddply(pr, .(subject_id), summarize, s_t_fre = length(user_id),
s_m_amount = mean(amount))
s_fre_amount_plot <- ggplot(data = s_fre_amount, aes(y = s_t_fre, x = s_m_amount))
s_fre_amount_scatter <- s_fre_amount_plot + geom_point(aes(colour = s_m_amount),
shape = 1)
s_fre_amount_done <- s_fre_amount_scatter + ylab("frequency") + xlab("average revenue") +
ggtitle("subjects' frequency vs average revenue") + theme(legend.position = "none")
s_fre_amount_done
可以从图中看出: