数据与模型 I 课后作业

张弦

1.统计数据显示,普通人后代成为“名人”(在某一领域有较高声望的人)的概率为千分之一,而“名人”后代成为名人的概率则为十分之一,因此血统论是有统计依据的。请对上述判断做出分析。

答:

2.阅读附件中论文Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained,简要列出文章中提到的进行A/B测试时需要注意的问题,并与组内同事共同进行讨论。

答:

3.附件中文件user_purchase_record.csv是用户在豆瓣某产品的购买记录(为保护隐私,对部分数据做了混淆),数据格式为:用户ID,条目ID,金额。一条记录表示一次购买行为,购买金额可以为零。请组内分工协作完成以下任务: a. 用直方图分别展示购买频次、购买金额的分布 b. 用散点图展示购买频次与金额的关系

绘图过程如下:

1.导入数据

# import data
pr <- read.csv("~/Elrond/user_purchase_record.csv", header = FALSE, sep = ",", 
    col.names = c("user_id", "subject_id", "amount"))

2.整理和汇总数据

library(plyr)
fre_user <- ddply(pr, .(user_id), summarize, user_f = length(subject_id))
fre_subject <- ddply(pr, .(subject_id), summarize, subject_f = length(user_id))
amount_subject <- ddply(pr, .(subject_id), summarize, subject_a = sum(amount))

3.画出用户购买频次分布直方图

library(ggplot2)
fre_user_plot <- ggplot(data = fre_user, aes(x = user_f))
fre_user_hist <- fre_user_plot + geom_histogram(binwidth = 1, aes(fill = ..count..))
fre_user_done <- fre_user_hist + xlab("user frequency") + ylab("total users") + 
    ggtitle("distribution of user frequency") + theme(legend.position = "none")
fre_user_done

plot of chunk unnamed-chunk-3

可以放大购买频次小于15的部分,如下:

fre_user_zoom <- fre_user_done + xlim(0, 15)
fre_user_zoom

plot of chunk unnamed-chunk-4

可以从图中看出:

4.画出用户购买金额分布直方图

amount_user_plot <- ggplot(data = pr, aes(x = amount))
amount_user_hist <- amount_user_plot + geom_histogram(binwidth = 1, aes(fill = ..count..))
amount_user_done <- amount_user_hist + xlab("expense") + ylab("total times") + 
    ggtitle("distribution of user expense") + theme(legend.position = "none")
amount_user_done

plot of chunk unnamed-chunk-5

可以放大购买金额小于20的部分,如下:

amount_user_zoom <- amount_user_done + xlim(0, 20)
amount_user_zoom

plot of chunk unnamed-chunk-6

可以从图中看出:

5.画出条目被购买频次分布直方图

fre_subject_plot <- ggplot(data = fre_subject, aes(x = subject_f))
fre_subject_hist <- fre_subject_plot + geom_histogram(binwidth = 100, aes(fill = ..count..))
fre_subject_done <- fre_subject_hist + xlab("subject frequency") + ylab("total subjects") + 
    ggtitle("distribution of subject frequency") + theme(legend.position = "none")
fre_subject_done

plot of chunk unnamed-chunk-7

注:这里的bin的宽度取的是100。
可以从图中看出:

6.画出条目总销售额分布直方图

amount_subject_plot <- ggplot(data = amount_subject, aes(x = subject_a))
amount_subject_hist <- amount_subject_plot + geom_histogram(binwidth = 400, 
    aes(fill = ..count..))
amount_subject_done <- amount_subject_hist + xlab("total revenue") + ylab("total subjects") + 
    ggtitle("distribution of subject revenue") + theme(legend.position = "none")
amount_subject_done

plot of chunk unnamed-chunk-8

注:这里bin的宽度取的是400。
可以从图中看出:

7.画出用户购买频次和金额的散点图

fre_amount <- ddply(pr, .(user_id), summarize, t_fre = length(subject_id), m_amount = mean(amount))
fre_amount_plot <- ggplot(data = fre_amount, aes(y = t_fre, x = m_amount))
fre_amount_scatter <- fre_amount_plot + geom_point(aes(colour = t_fre), shape = 1)
fre_amount_done <- fre_amount_scatter + ylab("frequency") + xlab("average expenses") + 
    ggtitle("users' frequency vs average expenses") + theme(legend.position = "none")
fre_amount_done

plot of chunk unnamed-chunk-9

可以从图中看出:

8.画出条目被购买频次和金额的散点图

s_fre_amount <- ddply(pr, .(subject_id), summarize, s_t_fre = length(user_id), 
    s_m_amount = mean(amount))
s_fre_amount_plot <- ggplot(data = s_fre_amount, aes(y = s_t_fre, x = s_m_amount))
s_fre_amount_scatter <- s_fre_amount_plot + geom_point(aes(colour = s_m_amount), 
    shape = 1)
s_fre_amount_done <- s_fre_amount_scatter + ylab("frequency") + xlab("average revenue") + 
    ggtitle("subjects' frequency vs average revenue") + theme(legend.position = "none")
s_fre_amount_done

plot of chunk unnamed-chunk-10

可以从图中看出: