組員:B044030015 許效軒 B046060036 王文弘 B054012011 黃冠綾 B046060027 姚詠馨 B046060035 張文瑜 M074020024 楊長舜
總結:數量跟金額上整體有逐漸上升的狀態,於2017年大幅成長,2018平緩化。除了因剛成立,處成長期而有上升的狀態,另外此平台也會因為特別活動而有突出的訂單數(e.g. 2017/11/24(六) 數量飆高,因為11/23是黑色星期五、2018五月初數量上升可能是因母親節節日)
# 篩掉同一個訂單
o=df[!duplicated(df$order_id), ]
# table(df$order_purchase_timestamp) %>% sort # 以月分劃分是最洽當的
ts=as.POSIXct(o$order_purchase_timestamp, format="%Y-%m-%d %H:%M:%S")
par(cex=0.8)
hist(ts,"month",las=2,freq=T,xlab="",main ="各月份訂單數",col ="#FF7744")
summary(o$order_id) # 總共97256筆
## Length Class Mode
## 97256 character character
# 細分
o$date= as.Date(o$order_purchase_timestamp,format="%Y-%m-%d")
lim <- as.Date(c("2017-01-01","2018-09-01"))
qplot(data=o,
x=date,
geom = "bar",fill="")+
# fill = customer_city) + # 依地區分
labs(x = "日期",y = "數量") +
scale_x_date(date_breaks="months",limits = lim) +
ggtitle("細分訂單數")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')+
theme(axis.text.x = element_text(angle = -60))
# 篩掉「訂單編號且訂單商品數一樣」的,避免同一商品因付款方式不同重複計算;最後再加總
cash = df %>% select(order_id,order_item_id,price,order_purchase_timestamp) %>% as.data.frame
cash1 =cash %>%
group_by(order_id,order_item_id) %>%
filter(row_number() == 1) %>%
ungroup()
cash2 = aggregate(price ~ order_id, cash1, sum)
cash2 =merge(cash2 ,cash1, by =c( "order_id"), all.x = TRUE)
cash2$date= as.Date(cash2$order_purchase_timestamp,format="%Y-%m-%d")
setDT(cash2)[, Month_Yr := format(as.Date(order_purchase_timestamp), "%Y-%m") ]
cash3 = aggregate(price.x ~ Month_Yr, cash2, sum)
# 視覺化
ggplot(cash3, aes(x = Month_Yr, y = price.x)) +
geom_point(color = '#FF7744') +
scale_y_continuous(labels = comma)+
#geom_text_repel(aes(label=price.x)) +
labs(x = "年月份",y = "銷售金額")+
ggtitle("總銷售金額趨勢")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')+
theme(axis.text.x = element_text(angle = -60))
總結:每次購買商品數量只買一項,客單價低,但是有許多離群值,因此可以依每個客戶潛力來做行銷,另外從其熱門商品來看,其潛力產品為健康保養(持續上升),夕陽產品則是3c產品[這邊可能要參考冠綾做的寫]。
此資料中最高是一個訂單中購買了21個商品,但大多仍是購買少量商品為主(90%的訂單都只單純購買一樣商品)。
pnumber = df %>% select(order_id,order_item_id) %>% as.data.frame
pnumber1 =pnumber %>%
group_by(order_id) %>%
arrange(order_id,desc(order_item_id)) %>%
filter(row_number() == 1) %>%
ungroup()
pnumber1 %>%
ggplot(aes(x=order_item_id,fill=""))+
geom_bar(aes(y = ..prop.., stat="count"))+
scale_y_continuous(breaks=seq(0,1,0.1),limits =c(0,1))+
labs(x = "產品銷售數量",y = "訂單數量百分比")+
ggtitle("單筆訂單中購買的產品數量")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')
以售出數量2000以上計算(前18名)分别為寢具用品、健康保養、運動休閒用品、家具及裝飾品、電腦配件、家居用品、手錶禮物、電話、園藝工具、汽車、玩具、coolstuff、香水、嬰兒用品、電器用品、文具、時尚包包配件、寵物用品;其較熱門的商品,反倒不像其他電商平台(銷售服飾雜貨等)。
df$product_category_name_english %>% table %>% sort %>%
as.data.frame %>%
filter(Freq>2000) %>%
ggplot(aes(x=.,y=Freq,fill=""))+
geom_bar(stat="identity")+
labs(x = "產品名稱",y = "售出數量")+
ggtitle("商品熱門排行")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')+
theme(axis.text.x = element_text(angle = -70))
以每筆訂單去做的單次購買商品的金額:其客單價(平均數)為137.85元巴西幣(約1097.5台幣),超過50%的人單筆花費(不含運費)少於100巴西幣(約800台幣),其實不高;但因為是購物平台,有各種不同產品,所以在整體價格上也會有很大的離群值(中位數86.8元巴西幣,約700元台幣),因此若在將來要實施行銷方案,可能必須先以各客戶的消費能力來作不同的營銷。
# 篩掉「訂單編號且訂單商品數一樣」的,避免同一商品因付款方式不同重複計算;最後再加總
price = df %>% select(order_id,order_item_id,price) %>% as.data.frame
price1 =price %>%
group_by(order_id,order_item_id) %>%
filter(row_number() == 1) %>%
ungroup()
price2 = aggregate(price ~ order_id, price1, sum) %>% as.data.frame
# 直方圖
ggplot(price2, aes(x=price,fill="")) +
geom_histogram(aes(y = (..count..)/sum(..count..)),binwidth=50)+
scale_x_continuous(breaks=seq(0,2000,100),limits =c(0,2000))+
scale_y_continuous(labels = percent)+
labs(x = "商品消費金額(巴西幣)",y = "數量/總數量")+
ggtitle("單筆消費商品金額(不含運費)")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')+
theme(axis.text.x = element_text(angle = -70))
# 盒狀圖
ggplot(price2, aes(x =1, y = price,fill="")) +
geom_boxplot( aes( y = price),binwidth=50)+
scale_y_continuous(breaks=seq(0,1000,100),limits =c(0,1000))+
labs(x = "整個平台的訂單",y = "總消費金額(不含運費)")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')
mean(price2$price) # 平均數 137.85
## [1] 137.8485
median(price2$price) # 中位數 86.8
## [1] 86.8
其含運費後平均提升至160.8元巴西幣(約1290台幣),中位數提升至105.28元(約850台幣)
# 會因為買的商品數量,重複計算到總金額;如果只篩訂單,又會把不同付款方式的各個金額篩掉
# 所以我們刪掉「訂單且付款方式一樣」的,並合併不同付款方式的各個金額變成各訂單總金額。
p =df %>%
group_by(order_id, payment_sequential) %>%
filter(row_number() == 1) %>%
ungroup()
pricecount = p %>% select(order_id,payment_value) %>% as.data.frame
pricecount1 = aggregate(payment_value ~ order_id, pricecount, sum) %>% as.data.frame
# 直方圖
ggplot(pricecount1, aes(x=payment_value,fill="")) +
geom_histogram(aes(y = (..count..)/sum(..count..)),binwidth=50)+
scale_x_continuous(breaks=seq(0,2000,100),limits =c(0,2000))+
scale_y_continuous(labels = percent)+
labs(x = "訂單總金額(巴西幣)",y = "數量/總數量")+
ggtitle("單筆消費總金額(含運費)")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')+
theme(axis.text.x = element_text(angle = -70))
# 盒狀圖
ggplot(pricecount1, aes(x =1, y = payment_value,fill="")) +
geom_boxplot( aes( y = payment_value),binwidth=50, outlier.alpha = 0.01)+
scale_y_continuous(breaks=seq(0,1000,100),limits =c(0,1000))+
labs(x = "整個平台的訂單",y = "總消費金額(含運費)")+
ggtitle("單筆消費總金額")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')
mean(pricecount1$payment_value) # 平均數 160.801
## [1] 160.801
median(pricecount1$payment_value) # 中位數 105.28
## [1] 105.28
總結:此平台吸引的消費族群,多位於聖保羅州等巴西東南側地帶,偏好於平日白天活動時間(10~16點)購物,並習慣使用信用卡支付,是較為信任網路資安,且經濟能力還算穩定者;從顧客忠誠度來看,可以感受到這些消費者仍在嘗鮮、試水溫的階段。
前6名為SP,RJ,MG,RS,PR,SC(聚集在巴西東南方);聖保羅州特別多
t= table(o$customer_state) %>% sort %>% as.data.frame
t %>%
ggplot(aes(x=Var1,y=Freq,fill=""))+
geom_bar(stat="identity")+
labs(x = "地區",y = "消費者數量")+
ggtitle("消費者所在地區")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')
購物時間主要在周一到周五的10至16點,還有周一、二的20至22點,但主要為上班時間或上課時段;假日反而沒人網購(除了周日晚上)
table(format(ts,"%u"), format(ts,"%H")) %>%
as.data.frame.matrix %>%
d3heatmap(F,F,col=colorRamp(c('seagreen','lightyellow','red')))
忠誠度低,多數顧客只買一次,估計有可能是此平台才成立不久,正在擴展客源的階段;並且可能是因商場優惠(黑色星期五)湧入的客源。
### 篩掉同一次的購買,以免重覆計算
as.data.frame(table(o$customer_id)) %>%
ggplot(aes(x=Freq,fill=""))+
geom_histogram(aes(y = (..count..)/sum(..count..)),binwidth=1)+
scale_x_continuous(breaks=seq(0,10,1),limits =c(0,5))+
scale_y_continuous(labels = percent)+
labs(x = "購買次數",y = "客戶數百分比")+
ggtitle("每位顧客消費次數")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')
相較於流行於巴西本地的現金支付方法boleto,這些消費者以使用信用卡付款為大宗,佔了約3/4,可以知道他們對於資訊及個資上的安全是較為信任的,且有能力申辦信用卡,代表其經濟能力是穩定且不錯的。
as.data.frame(table(o$payment_type))%>%
ggplot+
geom_bar(aes(x=factor(1),
y=Freq,
fill=Var1),
stat = "identity",
position = 'fill') +
coord_polar("y", start=0)+
labs(x = "",y = "")+
ggtitle("習慣支付方式")+
scale_fill_discrete(name="支付方式",
breaks=c("boleto", "credit_card","debit_card","voucher"),
labels=c("boleto本地現金支付方式", "信用卡","金融卡","禮券,代金券"))+
scale_y_continuous(labels = percent_format())
總結:我們也發現除了產品是否符合預期,物流的速度也是此消費族群重視的地方。而物流影響評價,評價影響銷量;若能提升物流,以提高評價,並連動提高銷售量,將是此平台極大的機會。
平均4.09335分,以五分為最多,然此數據非呈現遞減,一分數量卻高於兩分和三分。
### 篩掉同一個reviewid
r=df[!duplicated(df$review_id), ]
ggplot(r,aes(x=review_score,fill=""))+
geom_bar()+
labs(x = "評分(/5分)",y = "人數")+
ggtitle("評價分數")+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')
mean(r$review_score)
## [1] 4.09335
可以發現銷售量最高的日子:黑色星期五後給出的評價卻不到母親節檔期過後的高,另外我們也可以發現每一天五分的數量跟其他分數比起來都是最多的。
###各分數時間趨勢
r$d= as.Date(r$review_answer_timestamp,format="%Y/%m/%d")
reviewperiod=as.data.frame(table(r$d,r$review_score)) %>%
as.data.frame
reviewperiod$tsd=as.Date(reviewperiod$Var1, format="%Y-%m-%d")
reviewperiod %>%
ggplot(aes(x=tsd,y=Freq,color =Var2))+
geom_line()+
labs(x = "日期",y = "數量")+
ggtitle("評價分數趨勢")+
scale_x_date(date_breaks="month",limits = lim) +
scale_color_discrete(name="給分")+
theme_economist() +
theme(axis.text.x = element_text(angle = -60))
跟在期限內收到商品之間可能有關係
produto 產品/ de介係詞 /não不是/ prazo 交貨/ muito 非常/ entrega 交貨/ antes首先/chegou 到達/recebi 收到了/bom 良好 /
reviewword = as.data.frame(select(r,review_id,review_score,review_comment_title,review_comment_message))
colnames(reviewword) = c("id","score","title","message")
reviewword <- reviewword %>% # 刪除空行
filter(message!=""|!is.na(message))
reviewword1 <- reviewword %>% # 切割(尚未加入停用字)
ungroup() %>%
unnest_tokens(word, message) %>%
filter(!word %in% stop_words$word)
### 文字雲:
count <- reviewword1 %>%
anti_join(stop_words) %>%
count(word) %>%
arrange(desc(n))
count %>% wordcloud2
count
## # A tibble: 17,013 x 2
## word n
## <chr> <int>
## 1 produto 18008
## 2 de 11267
## 3 não 10559
## 4 prazo 8265
## 5 muito 7836
## 6 entrega 6441
## 7 antes 5533
## 8 chegou 5474
## 9 recebi 5168
## 10 bom 4520
## # … with 17,003 more rows
### 計算詞/句頻
n5 =reviewword %>%
ungroup() %>%
unnest_tokens(figram, message, token = "ngrams", n = 5) %>%
separate(figram, c("word1", "word2", "word3", "word4", "word5" ), sep = " ",na.rm = T) %>%
count(word1, word2, word3, word4,word5, sort = TRUE)
head(n5,20)
## # A tibble: 20 x 6
## word1 word2 word3 word4 word5 n
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 <NA> <NA> <NA> <NA> <NA> 66858
## 2 produto chegou antes do prazo 298
## 3 chegou bem antes do prazo 278
## 4 produto entregue antes do prazo 203
## 5 ainda não recebi o produto 183
## 6 o produto chegou antes do 167
## 7 foi entregue antes do prazo 145
## 8 o produto não foi entregue 122
## 9 chegou antes do prazo e 121
## 10 entregue bem antes do prazo 101
## 11 e chegou antes do prazo 93
## 12 produto chegou bem antes do 93
## 13 o produto antes do prazo 87
## 14 até o momento não recebi 84
## 15 não recebi o produto e 84
## 16 recebi o produto antes do 84
## 17 produto entregue dentro do prazo 83
## 18 chegou antes do prazo previsto 77
## 19 o produto chegou no prazo 77
## 20 produto chegou dentro do prazo 75
# 產品在截止日期前到達
# 在截止日期前到達
# 產品在截止日期前交付
# 我沒有收到產品
# 產品到貨之前到了
# 在截止日期前交付
# 產品未送達
# 在截止日期前抵達
# 在截止日期前交付
此平台有約90%的會準時到貨,平均物流抵達時間為12.5天內抵達居多,最長高達210天,最短一天不到,但光靠圖表,我們無法明顯看出兩者之間的關聯;我們使用相關性和回歸檢定,才發現呈現低度負相關,且回歸檢定上也是顯著的(指物流花越久,分數越低;物流越快,分數越高)。
# 物流準時時間
tspre=as.POSIXct(o$order_estimated_delivery_date, format="%Y-%m-%d %H:%M:%S")
tsaft=as.POSIXct(o$order_delivered_customer_date, format="%Y-%m-%d %H:%M:%S")
as.data.frame(table(tsaft > tspre))%>%
ggplot+
geom_bar(aes(x=factor(1),
y=Freq,
fill=Var1),
stat = "identity",
position = 'fill') +
labs(x = "",y = "")+
ggtitle("物流準時程度")+
coord_polar("y", start=0)+
scale_fill_discrete(name="到貨期",
breaks=c("FALSE", "TRUE"),
labels=c("準時期限內", "遲交貨"))+
theme_economist() +
scale_color_economist() +
scale_y_continuous(labels = percent_format())
# 關係
tsr1=as.POSIXct(r$order_purchase_timestamp, format="%Y-%m-%d %H:%M:%S")
tsr2=as.POSIXct(r$order_delivered_customer_date, format="%Y-%m-%d %H:%M:%S")
r = mutate(r,spendtime=difftime(tsr2,tsr1, units="day") ) #天數
# 視覺化
r %>%
ggplot(aes(review_score, spendtime)) +
geom_point(mapping = NULL, data = NULL, stat = "identity", position = "identity",color="#FF7744") +
labs(x = "評分(/5分)",y = "物流等待(天數)")+
ggtitle("物流跟評分")+
scale_y_continuous(labels = comma)+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')
summary(as.numeric(r$spendtime))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.5334 6.7635 10.2109 12.5501 15.7053 209.6286 2122
# 相關
cor.test(r$review_score, as.numeric(r$spendtime))
##
## Pearson's product-moment correlation
##
## data: r$review_score and as.numeric(r$spendtime)
## t = -109.76, df = 94930, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3412025 -0.3299126
## sample estimates:
## cor
## -0.3355696
# 回歸
model <- lm(formula= review_score ~ spendtime,
data=r)
summary(model)
##
## Call:
## lm(formula = review_score ~ spendtime, data = r)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6771 -0.4126 0.5124 0.7619 8.8013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7163697 0.0065356 721.6 <2e-16 ***
## spendtime -0.0454997 0.0004146 -109.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.219 on 94930 degrees of freedom
## (2122 observations deleted due to missingness)
## Multiple R-squared: 0.1126, Adjusted R-squared: 0.1126
## F-statistic: 1.205e+04 on 1 and 94930 DF, p-value: < 2.2e-16
評分五分的數量與銷售量之間呈現中度正相關,且其線性回歸顯著,代表當五分的評分數上升時,銷售量也會跟著提升。
r$o= as.Date(r$order_purchase_timestamp,format="%Y-%m-%d")
ro =as.data.frame(table(r$o))
reviewperiod1 = reviewperiod %>% filter(Var2 ==5) # 篩出五分
ro =merge(ro,reviewperiod1, by = "Var1", x.all = TRUE)
colnames(ro)=c("時間","sales","分數","評5分數","時間1")
ro %>%
ggplot(aes(sales,評5分數)) +
geom_point(mapping = NULL, data = NULL, stat = "identity", position = "identity",color="#FF7744",size=1) +
labs(x = "5分的評分數",y = "銷售量")+
ggtitle("產品銷售量與評分")+
scale_y_continuous(labels = comma)+
theme_economist() +
scale_color_economist()+
theme(legend.position='none')
cor.test(ro$sales, ro$評5分數)
##
## Pearson's product-moment correlation
##
## data: ro$sales and ro$評5分數
## t = 15.724, df = 595, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4825511 0.5961422
## sample estimates:
## cor
## 0.541816
電商平台因剛成立,處成長期而有上升的狀態,也會因為特別活動而有突出的訂單數,因此若能在特別節日中,舉辦相關活動,並且抓住成長期的大量顧客,則可以脫隱而出。
訂單中購買商品數量只買一項,且客單價低,但是有許多離群值,因此須依每個客戶潛力(如:將之分為高、低消費族群),做適合的行銷;另外也可依商品在不同地區、時間的熱門程度來做營銷。
此平台當前吸引的消費族群,多位於聖保羅州等巴西東南側地帶,偏好於平日白天活動時間(10~16點)購物,並習慣使用信用卡支付,是較為信任網路資安,且經濟能力還算穩定者;從顧客忠誠度來看,可以感受到這些消費者仍在嘗鮮、試水溫的階段。
從評論文字探索,除了產品是否符合預期,物流的速度也是此消費族群重視的地方。而物流影響評價,評價影響銷量;若能提升物流,以提高評價,並連動提高銷售量,將是此平台極大的機會。
因為透過單看各分數的文字雲無法看出實際問題,所以使用tf-idf
評分為1:insatisfeito不滿/ tento試圖 /parcela一部分 /decepção騙局 /enganosa 誤導/entrar 聯繫,進入/descaso 忽視
評分為2:tento試圖/entrar 聯繫,進入/flex靈活性/moist 潮濕/Pneu Maggion aro 某產品名稱 /esquerdo左
評分為3:identidade身分/taxas利率/pentes梳子/ tirando服用/apto適合/automotivo汽車/presilhas 緊/
評分為4:ótimo 美好/rapida很快/direitinho恰到好處/benefícios好處/baixar降低/Tirando 除此 評分為5:ótimo 美好/rapida很快/direitinho恰到好處/adorou喜歡/excelentes優點/superou超出預期/parabens恭喜
小結:估計低分是因為覺得被騙被誤導,但是聯絡賣家又被忽視;有一些詞是直接跟產品有關,像是2分出現的某輪胎牌子
### 每區間總共字數
reviewword2 =reviewword1 %>%
count(score, word, sort = TRUE)
total_words <- reviewword2 %>%
group_by(score) %>%
summarize(total = sum(n))
t_words <- left_join(reviewword2, total_words)
t_words
## # A tibble: 31,725 x 4
## score word n total
## <int> <chr> <int> <int>
## 1 5 produto 7963 142573
## 2 1 não 6008 125654
## 3 5 prazo 5569 142573
## 4 5 muito 5296 142573
## 5 1 produto 4861 125654
## 6 5 antes 4452 142573
## 7 1 de 3974 125654
## 8 5 de 3729 142573
## 9 5 entrega 3628 142573
## 10 5 recomendo 3227 142573
## # … with 31,715 more rows
###找出重要的單詞(專有詞)
t_words <-t_words %>%
bind_tf_idf(word,score, n)
t_words
## # A tibble: 31,725 x 7
## score word n total tf idf tf_idf
## <int> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 5 produto 7963 142573 0.0559 0 0
## 2 1 não 6008 125654 0.0478 0 0
## 3 5 prazo 5569 142573 0.0391 0 0
## 4 5 muito 5296 142573 0.0371 0 0
## 5 1 produto 4861 125654 0.0387 0 0
## 6 5 antes 4452 142573 0.0312 0 0
## 7 1 de 3974 125654 0.0316 0 0
## 8 5 de 3729 142573 0.0262 0 0
## 9 5 entrega 3628 142573 0.0254 0 0
## 10 5 recomendo 3227 142573 0.0226 0 0
## # … with 31,715 more rows
### 計算出tf-idf
t_words %>%
select(-total) %>%
arrange(desc(tf_idf))
## # A tibble: 31,725 x 6
## score word n tf idf tf_idf
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 5 otima 201 0.00141 0.223 0.000315
## 2 5 rapida 195 0.00137 0.223 0.000305
## 3 5 direitinho 73 0.000512 0.511 0.000262
## 4 5 adorou 70 0.000491 0.511 0.000251
## 5 5 excelentes 38 0.000267 0.916 0.000244
## 6 5 superou 143 0.00100 0.223 0.000224
## 7 4 direitinho 22 0.000423 0.511 0.000216
## 8 2 aussie 4 0.000131 1.61 0.000211
## 9 5 parabens 127 0.000891 0.223 0.000199
## 10 5 agradecer 29 0.000203 0.916 0.000186
## # … with 31,715 more rows
### 結果視覺化
t_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(score) %>%
top_n(7) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill =score)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~score, ncol = 2, scales = "free") +
coord_flip()
小結:極低度相關
### 產品照片數與評分的關係
cor(r$review_score, r$product_photos_qty)
## [1] 0.01435297
cor.test(r$review_score, r$product_photos_qty)
##
## Pearson's product-moment correlation
##
## data: r$review_score and r$product_photos_qty
## t = 4.4719, df = 97052, p-value = 7.763e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.008062374 0.020642439
## sample estimates:
## cor
## 0.01435297
r %>%
ggplot(aes(review_score, product_photos_qty,group=review_score)) +
geom_boxplot( width = .5, outlier.alpha = 0.01)+
geom_point(mapping = NULL, data = NULL, stat = "identity", position = "identity") +
labs(x = "評分(/5分)",y = "產品照片數量")+
ggtitle("產品照片數與評分")+
scale_y_continuous(labels = comma)
### 產品簡介字數與評分的關係
cor(r$review_score, r$product_description_lenght)
## [1] 0.01006433
cor.test(r$review_score, r$product_description_lenght)
##
## Pearson's product-moment correlation
##
## data: r$review_score and r$product_description_lenght
## t = 3.1355, df = 97052, p-value = 0.001716
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.003773238 0.016354620
## sample estimates:
## cor
## 0.01006433
r %>%
ggplot(aes(review_score, product_description_lenght,group=review_score)) +
geom_boxplot( width = .5, outlier.alpha = 0.01)+
geom_point(mapping = NULL, data = NULL, stat = "identity", position = "identity") +
labs(x = "評分(/5分)",y = "產品簡介字數")+
ggtitle("產品簡介字數與評分")+
scale_y_continuous(labels = comma)
小結:基本上關聯不大
cor(r$review_score, r$freight_value)
## [1] -0.03853995
cor.test(r$review_score, r$freight_value)
##
## Pearson's product-moment correlation
##
## data: r$review_score and r$freight_value
## t = -12.015, df = 97052, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04482041 -0.03225644
## sample estimates:
## cor
## -0.03853995
r %>%
ggplot(aes(review_score,freight_value)) +
geom_point(mapping = NULL, data = NULL, stat = "identity", position = "identity") +
labs(x = "評分(/5分)",y = "運費")+
ggtitle("運費與評分")+
scale_y_continuous(labels = comma)
SP聖保羅州 40801 /RJ里約熱內盧 12569 /MG米纳斯吉拉斯 11375 /RS南大河州 /PR巴拉那 /SC聖卡塔琳娜.....
這些州的特性???
axis_theme<-theme( #用來調整圖形的X,Y軸圖像
axis.title=element_text(
#family=NULL,
face = "bold", #字体("plain", "italic", "bold", "bold.italic")
colour = "red", #字体颜色
size = 20,#字体大小
hjust = .5, #调整轴标题1:纵轴靠上,横轴靠右;0.5居中;0:纵轴靠下,横轴靠左
vjust = .5, #1:靠图边框;0靠近轴线;.5居中
angle = 0 #为什么只对横轴标题有作用?
),
axis.title.x=element_text(colour="blue",vjust=-3),#x轴标题设置,优先级高于axis.title
axis.title.y=element_text(colour="orange",vjust=5),#同上
axis.text=element_text(colour="red"),#设置坐标轴刻度数字
axis.text.x=element_text(colour="blue"),#优先级高于aixis.text
axis.text.y=element_text(colour="orange"),#同上
axis.ticks=element_line(#坐标轴刻度线的设置
colour="red",
size=0.05,
linetype=1,
lineend=1),
axis.ticks.x=element_line(colour="blue"),#优先级高于axis.ticks
axis.ticks.y=element_line(colour="orange"),#同上
axis.ticks.length=unit(.4,"lines"),#设置刻度线的高度
axis.ticks.margin=unit(10,"mm"),#设置刻度数字与刻度线的距离
axis.line=element_line(#设置轴线
colour="red"),
axis.line.x=element_line(colour="white"),#设置x轴线,优先于axis.line
axis.line.y=element_line(colour="white"))#类似axis.line.
o=df[!duplicated(df$order_id), ] #篩掉同一個訂單
table(o$customer_state) %>%sort #依據州排序訂單量
##
## RR AP AC AM RO TO SE AL RN PI PB MS
## 44 68 79 146 242 275 343 407 476 486 523 700
## MA MT PA CE PE GO ES DF BA SC PR RS
## 730 891 951 1312 1630 1959 2008 2094 3314 3554 4928 5351
## MG RJ SP
## 11375 12569 40801
qplot(data=o,x=customer_state,geom = "bar")+
labs(x = "Cusotomer_State",y = "Total_amount",title = "Regional Distribution_Orders" )+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) +
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 30, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))+axis_theme
df$customer_state %>% table %>% sort %>% #依據州來排序訂單
as.data.frame %>%
ggplot(aes(x=.,y=Freq))+
geom_bar(stat="identity")+
labs(x = "Cusotomer_State",y = "Total_amount",title = "Regional Ranking_Orders" )+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) + axis_theme+
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))
### 各地區銷售金額排行
state_money= select(df, customer_state, price, freight_value) #挑選出州,商品單價及運費
state_money %>% mutate(total_value=price+freight_value) %>% #總金額=商品單價+商品運費
group_by(customer_state)%>% #依照州分類
summarize(sum_value=sum(total_value),mean_value=mean(total_value)) %>% #運算總金額以及平均金額
arrange(sum_value)%>% as.data.frame %>% tail %>% #依照總金額進行排序,取出前五名
ggplot(aes(x=customer_state,y=sum_value))+ #X軸為州名稱,Y軸為總銷售金額
geom_bar(stat="identity")+
labs(x = "地區名稱",y = "銷售金額" ,title = "地區排行")+
theme_excel() +
theme_economist()+
theme_economist_white()+
theme(axis.text.x = element_text(angle = 90))
axis_theme<-theme(
axis.title=element_text(
#family=NULL,
face = "bold", #字体("plain", "italic", "bold", "bold.italic")
colour = "red", #字体颜色
size = 15,#字体大小
hjust = .5, #调整轴标题1:纵轴靠上,横轴靠右;0.5居中;0:纵轴靠下,横轴靠左
vjust = .5, #1:靠图边框;0靠近轴线;.5居中
angle = 0 #为什么只对横轴标题有作用?
),
axis.title.x=element_text(colour="blue"),#x轴标题设置,优先级高于axis.title
axis.title.y=element_text(colour="orange",vjust=5),#同上
axis.text=element_text(colour="red"),#设置坐标轴刻度数字
axis.text.x=element_text(colour="blue"),#优先级高于aixis.text
axis.text.y=element_text(colour="orange"),#同上
axis.ticks=element_line(#坐标轴刻度线的设置
colour="red",
size=.5,
linetype=1,
lineend=1),
axis.ticks.x=element_line(colour="blue"),#优先级高于axis.ticks
axis.ticks.y=element_line(colour="orange"),#同上
axis.ticks.length=unit(.4,"lines"),#设置刻度线的高度
axis.ticks.margin=unit(.4,"cm"),#设置刻度数字与刻度线的距离
axis.line=element_line(#设置轴线
colour="red"),
axis.line.x=element_line(colour="white"),#设置x轴线,优先于axis.line
axis.line.y=element_line(colour="white"))#类似axis.line.
state_money %>% mutate(total_value=price+freight_value) %>% #總金額=商品單價+商品運費
group_by(customer_state) %>%
summarize(sum_value=sum(total_value),mean_value=mean(total_value)) %>%
arrange(sum_value)%>% as.data.frame %>%
ggplot(aes(x=customer_state,y=sum_value)) +
geom_bar(aes(fill=customer_state),position="stack",stat="identity")+ #不同地區依照不同顏色區分
labs(x = "Cusotomer_State",y = "Total Sales amount",title = "Regional Ranking_Total Sales" )+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) + axis_theme+ #加上axis_theme調整字型大小
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))
state_money %>% mutate(total_value=price+freight_value) %>%
group_by(customer_state) %>%
summarize(sum_value=sum(total_value),mean_value=mean(total_value)) %>%
arrange(sum_value)%>% as.data.frame %>% tail %>% #依照上一張圖,取出前五名
ggplot(aes(x=customer_state,y=sum_value)) +
geom_bar(aes(fill=customer_state),position="stack",stat="identity")+ #依照不同地區採不同顏色
labs(x = "Cusotomer_State",y = "Total Sales amount",title = "Regional Ranking_Total Sales" )+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) + axis_theme+
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))
### 各地區運費金額排行
state_freight= select(df, customer_state, price, freight_value) #與運費有關的數據
state_freight %>% group_by(customer_state) %>% #依照州排序運費
summarize(sum_freight=sum(freight_value),mean_freight=mean(freight_value)) %>% #總運費金額及平均運費
arrange(mean_freight) %>%
ggplot(aes(x=customer_state,y=mean_freight))+ #x軸為州,y軸為平均運費
geom_bar(aes(fill=customer_state),stat="identity",position="stack")+ #不同州不同顏色
labs(x = "Cusotomer_State",y = "Shipping amount" ,title = "Regional_Ranking_Mean_Shipping")+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) + axis_theme+
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))
state_freight= select(df, customer_state, price, freight_value)
state_freight %>% group_by(customer_state) %>%
summarize(sum_freight=sum(freight_value),mean_freight=mean(freight_value)) %>%
arrange(mean_freight) %>% tail %>% #採平均運費做排序,且只選取前五名
ggplot(aes(x=customer_state,y=mean_freight))+
geom_bar(aes(fill=customer_state),stat="identity",position="stack")+
labs(x = "Cusotomer_State",y = "Shipping amount" ,title = "Regional_TOP5_Mean_Shipping")+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) + axis_theme+
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))
SP,RJ,MG,PR,RS
axis_theme_2<-theme( #不同的圖,需要不同的字體大小,比較美觀
axis.title=element_text(
#family=NULL,
face = "bold", #字体("plain", "italic", "bold", "bold.italic")
colour = "red", #字体颜色
size = 10,#字体大小
hjust = .5, #调整轴标题1:纵轴靠上,横轴靠右;0.5居中;0:纵轴靠下,横轴靠左
vjust = .5, #1:靠图边框;0靠近轴线;.5居中
angle = 0 #为什么只对横轴标题有作用?
),
axis.title.x=element_text(colour="blue",vjust=-3,size=15),#x轴标题设置,优先级高于axis.title
axis.title.y=element_text(colour="orange",vjust=5,size=15),#同上
axis.text=element_text(colour="red"),#设置坐标轴刻度数字
axis.text.x=element_text(colour="blue"),#优先级高于aixis.text
axis.text.y=element_text(colour="orange"),#同上
axis.ticks=element_line(#坐标轴刻度线的设置
colour="red",
size=.01,
linetype=1,
lineend=1),
axis.ticks.x=element_line(colour="blue"),#优先级高于axis.ticks
axis.ticks.y=element_line(colour="orange"),#同上
axis.ticks.length=unit(.4,"lines"),#设置刻度线的高度
axis.ticks.margin=unit(.4,"cm"),#设置刻度数字与刻度线的距离
axis.line=element_line(#设置轴线
colour="red"),
axis.line.x=element_line(colour="white"),#设置x轴线,优先于axis.line
axis.line.y=element_line(colour="white"))#类似axis.line.
mainstate = df %>% #選取總銷額金額top5的五大洲
filter(customer_state == "SP" | customer_state == "RJ" |
customer_state == "MG"| customer_state == "PR"| customer_state == "RS")
mainstate %>%
ggplot( aes(x=payment_value)) +
geom_histogram(aes(y = (..count..)/sum(..count..)),
binwidth=50,position="dodge")+ #觀看其各區間消費金額比率
facet_grid(~customer_state )+ #依照不同州來劃分
scale_x_continuous(breaks=seq(0,1000,300),limits =c(0,1000))+ #金額由1~1000,每300為一區間
scale_y_continuous(labels = percent)+ #y軸以比率顯示
theme(axis.text.x = element_text(angle = 90))+
labs(x = "Orders_value",y = "Amount/Total_Amount", title = "Distribution of consumption")+
theme_excel() +
theme_economist()+
theme_economist_white()+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) + axis_theme_2+
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))
以售出數量2000以上計算(前18名)分别為寢具用品、健康保養、運動休閒用品、家具及裝飾品、電腦配件、家居用品、手錶禮物、電話、園藝工具、汽車、玩具、coolstuff?、香水、嬰兒用品、電器用品、文具、時尚包包配件、寵物用品
legend_theme<-theme( # 調整圖例大小及顏色
legend.background=element_rect(
colour=NA# 图例边框颜色
),# 图例背景填充色
legend.margin=unit(.1,"inches"),# 图例与图片区边缘的距离
legend.key=element_rect(fill="yellow"),
legend.key.size=unit(.5,"inches"),# 图例分类符号的大小
legend.key.height=unit(.2,"inches"),# 图例分类符号高度
legend.key.width=unit(.2,"inches"),# 图例符号的宽度
legend.text=element_text(colour="red",size=3),# 图例分类标签设置
legend.text.align=1,# 0左,1右,0.5居中, 图例分类标签的对齐方式
legend.title=element_text(colour="blue",size=3),# 图例标题设置
legend.title.align=1,#图例标题对齐方式
legend.position=c("right"),#"none","left","right","bottom","top",or
# two-element numeric vector,(0,0)-(1,1)
legend.direction="vertical",#"vertical" 图例排列方向
legend.justification=c("center"),#"center" or two-element numeric vector
legend.box="vertical",#"horizontal",对图例的排列方式
legend.box.just="top"#多图例的居中方式+
)
df$product_category_name_english %>% table %>% sort %>% #依照商品類別排序
as.data.frame %>%
filter(Freq>2000) %>% #選出消費次數超過2000次的商品類別
ggplot(aes(x=.,y=Freq))+
geom_bar(stat="identity")+
labs(x = "Product_Category",y = "Sells_Amount",title = "TOP Product_Ranking")+
theme(plot.title = element_text(hjust = 0.5)) +
theme_excel() +
theme_economist()+
theme_economist_white()+
theme(axis.text.x = element_text(angle = 90))
# 銷售總金額TOP5商品
top5= df %>% select(product_category_name_english,customer_state,price,freight_value) %>%
mutate(total_value=price+freight_value) %>% group_by(product_category_name_english) %>%
summarize(sum_value=sum(total_value),mean_value=mean(total_value)) %>%
arrange(sum_value)%>% as.data.frame %>% tail
# 挑選出銷售金額(商品價格+運費)TOP5的商品類別
ggplot(top5,aes(x=product_category_name_english),position="dodge") +
# 依照個商品類別畫圖
geom_bar(aes(y = sum_value,fill=product_category_name_english),
stat="identity",position="dodge")+ #不同類別給予不同顏色
labs(x = "Product_Category",y = "Total_Sales", title = "TOP5_Goods")+
scale_x_discrete(breaks=c("bed_bath_table","computers_accessories", "furniture_decor", "health_beauty","sports_leisure","watches_gifts"),
labels=c("寢具", "電腦配件", "家具","健康","運動","手錶"))+
# 將商品名稱由英文改成中文(美觀)
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) +
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))+axis_theme_2
mainstate = df %>%
filter(customer_state == "SP" | customer_state == "RJ" | customer_state == "MG"| customer_state == "PR"| customer_state == "RS")
top5goods_mainstate= mainstate %>%
filter(product_category_name_english=="health_beauty"|
product_category_name_english=="watches_gifts"|
product_category_name_english=="bed_bath_table" |
product_category_name_english=="sports_leisure"|
product_category_name_english=="computers_accessories") %>% #選出銷售金額top5的商品類別
mutate(total_value= price + freight_value) #總金額=商品單價+運費
# 銷售總金額TOP5商品在大都會的銷售分布差異
ggplot(top5goods_mainstate,aes(x=factor(customer_state)),position="dodge") + #x軸為州
geom_bar(aes(y = total_value,fill=customer_state),stat="identity",position="dodge")+ #y軸為總銷售金額
facet_grid(~product_category_name_english)+ #依照商品類別分成5類
labs(x = "Customer_State",y = "Total_Sales", title = "Distribution of consumption(region)")+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) + axis_theme_2+
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))
### 地區經緯度與賣家買家關係
geolocation$geolocation_lat = round(geolocation$geolocation_lat,3) #四捨五入至小數點第三位
geolocation$geolocation_lng = round(geolocation$geolocation_lng,3)
selllocation = geolocation %>% group_by(geolocation_city) %>% summarise(selllat = max(geolocation_lat),selllng=max(geolocation_lng))
custlocation = geolocation %>% group_by(geolocation_city) %>% summarise(custlat = max(geolocation_lat),custlng=max(geolocation_lng))
MergedData<-merge(item,seller,by.x="seller_id",by.y="seller_id")
CustOrd<-merge(order,customer,by.x="customer_id",by.y="customer_id")
custsellord<-merge(CustOrd,MergedData,by="order_id")
custsellordprod<-merge(custsellord,product,by="product_id")
complete<-merge(custsellordprod,payment,by="order_id")
complete1<-merge(complete,selllocation,by.x="seller_city",by.y="geolocation_city")
complete2<-merge(complete1,custlocation,by.x="customer_city",by.y="geolocation_city")
complete3<-merge(complete2,name,by="product_category_name")
dist_list = list()
for (i in 1:nrow(complete2)) {
dist_list[[i]] <- gdist(lon.1 = complete2$selllng[i],
lat.1 = complete2$selllat[i],
lon.2 = complete2$custlng[i],
lat.2 = complete2$custlat[i],
units="miles")
}
head(dist_list)
## [[1]]
## [1] 815.5593
##
## [[2]]
## [1] 415.3141
##
## [[3]]
## [1] 351.2629
##
## [[4]]
## [1] 518.2484
##
## [[5]]
## [1] 208.3824
##
## [[6]]
## [1] 103.155
complete2$distbtwn<-as.integer(dist_list)
Brazil<-map_data("world")%>%filter(region=="Brazil")
# Removing some outliers
#Brazils most Northern spot is at 5 deg 16′ 27.8″ N latitude.;
geolocation = geolocation[geolocation$geolocation_lat <= 5.27438888,]
#it’s most Western spot is at 73 deg, 58′ 58.19″W Long.
geolocation = geolocation[geolocation$geolocation_lng >= -73.98283055,]
#It’s most southern spot is at 33 deg, 45′ 04.21″ S Latitude.
geolocation = geolocation[geolocation$geolocation_lat >= -33.75116944,]
#It’s most Eastern spot is 34 deg, 47′ 35.33″ W Long.
geolocation = geolocation[geolocation$geolocation_lng <= -34.79314722,]
glimpse(geolocation)
## Observations: 1,000,121
## Variables: 5
## $ geolocation_zip_code_prefix <int> 1037, 1046, 1046, 1041, 1035, 1012, …
## $ geolocation_lat <dbl> -23.546, -23.546, -23.546, -23.544, …
## $ geolocation_lng <dbl> -46.639, -46.645, -46.643, -46.639, …
## $ geolocation_city <chr> "sao paulo", "sao paulo", "sao paulo…
## $ geolocation_state <chr> "SP", "SP", "SP", "SP", "SP", "SP", …
complete2 = complete2[complete2$selllat <= 5.27438888,]
complete2 = complete2[complete2$custlat <= 5.27438888,]
ggplot() +
geom_polygon(data = Brazil, aes(x=long, y = lat, group = group), fill="black")+
geom_point(data= complete2,aes(x=selllng,y=selllat,color=seller_state),size=0.2)+
labs(x = "Logitude",y = "Latitude", title = "Distribution of consumption(Sellers)")+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) + axis_theme_2+
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))
ggplot() +
geom_polygon(data = Brazil, aes(x=long, y = lat, group = group), fill="black")+
geom_point(data= complete2,aes(x=custlng,y=custlat,color=customer_state),size=0.2)+
labs(x = "Logitude",y = "Latitude", title = "Distribution of consumption(Consumers)")+
theme(plot.title = element_text(hjust = 0.5)) + #使標題置中
theme(panel.background = element_rect(size = 2)) + axis_theme_2+
theme(plot.background = element_rect( size = 1, linetype = 4),
plot.title = element_text(colour = "black", face = "bold",
size = 15, vjust = 2), plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"))
ggplot() +
geom_bar(data= complete3,aes(product_category_name_english,fill=seller_state),width=1)+ coord_flip()
#訂單編號VS消費金額、運費以及總付款金額(order_money)
order_money=select(item,order_id,price,freight_value)
order_money=
order_money%>%
mutate(total= price + freight_value)%>%
group_by(order_id)
#統整在一起
H_data = order_money
order_money = NULL
#訂單編號_購買(order_time)
order_time = select(order,order_id,order_purchase_timestamp)
order_time$order_purchase_timestamp=
as.POSIXct(order_time$order_purchase_timestamp,
format="%Y-%m-%d %H:%M:%S")
order_time$month=format(order_time$order_purchase_timestamp,format="%Y-%m")
#統整在一起
H_data = merge(H_data, order_time, by="order_id")
order_time =NULL
#訂單編號_類別(order_cato)
tras = select(product, product_id ,product_category_name)
product_cato = select(item ,order_id ,product_id)
order_cato = merge( tras, product_cato ,by = "product_id", na.rm=TRUE)
order_cato =
order_cato%>%
merge( name , by="product_category_name", na.rm=TRUE)%>%
select(order_id,product_category_name_english)
#統整在一起
H_data = merge(H_data, order_cato, by="order_id")
H_data$count = 1
order_cato =NULL
tras=NULL
product_cato=NULL
#物流時間整合
ooh = select(order,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,
order_delivered_customer_date,order_estimated_delivery_date)
ooh$shipping_limit_date = item$shipping_limit_date
ooh$order_approved_at= as.Date(ooh$order_approved_at, format= "%Y-%m-%d %H:%M:%S")
ooh$order_delivered_customer_date= as.Date(ooh$order_delivered_customer_date, format= "%Y-%m-%d %H:%M:%S")
ooh$shipping_limit_date= as.Date(ooh$shipping_limit_date, format= "%Y-%m-%d %H:%M:%S")
ooh$order_estimated_delivery_date= as.Date(ooh$order_estimated_delivery_date, format= "%Y-%m-%d %H:%M:%S")
ooh$order_purchase_timestamp= as.Date(ooh$order_purchase_timestamp, format= "%Y-%m-%d %H:%M:%S")
ooh$order_delivered_carrier_date= as.Date(ooh$order_delivered_carrier_date, format= "%Y-%m-%d %H:%M:%S")
ooh =
ooh%>%
mutate(
check=order_approved_at - order_purchase_timestamp,
pack = shipping_limit_date- order_approved_at,
true_deli = order_delivered_customer_date - shipping_limit_date,
balance =order_estimated_delivery_date - order_delivered_customer_date
)%>%
select(check, pack, true_deli, balance)
ooh$check = as.integer(ooh$check)
ooh$pack = as.integer(ooh$pack)
ooh$true_deli = as.integer(ooh$true_deli)
ooh$balance = as.integer(ooh$balance)
par(mfrow=c(1,4))
boxplot(ooh$check )
boxplot(ooh$pack)
boxplot(ooh$true_deli)
boxplot(ooh$balance)
# 購買時間(6-4)
# Check
# 付款被審核成功(6-5)
# 訂單過帳(6-6)
# pack
# 出貨時間(3-5)
# True_deli
# 實際交貨時間(6-7)
# Balance
# 預定交貨時間(6-8)
購買金額最高的前六個分別是: bed_bath_table,health_beauty,computer_accessories, furniture_decor,watches_gifts,sports_leisure
購買次數最多的前六個分別是: bed_bath_table,furniture_decor,health_beauty, computer_accessories, sports_leisure,horsewares
money_cato =
H_data%>%
select(product_category_name_english,total,count)%>%
group_by(product_category_name_english) %>%
summarise(total = sum(total),count=sum(count))
money_cato_t = arrange(money_cato,desc(total))
money_cato_t =money_cato_t[1:6, ]
money_cato_c = arrange(money_cato,desc(count))
money_cato_c =money_cato_c[1:6, ]
money_cato_t$product_category_name_english
## [1] "bed_bath_table" "health_beauty" "computers_accessories"
## [4] "furniture_decor" "watches_gifts" "sports_leisure"
money_cato_c$product_category_name_english
## [1] "bed_bath_table" "furniture_decor" "health_beauty"
## [4] "computers_accessories" "sports_leisure" "housewares"
money_cato_t = NULL
money_cato_c = NULL
K = H_data
K=
K%>%
group_by(month)%>%
summarise(total = sum(total),price = sum(price))
K$season =1
K$season[1:3]="16-4"
K$season[4:6]="17-1"
K$season[7:9]="17-2"
K$season[10:12]="17-3"
K$season[13:15]="17-4"
K$season[16:18]="18-1"
K$season[19:21]="18-2"
K$season[22:24]="18-3"
L = K%>%
group_by(season)%>%
summarise(total = sum(total),price = sum(price))
qplot(x=month,y=total,data=K,
color = rgb(255, 233, 139, maxColorValue = 255))+
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "日期",y = "金額",title = "總金額月趨勢圖")+
theme_bw()+
theme_light()
qplot(x=month, y=price,data=K,
color = rgb(93, 202, 253, maxColorValue = 255))+
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "日期",y = "金額",title = "商品金額月趨勢圖")+
theme_bw()+
theme_light()
qplot(season,total,data=L,
color = rgb(255, 233, 139, maxColorValue = 255) )+
labs(x = "日期",y = "金額",title = "總金額季趨勢圖")+
theme_bw()+
theme_light()
qplot(season,price, data=L,
color = rgb(93, 202, 253, maxColorValue = 255))+
labs(x = "日期",y = "金額",title = "消費金額季趨勢圖")+
theme_bw()+
theme_light()
K=L=O=P=NULL
time_cat_mon = filter(H_data,
product_category_name_english == "bed_bath_table"|
product_category_name_english == "health_beauty" |
product_category_name_english == "computers_accessories"|
product_category_name_english =="furniture_decor"|
product_category_name_english =="watches_gifts"|
product_category_name_english =="sports_leisure"|
product_category_name_english =="housewares" )
ggplot()+
geom_bar(data = time_cat_mon,
aes(x=month,y=total, fill = product_category_name_english),
stat = "identity")+
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "日期",y = "金額",title = "總金額月趨勢圖")+
theme_bw()+
theme_light()
K=
time_cat_mon%>%
group_by(month,product_category_name_english)%>%
summarise(total = sum(total),count=sum(count))
ggplot(data = K)+
geom_line(
aes(x=month,y=count,color = product_category_name_english)
)+
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "日期",y = "金額",title = "總金額月趨勢圖")+
theme_bw()
隨著時間增長,我們可看出健康食品的銷量是逐漸成長的,其餘前四名皆有下滑的趨勢,另外從表中我們可發現,在年末時,客人喜歡購買寢具與家具裝飾用品來迎接新的一年,而在2018年初,普遍五大熱門類別商品皆有很高的銷售金額,這可能與巴西嘉年華有關,此時期店家們多會打出優惠折扣,進而拉抬銷量
#bed_bath_table
K=
H_data%>%
filter(product_category_name_english=="bed_bath_table")%>%
group_by(month)%>%
summarise(total = sum(total))
ggplot(data = K)+
geom_point(
aes(x=month,y=total,col='red',size=60)
) +
# geom_abline(x=month,y=total,col='red')+
labs(x = "日期",y = "金額",title = "bed_bath_table月趨勢圖")+ guides(color=FALSE,size=FALSE)+
theme_bw()+
theme_light()+
theme(axis.text.x = element_text(angle = -60,size=20)) +
theme(axis.text.y = element_text(size=30))+
theme(axis.title.y = element_text(size=30)) +
theme(title = element_text(size=30))
#health_beauty
K=
H_data%>%
filter(product_category_name_english=="health_beauty")%>%
group_by(month)%>%
summarise(total = sum(total))
ggplot(data = K)+
geom_point(
aes(x=month,y=total,col="orange",size=60)
) +
labs(x = "日期",y = "金額",title = "health_beauty月趨勢圖")+guides(color=FALSE,size=FALSE)+
theme_bw()+
theme_light()+
theme(axis.text.x = element_text(angle = -60,size=20)) +
theme(axis.text.y = element_text(size=30))+
theme(axis.title.y = element_text(size=30)) +
theme(title = element_text(size=30))
#computers_accessories
K=
H_data%>%
filter(product_category_name_english=="computers_accessories")%>%
group_by(month)%>%
summarise(total = sum(total))
ggplot(data = K)+
geom_point(
aes(x=month,y=total,color ='yellow',size=60)
) +
labs(x = "日期",y = "金額",title = "computers_accessories月趨勢圖")+guides(color=FALSE,size=FALSE)+
theme_bw()+
theme_light()+
theme(axis.text.x = element_text(angle = -60,size=20)) +
theme(axis.text.y = element_text(size=30))+
theme(axis.title.y = element_text(size=30)) +
theme(title = element_text(size=30))
#furniture_decor
K=
H_data%>%
filter(product_category_name_english=="furniture_decor")%>%
group_by(month)%>%
summarise(total = sum(total))
ggplot(data = K)+
geom_point(
aes(x=month,y=total,col = "green",size=60)
) +
labs(x = "日期",y = "金額",title = "furniture_decor月趨勢圖")+guides(color=FALSE,size=FALSE)+
theme_bw()+
theme_light()+
theme(axis.text.x = element_text(angle = -60,size=20)) +
theme(axis.text.y = element_text(size=30))+
theme(axis.title.y = element_text(size=30)) +
theme(title = element_text(size=30))
#watches_gifts
K=
H_data%>%
filter(product_category_name_english=="watches_gifts")%>%
group_by(month)%>%
summarise(total = sum(total))
ggplot(data = K)+
geom_point(
aes(x=month,y=total,col= 'blue',size=60)
) +
labs(x = "日期",y = "金額",title = "watches_gifts月趨勢圖")+guides(color=FALSE,size=FALSE)+
theme_bw()+
theme_light()+
theme(axis.text.x = element_text(angle = -60,size=20)) +
theme(axis.text.y = element_text(size=30))+
theme(axis.title.y = element_text(size=30)) +
theme(title = element_text(size=30))
#sports_leisure
K=
H_data%>%
filter(product_category_name_english=="sports_leisure")%>%
group_by(month)%>%
summarise(total = sum(total))
ggplot(data = K)+
geom_point(
aes(x=month,y=total,col = 'purple',size=60)
) +
labs(x = "日期",y = "金額",title = "sports_leisure月趨勢圖")+guides(color=FALSE,size=FALSE)+
theme_bw()+
theme_light()+
theme(axis.text.x = element_text(angle = -60,size=20)) +
theme(axis.text.y = element_text(size=30))+
theme(axis.title.y = element_text(size=30)) +
theme(title = element_text(size=30))
#housewares
K=
H_data%>%
filter(product_category_name_english=="housewares")%>%
group_by(month)%>%
summarise(total = sum(total))
ggplot(data = K)+
geom_point(
aes(x=month,y=total,col='pink',size=150)
) +
labs(x = "日期",y = "金額",title = "housewares月趨勢圖")+guides(color=FALSE,size=FALSE)+
theme_bw()+
theme_light()+
theme(axis.text.x = element_text(angle = -60,size=20)) +
theme(axis.text.y = element_text(size=30))+
theme(axis.title.y = element_text(size=30)) +
theme(title = element_text(size=30))