本次欲探討的主題是建立在不同元素的添加,是如何影響消費者對於巧克力類零食偏好的程度。使用了常見於巧克力零食類產品的可可濃度、可可脂添加有無、香草精添加有無、卵磷脂添加有無、鹽添加有無、糖添加有無、甜味劑添加有無等變數。希望能對於掌握消費者偏好提出貢獻,更進一步促進廠商產品研發的精確性。
其中,依照主流大眾對於零食的偏好,各元素變數對於評分變數的影響假設如下:
“Cumulative logit model, also known as proportional odds model, is the go-to option when there is a natural ordering in the dependent variable, which is an ordinal response with more-than-one categories.”
\[P(Y\leq j) = \pi_1 + \pi_2 + ... + \pi_j \ ,\ j = 1,2,...,J\]
\[logit[P(Y\leq j)] = \log[{P(Y\leq j) \over 1-P(Y\leq j)}]\ ,\ j = 1,2,...,J-1\]
\[logit[P(Y\leq j)] = \alpha_j + \beta x\]ref | company | company_location | review_date | country_of_bean_origin | specific_bean_origin_or_bar_name | cocoa_percent | rating | counts_of_ingredients | beans | cocoa_butter | vanilla | lecithin | salt | sugar | sweetener | first_taste | second_taste | third_taste | fourth_taste |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2454 | 5150 | U.S.A | 2019 | Madagascar | Bejofo Estate, batch 1 | 76 | 3.75 | 3 | have_bean | have_cocoa_butter | have_not_vanila | have_not_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | cocoa | blackberry | full body | |
2458 | 5150 | U.S.A | 2019 | Dominican republic | Zorzal, batch 1 | 76 | 3.50 | 3 | have_bean | have_cocoa_butter | have_not_vanila | have_not_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | cocoa | vegetal | savory | |
2454 | 5150 | U.S.A | 2019 | Tanzania | Kokoa Kamili, batch 1 | 76 | 3.25 | 3 | have_bean | have_cocoa_butter | have_not_vanila | have_not_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | rich cocoa | fatty | bready | |
797 | A. Morin | France | 2012 | Peru | Peru | 63 | 3.75 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | fruity | melon | roasty | |
797 | A. Morin | France | 2012 | Bolivia | Bolivia | 70 | 3.50 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | vegetal | nutty | ||
1015 | A. Morin | France | 2013 | Venezuela | Chuao | 70 | 4.00 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | oily | nut | caramel | raspberry |
1019 | A. Morin | France | 2013 | Peru | Chanchamayo Province | 63 | 4.00 | 3 | have_bean | have_cocoa_butter | have_not_vanila | have_not_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | sweet | cocoa | tangerine | |
1011 | A. Morin | France | 2013 | Ecuador | Equateur | 70 | 3.75 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | sandy | nutty | cocoa | fig |
1019 | A. Morin | France | 2013 | Peru | Chanchamayo Province | 70 | 3.50 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | cocoa | sour | intense tangerine | |
1011 | A. Morin | France | 2013 | Brazil | Brazil | 70 | 3.25 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | mild tobacco | |||
1015 | A. Morin | France | 2013 | Papua new guinea | Papua New Guinea | 70 | 3.25 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | mild fruit | strong smoke | ||
1019 | A. Morin | France | 2013 | Peru | Piura | 70 | 3.25 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | green | nutty | cocoa | |
1011 | A. Morin | France | 2013 | Madagascar | Madagascar, Criollo | 70 | 3.00 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | sticky | red fruit | sour | |
1015 | A. Morin | France | 2013 | Burma | Birmanie | 70 | 3.00 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | sticky | smokey | grass | |
1011 | A. Morin | France | 2013 | Panama | Panama | 70 | 2.75 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | brief fruit note | earthy | nutty | |
1015 | A. Morin | France | 2013 | Colombia | Colombie | 70 | 2.75 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | burnt rubber | alkalyzed notes | ||
1319 | A. Morin | France | 2014 | Peru | Pablino | 70 | 4.00 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | delicate | hazelnut | brownie | |
1319 | A. Morin | France | 2014 | Venezuela | Puerto Cabello, Criollo | 70 | 3.75 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | astringent | nutty | chocolatey | |
1315 | A. Morin | France | 2014 | Cuba | Cuba | 70 | 3.50 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | sliglty dry | papaya | ||
1315 | A. Morin | France | 2014 | Venezuela | Sur del Lago, Criollo | 70 | 3.50 | 4 | have_bean | have_cocoa_butter | have_not_vanila | have_lecithin | have_not_salt | have_sugar | have_not_sweetener_without_sugar | nutty | mild choco | roasty |
總共有六個感興趣的食品添加物變數,所有自變數都處理成0/1的形式以利運算。其中1表有添加,0則表示無添加。
為避免多重共線性 (Multicollinaerity),我們利用相關係數矩陣圖來協助判斷是否有多重共線性的問題。由圖可知,糖 (Sugar) 和甜味劑 (Sweetener) 似乎有非常高的相關。事實上是很合理,甜味劑就是用來彌補不添加糖之下甜味不足的添加物,兩者是互補的性質。但是在計量經濟學的範疇,兩者僅需取其一就能得到資料所提供的資訊,否則會有模型效力不足的問題。在這報告中,我們將甜味劑移除。
變數移除之後,多重共線性問題也獲得解決,使我們能繼續進行模型配適。
target<- names[8]
variable<- names[c(7, 11:16)]
variables<- names[c(7, 11:15)]
par(mfrow = c(1, 2))
fig_1<- corrplot::corrplot.mixed(cor(data[variable]),
lower = "number", upper = "circle")
fig_2<- corrplot::corrplot.mixed(cor(data[variables]),
lower = "number", upper = "circle")
為了簡化模型配適,我們將目標變數-評分 (Rating) 劃分成四區間。整理區間詳細如下表所示:
1 | 1.5 | 1.75 | 2 | 2.25 | 2.5 | 2.75 | 3 | 3.25 | 3.5 | 3.75 | 4 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 5 | 1 | 29 | 14 | 150 | 304 | 471 | 394 | 489 | 265 | 101 |
1 | 2 | 3 | 4 |
---|---|---|---|
200 | 775 | 883 | 366 |
\[logit[\hat{P}(Rating\leq 2.5)] = \hat{\alpha_1} + \hat{\beta_1} \times cocoa\ percent + \hat{\beta_2} \times cocoa\ butter + \hat{\beta_3} \times vanilla + \hat{\beta_4} \times lecithin + \hat{\beta_5} \times salt + \hat{\beta_6} \times sugar\]
\[logit[\hat{P}(Rating\leq 3)] = \hat{\alpha_2} + \hat{\beta_1} \times cocoa\ percent + \hat{\beta_2} \times cocoa\ butter + \hat{\beta_3} \times vanilla + \hat{\beta_4} \times lecithin + \hat{\beta_5} \times salt + \hat{\beta_6} \times sugar\]
\[logit[\hat{P}(Rating\leq 3.5)] = \hat{\alpha_3} + \hat{\beta_1} \times cocoa\ percent + \hat{\beta_2} \times cocoa\ butter + \hat{\beta_3} \times vanilla + \hat{\beta_4} \times lecithin + \hat{\beta_5} \times salt + \hat{\beta_6} \times sugar\]
# cumulative logit model
f <- paste(target, "~", paste(variables, collapse=" + "))
cumu.logit=vglm(f,family=cumulative(parallel=TRUE), data=train)
# summary(cumu.logit, digits = 3)
Table 4. Estimated coefficients outlook
variables | Estimate | Std. Error | Z value | p-value | |
---|---|---|---|---|---|
(Intercept):1 | -4.177797 | 0.508384 | -8.218 | < 2e-16 | *** |
(Intercept):2 | -2.042951 | 0.500669 | -4.080 | 4.49e-05 | *** |
(Intercept):3 | -0.125167 | 0.500023 | -0.250 | 0.80234 | |
cocoa_percent | 0.031184 | 0.006153 | 5.068 | 4.02e-07 | *** |
cocoa_butter1 | -0.203703 | 0.073550 | -2.770 | 0.00561 | ** |
vanilla1 | 0.852514 | 0.096128 | 8.869 | < 2e-16 | *** |
lecithin1 | 0.227091 | 0.085531 | 2.655 | 0.00793 | ** |
salt1 | 0.548428 | 0.270291 | 2.029 | 0.04246 | * |
sugar1 | -0.501570 | 0.189077 | -2.653 | 0.00798 | ** |
去除掉顯著性最低的兩個自變數 (鹽(Salt)、卵磷脂(Lecithin)) 後,再配一次模型。此步驟其實是要簡化後續的模型解釋。經過配適後的估計模型如下所示:
\[logit[\hat{P}(Rating\leq 2.5)] = -3.851 + 0.0284 \times cocoa\ percent - 0.142 \times cocoa\ butter + 0.923 \times vanilla - 0.621 \times sugar\]
\[logit[\hat{P}(Rating\leq 3)] = -1.721 + 0.0284 \times cocoa\ percent - 0.142 \times cocoa\ butter + 0.923 \times vanilla - 0.621 \times sugar\]
\[logit[\hat{P}(Rating\leq 3.5)] = 0.192 + 0.0284 \times cocoa\ percent - 0.142 \times cocoa\ butter + 0.923 \times vanilla - 0.621 \times sugar\]
# cumulative logit model
variables2<- variables[-c(4,5)]
f2 <- paste(target, "~", paste(variables2, collapse=" + "))
cumu.logit2=vglm(f2, family=cumulative(parallel=TRUE), data=train)
# summary(cumu.logit2, digits = 3)
Table 5. Estimated coefficients outlook
variables | Estimate | Std. Error | Z value | p-value | |
---|---|---|---|---|---|
(Intercept):1 | -3.851430 | 0.493486 | -7.805 | 5.97e-15 | *** |
(Intercept):2 | -1.721491 | 0.486258 | -3.540 | 0.000400 | *** |
(Intercept):3 | 0.191847 | 0.485983 | 0.395 | 0.693019 | |
cocoa_percent | 0.028365 | 0.006096 | 4.653 | 3.27e-06 | *** |
cocoa_butter1 | -0.141684 | 0.070632 | -2.006 | 0.044863 | * |
vanilla1 | 0.923072 | 0.092287 | 10.002 | < 2e-16 | *** |
sugar1 | -0.621057 | 0.172767 | -3.595 | 0.000325 | *** |
org[, variables[-1]]<- lapply(org[, variables[-1]], function(x){
factor(x, levels = c("0", "1"))
}) %>% as.data.frame()
n_train<- org[train_int,]
n_test<- org[-train_int,]
rms_train<- n_train
# Had to reverse factors of response since I did not find an option to reverse hypothesis
# in the model fitting function orm()
# However, the two models from two different packages turn out to be identical
# in terms of their AICs and estimated coefficients.
rms_train$rating<- n_train$rating %>% as.factor()# %>% fct_rev()
ddist<- datadist(rms_train)
options(datadist = "ddist")
fit.blogit.rms<- rms::orm(as.formula(f2), data=rms_train)
coef<- fit.blogit.rms$coefficients
ff<- Newlabels(fit.blogit.rms, c(vanilla = "vanilla"))
fun2<- function(x) {
plogis(x- coef[1] + coef[2])
}
fun3<- function(x) {
plogis(x- coef[1] + coef[3])
}
nomo<- nomogram(ff,
fun = list("Prob Y>=2" = plogis,
"Prob Y>=3" = fun2,
"Prob Y>=4" = fun3),
lp = F,
fun.at = c(.01, .05, seq(.1, .9, by = .1), .95, .99))
plot(nomo, lmgp = .2, cex.axis = .6)