0. 前言

  • 本次報告的動機是想要深入了解多類別順序羅吉斯迴歸模型 (Cumulative Logit Model、Proportional Odds Model、Ordinal Logistic Regression Model皆是指該模型) 的理論解釋及適用情形。
  • 因其特殊的模型架構,適合的反應變數為多類別且具有順序意義的離散形式。本次使用的資料為在開放數據平台Kaggle上找到的巧克力棒零食評分資料集。
  • 簡而言之,我們希望透過模型的配適及解釋來了解到相關食品添加物的有無 (可可脂、鹽巴、糖等常見於巧克力類零食的添加物) 對消費者偏好評分 (rating) 的影響。
  • 本次報告除了有標準的模型配適流程及迴歸報表的解釋,也有使用列線圖 (Nomogram) 來協助呈現變數影響關係。

1. 假說

本次欲探討的主題是建立在不同元素的添加,是如何影響消費者對於巧克力類零食偏好的程度。使用了常見於巧克力零食類產品的可可濃度、可可脂添加有無、香草精添加有無、卵磷脂添加有無、鹽添加有無、糖添加有無、甜味劑添加有無等變數。希望能對於掌握消費者偏好提出貢獻,更進一步促進廠商產品研發的精確性。

其中,依照主流大眾對於零食的偏好,各元素變數對於評分變數的影響假設如下:

  • 可可濃度應有反向相關 (-)
  • 可可脂應有正向相關 (+)
  • 香草精應有正向相關 (+)
  • 卵磷脂方向不確定 (?)
  • 鹽應有正向相關 (?)
  • 糖應有正向相關 (+)
  • 甜味劑應有正向相關 (+)

2. 模型介紹

“Cumulative logit model, also known as proportional odds model, is the go-to option when there is a natural ordering in the dependent variable, which is an ordinal response with more-than-one categories.”

\[P(Y\leq j) = \pi_1 + \pi_2 + ... + \pi_j \ ,\ j = 1,2,...,J\]

\[logit[P(Y\leq j)] = \log[{P(Y\leq j) \over 1-P(Y\leq j)}]\ ,\ j = 1,2,...,J-1\]

\[logit[P(Y\leq j)] = \alpha_j + \beta x\]

3. 資料

長相

Table 1. Data outlook
ref company company_location review_date country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent rating counts_of_ingredients beans cocoa_butter vanilla lecithin salt sugar sweetener first_taste second_taste third_taste fourth_taste
2454 5150 U.S.A 2019 Madagascar Bejofo Estate, batch 1 76 3.75 3 have_bean have_cocoa_butter have_not_vanila have_not_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar cocoa blackberry full body
2458 5150 U.S.A 2019 Dominican republic Zorzal, batch 1 76 3.50 3 have_bean have_cocoa_butter have_not_vanila have_not_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar cocoa vegetal savory
2454 5150 U.S.A 2019 Tanzania Kokoa Kamili, batch 1 76 3.25 3 have_bean have_cocoa_butter have_not_vanila have_not_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar rich cocoa fatty bready
797 A. Morin France 2012 Peru Peru 63 3.75 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar fruity melon roasty
797 A. Morin France 2012 Bolivia Bolivia 70 3.50 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar vegetal nutty
1015 A. Morin France 2013 Venezuela Chuao 70 4.00 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar oily nut caramel raspberry
1019 A. Morin France 2013 Peru Chanchamayo Province 63 4.00 3 have_bean have_cocoa_butter have_not_vanila have_not_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar sweet cocoa tangerine
1011 A. Morin France 2013 Ecuador Equateur 70 3.75 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar sandy nutty cocoa fig
1019 A. Morin France 2013 Peru Chanchamayo Province 70 3.50 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar cocoa sour intense tangerine
1011 A. Morin France 2013 Brazil Brazil 70 3.25 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar mild tobacco
1015 A. Morin France 2013 Papua new guinea Papua New Guinea 70 3.25 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar mild fruit strong smoke
1019 A. Morin France 2013 Peru Piura 70 3.25 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar green nutty cocoa
1011 A. Morin France 2013 Madagascar Madagascar, Criollo 70 3.00 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar sticky red fruit sour
1015 A. Morin France 2013 Burma Birmanie 70 3.00 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar sticky smokey grass
1011 A. Morin France 2013 Panama Panama 70 2.75 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar brief fruit note earthy nutty
1015 A. Morin France 2013 Colombia Colombie 70 2.75 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar burnt rubber alkalyzed notes
1319 A. Morin France 2014 Peru Pablino 70 4.00 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar delicate hazelnut brownie
1319 A. Morin France 2014 Venezuela Puerto Cabello, Criollo 70 3.75 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar astringent nutty chocolatey
1315 A. Morin France 2014 Cuba Cuba 70 3.50 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar sliglty dry papaya
1315 A. Morin France 2014 Venezuela Sur del Lago, Criollo 70 3.50 4 have_bean have_cocoa_butter have_not_vanila have_lecithin have_not_salt have_sugar have_not_sweetener_without_sugar nutty mild choco roasty

視覺化

總共有六個感興趣的食品添加物變數,所有自變數都處理成0/1的形式以利運算。其中1表有添加,0則表示無添加。

變數相關性

為避免多重共線性 (Multicollinaerity),我們利用相關係數矩陣圖來協助判斷是否有多重共線性的問題。由圖可知,糖 (Sugar) 和甜味劑 (Sweetener) 似乎有非常高的相關。事實上是很合理,甜味劑就是用來彌補不添加糖之下甜味不足的添加物,兩者是互補的性質。但是在計量經濟學的範疇,兩者僅需取其一就能得到資料所提供的資訊,否則會有模型效力不足的問題。在這報告中,我們將甜味劑移除。

變數移除之後,多重共線性問題也獲得解決,使我們能繼續進行模型配適。

target<- names[8]
variable<- names[c(7, 11:16)]
variables<- names[c(7, 11:15)]

par(mfrow = c(1, 2))
fig_1<- corrplot::corrplot.mixed(cor(data[variable]),
                         lower = "number", upper = "circle")
fig_2<- corrplot::corrplot.mixed(cor(data[variables]),
                         lower = "number", upper = "circle")


4. 模型配適

目標變數分組

為了簡化模型配適,我們將目標變數-評分 (Rating) 劃分成四區間。整理區間詳細如下表所示:

Table 2. Original rating outlook
rating
Group 1
Group 2
Group 3
Group 4
1 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4
1 5 1 29 14 150 304 471 394 489 265 101
Table 3. Reorganized rating outlook
rating
1 2 3 4
200 775 883 366

Initial fit

\[logit[\hat{P}(Rating\leq 2.5)] = \hat{\alpha_1} + \hat{\beta_1} \times cocoa\ percent + \hat{\beta_2} \times cocoa\ butter + \hat{\beta_3} \times vanilla + \hat{\beta_4} \times lecithin + \hat{\beta_5} \times salt + \hat{\beta_6} \times sugar\]

\[logit[\hat{P}(Rating\leq 3)] = \hat{\alpha_2} + \hat{\beta_1} \times cocoa\ percent + \hat{\beta_2} \times cocoa\ butter + \hat{\beta_3} \times vanilla + \hat{\beta_4} \times lecithin + \hat{\beta_5} \times salt + \hat{\beta_6} \times sugar\]

\[logit[\hat{P}(Rating\leq 3.5)] = \hat{\alpha_3} + \hat{\beta_1} \times cocoa\ percent + \hat{\beta_2} \times cocoa\ butter + \hat{\beta_3} \times vanilla + \hat{\beta_4} \times lecithin + \hat{\beta_5} \times salt + \hat{\beta_6} \times sugar\]

# cumulative logit model
f <- paste(target, "~", paste(variables, collapse=" + "))
cumu.logit=vglm(f,family=cumulative(parallel=TRUE), data=train)
# summary(cumu.logit, digits = 3)

Table 4. Estimated coefficients outlook

variables Estimate Std. Error Z value p-value
(Intercept):1 -4.177797 0.508384 -8.218 < 2e-16 ***
(Intercept):2 -2.042951 0.500669 -4.080 4.49e-05 ***
(Intercept):3 -0.125167 0.500023 -0.250 0.80234
cocoa_percent 0.031184 0.006153 5.068 4.02e-07 ***
cocoa_butter1 -0.203703 0.073550 -2.770 0.00561 **
vanilla1 0.852514 0.096128 8.869 < 2e-16 ***
lecithin1 0.227091 0.085531 2.655 0.00793 **
salt1 0.548428 0.270291 2.029 0.04246 *
sugar1 -0.501570 0.189077 -2.653 0.00798 **

Second-Round fit

去除掉顯著性最低的兩個自變數 (鹽(Salt)、卵磷脂(Lecithin)) 後,再配一次模型。此步驟其實是要簡化後續的模型解釋。經過配適後的估計模型如下所示:

\[logit[\hat{P}(Rating\leq 2.5)] = -3.851 + 0.0284 \times cocoa\ percent - 0.142 \times cocoa\ butter + 0.923 \times vanilla - 0.621 \times sugar\]

\[logit[\hat{P}(Rating\leq 3)] = -1.721 + 0.0284 \times cocoa\ percent - 0.142 \times cocoa\ butter + 0.923 \times vanilla - 0.621 \times sugar\]

\[logit[\hat{P}(Rating\leq 3.5)] = 0.192 + 0.0284 \times cocoa\ percent - 0.142 \times cocoa\ butter + 0.923 \times vanilla - 0.621 \times sugar\]

# cumulative logit model
variables2<- variables[-c(4,5)]
f2 <- paste(target, "~", paste(variables2, collapse=" + "))
cumu.logit2=vglm(f2, family=cumulative(parallel=TRUE), data=train)
# summary(cumu.logit2, digits = 3)

Table 5. Estimated coefficients outlook

variables Estimate Std. Error Z value p-value
(Intercept):1 -3.851430 0.493486 -7.805 5.97e-15 ***
(Intercept):2 -1.721491 0.486258 -3.540 0.000400 ***
(Intercept):3 0.191847 0.485983 0.395 0.693019
cocoa_percent 0.028365 0.006096 4.653 3.27e-06 ***
cocoa_butter1 -0.141684 0.070632 -2.006 0.044863 *
vanilla1 0.923072 0.092287 10.002 < 2e-16 ***
sugar1 -0.621057 0.172767 -3.595 0.000325 ***

5. 模型結果

列線圖 (Nomogram)

org[, variables[-1]]<- lapply(org[, variables[-1]], function(x){
  factor(x, levels = c("0", "1"))
}) %>% as.data.frame()
n_train<- org[train_int,]
n_test<- org[-train_int,]
rms_train<- n_train
# Had to reverse factors of response since I did not find an option to reverse hypothesis
# in the model fitting function orm()
# However, the two models from two different packages turn out to be identical
# in terms of their AICs and estimated coefficients.
rms_train$rating<- n_train$rating %>% as.factor()# %>% fct_rev()

ddist<- datadist(rms_train)
options(datadist = "ddist")
fit.blogit.rms<- rms::orm(as.formula(f2), data=rms_train)
coef<- fit.blogit.rms$coefficients
ff<- Newlabels(fit.blogit.rms, c(vanilla = "vanilla"))
fun2<- function(x) {
  plogis(x- coef[1] + coef[2])
}
fun3<- function(x) {
  plogis(x- coef[1] + coef[3])
}
nomo<- nomogram(ff,
                fun = list("Prob Y>=2" = plogis,
                           "Prob Y>=3" = fun2,
                           "Prob Y>=4" = fun3),
                           lp = F,
                           fun.at = c(.01, .05, seq(.1, .9, by = .1), .95, .99))
plot(nomo, lmgp = .2, cex.axis = .6)