0. 前言

本次報告的動機是想要深入了解多類別順序羅吉斯迴歸模型 (Cumulative Logit Model、Proportional Odds Model、Ordinal Logistic Regression Model皆是指該模型) 的理論解釋及適用情形。
因其特殊的模型架構，適合的反應變數為多類別且具有順序意義的離散形式。本次使用的資料為在開放數據平台Kaggle上找到的巧克力棒零食評分資料集。
簡而言之，我們希望透過模型的配適及解釋來了解到相關食品添加物的有無 (可可脂、鹽巴、糖等常見於巧克力類零食的添加物) 對消費者偏好評分 (rating) 的影響。
本次報告除了有標準的模型配適流程及迴歸報表的解釋，也有使用列線圖 (Nomogram) 來協助呈現變數影響關係。

1. 假說

本次欲探討的主題是建立在不同元素的添加，是如何影響消費者對於巧克力類零食偏好的程度。使用了常見於巧克力零食類產品的可可濃度、可可脂添加有無、香草精添加有無、卵磷脂添加有無、鹽添加有無、糖添加有無、甜味劑添加有無等變數。希望能對於掌握消費者偏好提出貢獻，更進一步促進廠商產品研發的精確性。

其中，依照主流大眾對於零食的偏好，各元素變數對於評分變數的影響假設如下：

可可濃度應有反向相關 (-)
可可脂應有正向相關 (+)
香草精應有正向相關 (+)
卵磷脂方向不確定 (?)
鹽應有正向相關 (?)
糖應有正向相關 (+)
甜味劑應有正向相關 (+)

2. 模型介紹

“Cumulative logit model, also known as proportional odds model, is the go-to option when there is a natural ordering in the dependent variable, which is an ordinal response with more-than-one categories.”

\[P(Y\leq j) = \pi_1 + \pi_2 + ... + \pi_j \ ,\ j = 1,2,...,J\]

\[logit[P(Y\leq j)] = \log[{P(Y\leq j) \over 1-P(Y\leq j)}]\ ,\ j = 1,2,...,J-1\]

\[logit[P(Y\leq j)] = \alpha_j + \beta x\]

3. 資料

長相

Table 1. Data outlook
ref	company	company_location	review_date	country_of_bean_origin	specific_bean_origin_or_bar_name	cocoa_percent	rating	counts_of_ingredients	beans	cocoa_butter	vanilla	lecithin	salt	sugar	sweetener	first_taste	second_taste	third_taste	fourth_taste
2454	5150	U.S.A	2019	Madagascar	Bejofo Estate, batch 1	76	3.75	3	have_bean	have_cocoa_butter	have_not_vanila	have_not_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	cocoa	blackberry	full body
2458	5150	U.S.A	2019	Dominican republic	Zorzal, batch 1	76	3.50	3	have_bean	have_cocoa_butter	have_not_vanila	have_not_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	cocoa	vegetal	savory
2454	5150	U.S.A	2019	Tanzania	Kokoa Kamili, batch 1	76	3.25	3	have_bean	have_cocoa_butter	have_not_vanila	have_not_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	rich cocoa	fatty	bready
797	A. Morin	France	2012	Peru	Peru	63	3.75	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	fruity	melon	roasty
797	A. Morin	France	2012	Bolivia	Bolivia	70	3.50	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	vegetal	nutty
1015	A. Morin	France	2013	Venezuela	Chuao	70	4.00	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	oily	nut	caramel	raspberry
1019	A. Morin	France	2013	Peru	Chanchamayo Province	63	4.00	3	have_bean	have_cocoa_butter	have_not_vanila	have_not_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	sweet	cocoa	tangerine
1011	A. Morin	France	2013	Ecuador	Equateur	70	3.75	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	sandy	nutty	cocoa	fig
1019	A. Morin	France	2013	Peru	Chanchamayo Province	70	3.50	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	cocoa	sour	intense tangerine
1011	A. Morin	France	2013	Brazil	Brazil	70	3.25	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	mild tobacco
1015	A. Morin	France	2013	Papua new guinea	Papua New Guinea	70	3.25	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	mild fruit	strong smoke
1019	A. Morin	France	2013	Peru	Piura	70	3.25	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	green	nutty	cocoa
1011	A. Morin	France	2013	Madagascar	Madagascar, Criollo	70	3.00	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	sticky	red fruit	sour
1015	A. Morin	France	2013	Burma	Birmanie	70	3.00	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	sticky	smokey	grass
1011	A. Morin	France	2013	Panama	Panama	70	2.75	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	brief fruit note	earthy	nutty
1015	A. Morin	France	2013	Colombia	Colombie	70	2.75	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	burnt rubber	alkalyzed notes
1319	A. Morin	France	2014	Peru	Pablino	70	4.00	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	delicate	hazelnut	brownie
1319	A. Morin	France	2014	Venezuela	Puerto Cabello, Criollo	70	3.75	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	astringent	nutty	chocolatey
1315	A. Morin	France	2014	Cuba	Cuba	70	3.50	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	sliglty dry	papaya
1315	A. Morin	France	2014	Venezuela	Sur del Lago, Criollo	70	3.50	4	have_bean	have_cocoa_butter	have_not_vanila	have_lecithin	have_not_salt	have_sugar	have_not_sweetener_without_sugar	nutty	mild choco	roasty

視覺化

總共有六個感興趣的食品添加物變數，所有自變數都處理成0/1的形式以利運算。其中1表有添加，0則表示無添加。

變數相關性

為避免多重共線性 (Multicollinaerity)，我們利用相關係數矩陣圖來協助判斷是否有多重共線性的問題。由圖可知，糖 (Sugar) 和甜味劑 (Sweetener) 似乎有非常高的相關。事實上是很合理，甜味劑就是用來彌補不添加糖之下甜味不足的添加物，兩者是互補的性質。但是在計量經濟學的範疇，兩者僅需取其一就能得到資料所提供的資訊，否則會有模型效力不足的問題。在這報告中，我們將甜味劑移除。

變數移除之後，多重共線性問題也獲得解決，使我們能繼續進行模型配適。

target<- names[8]
variable<- names[c(7, 11:16)]
variables<- names[c(7, 11:15)]

par(mfrow = c(1, 2))
fig_1<- corrplot::corrplot.mixed(cor(data[variable]),
                         lower = "number", upper = "circle")
fig_2<- corrplot::corrplot.mixed(cor(data[variables]),
                         lower = "number", upper = "circle")

4. 模型配適

目標變數分組

為了簡化模型配適，我們將目標變數-評分 (Rating) 劃分成四區間。整理區間詳細如下表所示：

Table 2. Original rating outlook
rating
Group 1						Group 2		Group 3		Group 4
1	1.5	1.75	2	2.25	2.5	2.75	3	3.25	3.5	3.75	4
1	5	1	29	14	150	304	471	394	489	265	101

Table 3. Reorganized rating outlook
rating
1	2	3	4
200	775	883	366

Initial fit

\[logit[\hat{P}(Rating\leq 2.5)] = \hat{\alpha_1} + \hat{\beta_1} \times cocoa\ percent + \hat{\beta_2} \times cocoa\ butter + \hat{\beta_3} \times vanilla + \hat{\beta_4} \times lecithin + \hat{\beta_5} \times salt + \hat{\beta_6} \times sugar\]

\[logit[\hat{P}(Rating\leq 3)] = \hat{\alpha_2} + \hat{\beta_1} \times cocoa\ percent + \hat{\beta_2} \times cocoa\ butter + \hat{\beta_3} \times vanilla + \hat{\beta_4} \times lecithin + \hat{\beta_5} \times salt + \hat{\beta_6} \times sugar\]

\[logit[\hat{P}(Rating\leq 3.5)] = \hat{\alpha_3} + \hat{\beta_1} \times cocoa\ percent + \hat{\beta_2} \times cocoa\ butter + \hat{\beta_3} \times vanilla + \hat{\beta_4} \times lecithin + \hat{\beta_5} \times salt + \hat{\beta_6} \times sugar\]

# cumulative logit model
f <- paste(target, "~", paste(variables, collapse=" + "))
cumu.logit=vglm(f,family=cumulative(parallel=TRUE), data=train)
# summary(cumu.logit, digits = 3)

Table 4. Estimated coefficients outlook

variables	Estimate	Std. Error	Z value	p-value
(Intercept):1	-4.177797	0.508384	-8.218	< 2e-16	***
(Intercept):2	-2.042951	0.500669	-4.080	4.49e-05	***
(Intercept):3	-0.125167	0.500023	-0.250	0.80234
cocoa_percent	0.031184	0.006153	5.068	4.02e-07	***
cocoa_butter1	-0.203703	0.073550	-2.770	0.00561	**
vanilla1	0.852514	0.096128	8.869	< 2e-16	***
lecithin1	0.227091	0.085531	2.655	0.00793	**
salt1	0.548428	0.270291	2.029	0.04246	*
sugar1	-0.501570	0.189077	-2.653	0.00798	**

Second-Round fit

去除掉顯著性最低的兩個自變數 (鹽(Salt)、卵磷脂(Lecithin)) 後，再配一次模型。此步驟其實是要簡化後續的模型解釋。經過配適後的估計模型如下所示：

\[logit[\hat{P}(Rating\leq 2.5)] = -3.851 + 0.0284 \times cocoa\ percent - 0.142 \times cocoa\ butter + 0.923 \times vanilla - 0.621 \times sugar\]

\[logit[\hat{P}(Rating\leq 3)] = -1.721 + 0.0284 \times cocoa\ percent - 0.142 \times cocoa\ butter + 0.923 \times vanilla - 0.621 \times sugar\]

\[logit[\hat{P}(Rating\leq 3.5)] = 0.192 + 0.0284 \times cocoa\ percent - 0.142 \times cocoa\ butter + 0.923 \times vanilla - 0.621 \times sugar\]

# cumulative logit model
variables2<- variables[-c(4,5)]
f2 <- paste(target, "~", paste(variables2, collapse=" + "))
cumu.logit2=vglm(f2, family=cumulative(parallel=TRUE), data=train)
# summary(cumu.logit2, digits = 3)

Table 5. Estimated coefficients outlook

variables	Estimate	Std. Error	Z value	p-value
(Intercept):1	-3.851430	0.493486	-7.805	5.97e-15	***
(Intercept):2	-1.721491	0.486258	-3.540	0.000400	***
(Intercept):3	0.191847	0.485983	0.395	0.693019
cocoa_percent	0.028365	0.006096	4.653	3.27e-06	***
cocoa_butter1	-0.141684	0.070632	-2.006	0.044863	*
vanilla1	0.923072	0.092287	10.002	< 2e-16	***
sugar1	-0.621057	0.172767	-3.595	0.000325	***

5. 模型結果

列線圖 (Nomogram)

org[, variables[-1]]<- lapply(org[, variables[-1]], function(x){
  factor(x, levels = c("0", "1"))
}) %>% as.data.frame()
n_train<- org[train_int,]
n_test<- org[-train_int,]
rms_train<- n_train
# Had to reverse factors of response since I did not find an option to reverse hypothesis
# in the model fitting function orm()
# However, the two models from two different packages turn out to be identical
# in terms of their AICs and estimated coefficients.
rms_train$rating<- n_train$rating %>% as.factor()# %>% fct_rev()

ddist<- datadist(rms_train)
options(datadist = "ddist")
fit.blogit.rms<- rms::orm(as.formula(f2), data=rms_train)
coef<- fit.blogit.rms$coefficients
ff<- Newlabels(fit.blogit.rms, c(vanilla = "vanilla"))
fun2<- function(x) {
  plogis(x- coef[1] + coef[2])
}
fun3<- function(x) {
  plogis(x- coef[1] + coef[3])
}
nomo<- nomogram(ff,
                fun = list("Prob Y>=2" = plogis,
                           "Prob Y>=3" = fun2,
                           "Prob Y>=4" = fun3),
                           lp = F,
                           fun.at = c(.01, .05, seq(.1, .9, by = .1), .95, .99))
plot(nomo, lmgp = .2, cex.axis = .6)

6. 資料來源

Cumulative Logit Model

財務金融學系碩一張碩文

2023/06/07

0. 前言

1. 假說

2. 模型介紹

3. 資料

長相

視覺化

變數相關性

4. 模型配適

目標變數分組

Initial fit

Second-Round fit

5. 模型結果

列線圖 (Nomogram)

6. 資料來源

Cumulative Logit Model

財務金融學系碩一 張碩文

2023/06/07

0. 前言

1. 假說

2. 模型介紹

3. 資料

長相

視覺化

變數相關性

4. 模型配適

目標變數分組

Initial fit

Second-Round fit

5. 模型結果

列線圖 (Nomogram)

6. 資料來源

財務金融學系碩一張碩文