Penelitian ini bertujuan untuk menganalisis faktor-faktor yang mempengaruhi tingkat obesitas menggunakan metode analisis diskriminan dan regresi logistik ordinal. Variabel dependen yang digunakan adalah NObeyesdad yang memiliki 7 kategori berurutan.
1. Import Library
library(readxl)
library(ggplot2)
library(biotools)
## Loading required package: MASS
## ---
## biotools version 4.3
library(car)
## Loading required package: carData
library(MASS)
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
2. Load Data
data <- read.csv(file.choose())
str(data)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
## $ FAVC : chr "no" "no" "no" "no" ...
## $ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "yes" "no" "no" ...
## $ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr "no" "yes" "no" "no" ...
## $ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ MTRANS : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
summary(data)
## Gender Age Height Weight
## Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
## Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
## Mode :character Median :22.78 Median :1.700 Median : 83.00
## Mean :24.31 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
## family_history_with_overweight FAVC FCVC
## Length:2111 Length:2111 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :2.386
## Mean :2.419
## 3rd Qu.:3.000
## Max. :3.000
## NCP CAEC SMOKE CH2O
## Min. :1.000 Length:2111 Length:2111 Min. :1.000
## 1st Qu.:2.659 Class :character Class :character 1st Qu.:1.585
## Median :3.000 Mode :character Mode :character Median :2.000
## Mean :2.686 Mean :2.008
## 3rd Qu.:3.000 3rd Qu.:2.477
## Max. :4.000 Max. :3.000
## SCC FAF TUE CALC
## Length:2111 Min. :0.0000 Min. :0.0000 Length:2111
## Class :character 1st Qu.:0.1245 1st Qu.:0.0000 Class :character
## Mode :character Median :1.0000 Median :0.6253 Mode :character
## Mean :1.0103 Mean :0.6579
## 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.0000
## MTRANS NObeyesdad
## Length:2111 Length:2111
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Dataset terdiri dari 2111 observasi dan 17 variabel. Variabel utama yang digunakan adalah Age, Height, Weight, FCVC, dan FAF.
3. Data Cleaning
sum(is.na(data))
## [1] 0
colSums(is.na(data))
## Gender Age
## 0 0
## Height Weight
## 0 0
## family_history_with_overweight FAVC
## 0 0
## FCVC NCP
## 0 0
## CAEC SMOKE
## 0 0
## CH2O SCC
## 0 0
## FAF TUE
## 0 0
## CALC MTRANS
## 0 0
## NObeyesdad
## 0
Tidak ditemukan missing value pada dataset sehingga seluruh data dapat digunakan untuk analisis.
4. Transformasi Variabel
data$NObeyesdad <- factor(data$NObeyesdad,
levels = c("Insufficient_Weight",
"Normal_Weight",
"Overweight_Level_I",
"Overweight_Level_II",
"Obesity_Type_I",
"Obesity_Type_II",
"Obesity_Type_III"),
ordered = TRUE)
table(data$NObeyesdad)
##
## Insufficient_Weight Normal_Weight Overweight_Level_I Overweight_Level_II
## 272 287 290 290
## Obesity_Type_I Obesity_Type_II Obesity_Type_III
## 351 297 324
Variabel NObeyesdad diubah menjadi faktor ordinal agar sesuai dengan metode regresi logistik ordinal.
5. Visualisasi Data
ggplot(data, aes(x=NObeyesdad)) +
geom_bar(fill="skyblue") +
theme_minimal()
Distribusi kategori obesitas relatif merata, dengan jumlah terbesar pada
Obesity Type I.
ggplot(data, aes(x=NObeyesdad, y=Weight)) +
geom_boxplot(fill="pink") +
theme_minimal()
Semakin tinggi kategori obesitas, maka berat badan cenderung
meningkat.
ggplot(data, aes(x=NObeyesdad, y=FAF)) +
geom_boxplot(fill="yellow") +
theme_minimal()
Kategori obesitas tinggi cenderung memiliki aktivitas fisik yang lebih
rendah.
ggplot(data, aes(x=Weight, y=FAF, color=NObeyesdad)) +
geom_point() +
theme_minimal()
Terdapat kecenderungan hubungan negatif antara berat badan dan aktivitas
fisik.
6. Uji Asumsi Normalitas
shapiro.test(data$Age)
##
## Shapiro-Wilk normality test
##
## data: data$Age
## W = 0.86606, p-value < 2.2e-16
shapiro.test(data$Height)
##
## Shapiro-Wilk normality test
##
## data: data$Height
## W = 0.99323, p-value = 2.772e-08
shapiro.test(data$Weight)
##
## Shapiro-Wilk normality test
##
## data: data$Weight
## W = 0.9765, p-value < 2.2e-16
Semua variabel memiliki p-value < 0.05 sehingga tidak berdistribusi normal
Homogenitas Kovarians
data_num <- data[, c("Age","Height","Weight")]
boxM(data_num, data$NObeyesdad)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: data_num
## Chi-Sq (approx.) = 3273.2, df = 36, p-value < 2.2e-16
Hasil menunjukkan kovarians tidak homogen.
model_lm <- lm(Age ~ Height + Weight + FCVC + FAF, data=data)
vif(model_lm)
## Height Weight FCVC FAF
## 1.513475 1.451400 1.086684 1.162765
Nilai VIF < 10 menunjukkan tidak terdapat multikolinearitas.
7. Analisis Diskriminan (LDA)
lda_model <- lda(NObeyesdad ~ Age + Height + Weight + FCVC + FAF, data=data)
pred <- predict(lda_model)
cm <- table(pred$class, data$NObeyesdad)
cm
##
## Insufficient_Weight Normal_Weight Overweight_Level_I
## Insufficient_Weight 246 30 0
## Normal_Weight 26 201 4
## Overweight_Level_I 0 55 247
## Overweight_Level_II 0 1 39
## Obesity_Type_I 0 0 0
## Obesity_Type_II 0 0 0
## Obesity_Type_III 0 0 0
##
## Overweight_Level_II Obesity_Type_I Obesity_Type_II
## Insufficient_Weight 0 0 0
## Normal_Weight 0 0 0
## Overweight_Level_I 71 0 0
## Overweight_Level_II 218 8 0
## Obesity_Type_I 1 332 5
## Obesity_Type_II 0 11 283
## Obesity_Type_III 0 0 9
##
## Obesity_Type_III
## Insufficient_Weight 0
## Normal_Weight 0
## Overweight_Level_I 0
## Overweight_Level_II 0
## Obesity_Type_I 0
## Obesity_Type_II 23
## Obesity_Type_III 301
accuracy <- sum(diag(cm)) / sum(cm)
accuracy
## [1] 0.8659403
aper <- 1 - accuracy
aper
## [1] 0.1340597
Model memiliki akurasi sebesar 86.59% dengan tingkat kesalahan (APER) sebesar 13.41%. Hal ini menunjukkan bahwa model mampu mengklasifikasikan data dengan baik.
8. Regresi Logistik Ordinal Standarisasi
data$Weight_s <- scale(data$Weight)
data$FAF_s <- scale(data$FAF)
Standarisasi dilakukan agar variabel memiliki skala yang sama.
Model
model_ord <- polr(NObeyesdad ~ Weight_s + FAF_s, data=data, Hess=TRUE)
summary(model_ord)
## Call:
## polr(formula = NObeyesdad ~ Weight_s + FAF_s, data = data, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## Weight_s 4.8374 0.11982 40.37
## FAF_s -0.7101 0.04678 -15.18
##
## Intercepts:
## Value Std. Error t value
## Insufficient_Weight|Normal_Weight -6.2866 0.1680 -37.4206
## Normal_Weight|Overweight_Level_I -3.6912 0.1162 -31.7698
## Overweight_Level_I|Overweight_Level_II -1.6596 0.0873 -19.0144
## Overweight_Level_II|Obesity_Type_I 0.2285 0.0819 2.7898
## Obesity_Type_I|Obesity_Type_II 3.4620 0.1324 26.1576
## Obesity_Type_II|Obesity_Type_III 5.9875 0.1601 37.4055
##
## Residual Deviance: 4238.825
## AIC: 4254.825
Weight berpengaruh positif terhadap obesitas, sedangkan FAF berpengaruh negatif.
Uji Signifikansi Model
model_null <- polr(NObeyesdad ~ 1, data=data, Hess=TRUE)
anova(model_null, model_ord)
## Likelihood ratio tests of ordinal regression models
##
## Response: NObeyesdad
## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
## 1 1 2105 8201.628
## 2 Weight_s + FAF_s 2103 4238.825 1 vs 2 2 3962.803 0
Null Model = 8201.628 Model = 4238.825 p-value < 0.05 Model signifikan secara keseluruhan.
Koefisien dan p-value
ctable <- coef(summary(model_ord))
p_value <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
cbind(ctable, p_value)
## Value Std. Error t value
## Weight_s 4.8373749 0.11981644 40.373216
## FAF_s -0.7100537 0.04677521 -15.180128
## Insufficient_Weight|Normal_Weight -6.2866142 0.16799876 -37.420598
## Normal_Weight|Overweight_Level_I -3.6911731 0.11618512 -31.769759
## Overweight_Level_I|Overweight_Level_II -1.6595635 0.08727945 -19.014368
## Overweight_Level_II|Obesity_Type_I 0.2285153 0.08191073 2.789809
## Obesity_Type_I|Obesity_Type_II 3.4620449 0.13235324 26.157615
## Obesity_Type_II|Obesity_Type_III 5.9875334 0.16007103 37.405479
## p_value
## Weight_s 0.000000e+00
## FAF_s 4.788494e-52
## Insufficient_Weight|Normal_Weight 1.807143e-306
## Normal_Weight|Overweight_Level_I 1.694232e-221
## Overweight_Level_I|Overweight_Level_II 1.296901e-80
## Overweight_Level_II|Obesity_Type_I 5.273911e-03
## Obesity_Type_I|Obesity_Type_II 8.073266e-151
## Obesity_Type_II|Obesity_Type_III 3.182916e-306
Semua variabel signifikan karena p-value < 0.05.
Odds Ratio
exp(coef(model_ord))
## Weight_s FAF_s
## 126.1377876 0.4916178
Weight meningkatkan peluang obesitas, sedangkan FAF menurunkan peluang.
Goodness of Fit
pR2(model_ord)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -2119.4124283 -4100.8140859 3962.8033153 0.4831728 0.8469845
## r2CU
## 0.8647499
Nilai Nagelkerke R² tinggi menunjukkan model sangat baik.
Cutpoint
model_ord$zeta
## Insufficient_Weight|Normal_Weight Normal_Weight|Overweight_Level_I
## -6.2866142 -3.6911731
## Overweight_Level_I|Overweight_Level_II Overweight_Level_II|Obesity_Type_I
## -1.6595635 0.2285153
## Obesity_Type_I|Obesity_Type_II Obesity_Type_II|Obesity_Type_III
## 3.4620449 5.9875334
Cutpoint menunjukkan batas antar kategori obesitas dan tersusun secara berurutan.
VGAM
model_vglm <- vglm(NObeyesdad ~ Weight_s + FAF_s,
family = cumulative(parallel=TRUE),
data=data)
summary(model_vglm)
##
## Call:
## vglm(formula = NObeyesdad ~ Weight_s + FAF_s, family = cumulative(parallel = TRUE),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -6.28661 0.16699 -37.646 < 2e-16 ***
## (Intercept):2 -3.69117 0.11693 -31.566 < 2e-16 ***
## (Intercept):3 -1.65956 0.08739 -18.991 < 2e-16 ***
## (Intercept):4 0.22852 0.08289 2.757 0.00584 **
## (Intercept):5 3.46203 0.12742 27.171 < 2e-16 ***
## (Intercept):6 5.98753 0.16089 37.214 < 2e-16 ***
## Weight_s -4.83737 0.11944 -40.500 < 2e-16 ***
## FAF_s 0.71006 0.04679 15.175 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 6
##
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
## logitlink(P[Y<=3]), logitlink(P[Y<=4]), logitlink(P[Y<=5]), logitlink(P[Y<=6])
##
## Residual deviance: 4238.825 on 12658 degrees of freedom
##
## Log-likelihood: -2119.412 on 12658 degrees of freedom
##
## Number of Fisher scoring iterations: 9
##
##
## Exponentiated coefficients:
## Weight_s FAF_s
## 0.007927855 2.034106763
Model VGAM menghasilkan hasil yang konsisten dengan model sebelumnya.
Kesimpulan: Berdasarkan hasil analisis, model analisis diskriminan (LDA) mampu mengklasifikasikan data dengan baik dengan akurasi sebesar 86,59%. Hasil regresi logistik ordinal menunjukkan bahwa berat badan berpengaruh positif terhadap tingkat obesitas, sedangkan aktivitas fisik berpengaruh negatif. Artinya, semakin tinggi berat badan maka peluang obesitas meningkat, sementara aktivitas fisik yang tinggi dapat menurunkan peluang tersebut. Secara keseluruhan, model yang digunakan sudah cukup baik dalam menjelaskan tingkat obesitas.