Penelitian ini bertujuan untuk menganalisis faktor-faktor yang mempengaruhi tingkat obesitas menggunakan metode analisis diskriminan dan regresi logistik ordinal. Variabel dependen yang digunakan adalah NObeyesdad yang memiliki 7 kategori berurutan.

1. Import Library

library(readxl)
library(ggplot2)
library(biotools)
## Loading required package: MASS
## ---
## biotools version 4.3
library(car)
## Loading required package: carData
library(MASS)
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(VGAM)
## Loading required package: stats4
## Loading required package: splines

2. Load Data

data <- read.csv(file.choose())

str(data)
## 'data.frame':    2111 obs. of  17 variables:
##  $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr  "no" "no" "no" "no" ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "yes" "no" "no" ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr  "no" "yes" "no" "no" ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
summary(data)
##     Gender               Age            Height          Weight      
##  Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
##  Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
##                     Mean   :24.31   Mean   :1.702   Mean   : 86.59  
##                     3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
##                     Max.   :61.00   Max.   :1.980   Max.   :173.00  
##  family_history_with_overweight     FAVC                FCVC      
##  Length:2111                    Length:2111        Min.   :1.000  
##  Class :character               Class :character   1st Qu.:2.000  
##  Mode  :character               Mode  :character   Median :2.386  
##                                                    Mean   :2.419  
##                                                    3rd Qu.:3.000  
##                                                    Max.   :3.000  
##       NCP            CAEC              SMOKE                CH2O      
##  Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
##  1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
##  Median :3.000   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :2.686                                         Mean   :2.008  
##  3rd Qu.:3.000                                         3rd Qu.:2.477  
##  Max.   :4.000                                         Max.   :3.000  
##      SCC                 FAF              TUE             CALC          
##  Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
##  Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
##  Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
##                     Mean   :1.0103   Mean   :0.6579                     
##                     3rd Qu.:1.6667   3rd Qu.:1.0000                     
##                     Max.   :3.0000   Max.   :2.0000                     
##     MTRANS           NObeyesdad       
##  Length:2111        Length:2111       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Dataset terdiri dari 2111 observasi dan 17 variabel. Variabel utama yang digunakan adalah Age, Height, Weight, FCVC, dan FAF.

3. Data Cleaning

sum(is.na(data))
## [1] 0
colSums(is.na(data))
##                         Gender                            Age 
##                              0                              0 
##                         Height                         Weight 
##                              0                              0 
## family_history_with_overweight                           FAVC 
##                              0                              0 
##                           FCVC                            NCP 
##                              0                              0 
##                           CAEC                          SMOKE 
##                              0                              0 
##                           CH2O                            SCC 
##                              0                              0 
##                            FAF                            TUE 
##                              0                              0 
##                           CALC                         MTRANS 
##                              0                              0 
##                     NObeyesdad 
##                              0

Tidak ditemukan missing value pada dataset sehingga seluruh data dapat digunakan untuk analisis.

4. Transformasi Variabel

data$NObeyesdad <- factor(data$NObeyesdad,
                          levels = c("Insufficient_Weight",
                                     "Normal_Weight",
                                     "Overweight_Level_I",
                                     "Overweight_Level_II",
                                     "Obesity_Type_I",
                                     "Obesity_Type_II",
                                     "Obesity_Type_III"),
                          ordered = TRUE)

table(data$NObeyesdad)
## 
## Insufficient_Weight       Normal_Weight  Overweight_Level_I Overweight_Level_II 
##                 272                 287                 290                 290 
##      Obesity_Type_I     Obesity_Type_II    Obesity_Type_III 
##                 351                 297                 324

Variabel NObeyesdad diubah menjadi faktor ordinal agar sesuai dengan metode regresi logistik ordinal.

5. Visualisasi Data

ggplot(data, aes(x=NObeyesdad)) +
  geom_bar(fill="skyblue") +
  theme_minimal()

Distribusi kategori obesitas relatif merata, dengan jumlah terbesar pada Obesity Type I.

ggplot(data, aes(x=NObeyesdad, y=Weight)) +
  geom_boxplot(fill="pink") +
  theme_minimal()

Semakin tinggi kategori obesitas, maka berat badan cenderung meningkat.

ggplot(data, aes(x=NObeyesdad, y=FAF)) +
  geom_boxplot(fill="yellow") +
  theme_minimal()

Kategori obesitas tinggi cenderung memiliki aktivitas fisik yang lebih rendah.

ggplot(data, aes(x=Weight, y=FAF, color=NObeyesdad)) +
  geom_point() +
  theme_minimal()

Terdapat kecenderungan hubungan negatif antara berat badan dan aktivitas fisik.

6. Uji Asumsi Normalitas

shapiro.test(data$Age)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$Age
## W = 0.86606, p-value < 2.2e-16
shapiro.test(data$Height)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$Height
## W = 0.99323, p-value = 2.772e-08
shapiro.test(data$Weight)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$Weight
## W = 0.9765, p-value < 2.2e-16

Semua variabel memiliki p-value < 0.05 sehingga tidak berdistribusi normal

Homogenitas Kovarians

data_num <- data[, c("Age","Height","Weight")]
boxM(data_num, data$NObeyesdad)
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  data_num
## Chi-Sq (approx.) = 3273.2, df = 36, p-value < 2.2e-16

Hasil menunjukkan kovarians tidak homogen.

model_lm <- lm(Age ~ Height + Weight + FCVC + FAF, data=data)
vif(model_lm)
##   Height   Weight     FCVC      FAF 
## 1.513475 1.451400 1.086684 1.162765

Nilai VIF < 10 menunjukkan tidak terdapat multikolinearitas.

7. Analisis Diskriminan (LDA)

lda_model <- lda(NObeyesdad ~ Age + Height + Weight + FCVC + FAF, data=data)

pred <- predict(lda_model)
cm <- table(pred$class, data$NObeyesdad)
cm
##                      
##                       Insufficient_Weight Normal_Weight Overweight_Level_I
##   Insufficient_Weight                 246            30                  0
##   Normal_Weight                        26           201                  4
##   Overweight_Level_I                    0            55                247
##   Overweight_Level_II                   0             1                 39
##   Obesity_Type_I                        0             0                  0
##   Obesity_Type_II                       0             0                  0
##   Obesity_Type_III                      0             0                  0
##                      
##                       Overweight_Level_II Obesity_Type_I Obesity_Type_II
##   Insufficient_Weight                   0              0               0
##   Normal_Weight                         0              0               0
##   Overweight_Level_I                   71              0               0
##   Overweight_Level_II                 218              8               0
##   Obesity_Type_I                        1            332               5
##   Obesity_Type_II                       0             11             283
##   Obesity_Type_III                      0              0               9
##                      
##                       Obesity_Type_III
##   Insufficient_Weight                0
##   Normal_Weight                      0
##   Overweight_Level_I                 0
##   Overweight_Level_II                0
##   Obesity_Type_I                     0
##   Obesity_Type_II                   23
##   Obesity_Type_III                 301
accuracy <- sum(diag(cm)) / sum(cm)
accuracy
## [1] 0.8659403
aper <- 1 - accuracy
aper
## [1] 0.1340597

Model memiliki akurasi sebesar 86.59% dengan tingkat kesalahan (APER) sebesar 13.41%. Hal ini menunjukkan bahwa model mampu mengklasifikasikan data dengan baik.

8. Regresi Logistik Ordinal Standarisasi

data$Weight_s <- scale(data$Weight)
data$FAF_s <- scale(data$FAF)

Standarisasi dilakukan agar variabel memiliki skala yang sama.

Model

model_ord <- polr(NObeyesdad ~ Weight_s + FAF_s, data=data, Hess=TRUE)
summary(model_ord)
## Call:
## polr(formula = NObeyesdad ~ Weight_s + FAF_s, data = data, Hess = TRUE)
## 
## Coefficients:
##            Value Std. Error t value
## Weight_s  4.8374    0.11982   40.37
## FAF_s    -0.7101    0.04678  -15.18
## 
## Intercepts:
##                                        Value    Std. Error t value 
## Insufficient_Weight|Normal_Weight       -6.2866   0.1680   -37.4206
## Normal_Weight|Overweight_Level_I        -3.6912   0.1162   -31.7698
## Overweight_Level_I|Overweight_Level_II  -1.6596   0.0873   -19.0144
## Overweight_Level_II|Obesity_Type_I       0.2285   0.0819     2.7898
## Obesity_Type_I|Obesity_Type_II           3.4620   0.1324    26.1576
## Obesity_Type_II|Obesity_Type_III         5.9875   0.1601    37.4055
## 
## Residual Deviance: 4238.825 
## AIC: 4254.825

Weight berpengaruh positif terhadap obesitas, sedangkan FAF berpengaruh negatif.

Uji Signifikansi Model

model_null <- polr(NObeyesdad ~ 1, data=data, Hess=TRUE)
anova(model_null, model_ord)
## Likelihood ratio tests of ordinal regression models
## 
## Response: NObeyesdad
##              Model Resid. df Resid. Dev   Test    Df LR stat. Pr(Chi)
## 1                1      2105   8201.628                              
## 2 Weight_s + FAF_s      2103   4238.825 1 vs 2     2 3962.803       0

Null Model = 8201.628 Model = 4238.825 p-value < 0.05 Model signifikan secara keseluruhan.

Koefisien dan p-value

ctable <- coef(summary(model_ord))
p_value <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
cbind(ctable, p_value)
##                                             Value Std. Error    t value
## Weight_s                                4.8373749 0.11981644  40.373216
## FAF_s                                  -0.7100537 0.04677521 -15.180128
## Insufficient_Weight|Normal_Weight      -6.2866142 0.16799876 -37.420598
## Normal_Weight|Overweight_Level_I       -3.6911731 0.11618512 -31.769759
## Overweight_Level_I|Overweight_Level_II -1.6595635 0.08727945 -19.014368
## Overweight_Level_II|Obesity_Type_I      0.2285153 0.08191073   2.789809
## Obesity_Type_I|Obesity_Type_II          3.4620449 0.13235324  26.157615
## Obesity_Type_II|Obesity_Type_III        5.9875334 0.16007103  37.405479
##                                              p_value
## Weight_s                                0.000000e+00
## FAF_s                                   4.788494e-52
## Insufficient_Weight|Normal_Weight      1.807143e-306
## Normal_Weight|Overweight_Level_I       1.694232e-221
## Overweight_Level_I|Overweight_Level_II  1.296901e-80
## Overweight_Level_II|Obesity_Type_I      5.273911e-03
## Obesity_Type_I|Obesity_Type_II         8.073266e-151
## Obesity_Type_II|Obesity_Type_III       3.182916e-306

Semua variabel signifikan karena p-value < 0.05.

Odds Ratio

exp(coef(model_ord))
##    Weight_s       FAF_s 
## 126.1377876   0.4916178

Weight meningkatkan peluang obesitas, sedangkan FAF menurunkan peluang.

Goodness of Fit

pR2(model_ord)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -2119.4124283 -4100.8140859  3962.8033153     0.4831728     0.8469845 
##          r2CU 
##     0.8647499

Nilai Nagelkerke R² tinggi menunjukkan model sangat baik.

Cutpoint

model_ord$zeta
##      Insufficient_Weight|Normal_Weight       Normal_Weight|Overweight_Level_I 
##                             -6.2866142                             -3.6911731 
## Overweight_Level_I|Overweight_Level_II     Overweight_Level_II|Obesity_Type_I 
##                             -1.6595635                              0.2285153 
##         Obesity_Type_I|Obesity_Type_II       Obesity_Type_II|Obesity_Type_III 
##                              3.4620449                              5.9875334

Cutpoint menunjukkan batas antar kategori obesitas dan tersusun secara berurutan.

VGAM

model_vglm <- vglm(NObeyesdad ~ Weight_s + FAF_s,
                   family = cumulative(parallel=TRUE),
                   data=data)

summary(model_vglm)
## 
## Call:
## vglm(formula = NObeyesdad ~ Weight_s + FAF_s, family = cumulative(parallel = TRUE), 
##     data = data)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1 -6.28661    0.16699 -37.646  < 2e-16 ***
## (Intercept):2 -3.69117    0.11693 -31.566  < 2e-16 ***
## (Intercept):3 -1.65956    0.08739 -18.991  < 2e-16 ***
## (Intercept):4  0.22852    0.08289   2.757  0.00584 ** 
## (Intercept):5  3.46203    0.12742  27.171  < 2e-16 ***
## (Intercept):6  5.98753    0.16089  37.214  < 2e-16 ***
## Weight_s      -4.83737    0.11944 -40.500  < 2e-16 ***
## FAF_s          0.71006    0.04679  15.175  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  6 
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3]), logitlink(P[Y<=4]), logitlink(P[Y<=5]), logitlink(P[Y<=6])
## 
## Residual deviance: 4238.825 on 12658 degrees of freedom
## 
## Log-likelihood: -2119.412 on 12658 degrees of freedom
## 
## Number of Fisher scoring iterations: 9 
## 
## 
## Exponentiated coefficients:
##    Weight_s       FAF_s 
## 0.007927855 2.034106763

Model VGAM menghasilkan hasil yang konsisten dengan model sebelumnya.

Kesimpulan: Berdasarkan hasil analisis, model analisis diskriminan (LDA) mampu mengklasifikasikan data dengan baik dengan akurasi sebesar 86,59%. Hasil regresi logistik ordinal menunjukkan bahwa berat badan berpengaruh positif terhadap tingkat obesitas, sedangkan aktivitas fisik berpengaruh negatif. Artinya, semakin tinggi berat badan maka peluang obesitas meningkat, sementara aktivitas fisik yang tinggi dapat menurunkan peluang tersebut. Secara keseluruhan, model yang digunakan sudah cukup baik dalam menjelaskan tingkat obesitas.