K-MEANS CLUSTERING

Dalam era informasi saat ini, memahami faktor-faktor yang mempengaruhi kebahagiaan global menjadi semakin penting bagi para pembuat kebijakan, peneliti, dan masyarakat umum. World Happiness Report (WHR) merupakan salah satu sumber data utama yang memberikan wawasan tentang tingkat kebahagiaan di berbagai negara berdasarkan berbagai indikator seperti GDP per capita, dukungan sosial, harapan hidup sehat, dan kebebasan untuk membuat pilihan hidup.

Untuk menganalisis dan mengelompokkan data kebahagiaan secara efektif, metode clustering menjadi alat yang sangat berguna. Salah satu teknik clustering yang populer adalah k-means clustering.

library(tidyverse)  # data manipulation

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cluster)    # clustering algorithms
library(factoextra) # clustering algorithms & visualization

## Warning: package 'factoextra' was built under R version 4.3.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(PerformanceAnalytics)

## Warning: package 'PerformanceAnalytics' was built under R version 4.3.3

## Loading required package: xts
## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## 
## Attaching package: 'PerformanceAnalytics'
## 
## The following object is masked from 'package:graphics':
## 
##     legend

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.3.3

library(tibble)
library(MVN)

## Warning: package 'MVN' was built under R version 4.3.3

DATASET

data <- read.csv(file = "C:/Users/acer/Downloads/archive (11)/2019.csv",header=T,sep=",")
head(data)

##   Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1            1           Finland 7.769          1.340          1.587
## 2            2           Denmark 7.600          1.383          1.573
## 3            3            Norway 7.554          1.488          1.582
## 4            4           Iceland 7.494          1.380          1.624
## 5            5       Netherlands 7.488          1.396          1.522
## 6            6       Switzerland 7.480          1.452          1.526
##   Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1                   0.986                        0.596      0.153
## 2                   0.996                        0.592      0.252
## 3                   1.028                        0.603      0.271
## 4                   1.026                        0.591      0.354
## 5                   0.999                        0.557      0.322
## 6                   1.052                        0.572      0.263
##   Perceptions.of.corruption
## 1                     0.393
## 2                     0.410
## 3                     0.341
## 4                     0.118
## 5                     0.298
## 6                     0.343

Data merupakan data sekunder yang didapatkan dari website Kaggle yang berisikan variabel-variabel yang berpengaruh terhadap nilai kebahagian global tahun 2019.

# mengambil variabel numerik
index <- data[, 4:9] 
#rownames(index) <- data[1:156]

Preprocesing Data

# cek apakah ada atau tidaknya NA
index %>% 
  anyNA()

## [1] FALSE

# Cek seberapa banyak jumlah NA pada masing-masing kolom
index %>% 
  is.na() %>% 
  colSums()

##               GDP.per.capita               Social.support 
##                            0                            0 
##      Healthy.life.expectancy Freedom.to.make.life.choices 
##                            0                            0 
##                   Generosity    Perceptions.of.corruption 
##                            0                            0

Mengecek Multikolinearitas

cor=cor(index)
vif=diag(solve(cor))
vif

##               GDP.per.capita               Social.support 
##                     4.115838                     2.735651 
##      Healthy.life.expectancy Freedom.to.make.life.choices 
##                     3.572728                     1.575090 
##                   Generosity    Perceptions.of.corruption 
##                     1.224101                     1.431594

Penentuan Jumlah Clustering

fviz_nbclust(index, kmeans, method="silhouette")

#### Clustering dan Visualisasi

# k-means dengan 2 cluster
RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

k3 <- kmeans(x= index , centers = 2, nstart = 50)

#visualiasi
fviz_cluster(k3, data = index)

165 Negara berhasil di clustering dalam 2 kelompok. Dimana kelompok tersebut masing-masing memiliki karakteristik dan dan kemiripannya masing-masing berdasarkan faktor yang ada.

K-Means Clustering

Dhiya Ashilah Latief

2024-08-21

K-MEANS CLUSTERING

DATASET

Preprocesing Data

Mengecek Multikolinearitas

Penentuan Jumlah Clustering