1. Project Overview

EN.
This project analyzes the World Happiness Report 2024 dataset to understand how national happiness scores are related to socioeconomic, health, institutional, and social-support indicators.

KR.
이 프로젝트는 World Happiness Report 2024 데이터를 활용해 국가별 행복 점수가 경제, 건강, 제도 신뢰, 사회적 지지와 어떤 관계를 가지는지 분석한 개인 프로젝트입니다.

1.1. Motivation

EN.

During my exchange semester in Hungary, I became interested in comparing happiness and quality of life across countries. While living in Hungary, I noticed several aspects that felt comparable to South Korea in everyday life, such as urban routines, public transportation use, and the balance between academic life and personal time. At the same time, I also sensed differences in social atmosphere, perceived stability, and overall well-being.

This interest expanded as I traveled to neighboring Central and Eastern European countries. Even among geographically close countries, daily living conditions and social mood did not always feel the same. This made me wonder whether national happiness is mainly explained by economic conditions, or whether non-economic factors such as social support, healthy life expectancy, freedom, and institutional trust also play an important role.

Based on this experience, I used the World Happiness Report 2024 dataset to compare South Korea, Hungary, neighboring countries, and high-ranking happiness countries. The purpose of this analysis is to examine how happiness-related indicators differ across countries and to identify which factors are most closely associated with national happiness scores.

KR.

헝가리에서 교환학생으로 생활하며 국가별 행복도와 삶의 질 차이에 관심을 갖게 되었습니다. 헝가리에서 생활하는 동안 도시 생활 방식, 대중교통 이용, 학업과 개인 생활의 균형 등에서 한국과 비슷하게 느껴지는 지점이 있었습니다. 동시에 사회적 분위기, 안정감, 전반적인 삶의 질에 대한 체감에서는 차이도 느낄 수 있었습니다.

이 관심은 주변 중·동유럽 국가를 여행하면서 더 커졌습니다. 지리적으로 가까운 국가들이라도 실제 생활환경과 사회적 분위기, 행복 수준은 서로 다르게 느껴졌습니다. 이를 계기로 국가별 행복 차이가 경제 수준만으로 설명되는지, 아니면 사회적 지지, 건강 기대수명, 자유, 제도 신뢰와 같은 비경제적 요인도 중요한 역할을 하는지 확인하고자 했습니다.

이러한 경험을 바탕으로 World Happiness Report 2024 데이터를 활용해 한국, 헝가리, 주변 국가, 행복도 상위권 국가들을 비교했습니다. 본 분석의 목적은 국가별 행복 관련 지표가 어떻게 다른지 살펴보고, 행복 점수와 가장 밀접하게 연결되는 요인을 확인하는 것입니다.

1.2. Research Questions

EN.
This project focuses on three questions.

Which variables are most strongly associated with national happiness scores?
Can countries be grouped into meaningful clusters based on happiness-related indicators?
Which dimensions explain the differences among clusters after dimensionality reduction?

KR.
본 분석은 다음 세 가지 질문을 중심으로 진행했습니다.

국가별 행복 점수와 가장 강하게 관련된 변수는 무엇인가?
행복 관련 지표를 기준으로 국가들을 의미 있는 군집으로 나눌 수 있는가?
차원축소 이후 각 군집을 구분하는 주요 요인은 무엇인가?

1.3. Analysis Flow

EN.
The analysis proceeds in the following order: data preparation, preprocessing, exploratory visualization, clustering, principal component analysis, and multinomial logistic regression.

KR.
분석은 데이터 준비, 전처리, 탐색적 시각화, 군집분석, 주성분분석, 다항 로지스틱 회귀 순서로 진행했습니다.

2. Load Libraries and Dataset

EN.
The required R packages and the World Happiness Report 2024 dataset are loaded. The dataset is prepared for preprocessing, visualization, clustering, PCA, and regression analysis.

KR.
분석에 필요한 R 패키지와 World Happiness Report 2024 데이터를 불러옵니다. 이후 전처리, 시각화, 군집분석, PCA, 회귀분석을 수행하기 위한 기본 데이터를 준비합니다.

rm(list = ls())

library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)
library(reshape2)
library(tidyverse)
library(conflicted)
library(factoextra)
library(nnet)

whr = read.csv('WHR2024.csv')

3. Data Structure

EN.
The dataset contains country-level happiness scores and variables that explain differences in life evaluation. Before modeling, the structure, dimensions, and summary statistics are reviewed to understand the data scale and missing values.

KR.
데이터는 국가별 행복 점수와 삶의 평가 차이를 설명하는 여러 지표로 구성되어 있습니다. 모델링에 앞서 데이터 구조, 크기, 요약 통계를 확인해 변수의 스케일과 결측 여부를 파악했습니다.

head(whr)

##   Country.name Ladder.score upperwhisker lowerwhisker
## 1      Finland        7.741        7.815        7.667
## 2      Denmark        7.583        7.665        7.500
## 3      Iceland        7.525        7.618        7.433
## 4       Sweden        7.344        7.422        7.267
## 5       Israel        7.341        7.405        7.277
## 6  Netherlands        7.319        7.383        7.256
##   Explained.by..Log.GDP.per.capita Explained.by..Social.support
## 1                            1.844                        1.572
## 2                            1.908                        1.520
## 3                            1.881                        1.617
## 4                            1.878                        1.501
## 5                            1.803                        1.513
## 6                            1.901                        1.462
##   Explained.by..Healthy.life.expectancy
## 1                                 0.695
## 2                                 0.699
## 3                                 0.718
## 4                                 0.724
## 5                                 0.740
## 6                                 0.706
##   Explained.by..Freedom.to.make.life.choices Explained.by..Generosity
## 1                                      0.859                    0.142
## 2                                      0.823                    0.204
## 3                                      0.819                    0.258
## 4                                      0.838                    0.221
## 5                                      0.641                    0.153
## 6                                      0.725                    0.247
##   Explained.by..Perceptions.of.corruption Dystopia...residual
## 1                                   0.546               2.082
## 2                                   0.548               1.881
## 3                                   0.182               2.050
## 4                                   0.524               1.658
## 5                                   0.193               2.298
## 6                                   0.372               1.906

dim(whr) # Expected: 143 countries, 11 columns

## [1] 143  11

str(whr)

## 'data.frame':    143 obs. of  11 variables:
##  $ Country.name                              : chr  "Finland" "Denmark" "Iceland" "Sweden" ...
##  $ Ladder.score                              : num  7.74 7.58 7.53 7.34 7.34 ...
##  $ upperwhisker                              : num  7.82 7.67 7.62 7.42 7.41 ...
##  $ lowerwhisker                              : num  7.67 7.5 7.43 7.27 7.28 ...
##  $ Explained.by..Log.GDP.per.capita          : num  1.84 1.91 1.88 1.88 1.8 ...
##  $ Explained.by..Social.support              : num  1.57 1.52 1.62 1.5 1.51 ...
##  $ Explained.by..Healthy.life.expectancy     : num  0.695 0.699 0.718 0.724 0.74 0.706 0.704 0.708 0.747 0.692 ...
##  $ Explained.by..Freedom.to.make.life.choices: num  0.859 0.823 0.819 0.838 0.641 0.725 0.835 0.801 0.759 0.756 ...
##  $ Explained.by..Generosity                  : num  0.142 0.204 0.258 0.221 0.153 0.247 0.224 0.146 0.173 0.225 ...
##  $ Explained.by..Perceptions.of.corruption   : num  0.546 0.548 0.182 0.524 0.193 0.372 0.484 0.432 0.498 0.323 ...
##  $ Dystopia...residual                       : num  2.08 1.88 2.05 1.66 2.3 ...

summary(whr)

##  Country.name        Ladder.score    upperwhisker    lowerwhisker  
##  Length:143         Min.   :1.721   Min.   :1.775   Min.   :1.667  
##  Class :character   1st Qu.:4.726   1st Qu.:4.846   1st Qu.:4.606  
##  Mode  :character   Median :5.785   Median :5.895   Median :5.674  
##                     Mean   :5.528   Mean   :5.641   Mean   :5.414  
##                     3rd Qu.:6.416   3rd Qu.:6.508   3rd Qu.:6.319  
##                     Max.   :7.741   Max.   :7.815   Max.   :7.667  
##                                                                    
##  Explained.by..Log.GDP.per.capita Explained.by..Social.support
##  Min.   :0.000                    Min.   :0.0000              
##  1st Qu.:1.078                    1st Qu.:0.9217              
##  Median :1.431                    Median :1.2375              
##  Mean   :1.379                    Mean   :1.1343              
##  3rd Qu.:1.742                    3rd Qu.:1.3833              
##  Max.   :2.141                    Max.   :1.6170              
##  NA's   :3                        NA's   :3                   
##  Explained.by..Healthy.life.expectancy
##  Min.   :0.0000                       
##  1st Qu.:0.3980                       
##  Median :0.5495                       
##  Mean   :0.5209                       
##  3rd Qu.:0.6485                       
##  Max.   :0.8570                       
##  NA's   :3                            
##  Explained.by..Freedom.to.make.life.choices Explained.by..Generosity
##  Min.   :0.0000                             Min.   :0.0000          
##  1st Qu.:0.5275                             1st Qu.:0.0910          
##  Median :0.6410                             Median :0.1365          
##  Mean   :0.6206                             Mean   :0.1463          
##  3rd Qu.:0.7360                             3rd Qu.:0.1925          
##  Max.   :0.8630                             Max.   :0.4010          
##  NA's   :3                                  NA's   :3               
##  Explained.by..Perceptions.of.corruption Dystopia...residual
##  Min.   :0.00000                         Min.   :-0.073     
##  1st Qu.:0.06875                         1st Qu.: 1.308     
##  Median :0.12050                         Median : 1.645     
##  Mean   :0.15412                         Mean   : 1.576     
##  3rd Qu.:0.19375                         3rd Qu.: 1.882     
##  Max.   :0.57500                         Max.   : 2.998     
##  NA's   :3                               NA's   :3

4. Data Preprocessing

4.1. Rename Columns

EN.
Column names are simplified to make the code easier to read and to reduce repeated long variable names in the analysis.

KR.
분석 코드의 가독성을 높이기 위해 변수명을 간결하게 변경했습니다. 긴 변수명을 반복적으로 사용하는 대신, 각 변수의 의미가 드러나는 짧은 이름을 사용했습니다.

colnames(whr) = c(
  'Country', 'Score', 'upperwhisker', 'lowerwhisker',
  'LogGDP', 'SocialSupport', 'LifeExpectancy',
  'Freedom', 'Generosity', 'Corruption', 'DystopiaResidual'
)

4.2. Remove Missing Values

EN.
Rows with missing values are removed to ensure that clustering, PCA, and regression are performed on complete observations. In this dataset, Bahrain, Tajikistan, and State of Palestine contained missing values.

KR.
군집분석, PCA, 회귀분석을 완전한 관측치 기준으로 수행하기 위해 결측치가 있는 행을 제거했습니다. 해당 데이터에서는 Bahrain, Tajikistan, State of Palestine에 결측치가 존재했습니다.

colSums(is.na(whr))

##          Country            Score     upperwhisker     lowerwhisker 
##                0                0                0                0 
##           LogGDP    SocialSupport   LifeExpectancy          Freedom 
##                3                3                3                3 
##       Generosity       Corruption DystopiaResidual 
##                3                3                3

whr[apply(whr, 1, function(x) any(is.na(x))), ]

##                Country Score upperwhisker lowerwhisker LogGDP SocialSupport
## 62             Bahrain 5.959        6.153        5.766     NA            NA
## 88          Tajikistan 5.281        5.361        5.201     NA            NA
## 103 State of Palestine 4.879        5.006        4.753     NA            NA
##     LifeExpectancy Freedom Generosity Corruption DystopiaResidual
## 62              NA      NA         NA         NA               NA
## 88              NA      NA         NA         NA               NA
## 103             NA      NA         NA         NA               NA

whr = whr[!apply(whr, 1, function(x) any(is.na(x))), ]
dim(whr) # Expected after removing missing values: 140 countries, 11 columns

## [1] 140  11

4.3. Add Region and SubRegion Columns

EN.
External continent-mapping data is merged with the happiness dataset to add regional context. Countries that were not matched automatically are manually mapped. This step allows the analysis to compare not only individual countries, but also broader regional patterns.

KR.
국가별 행복지표에 지역적 맥락을 추가하기 위해 외부 대륙 매핑 데이터를 병합했습니다. 자동 매칭되지 않은 국가는 수동으로 Region과 SubRegion을 부여했습니다. 이를 통해 개별 국가뿐 아니라 지역 단위의 패턴도 함께 해석할 수 있습니다.

cont = read.csv('continents2.csv')
head(cont)

##             name alpha.2 alpha.3 country.code    iso_3166.2  region
## 1    Afghanistan      AF     AFG            4 ISO 3166-2:AF    Asia
## 2  Åland Islands      AX     ALA          248 ISO 3166-2:AX  Europe
## 3        Albania      AL     ALB            8 ISO 3166-2:AL  Europe
## 4        Algeria      DZ     DZA           12 ISO 3166-2:DZ  Africa
## 5 American Samoa      AS     ASM           16 ISO 3166-2:AS Oceania
## 6        Andorra      AD     AND           20 ISO 3166-2:AD  Europe
##        sub.region intermediate.region region.code sub.region.code
## 1   Southern Asia                             142              34
## 2 Northern Europe                             150             154
## 3 Southern Europe                             150              39
## 4 Northern Africa                               2              15
## 5       Polynesia                               9              61
## 6 Southern Europe                             150              39
##   intermediate.region.code
## 1                       NA
## 2                       NA
## 3                       NA
## 4                       NA
## 5                       NA
## 6                       NA

cont = cont[, c('name', 'region', 'sub.region')]
colnames(cont) = c('Country', 'Region', 'SubRegion')

whr_new = merge(whr, cont, by = 'Country', all.x = TRUE)

whr_new[is.na(whr_new$Region), ]['Country']

##                       Country
## 13     Bosnia and Herzegovina
## 26        Congo (Brazzaville)
## 27           Congo (Kinshasa)
## 31                    Czechia
## 51  Hong Kong S.A.R. of China
## 61                Ivory Coast
## 67                     Kosovo
## 99            North Macedonia
## 123  Taiwan Province of China
## 128                   Turkiye

map = data.frame(
  Country = c(
    "Bosnia and Herzegovina", "Congo (Brazzaville)", "Congo (Kinshasa)",
    "Czechia", "Hong Kong S.A.R. of China", "Ivory Coast",
    "Kosovo", "North Macedonia", "Taiwan Province of China", "Turkiye"
  ),
  Region = c(
    "Europe", "Africa", "Africa",
    "Europe", "Asia", "Africa",
    "Europe", "Europe", "Asia", "Asia"
  ),
  SubRegion = c(
    "Southern Europe", "Sub-Saharan Africa", "Sub-Saharan Africa",
    "Eastern Europe", "Eastern Asia", "Sub-Saharan Africa",
    "Southern Europe", "Southern Europe", "Eastern Asia", "Western Asia"
  )
)

whr_new = merge(whr_new, map, by = 'Country', all.x = TRUE)

whr_new$Region = ifelse(is.na(whr_new$Region.x), whr_new$Region.y, whr_new$Region.x)
whr_new$SubRegion = ifelse(is.na(whr_new$SubRegion.x), whr_new$SubRegion.y, whr_new$SubRegion.x)

whr_new = whr_new %>%
  dplyr::select(-Region.x, -SubRegion.x, -Region.y, -SubRegion.y)

sum(is.na(whr_new['Region']))

## [1] 0

5. Exploratory Data Analysis

5.1. Numeric Variable Setting

EN.
Only numeric variables are selected for visualization, correlation analysis, clustering, and PCA. Country names and regional labels are excluded from the numeric matrix.

KR.
시각화, 상관분석, 군집분석, PCA에 사용할 수치형 변수만 별도로 분리했습니다. 국가명과 지역 라벨은 수치형 분석 행렬에서 제외했습니다.

num_var = setdiff(colnames(whr_new), c("Country", 'upperwhisker', 'lowerwhisker', "Region", "SubRegion"))
num_data = whr_new[, num_var]

5.2. Boxplots for Each Variable

EN.
Boxplots are used to identify outliers and compare the spread of each variable. This helps check whether certain countries show unusually high or low values in specific indicators.

KR.
각 변수의 이상치와 분포 범위를 확인하기 위해 boxplot을 사용했습니다. 이를 통해 특정 국가가 특정 지표에서 비정상적으로 높거나 낮은 값을 보이는지 확인할 수 있습니다.

plot_list = list()

for (var in num_var) {
  p = boxplot(
    num_data[[var]],
    main = paste("Boxplot of", var),
    col = "lightblue",
    outline = TRUE
  )
  plot_list[[var]] = p
}

5.3. Density Distribution for Each Variable

EN.
Density plots are used to inspect the distribution of each variable. Since the variables are already scaled in the report dataset, this step focuses on distributional shape rather than raw-unit comparison.

KR.
각 변수의 분포 형태를 확인하기 위해 density plot을 그렸습니다. 해당 데이터의 변수들은 이미 보고서 기준에 따라 스케일링되어 있으므로, 절대 단위 비교보다 분포 형태 확인에 초점을 두었습니다.

plot_list1 = list()

for (var in num_var) {
  p = ggplot(whr_new, aes_string(x = var)) +
    geom_density(fill = "blue", alpha = 0.3) +
    labs(title = paste("Distribution of", var), x = var, y = "Density") +
    theme_minimal()
  plot_list1[[var]] = p
}

print(plot_list1)

## $Score

## 
## $LogGDP

## 
## $SocialSupport

## 
## $LifeExpectancy

## 
## $Freedom

## 
## $Generosity

## 
## $Corruption

## 
## $DystopiaResidual

5.4. Correlation Matrix

EN.
Correlation analysis is used to examine which variables are most strongly related to the happiness score. In the original analysis, SocialSupport, LifeExpectancy, and LogGDP showed strong relationships with Score, while Generosity had a relatively weak relationship.

KR.
상관분석을 통해 어떤 변수가 행복 점수와 강하게 관련되는지 확인했습니다. 기존 분석 결과에서는 SocialSupport, LifeExpectancy, LogGDP가 Score와 강한 관련성을 보였고, Generosity는 상대적으로 약한 관련성을 보였습니다.

corr_matrix = cor(num_data, use = "complete.obs")
melt_corr = melt(corr_matrix)

ggplot(data = melt_corr, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), size = 3, color = "black") +
  scale_fill_gradient(low = "white", high = "darkgreen") +
  labs(title = "Correlation Matrix") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

5.5. Relationship between LifeExpectancy and LogGDP

EN.
This plot checks whether countries with higher economic indicators also tend to have higher healthy life expectancy. A strong positive pattern would suggest that economic conditions and health-related outcomes are closely connected at the national level.

KR.
경제 수준이 높은 국가일수록 건강수명도 높은 경향이 있는지 확인했습니다. 강한 양의 관계가 나타난다면 국가 단위에서 경제적 조건과 건강 관련 지표가 밀접하게 연결되어 있음을 시사합니다.

ggplot(whr_new, aes(x = LifeExpectancy, y = LogGDP)) +
  geom_point(color = "red", alpha = 0.7) +
  geom_smooth(method = "lm", color = "darkred") +
  labs(
    title = "Relationship between Life Expectancy and Log GDP",
    x = "Life Expectancy",
    y = "Log GDP"
  ) +
  theme_minimal()

5.6. Relationship between SocialSupport and Score

EN.
This plot examines the relationship between social support and the happiness score. Social support is especially important because it reflects whether individuals feel they can rely on others in difficult situations.

KR.
사회적 지지와 행복 점수의 관계를 확인했습니다. SocialSupport는 어려운 상황에서 도움을 받을 수 있는 관계망이 있는지를 반영하는 지표로, 삶의 안정성과 행복 수준을 설명하는 데 중요한 변수입니다.

ggplot(whr_new, aes(x = SocialSupport, y = Score)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_smooth(method = "lm", color = "darkblue") +
  labs(
    title = "Relationship between Social Support and Happiness Score",
    x = "Social Support",
    y = "Happiness Score"
  ) +
  theme_minimal()

5.7. Selected Country Comparison

EN.
A selected group of countries is compared to connect the statistical analysis with the exchange-student context. The comparison includes South Korea and neighboring East Asian countries, Hungary and neighboring European countries, and high-ranking Nordic countries.

KR.
교환학생 경험과 분석 주제를 연결하기 위해 관심 국가들을 별도로 비교했습니다. 한국과 동아시아 인접국, 헝가리와 주변 유럽 국가, 행복도 상위권 북유럽 국가들을 포함했습니다.

southkorea_neighbors = c("South Korea", "Japan", "China")
hungary_neighbors = c(
  "Hungary", "Slovakia", "Austria", "Czechia",
  "Slovenia", "Croatia", "Serbia", "Romania"
)
highscore_countries = c("Finland", "Denmark", "Iceland", "Sweden")

interesting_countries = c(southkorea_neighbors, hungary_neighbors, highscore_countries)

plot_list2 = list()

for (var in num_var) {
  filtered_data = whr_new %>%
    dplyr::filter(Country %in% interesting_countries) %>%
    mutate(Group = case_when(
      Country %in% southkorea_neighbors ~ "South Korea & Neighbors",
      Country %in% hungary_neighbors ~ "Hungary & Neighbors",
      Country %in% highscore_countries ~ "High Score Countries"
    ))

  p = ggplot(filtered_data, aes(x = reorder(Country, .data[[var]]), y = .data[[var]], fill = Group)) +
    geom_bar(stat = "identity", position = "dodge") +
    geom_text(aes(label = SubRegion), position = position_dodge(width = 0.9), size = 2) +
    coord_flip() +
    labs(title = paste("Barplot of", var), x = "Country", y = var) +
    scale_fill_manual(values = c(
      "South Korea & Neighbors" = "lightblue",
      "Hungary & Neighbors" = "orange",
      "High Score Countries" = "lightgreen"
    )) +
    theme_minimal() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1),
      axis.text.y = element_text(size = 7),
      legend.position = "bottom"
    )

  plot_list2[[var]] = p
}

plot_list2

## $Score

## 
## $LogGDP

## 
## $SocialSupport

## 
## $LifeExpectancy

## 
## $Freedom

## 
## $Generosity

## 
## $Corruption

## 
## $DystopiaResidual

6. Clustering Analysis

6.1. Determine the Number of Clusters

EN.
Clustering is used to group countries with similar happiness-related profiles. The Elbow Method initially suggests a smaller number of clusters, but hierarchical clustering is also used to inspect whether a more detailed structure exists. Based on the dendrogram and interpretability, 10 clusters are used in the final analysis.

KR.
군집분석은 행복 관련 지표가 유사한 국가들을 그룹화하기 위해 사용했습니다. Elbow Method는 비교적 적은 수의 군집을 제안하지만, 계층적 군집분석을 통해 더 세분화된 구조가 존재하는지 확인했습니다. 덴드로그램과 해석 가능성을 함께 고려해 최종적으로 10개 군집을 사용했습니다.

num_var = setdiff(num_var, "Score")
df = whr_new[, num_var]
df = scale(df)

wss = numeric(30)
for (k in 1:30) {
  model = kmeans(df, centers = k, nstart = 25)
  wss[k] = model$tot.withinss
}

plot(
  1:30, wss,
  type = "b",
  pch = 19,
  xlab = "Number of Clusters (K)",
  ylab = "Total Within-SS",
  main = "Elbow Method for K-Means Clustering"
)

dist_mat = dist(df)
hc = hclust(dist_mat, method = "ward.D2")

plot(hc, labels = FALSE, main = "Dendrogram", xlab = "", sub = "")

clusters_hc = cutree(hc, k = 10)

k = 10

set.seed(123)
kmeans_result = kmeans(df, centers = k, nstart = 25)

whr_new$Cluster = as.factor(kmeans_result$cluster)

6.2. Visualize Clustering Results

EN.
The clustering result is visualized to examine how countries are separated in the reduced cluster space. The cluster-level average happiness score is then compared to identify high-scoring and low-scoring groups.

KR.
군집 결과를 시각화해 국가들이 축소된 군집 공간에서 어떻게 구분되는지 확인했습니다. 이후 군집별 평균 행복 점수를 비교하여 상대적으로 행복 점수가 높은 군집과 낮은 군집을 파악했습니다.

all(rownames(df) == rownames(whr_new))

## [1] TRUE

fviz_cluster(
  kmeans_result,
  data = df,
  geom = "point",
  ellipse.type = "norm",
  ggtheme = theme_minimal(),
  main = "K-Means Clustering Results"
) +
  geom_text(aes(label = whr_new$Country), size = 2.5, vjust = -1)

cluster_avg_score = whr_new %>%
  group_by(Cluster) %>%
  summarise(avg_score = mean(Score, na.rm = TRUE)) %>%
  arrange(desc(avg_score))

print(cluster_avg_score)

## # A tibble: 10 × 2
##    Cluster avg_score
##    <fct>       <dbl>
##  1 8            6.93
##  2 10           6.18
##  3 5            6.16
##  4 2            6.08
##  5 3            5.15
##  6 9            4.65
##  7 7            4.30
##  8 4            4.19
##  9 6            3.49
## 10 1            2.66

ggplot(cluster_avg_score, aes(x = reorder(as.factor(Cluster), -avg_score), y = avg_score, fill = as.factor(Cluster))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Average Happiness Score by Cluster",
    x = "Cluster",
    y = "Average Score"
  ) +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

boxplots = lapply(num_var, function(var) {
  ggplot(whr_new, aes(x = as.factor(Cluster), y = .data[[var]], fill = as.factor(Cluster))) +
    geom_boxplot() +
    labs(title = paste("Boxplot of", var), x = "Cluster", y = var) +
    theme_minimal() +
    theme(legend.position = "none")
})

print(boxplots)

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

cluster_summary = whr_new %>%
  group_by(Cluster) %>%
  summarise(across(c(LogGDP, SocialSupport, LifeExpectancy, Freedom, Generosity, Corruption, DystopiaResidual), mean, na.rm = TRUE))

print(cluster_summary)

## # A tibble: 10 × 8
##    Cluster LogGDP SocialSupport LifeExpectancy Freedom Generosity Corruption
##    <fct>    <dbl>         <dbl>          <dbl>   <dbl>      <dbl>      <dbl>
##  1 1        0.967         0.302          0.389   0.115     0.0957     0.0923
##  2 2        1.40          1.21           0.552   0.726     0.0987     0.122 
##  3 3        1.42          1.15           0.493   0.374     0.0668     0.0836
##  4 4        1.14          0.813          0.457   0.720     0.148      0.146 
##  5 5        1.54          1.37           0.580   0.720     0.244      0.0892
##  6 6        1.04          0.979          0.238   0.452     0.0593     0.131 
##  7 7        0.912         0.892          0.369   0.510     0.241      0.0778
##  8 8        1.92          1.42           0.703   0.754     0.198      0.419 
##  9 9        0.779         0.678          0.315   0.524     0.146      0.129 
## 10 10       1.69          1.38           0.646   0.629     0.100      0.117 
## # ℹ 1 more variable: DystopiaResidual <dbl>

long_summary = cluster_summary %>%
  pivot_longer(cols = -Cluster, names_to = "Variable", values_to = "Mean")

ggplot(long_summary, aes(x = Variable, y = Mean, fill = as.factor(Cluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Mean of Variables by Cluster", x = "Variable", y = "Mean") +
  theme_minimal()

region_summary = whr_new %>%
  group_by(Cluster, Region) %>%
  summarise(Count = n(), .groups = "drop") %>%
  arrange(Cluster, desc(Count))

ggplot(region_summary, aes(x = as.factor(Cluster), y = Count, fill = Region)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Number of Regions by Cluster", x = "Cluster", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top")

6.3. Selected Countries in Clusters

EN.
The selected countries are checked against the cluster assignments. In the original analysis, South Korea, Japan, China, Hungary, Slovakia, Slovenia, and Croatia were assigned to the same cluster, while several Nordic countries were assigned to another high-scoring cluster.

KR.
관심 국가들이 어떤 군집에 속하는지 확인했습니다. 기존 분석에서는 한국, 일본, 중국, 헝가리, 슬로바키아, 슬로베니아, 크로아티아가 같은 군집에 속했고, 북유럽 주요 국가들은 다른 고득점 군집에 속하는 것으로 나타났습니다.

interesting_df = whr_new %>%
  dplyr::filter(Country %in% interesting_countries)

interesting_df[c('Country', 'Cluster')]

##        Country Cluster
## 1      Austria       8
## 2        China      10
## 3      Croatia      10
## 4      Czechia       5
## 5      Denmark       8
## 6      Finland       8
## 7      Hungary      10
## 8      Iceland       5
## 9        Japan      10
## 10     Romania       2
## 11      Serbia       5
## 12    Slovakia      10
## 13    Slovenia      10
## 14 South Korea      10
## 15      Sweden       8

ggplot(whr_new, aes(x = LogGDP, y = Score, color = Cluster)) +
  geom_point(alpha = 0.6) +
  geom_point(data = interesting_df, aes(x = LogGDP, y = Score), color = "black", size = 3) +
  ggtitle("World Happiness by Log GDP with Selected Countries") +
  theme_minimal()

region_distribution = interesting_df %>%
  group_by(Region, Cluster, Country) %>%
  summarise(count = n(), .groups = 'drop')

ggplot(region_distribution, aes(x = Region, fill = as.factor(Cluster), y = count)) +
  geom_bar(stat = "identity", position = "stack") +
  geom_text(aes(label = Country), position = position_stack(vjust = 0.5), size = 3, check_overlap = TRUE) +
  ggtitle("Cluster Distribution by Region") +
  theme_minimal()

subregion_distribution = interesting_df %>%
  group_by(SubRegion, Cluster, Country) %>%
  summarise(count = n(), .groups = 'drop')

ggplot(subregion_distribution, aes(x = SubRegion, fill = as.factor(Cluster), y = count)) +
  geom_bar(stat = "identity", position = "stack") +
  geom_text(aes(label = Country), position = position_stack(vjust = 0.5), size = 3, check_overlap = TRUE) +
  ggtitle("Cluster Distribution by SubRegion") +
  theme_minimal()

7. Principal Component Analysis

EN.
PCA is applied to reduce the dimensionality of the explanatory variables while preserving most of the variance. This helps summarize multiple indicators into a smaller number of components that are easier to interpret.

KR.
여러 설명 변수를 더 적은 수의 축으로 요약하기 위해 PCA를 적용했습니다. 이를 통해 정보 손실을 줄이면서도 변수 구조를 더 간결하게 해석할 수 있습니다.

df_pca = whr_new %>%
  select(-Cluster, -Score, -upperwhisker, -lowerwhisker) %>%
  select_if(is.numeric)

pca_res = prcomp(df_pca, scale = TRUE)

fviz_eig(pca_res, addlabels = TRUE, geom = "bar", bar_fill = "skyblue")

pca_res$rotation

##                         PC1         PC2         PC3         PC4         PC5
## LogGDP           0.50540701 -0.28927959 -0.01091025  0.04287087  0.09618203
## SocialSupport    0.48007894 -0.04944116  0.17718662  0.37283454  0.07607281
## LifeExpectancy   0.49942503 -0.21528282  0.03145675  0.10874463  0.24303850
## Freedom          0.37899917  0.39340427  0.06411383 -0.02705834 -0.82193779
## Generosity       0.07468933  0.64866833 -0.52881368  0.42679124  0.30329426
## Corruption       0.32802345  0.10023870 -0.43409802 -0.77491678  0.15172923
## DystopiaResidual 0.08856830  0.53098105  0.70377607 -0.25288767  0.36783601
##                          PC6         PC7
## LogGDP           -0.16279114  0.78941256
## SocialSupport     0.74236698 -0.19945748
## LifeExpectancy   -0.57259274 -0.55179966
## Freedom          -0.14323707 -0.02552224
## Generosity       -0.09677124  0.10248968
## Corruption        0.24345127 -0.10547720
## DystopiaResidual -0.07333363  0.10139382

fviz_pca_var(pca_res, col.var = "contrib", gradient.cols = c("blue", "green", "red"))

7.1. PCA Interpretation

EN.
The original analysis showed that three principal components were sufficient to explain approximately 80% of the variance. PC1 was mainly associated with LogGDP, LifeExpectancy, and SocialSupport. PC2 was influenced by Generosity and DystopiaResidual, while PC3 was related to DystopiaResidual, Generosity, and Corruption.

KR.
기존 분석 결과, 세 개의 주성분으로 약 80% 수준의 누적 설명력을 확보할 수 있었습니다. PC1은 주로 LogGDP, LifeExpectancy, SocialSupport와 관련되었고, PC2는 Generosity와 DystopiaResidual의 영향이 컸습니다. PC3는 DystopiaResidual, Generosity, Corruption과 관련성이 높게 나타났습니다.

8. Multinomial Logistic Regression

EN.
A multinomial logistic regression model is trained using the first three principal components as predictors and cluster labels as the target variable. This step tests which principal components contribute to distinguishing cluster membership.

KR.
첫 세 개의 주성분을 설명 변수로 사용하고 군집 라벨을 종속 변수로 설정해 다항 로지스틱 회귀모형을 학습했습니다. 이 단계는 어떤 주성분이 군집 구분에 기여하는지 확인하기 위한 과정입니다.

pca_scores = data.frame(pca_res$x[, 1:3])
colnames(pca_scores) = paste0("PC", 1:3)

logit_data = cbind(pca_scores, Cluster = whr_new$Cluster)

logit = multinom(Cluster ~ ., data = logit_data)

## # weights:  50 (36 variable)
## initial  value 322.361913 
## iter  10 value 93.535186
## iter  20 value 54.715425
## iter  30 value 42.376508
## iter  40 value 38.119917
## iter  50 value 36.162987
## iter  60 value 34.516699
## iter  70 value 33.905992
## iter  80 value 33.709927
## iter  90 value 33.469725
## iter 100 value 33.174800
## final  value 33.174800 
## stopped after 100 iterations

summary(logit)

## Call:
## multinom(formula = Cluster ~ ., data = logit_data)
## 
## Coefficients:
##    (Intercept)       PC1       PC2         PC3
## 2     84.63824 67.247724 23.746593   1.4276877
## 3     47.78116 16.536280  5.788189  14.1045746
## 4     68.91433 31.459009 27.875945 -21.5715688
## 5     83.57561 73.526413 31.789008 -16.2217993
## 6     24.65002  6.385323  1.241900   5.1620740
## 7     66.65507 29.654972 29.847570 -20.8941607
## 8     74.31962 77.644844 30.486298 -20.5776998
## 9     37.00884  7.299077 14.078805  16.3068759
## 10    80.44084 72.306301 17.666545  -0.5224586
## 
## Std. Errors:
##    (Intercept)       PC1       PC2       PC3
## 2     36.44148 42.765586 10.881850 13.069422
## 3     30.13509 10.012607  6.775996  9.729776
## 4     30.77179 11.962410 11.036338 18.359411
## 5     36.40357 42.529568 11.748434 15.291844
## 6     23.19863  6.515801  2.718891  7.194339
## 7     30.76016 11.942285 11.058151 18.375236
## 8     36.83048 42.581189 11.912938 15.573643
## 9     28.29371  7.928251  6.465308  9.847288
## 10    36.40548 42.788454 10.800108 13.099919
## 
## Residual Deviance: 66.3496 
## AIC: 138.3496

z_scores = summary(logit)$coefficients / summary(logit)$standard.errors
p_values = (1 - pnorm(abs(z_scores), 0, 1)) * 2
print(p_values)

##    (Intercept)         PC1         PC2        PC3
## 2   0.02020175 0.115840923 0.029093278 0.91301311
## 3   0.11283788 0.098627168 0.392983343 0.14716175
## 4   0.02512147 0.008542958 0.011542441 0.24001081
## 5   0.02168686 0.083839440 0.006813843 0.28877451
## 6   0.28797981 0.327098369 0.647838447 0.47305398
## 7   0.03024029 0.013021136 0.006951755 0.25550377
## 8   0.04360342 0.068234641 0.010494610 0.18639610
## 9   0.19086536 0.357237587 0.029436410 0.09772661
## 10  0.02713430 0.091055613 0.101886735 0.96818673

8.1. Regression Interpretation

EN.
The original result suggested that PC2 was important in distinguishing several clusters. This indicates that not only economic and health-related variables, but also residual and generosity-related dimensions can help differentiate country groups. PC1 also influenced some clusters, while PC3 played a more limited role.

KR.
기존 결과에서는 PC2가 여러 군집을 구분하는 데 중요한 역할을 하는 것으로 나타났습니다. 이는 경제·건강 관련 변수뿐 아니라 잔차적 요인과 관대성 관련 차원도 국가군을 구분하는 데 의미가 있음을 시사합니다. PC1은 일부 군집에서 영향이 있었고, PC3의 역할은 상대적으로 제한적이었습니다.

9. Key Findings

EN.
The analysis produced the following findings.

Happiness is not explained by income alone. SocialSupport, LifeExpectancy, and LogGDP showed strong relationships with Score, suggesting that happiness is shaped by both economic and non-economic conditions.
Regional context matters. European countries were frequently found in high-scoring clusters, while some African countries were concentrated in lower-scoring clusters.
South Korea and Hungary were analytically close in the cluster result. South Korea and its neighboring countries were grouped with Hungary and several neighboring European countries, suggesting that similar score structures can appear across different geographic regions.
PCA helped summarize the structure of the indicators. PC1 mainly represented economic, health, and social-support dimensions, while PC2 and PC3 captured additional dimensions such as generosity, residual factors, and corruption-related variation.
Cluster differentiation required more than one dimension. Logistic regression showed that different principal components contributed differently depending on the cluster, which means country groups cannot be explained by one simple axis.

KR.
본 분석의 핵심 결과는 다음과 같습니다.

행복 점수는 소득만으로 설명되지 않습니다. SocialSupport, LifeExpectancy, LogGDP가 Score와 강한 관련성을 보였으며, 행복은 경제적 조건과 비경제적 조건이 함께 작용하는 지표임을 확인했습니다.
지역적 맥락이 중요합니다. 유럽 국가들은 고득점 군집에 자주 포함되었고, 일부 아프리카 국가는 상대적으로 낮은 점수의 군집에 집중되는 경향이 나타났습니다.
한국과 헝가리는 군집 결과에서 가까운 구조를 보였습니다. 한국 및 주변 동아시아 국가와 헝가리 및 일부 주변 유럽 국가가 같은 군집에 속해, 서로 다른 지역에서도 유사한 행복지표 구조가 나타날 수 있음을 확인했습니다.
PCA는 지표 구조를 요약하는 데 유용했습니다. PC1은 경제, 건강, 사회적 지지를 주로 설명했고, PC2와 PC3는 관대성, 잔차적 요인, 부패 인식 관련 차이를 포착했습니다.
국가군의 차이는 하나의 축으로만 설명되기 어렵습니다. 로지스틱 회귀 결과 군집별로 영향을 주는 주성분이 달랐기 때문에, 국가별 행복 구조는 다차원적으로 해석해야 합니다.

10. Limitations

EN.
This project has several limitations.

The analysis is based on cross-sectional country-level data, so it cannot identify causal relationships.
The cluster number was selected based on both statistical indicators and interpretability; different choices of K may lead to different groupings.
Regional labels were added from an external mapping dataset, and some countries required manual mapping.
Some variables in the World Happiness Report are based on survey responses, so cultural differences in response behavior may affect comparisons.

KR.
본 프로젝트에는 다음과 같은 한계가 있습니다.

단일 연도의 국가 단위 횡단면 데이터를 사용했기 때문에 인과관계를 식별할 수는 없습니다.
군집 수는 통계적 지표와 해석 가능성을 함께 고려해 선택했으므로, K값 선택에 따라 결과가 달라질 수 있습니다.
지역 라벨은 외부 매핑 데이터를 사용해 추가했으며, 일부 국가는 수동 매핑이 필요했습니다.
World Happiness Report의 일부 변수는 설문 응답을 기반으로 하므로, 문화권별 응답 방식 차이가 국가 간 비교에 영향을 줄 수 있습니다.

11. Conclusion

EN.
This project shows that national happiness should not be interpreted only as a country ranking or as a result of income level alone. The analysis suggests that happiness scores are closely related to multiple dimensions, including social support, healthy life expectancy, economic conditions, and institutional trust.

By combining correlation analysis, clustering, PCA, and multinomial logistic regression, the project identified both the variables associated with happiness scores and the structural differences among country groups. The results indicate that countries with similar happiness profiles can appear across different geographic regions, and that the differences among clusters should be interpreted through multiple dimensions rather than a single factor.

KR.
이 프로젝트는 국가별 행복도를 단순한 순위나 경제 수준만으로 해석하기 어렵다는 점을 보여줍니다. 분석 결과, 행복 점수는 사회적 지지, 건강 기대수명, 경제적 조건, 제도 신뢰와 같은 여러 차원과 함께 해석될 필요가 있음을 확인했습니다.

상관분석, 군집분석, PCA, 다항 로지스틱 회귀를 함께 활용해 행복 점수와 관련된 주요 변수뿐 아니라 국가군 간 구조적 차이도 확인했습니다. 그 결과, 지리적으로 다른 국가들도 유사한 행복지표 구조를 가질 수 있으며, 군집 간 차이는 하나의 요인보다 여러 차원을 함께 고려해 해석해야 함을 확인했습니다.

12. References

World Happiness Report 2024: https://www.worldhappiness.report/ed/2024/
World Happiness Report 2024 Statistical Appendix: https://files.worldhappiness.report/WHR24_Statistical_Appendix.pdf
Country Mapping Dataset: https://www.kaggle.com/datasets/andradaolteanu/country-mapping-iso-continent-region

World Happiness Report 2024 Analysis

Modern Computer-Based Methods of Statistics

Saeun Park