THE WORLD HAPPINESS REPORT

Latar Belakang Dataset

The world happiness report adalah survei tentang keadaan kebahagiaan secara global. Laporan pertama diterbitkan pada 2012, kedua pada 2013, ketiga pada 2015, dan keempat pada pembaharuan 2016. Laporan ini mendapat pengakuan global karena pemerintah, organisasi, dan masyarakat sipil semakin menggunakan indikator kebahagiaan untuk menginformasikan keputusan pembuatan kebijakan mereka. Laporan meninjau keadaan kebahagiaana di dunia saat ini.

Skor dan peringkat kebahagiaan menggunakan data dari Gallup World Poll. Skor didasarkan pada jawaban atas pertanyaan evaluasi kehidupan utama yang diajukan dalam jajak pendapat. Pertanyaan ini, yang dikenal sebagai tangga Cantril, meminta responden untuk memikirkan tangga dengan kemungkinan kehidupan terbaik bagi mereka adalah 10 dan kemungkinan kehidupan terburuk adalah 0 dan untuk menilai kehidupan mereka sendiri saat ini pada skala itu.

Kolom yang mengikuti skor kebahagiaan memperkirakan sejauh mana masing-masing dari enam faktor - produksi ekonomi, dukungan sosial, harapan hidup, kebebasan, tidak adanya korupsi, dan kemurahan hati - berkontribusi membuat evaluasi kehidupan lebih tinggi di setiap negara daripada di Dystopia, sebuah negara dengan hipotesis yang memiliki nilai sama dengan rata-rata nasional terendah di dunia untuk masing-masing dari enam faktor. Mereka tidak berdampak pada skor total yang dilaporkan untuk setiap negara, tetapi mereka menjelaskan mengapa beberapa negara memiliki peringkat lebih tinggi daripada yang lain.

Pada mini project ini, Saya akan menggunakan dataset laporan tahun 2019.

Pertanyaan Analisis

Faktor apa yang paling mempengaruhi tingkat kebahagiaan di dunia pada tahun 2019?

EDA (Exploratory Data Analysis)

# Import Library
library(heatmaply)
## Warning: package 'heatmaply' was built under R version 4.0.5
## Loading required package: plotly
## Warning: package 'plotly' was built under R version 4.0.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: viridis
## Warning: package 'viridis' was built under R version 4.0.5
## Loading required package: viridisLite
## Warning: package 'viridisLite' was built under R version 4.0.5
## 
## ======================
## Welcome to heatmaply version 1.3.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags: 
##   https://stackoverflow.com/questions/tagged/heatmaply
## ======================
library(visdat)
## Warning: package 'visdat' was built under R version 4.0.5
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.0.5
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.5
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
library(ggplot2) 
library(psych)
## Warning: package 'psych' was built under R version 4.0.5
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.0.5
library(plotly)
library(httr)
## Warning: package 'httr' was built under R version 4.0.5
## 
## Attaching package: 'httr'
## The following object is masked from 'package:plotly':
## 
##     config
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rjson)
## Warning: package 'rjson' was built under R version 4.0.3
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.0.5
library(RCurl)
## Warning: package 'RCurl' was built under R version 4.0.5
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
# Mengakses Data Set
df<- read.csv("D:/1. KULIAH/Lain-lain/PSDS/Kelas Dasar 3.0/World_Happiness_Report_2019.csv")
df[df==""] <- NA
df[0:10,]
##    Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1             1           Finland 7.769          1.340          1.587
## 2             2           Denmark 7.600          1.383          1.573
## 3             3            Norway 7.554          1.488          1.582
## 4             4           Iceland 7.494          1.380          1.624
## 5             5       Netherlands 7.488          1.396          1.522
## 6             6       Switzerland 7.480          1.452          1.526
## 7             7            Sweden 7.343          1.387          1.487
## 8             8       New Zealand 7.307          1.303          1.557
## 9             9            Canada 7.278          1.365          1.505
## 10           10           Austria 7.246          1.376          1.475
##    Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1                    0.986                        0.596      0.153
## 2                    0.996                        0.592      0.252
## 3                    1.028                        0.603      0.271
## 4                    1.026                        0.591      0.354
## 5                    0.999                        0.557      0.322
## 6                    1.052                        0.572      0.263
## 7                    1.009                        0.574      0.267
## 8                    1.026                        0.585      0.330
## 9                    1.039                        0.584      0.285
## 10                   1.016                        0.532      0.244
##    Perceptions.of.corruption
## 1                      0.393
## 2                      0.410
## 3                      0.341
## 4                      0.118
## 5                      0.298
## 6                      0.343
## 7                      0.373
## 8                      0.380
## 9                      0.308
## 10                     0.226

Tujuan Analisis Data

Melakukan analisis untuk memperoleh faktor yang paling mempengaruhi tingkat kebahagiaan.

Memahami Dataset

Variabel-variabel yang terdapat pada dataset ini adalah sebagai berikut:

  1. Country or region = negara atau wilayah milik negara
  2. Overall rank = Peringkat negara berdasarkan Skor Kebahagiaan
  3. Score = Sebuah metrik yang diukur pada tahun 2019 dengan mengajukan pertanyaan kepada orang-orang yang dijadikan sampel: “Bagaimana Anda menilai kebahagiaan Anda dalam skala 0 hingga 10 di mana 10 adalah yang paling bahagia.”
  4. GDP per capita = Sejauh mana kontribusi GDP terhadap perhitungan Skor Kebahagiaan.
  5. Social support = Sejauh mana Keluarga atau dukungan sosial berkontribusi dalam penghitungan Skor Kebahagiaan
  6. Healthy life expectancy = Sejauh mana Angka Harapan Hidup berkontribusi terhadap perhitungan Skor Kebahagiaan
  7. Freedom to make life choices = Sejauh mana Kebebasan berkontribusi pada perhitungan Happiness Score.
  8. Generosity = Sejauh mana Kedermawanan berkontribusi dalam penghitungan Skor Kebahagiaan.
  9. Perceptions of corruption = Sejauh mana Persepsi Korupsi berkontribusi terhadap Skor Kebahagiaan.
## DIMENSI DATA
dim(df)
## [1] 156   9

Artinya kita memiliki data dengan 9 kolom dan 156 baris.

## VARIABEL DATA SET
names(df)
## [1] "Overall.rank"                 "Country.or.region"           
## [3] "Score"                        "GDP.per.capita"              
## [5] "Social.support"               "Healthy.life.expectancy"     
## [7] "Freedom.to.make.life.choices" "Generosity"                  
## [9] "Perceptions.of.corruption"
# Mengubah nama setiap kolom terkhusus kolom yang akan dianalisis
names(df)[1]<-paste("Rank")
names(df)[2]<-paste("Country")
names(df)[6]<-paste("Life Expectancy")
names(df)[7]<-paste("Freedom")
names(df)
## [1] "Rank"                      "Country"                  
## [3] "Score"                     "GDP.per.capita"           
## [5] "Social.support"            "Life Expectancy"          
## [7] "Freedom"                   "Generosity"               
## [9] "Perceptions.of.corruption"
str(df)
## 'data.frame':    156 obs. of  9 variables:
##  $ Rank                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country                  : chr  "Finland" "Denmark" "Norway" "Iceland" ...
##  $ Score                    : num  7.77 7.6 7.55 7.49 7.49 ...
##  $ GDP.per.capita           : num  1.34 1.38 1.49 1.38 1.4 ...
##  $ Social.support           : num  1.59 1.57 1.58 1.62 1.52 ...
##  $ Life Expectancy          : num  0.986 0.996 1.028 1.026 0.999 ...
##  $ Freedom                  : num  0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
##  $ Generosity               : num  0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
##  $ Perceptions.of.corruption: num  0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...
## MENGECEK MISSING DATA
sapply(df, function(x) sum(is.na(x)))
##                      Rank                   Country                     Score 
##                         0                         0                         0 
##            GDP.per.capita            Social.support           Life Expectancy 
##                         0                         0                         0 
##                   Freedom                Generosity Perceptions.of.corruption 
##                         0                         0                         0
heatmaply_na(
  df[1:20,],
  showticklabels = c(TRUE, FALSE)
)
vis_miss(df)

Dari ketiga hasil di atas diperoleh bahwa tidak ada kolom dengan missing data. Jadi tidak perlu melakukan step hapus data.

## MENGECEK OUTLIER
num_cols <- unlist(lapply(df, is.numeric)) #Memilih kolom bertipe numerik
df_num <- df[ , num_cols]  
boxplot(df_num)

Titik lingkaran di luar boxplot adalah outlier. Sehingga kolom yang mengandung outlier adalah Social.support, Life.Expectancy, Freedom, Generosity, Perceptions.

## MELIHAT KORELASI DATA
plot_correlation(df_num)

## MELIHAT STATISTIK DATA
summary(df)
##       Rank          Country              Score       GDP.per.capita  
##  Min.   :  1.00   Length:156         Min.   :2.853   Min.   :0.0000  
##  1st Qu.: 39.75   Class :character   1st Qu.:4.545   1st Qu.:0.6028  
##  Median : 78.50   Mode  :character   Median :5.380   Median :0.9600  
##  Mean   : 78.50                      Mean   :5.407   Mean   :0.9051  
##  3rd Qu.:117.25                      3rd Qu.:6.184   3rd Qu.:1.2325  
##  Max.   :156.00                      Max.   :7.769   Max.   :1.6840  
##  Social.support  Life Expectancy     Freedom         Generosity    
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.056   1st Qu.:0.5477   1st Qu.:0.3080   1st Qu.:0.1087  
##  Median :1.272   Median :0.7890   Median :0.4170   Median :0.1775  
##  Mean   :1.209   Mean   :0.7252   Mean   :0.3926   Mean   :0.1848  
##  3rd Qu.:1.452   3rd Qu.:0.8818   3rd Qu.:0.5072   3rd Qu.:0.2482  
##  Max.   :1.624   Max.   :1.1410   Max.   :0.6310   Max.   :0.5660  
##  Perceptions.of.corruption
##  Min.   :0.0000           
##  1st Qu.:0.0470           
##  Median :0.0855           
##  Mean   :0.1106           
##  3rd Qu.:0.1412           
##  Max.   :0.4530
d <- melt(df_num)
## No id variables; using all as measure variables
ggplot(d,aes(x = value)) + 
    facet_wrap(~variable,scales = "free_x") + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## PAIR PLOT
pairs.panels(df_num, 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

Data Preprocessing

# SELEKSI KOLOM
df$Rank<- NULL
df$Score<- NULL
df$year<- NULL
head(df)
##       Country GDP.per.capita Social.support Life Expectancy Freedom Generosity
## 1     Finland          1.340          1.587           0.986   0.596      0.153
## 2     Denmark          1.383          1.573           0.996   0.592      0.252
## 3      Norway          1.488          1.582           1.028   0.603      0.271
## 4     Iceland          1.380          1.624           1.026   0.591      0.354
## 5 Netherlands          1.396          1.522           0.999   0.557      0.322
## 6 Switzerland          1.452          1.526           1.052   0.572      0.263
##   Perceptions.of.corruption
## 1                     0.393
## 2                     0.410
## 3                     0.341
## 4                     0.118
## 5                     0.298
## 6                     0.343
# MENGHAPUS OUTLIER
is_outlier <- function(x, na.rm = FALSE) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)

  lowerq <- qs[1]
  upperq <- qs[2]
  iqr = upperq - lowerq 

  extreme.threshold.upper = (iqr * 3) + upperq
  extreme.threshold.lower = lowerq - (iqr * 3)

  # Return logical vector
  x > extreme.threshold.upper | x < extreme.threshold.lower
}
remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    cat("Removing outliers in column: ", col, " \n")
    df<- df[!is_outlier(df[[col]]),]
  }
  df
}


vars_of_interest <- c("Social.support", "Life.Expectancy", "Freedom", "Generosity","Perceptions")


df_filtered <- remove_outliers(df, vars_of_interest)
## Removing outliers in column:  Social.support  
## Removing outliers in column:  Life.Expectancy  
## Removing outliers in column:  Freedom  
## Removing outliers in column:  Generosity  
## Removing outliers in column:  Perceptions
boxplot(df_filtered)
# NORMALISASI DATA
unit_length <- function(x) {
                            x / sqrt(sum(x^2))
                            }
unit_length_df <- as.data.frame(lapply(df_num, unit_length))

head(unit_length_df)
##           Rank     Score GDP.per.capita Social.support Life.Expectancy
## 1 0.0008846904 0.1126892      0.1085420     0.10205270       0.1032818
## 2 0.0017693809 0.1102379      0.1120250     0.10115243       0.1043293
## 3 0.0026540713 0.1095707      0.1205302     0.10173118       0.1076813
## 4 0.0035387617 0.1087004      0.1117820     0.10443200       0.1074718
## 5 0.0044234522 0.1086133      0.1130781     0.09787285       0.1046436
## 6 0.0053081426 0.1084973      0.1176141     0.09813007       0.1101952
##     Freedom Generosity Perceptions.of.corruption
## 1 0.1142277 0.05894826                0.21654835
## 2 0.1134611 0.09709125                0.22591558
## 3 0.1155693 0.10441162                0.18789564
## 4 0.1132694 0.13639009                0.06501961
## 5 0.1067531 0.12406104                0.16420206
## 6 0.1096279 0.10132936                0.18899767

Data Visualization

library(plotly)
fig <- plot_ly(x = df['Country'],
               y = df['GDP.per.capita'],
               name = "GDP per capita",
               line = list(color = 'rgb(22, 96, 167)'),
               marker = list(color = 'rgb(22, 96, 167)'),
               type = 'scatter', mode = 'lines+markers')

fig <- fig %>% add_trace (x = df['Country'],
                          y = df['Social.support'],
                          name = "Social support",
                          line = list(color = 'rgb(96, 167, 22)'), 
                          marker = list(color = 'rgb(96, 167, 22)'),
                          type = 'scatter', mode = 'lines+markers') 

fig <- fig %>% add_trace(x = df['Country'],
                         y = df['Freedom'],
                         name = "Freedom",
                         line = list(color = 'rgb(255, 0, 0)'), 
                         marker = list(color = 'rgb(255, 0, 0)'),
                         type = 'scatter', mode = 'lines+markers')

fig <- fig %>% add_trace(x = df['Country'],
                         y = df['Generosity'],
                         name = "Generosity",
                         line = list(color = 'rgb(255, 255, 0)'), 
                         marker = list(color = 'rgb(255, 255, 0)'),
                         type = 'scatter', mode = 'lines+markers')

fig <-fig %>% add_trace(x = df['Country'],
                        y = df['Perceptions.of.corruption'],
                        name = "Perceptions of corruption",
                        line = list(color = 'rgb(0, 0, 128)'), 
                        marker = list(color = 'rgb(0, 0, 128)'),
                        type = 'scatter', mode = 'lines+markers')

fig <- fig %>% layout(title = "Faktor Yang Mempengaruhi Tingkat Kebahagiaan di Dunia",
                         xaxis = list(title = "Country"),
                         yaxis = list (title = "Happiness"))

fig