Kelompok 9

Nabila Chesaria Octavia Putri	5052241006
Amelia Widiastuti	5052241007
Agata Corinna Aulia Widyawati	5052241036

Data Billionaire

Data Billionaire merupakan kumpulan informasi mengenai individu-individu terkaya di dunia. Data ini mencakup berbagai aspek, seperti nama, negara asal, total kekayaan, usia, industri yang digeluti, peringkat global, dan informasi relevan lainnya. Dengan menganalisis data ini, kita dapat memperoleh berbagai wawasan menarik, seperti faktor-faktor yang melatarbelakangi seseorang menjadi miliarder. Faktor-faktor tersebut dapat berasal dari negara asal, sektor industri, hingga kemungkinan adanya warisan kekayaan.

Apa yang Bisa Didapat dari Data Ini?

Untuk mengetahui apa saja hal yang bisa dijawab dari data ini, kami menyusun 3 pertanyaan utama. Pertanyaannya adalah sebagai berikut:

Apa pola yang ada dalam distribusi kekayaan miliarder di berbagai negara dan industri?
Apakah status self-made dan gender berpengaruh pada total kekayaan?
Bagaimana hubungan antara usia dan kekayaan di kalangan miliarder?

Sebelum mulai ke tahap visuaisasi, kami akan memulai dari memproses data hingga data menjadi data yang bersih dan siap di olah.

1. Pre-Processing Data

Sebelum masuk ke tahap analisis, kami melakukan tahap pre-processing data untuk membersihkan dan mempersiapkan data mentah menjadi format yang lebih siap untuk analisis lebih lanjut.

Load Library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(ggcorrplot)

Read Data

billion = read.csv("FIX_BILLIONAIRE.csv")
glimpse(billion)

## Rows: 2,640
## Columns: 35
## $ rank                                       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, …
## $ finalWorth                                 <int> 211000, 180000, 114000, 107…
## $ category                                   <chr> "Fashion & Retail", "Automo…
## $ personName                                 <chr> "Bernard Arnault & family",…
## $ age                                        <int> 74, 51, 59, 78, 92, 67, 81,…
## $ country                                    <chr> "France", "United States", …
## $ city                                       <chr> "Paris", "Austin", "Medina"…
## $ source                                     <chr> "LVMH", "Tesla, SpaceX", "A…
## $ industries                                 <chr> "Fashion & Retail", "Automo…
## $ countryOfCitizenship                       <chr> "France", "United States", …
## $ organization                               <chr> "LVMH Moët Hennessy Louis V…
## $ selfMade                                   <lgl> FALSE, TRUE, TRUE, TRUE, TR…
## $ status                                     <chr> "U", "D", "D", "U", "D", "D…
## $ gender                                     <chr> "M", "M", "M", "M", "M", "M…
## $ birthDate                                  <chr> "3/5/1949 0:00", "6/28/1971…
## $ lastName                                   <chr> "Arnault", "Musk", "Bezos",…
## $ firstName                                  <chr> "Bernard", "Elon", "Jeff", …
## $ title                                      <chr> "Chairman and CEO", "CEO", …
## $ date                                       <chr> "4/4/2023 5:01", "4/4/2023 …
## $ state                                      <chr> "", "Texas", "Washington", …
## $ residenceStateRegion                       <chr> "", "South", "West", "West"…
## $ birthYear                                  <int> 1949, 1971, 1964, 1944, 193…
## $ birthMonth                                 <int> 3, 6, 1, 8, 8, 10, 2, 1, 4,…
## $ birthDay                                   <int> 5, 28, 12, 17, 30, 28, 14, …
## $ cpi_country                                <dbl> 110.05, 117.24, 117.24, 117…
## $ cpi_change_country                         <dbl> 1.1, 7.5, 7.5, 7.5, 7.5, 7.…
## $ gdp_country                                <chr> "$2,715,518,274,227 ", "$21…
## $ gross_tertiary_education_enrollment        <dbl> 65.6, 88.2, 88.2, 88.2, 88.…
## $ gross_primary_education_enrollment_country <dbl> 102.5, 101.8, 101.8, 101.8,…
## $ life_expectancy_country                    <dbl> 82.5, 78.5, 78.5, 78.5, 78.…
## $ tax_revenue_country_country                <dbl> 24.2, 9.6, 9.6, 9.6, 9.6, 9…
## $ total_tax_rate_country                     <dbl> 60.7, 36.6, 36.6, 36.6, 36.…
## $ population_country                         <int> 67059887, 328239523, 328239…
## $ latitude_country                           <dbl> 46.22764, 37.09024, 37.0902…
## $ longitude_country                          <dbl> 2.213749, -95.712891, -95.7…

Ada data “” yang ditemukan dan data tersebut adalah NA. Lalu, ditemukan juga bahwa tipe data gdp dan date adalah char padahal harusnya numeric.

Ubah incorrect data types

library(readr)

billion <- billion %>%
  mutate(date = as.Date(date, format = "%d/%m/%Y")) 

billion <- billion %>% 
  mutate(gdp_country = as.character(gdp_country), # mastiin tipenya char
         gdp_country = parse_number(gdp_country)) # buang char $ dan koma

glimpse(billion)

## Rows: 2,640
## Columns: 35
## $ rank                                       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, …
## $ finalWorth                                 <int> 211000, 180000, 114000, 107…
## $ category                                   <chr> "Fashion & Retail", "Automo…
## $ personName                                 <chr> "Bernard Arnault & family",…
## $ age                                        <int> 74, 51, 59, 78, 92, 67, 81,…
## $ country                                    <chr> "France", "United States", …
## $ city                                       <chr> "Paris", "Austin", "Medina"…
## $ source                                     <chr> "LVMH", "Tesla, SpaceX", "A…
## $ industries                                 <chr> "Fashion & Retail", "Automo…
## $ countryOfCitizenship                       <chr> "France", "United States", …
## $ organization                               <chr> "LVMH Moët Hennessy Louis V…
## $ selfMade                                   <lgl> FALSE, TRUE, TRUE, TRUE, TR…
## $ status                                     <chr> "U", "D", "D", "U", "D", "D…
## $ gender                                     <chr> "M", "M", "M", "M", "M", "M…
## $ birthDate                                  <chr> "3/5/1949 0:00", "6/28/1971…
## $ lastName                                   <chr> "Arnault", "Musk", "Bezos",…
## $ firstName                                  <chr> "Bernard", "Elon", "Jeff", …
## $ title                                      <chr> "Chairman and CEO", "CEO", …
## $ date                                       <date> 2023-04-04, 2023-04-04, 20…
## $ state                                      <chr> "", "Texas", "Washington", …
## $ residenceStateRegion                       <chr> "", "South", "West", "West"…
## $ birthYear                                  <int> 1949, 1971, 1964, 1944, 193…
## $ birthMonth                                 <int> 3, 6, 1, 8, 8, 10, 2, 1, 4,…
## $ birthDay                                   <int> 5, 28, 12, 17, 30, 28, 14, …
## $ cpi_country                                <dbl> 110.05, 117.24, 117.24, 117…
## $ cpi_change_country                         <dbl> 1.1, 7.5, 7.5, 7.5, 7.5, 7.…
## $ gdp_country                                <dbl> 2.715518e+12, 2.142770e+13,…
## $ gross_tertiary_education_enrollment        <dbl> 65.6, 88.2, 88.2, 88.2, 88.…
## $ gross_primary_education_enrollment_country <dbl> 102.5, 101.8, 101.8, 101.8,…
## $ life_expectancy_country                    <dbl> 82.5, 78.5, 78.5, 78.5, 78.…
## $ tax_revenue_country_country                <dbl> 24.2, 9.6, 9.6, 9.6, 9.6, 9…
## $ total_tax_rate_country                     <dbl> 60.7, 36.6, 36.6, 36.6, 36.…
## $ population_country                         <int> 67059887, 328239523, 328239…
## $ latitude_country                           <dbl> 46.22764, 37.09024, 37.0902…
## $ longitude_country                          <dbl> 2.213749, -95.712891, -95.7…

Data sudah berubah ke tipe yang benar, maka data bisa diolah/cleaning lebih lanjut.

Cek total NA

billion %>% duplicated() %>% sum()

## [1] 0

billion %>% filter(duplicated(.)) #show duplicated

##  [1] rank                                      
##  [2] finalWorth                                
##  [3] category                                  
##  [4] personName                                
##  [5] age                                       
##  [6] country                                   
##  [7] city                                      
##  [8] source                                    
##  [9] industries                                
## [10] countryOfCitizenship                      
## [11] organization                              
## [12] selfMade                                  
## [13] status                                    
## [14] gender                                    
## [15] birthDate                                 
## [16] lastName                                  
## [17] firstName                                 
## [18] title                                     
## [19] date                                      
## [20] state                                     
## [21] residenceStateRegion                      
## [22] birthYear                                 
## [23] birthMonth                                
## [24] birthDay                                  
## [25] cpi_country                               
## [26] cpi_change_country                        
## [27] gdp_country                               
## [28] gross_tertiary_education_enrollment       
## [29] gross_primary_education_enrollment_country
## [30] life_expectancy_country                   
## [31] tax_revenue_country_country               
## [32] total_tax_rate_country                    
## [33] population_country                        
## [34] latitude_country                          
## [35] longitude_country                         
## <0 rows> (or 0-length row.names)

billion <- billion %>%
  mutate(across(where(is.character), ~na_if(., "")))

colSums(is.na(billion))

##                                       rank 
##                                          0 
##                                 finalWorth 
##                                          0 
##                                   category 
##                                          0 
##                                 personName 
##                                          0 
##                                        age 
##                                         65 
##                                    country 
##                                         38 
##                                       city 
##                                         72 
##                                     source 
##                                          0 
##                                 industries 
##                                          0 
##                       countryOfCitizenship 
##                                          0 
##                               organization 
##                                       2315 
##                                   selfMade 
##                                          0 
##                                     status 
##                                          0 
##                                     gender 
##                                          0 
##                                  birthDate 
##                                         76 
##                                   lastName 
##                                          0 
##                                  firstName 
##                                          3 
##                                      title 
##                                       2301 
##                                       date 
##                                          0 
##                                      state 
##                                       1887 
##                       residenceStateRegion 
##                                       1893 
##                                  birthYear 
##                                         76 
##                                 birthMonth 
##                                         76 
##                                   birthDay 
##                                         76 
##                                cpi_country 
##                                        184 
##                         cpi_change_country 
##                                        184 
##                                gdp_country 
##                                        164 
##        gross_tertiary_education_enrollment 
##                                        182 
## gross_primary_education_enrollment_country 
##                                        181 
##                    life_expectancy_country 
##                                        182 
##                tax_revenue_country_country 
##                                        183 
##                     total_tax_rate_country 
##                                        182 
##                         population_country 
##                                        164 
##                           latitude_country 
##                                        164 
##                          longitude_country 
##                                        164

Replace kolom dengan > 1500 NA dengan “Unknown”

na_cols <- billion %>% 
  select(where(is.character)) %>%
  names()

billion <- billion %>% 
  mutate(across(all_of(na_cols), ~replace_na(., "Unknown")))

glimpse(billion)

## Rows: 2,640
## Columns: 35
## $ rank                                       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, …
## $ finalWorth                                 <int> 211000, 180000, 114000, 107…
## $ category                                   <chr> "Fashion & Retail", "Automo…
## $ personName                                 <chr> "Bernard Arnault & family",…
## $ age                                        <int> 74, 51, 59, 78, 92, 67, 81,…
## $ country                                    <chr> "France", "United States", …
## $ city                                       <chr> "Paris", "Austin", "Medina"…
## $ source                                     <chr> "LVMH", "Tesla, SpaceX", "A…
## $ industries                                 <chr> "Fashion & Retail", "Automo…
## $ countryOfCitizenship                       <chr> "France", "United States", …
## $ organization                               <chr> "LVMH Moët Hennessy Louis V…
## $ selfMade                                   <lgl> FALSE, TRUE, TRUE, TRUE, TR…
## $ status                                     <chr> "U", "D", "D", "U", "D", "D…
## $ gender                                     <chr> "M", "M", "M", "M", "M", "M…
## $ birthDate                                  <chr> "3/5/1949 0:00", "6/28/1971…
## $ lastName                                   <chr> "Arnault", "Musk", "Bezos",…
## $ firstName                                  <chr> "Bernard", "Elon", "Jeff", …
## $ title                                      <chr> "Chairman and CEO", "CEO", …
## $ date                                       <date> 2023-04-04, 2023-04-04, 20…
## $ state                                      <chr> "Unknown", "Texas", "Washin…
## $ residenceStateRegion                       <chr> "Unknown", "South", "West",…
## $ birthYear                                  <int> 1949, 1971, 1964, 1944, 193…
## $ birthMonth                                 <int> 3, 6, 1, 8, 8, 10, 2, 1, 4,…
## $ birthDay                                   <int> 5, 28, 12, 17, 30, 28, 14, …
## $ cpi_country                                <dbl> 110.05, 117.24, 117.24, 117…
## $ cpi_change_country                         <dbl> 1.1, 7.5, 7.5, 7.5, 7.5, 7.…
## $ gdp_country                                <dbl> 2.715518e+12, 2.142770e+13,…
## $ gross_tertiary_education_enrollment        <dbl> 65.6, 88.2, 88.2, 88.2, 88.…
## $ gross_primary_education_enrollment_country <dbl> 102.5, 101.8, 101.8, 101.8,…
## $ life_expectancy_country                    <dbl> 82.5, 78.5, 78.5, 78.5, 78.…
## $ tax_revenue_country_country                <dbl> 24.2, 9.6, 9.6, 9.6, 9.6, 9…
## $ total_tax_rate_country                     <dbl> 60.7, 36.6, 36.6, 36.6, 36.…
## $ population_country                         <int> 67059887, 328239523, 328239…
## $ latitude_country                           <dbl> 46.22764, 37.09024, 37.0902…
## $ longitude_country                          <dbl> 2.213749, -95.712891, -95.7…

Hapus NA yang tersisa

bill_clean <- billion %>% 
  distinct() %>% 
  drop_na(where(is.numeric)) 
  
summary(bill_clean)

##       rank        finalWorth       category          personName       
##  Min.   :   1   Min.   :  1000   Length:2397        Length:2397       
##  1st Qu.: 636   1st Qu.:  1500   Class :character   Class :character  
##  Median :1272   Median :  2400   Mode  :character   Mode  :character  
##  Mean   :1276   Mean   :  4759                                        
##  3rd Qu.:1905   3rd Qu.:  4300                                        
##  Max.   :2540   Max.   :211000                                        
##       age           country              city              source         
##  Min.   : 18.00   Length:2397        Length:2397        Length:2397       
##  1st Qu.: 56.00   Class :character   Class :character   Class :character  
##  Median : 65.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 64.96                                                           
##  3rd Qu.: 74.00                                                           
##  Max.   :101.00                                                           
##   industries        countryOfCitizenship organization        selfMade      
##  Length:2397        Length:2397          Length:2397        Mode :logical  
##  Class :character   Class :character     Class :character   FALSE:713      
##  Mode  :character   Mode  :character     Mode  :character   TRUE :1684     
##                                                                            
##                                                                            
##                                                                            
##     status             gender           birthDate           lastName        
##  Length:2397        Length:2397        Length:2397        Length:2397       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   firstName            title                date               state          
##  Length:2397        Length:2397        Min.   :2023-04-04   Length:2397       
##  Class :character   Class :character   1st Qu.:2023-04-04   Class :character  
##  Mode  :character   Mode  :character   Median :2023-04-04   Mode  :character  
##                                        Mean   :2023-04-04                     
##                                        3rd Qu.:2023-04-04                     
##                                        Max.   :2023-04-04                     
##  residenceStateRegion   birthYear      birthMonth        birthDay    
##  Length:2397          Min.   :1921   Min.   : 1.000   Min.   : 1.00  
##  Class :character     1st Qu.:1948   1st Qu.: 2.000   1st Qu.: 1.00  
##  Mode  :character     Median :1958   Median : 6.000   Median :11.00  
##                       Mean   :1957   Mean   : 5.757   Mean   :12.28  
##                       3rd Qu.:1967   3rd Qu.: 9.000   3rd Qu.:21.00  
##                       Max.   :2004   Max.   :12.000   Max.   :31.00  
##   cpi_country     cpi_change_country  gdp_country       
##  Min.   : 99.55   Min.   :-1.900     Min.   :1.367e+10  
##  1st Qu.:117.24   1st Qu.: 1.700     1st Qu.:1.736e+12  
##  Median :117.24   Median : 2.900     Median :1.991e+13  
##  Mean   :127.90   Mean   : 4.401     Mean   :1.171e+13  
##  3rd Qu.:125.08   3rd Qu.: 7.500     3rd Qu.:2.143e+13  
##  Max.   :288.57   Max.   :53.500     Max.   :2.143e+13  
##  gross_tertiary_education_enrollment gross_primary_education_enrollment_country
##  Min.   :  4.00                      Min.   : 84.7                             
##  1st Qu.: 50.60                      1st Qu.:100.2                             
##  Median : 67.00                      Median :101.8                             
##  Mean   : 67.47                      Mean   :102.9                             
##  3rd Qu.: 88.20                      3rd Qu.:102.6                             
##  Max.   :136.60                      Max.   :142.1                             
##  life_expectancy_country tax_revenue_country_country total_tax_rate_country
##  Min.   :54.3            Min.   : 0.10               Min.   :  9.90        
##  1st Qu.:77.0            1st Qu.: 9.60               1st Qu.: 36.60        
##  Median :78.5            Median : 9.60               Median : 38.70        
##  Mean   :78.1            Mean   :12.58               Mean   : 43.81        
##  3rd Qu.:80.9            3rd Qu.:12.80               3rd Qu.: 59.10        
##  Max.   :84.2            Max.   :37.20               Max.   :106.30        
##  population_country  latitude_country longitude_country
##  Min.   :6.454e+05   Min.   :-40.90   Min.   :-106.35  
##  1st Qu.:6.706e+07   1st Qu.: 35.86   1st Qu.: -95.71  
##  Median :3.282e+08   Median : 37.09   Median :  10.45  
##  Mean   :5.103e+08   Mean   : 34.78   Mean   :  11.58  
##  3rd Qu.:1.366e+09   3rd Qu.: 38.96   3rd Qu.: 104.20  
##  Max.   :1.398e+09   Max.   : 61.92   Max.   : 174.89

Check boxplot

# Final Worth
Outlier_FinalWorth <- ggplot(bill_clean, aes(x = "", y = finalWorth)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot Final Worth", y = "Final Worth")

# Age
Outlier_Age <- ggplot(bill_clean, aes(x = "", y = age)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot Age", y = "Age")

# CPI Country
Outlier_CpiCountry <- ggplot(bill_clean, aes(x = "", y = cpi_country)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot CPI Country", y = "CPI Country")

# CPI Change
Outlier_CpiChange <- ggplot(bill_clean, aes(x = "", y = cpi_change_country)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot CPI Change Country", y = "CPI Change Country")

# Gross Tertiary Education Enrollment
Outlier_GrossTertiary <- ggplot(bill_clean, aes(x = "", y = gross_tertiary_education_enrollment)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot Gross Tertiary Education Enrollment", y = "Gross Tertiary Education Enrollment")

# Gross Primary Education Enrollment
Outlier_GrossPrimary <- ggplot(bill_clean, aes(x = "", y = gross_primary_education_enrollment_country)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot Gross Primary Education Enrollment", y = "Gross Primary Education Enrollment")

# Life Expectancy
Outlier_LifeExpectancy <- ggplot(bill_clean, aes(x = "", y = life_expectancy_country)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot Life Expectancy Country", y = "Life Expectancy Country")

# Tax Revenue
Outlier_TaxRevenue <- ggplot(bill_clean, aes(x = "", y = tax_revenue_country_country)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot Tax Revenue Country", y = "Tax Revenue Country")

# Total Tax Rate
Outlier_TotalTaxRate <- ggplot(bill_clean, aes(x = "", y = total_tax_rate_country)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot Total Tax Rate Country", y = "Total Tax Rate Country")

# Population
Outlier_Population <- ggplot(bill_clean, aes(x = "", y = population_country)) +
  geom_boxplot(fill = "gray", outlier.color = "tan3") +
  labs(title = "Boxplot Population Country", y = "Population Country")

grid.arrange(
  Outlier_FinalWorth, 
  Outlier_Age, 
  Outlier_CpiCountry, 
  Outlier_CpiChange,
  Outlier_GrossTertiary, 
  Outlier_GrossPrimary, 
  Outlier_LifeExpectancy, 
  Outlier_TaxRevenue,
  Outlier_TotalTaxRate, 
  Outlier_Population, 
  ncol = 4
)

2. Analisis Data

Setelah melakukan pre-prosessing data, kami melanjutkan ke tahap analisis data. Untuk itu, kami menggunakan pertanyaan yang telah kami susun untuk mencari tahu mengenai faktor yang melatarbelakangi seseorang menjadi miliarder.

a) Apa pola yang ada dalam distribusi kekayaan miliarder di berbagai negara dan industri?

Kita akan lihat bagaimana negara dan industri itu berpengaruh terhadap kekayaan seseorang.

Negara dengan Miliarder Terbanyak

bill_clean %>%
  count(country, sort = TRUE) %>%
  ggplot(aes(x = reorder(country, n), y = n)) +
  geom_col(fill = "tan1") +
  coord_flip() + #nuker sumbu x dan y
  geom_text(
        aes(label = paste0(n, " (", scales::percent(n/sum(n), accuracy = 0.1), ")")),
        hjust = -0.05,
        size = 3.5
    ) +
  scale_y_continuous(expand = expansion(mult = c(0,0.25))) +
  labs(title = "Jumlah Miliarder di Tiap Negara",
       x = "Negara", y = "Jumlah Miliarder") +
  theme_minimal()

bill_clean %>%
  count(country, sort = TRUE) %>%
  head(10) %>%
  ggplot(aes(x = reorder(country, n), y = n)) +
  geom_col(fill = "tan1") +
  coord_flip() + #nuker sumbu x dan y
  geom_text(
        aes(label = paste0(n, " (", scales::percent(n / sum(n), accuracy = 0.1), ")")),
        hjust = -0.05, size = 3.5
    ) +
  scale_y_continuous(expand = expansion(mult = c(0,0.15))) +
  labs(title = "Top 10 Negara dengan Miliarder Terbanyak",
       x = "Negara", y = "Jumlah Miliarder") +
  theme_minimal()

Dari kedua grafik, dapat dilihat bahwa US dan China menduduki peringkat teratas Negara dengan jumlah miliarder terbanyak di dunia. Maka dari itu, bisa jadi ada kemungkinan bahwa negara itu berpengaruh terhadap kekayaan seseorang.

TOP 10 Sektor dengan Kekayaan Terbesar

top_industries <- bill_clean %>%
  group_by(industries) %>%
  summarise(total_wealth = sum(finalWorth, na.rm = TRUE)) %>%
  arrange(desc(total_wealth)) %>%
  slice_head(n = 10) %>%
  mutate(pct = total_wealth / sum(total_wealth))

ggplot(top_industries, aes(x = reorder(industries, total_wealth), y = total_wealth)) +
  geom_bar(stat = "identity", fill = "tan1") +
  geom_text(
    aes(label = paste0(
      comma(total_wealth, accuracy = 1), 
      " (", percent(pct, accuracy = 0.1), ")"
    )),
    hjust = -0.05,
    size = 3.5
  ) +
  labs(
    title = "Top 10 Industri Berdasarkan Total Kekayaan",
    x = "Industri",
    y = "Total Kekayaan (Juta USD)"
  ) +
  coord_flip() +  
  theme_minimal() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.4)))  # beri ruang di ujung kanan

Dari grafik ini, terlihat bahwa industri Technology, Fashion & Retail, dan Finance & Investments adalah tiga industri teratas yang menghasilkan kekayan paling banyak dibandingkan industri lain. Hal ini karena kemungkinan kemajuan jaman—terutama di industri Technology—yang membuat ketiga industri ini menghasilkan total kekayaan yang banyak.

Distribusi Industri Miliarder di Top 5 Negara dengan Jumlah Miliarder Terbanyak

top_countries_order <- bill_clean %>%
  count(country, sort = TRUE) %>%
  slice_max(n, n = 5)

industry_country <- bill_clean %>%
  filter(country %in% top_countries_order$country, !is.na(industries)) %>%
  count(country, industries)

#Urutan industri berdasarkan total global
industry_levels <- industry_country %>%
  filter(country %in% top_countries_order$country, !is.na(industries)) %>%
  count(industries) %>%
  arrange(desc(n)) %>%
  pull(industries)

#Filter dan hitung jumlah miliarder per industri di 5 negara tersebut
industry_country <- industry_country %>%
  mutate(
    industries = factor(industries, levels = rev(industry_levels)),
    country = factor(country, levels = top_countries_order$country)
  )

#Visualisasi facet (dengan urutan negara sesuai ranking)
ggplot(industry_country, aes(x = industries, y = n, fill = industries)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  geom_text(aes(label = n), hjust = 0, size = 2.5) +
  scale_x_discrete(expand = expansion(mult = c(0, 0.1))) +
  facet_wrap(~country, scales = "free_y") +
  labs(
    title = "Distribusi Industri Miliarder di Negara-Negara Teratas",
    x = "Industri",
    y = "Jumlah Miliarder"
  ) +
  theme_bw(base_size = 10) +
  theme(
    strip.text = element_text(size = 10, face = "bold"),
    axis.text.x = element_text(size = 8),
    axis.text.y = element_text(size = 7)
  )

Dari grafik ini, kita jadi mengetahui tiap-tiap negara dengan jumlah miliarder terbanyak, industri apa yang paling banyak digeluti oleh para miliardernya. Jadi, kita bisa melihat ketika kita ingin menjadi miliarder, tempat atau lokasi dan industri mana saja yang membuka peluang lebih besar untuk menjadikan kita seorang miliarder.

b) Apakah status self-made dan gender berpengaruh pada total kekayaan?

Distribusi Kekayaan Tiap Gender berdasarkan Aset Awal (Pewaris atau Perintis)

bill_clean %>%
  filter(!is.na(gender), !is.na(selfMade)) %>%
  ggplot(aes(x = selfMade, y = finalWorth, fill = gender)) +
  geom_boxplot() + 
  scale_fill_manual(values = c("Male" = "navy", "F" = "hotpink")) +
  labs(title = "Distribusi Kekayaan Berdasarkan Self-Made dan Gender",
       x = "Status Kekayaan", y = "Kekayaan (Juta USD)") +
  ylim(0, 10000) +
  theme_minimal()

Dapat diketahui bahwa kekayaan pewaris lebih besar dibanding perintis. Boxplot ini juga dapat menunjukkan bahwa variasi kekayaan miliarder wanita pewaris lebih banyak daripada miliarder pria pewaris maupun perintis. Jadi, seorang perintis tetap berkemungkinan memiliki kekayaan yang setara pewaris tetapi tetap ada perbedaan dari keduanya.

Distribusi gender tiap sektor

bill_clean %>%
  group_by(industries, gender) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = reorder(industries, count), y = count, fill = gender)) +
  scale_fill_manual(values = c("M" = "navy", "F" = "hotpink")) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = count),
            position = position_dodge(width = 0.8),
            hjust = -0.1, size = 2) +
  coord_flip() +
  labs(title = "Jumlah Miliarder Berdasarkan Industri dan Gender", x = "Industri", y = "Jumlah", fill = "Gender") +
  theme_minimal()

## `summarise()` has grouped output by 'industries'. You can override using the
## `.groups` argument.

Menunjukkan bahwa gender di tiap sektor masih didominasi oleh laki-laki.

c) Bagaimana hubungan antara usia dan kekayaan di kalangan miliarder?

Distribusi Usia dan Kekayaan Total

ggplot(bill_clean, aes(x = age, y = finalWorth, color = gender)) +
  geom_point(alpha = 0.9) +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

  labs(title = "Hubungan Usia dan Kekayaan Berdasarkan Gender",
       x = "Usia", y = "Kekayaan (Miliar USD)") +
  theme_minimal()

## NULL

Hal ini menunjukkan bahwa usia tidak terlalu memiliki pengaruh pada kekayaan total. Puncak dari distribusi berada di sekitar 50-75 tahun yang menunjukkan angka rata-rata miliarder di dunia.

Distribusi Umur Miliarder di Tiap Sektor

bill_clean %>%
  mutate(age_group = cut(age, breaks = c(0, 40, 60, 80, 100), 
                         labels = c("<=40", "41-60", "61-80", ">80"))) %>%
  filter(!is.na(industries)) %>%
  count(age_group, industries) %>%
  top_n(40, n) %>%  # ambil 40 baris tertinggi agar visualisasi tetap jelas
  ggplot(aes(x = reorder(industries, n), y = n, fill = age_group)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(title = "Distribusi Miliarder Berdasarkan Usia dan Industri",
       x = "Industri",
       y = "Jumlah Miliarder",
       fill = "Kelompok Usia") +
  theme_minimal()

Dari sini dapat kita lihat bahwa miliarder di finance lebih banyak berusia 61-80 tahun, maka ada kemungkinan mereka berkecimpung di sektor tersebut dari lama. Sedangkan miliarder dengan kelompok usia paling bervariasi berada di sektor teknologi yang memang sedang digemari akhir-akhir ini.

Korelasi antar variabel numerik

vars_individu <- bill_clean %>%
select(finalWorth, age)

vars_negara <- bill_clean %>%
select(gdp_country, cpi_country, population_country,
total_tax_rate_country, tax_revenue_country_country) 

cor_individu <- cor(vars_individu, use = "complete.obs")
cor_negara <- cor(vars_negara, use = "complete.obs")

Per orang

ggcorrplot(cor_individu,
lab = TRUE,
type = "full",
colors = c("skyblue", "white", "firebrick"),
title = "Korelasi Variabel Individu",
lab_size = 4)

Hal ini relate dengan pernyataan pada scatterplot di atas bahwa umur tidak terlalu berpengaruh pada kekayaan.

Per negara

ggcorrplot(cor_negara,
lab = TRUE,
type = "full",
colors = c("tan2", "white", "salmon"),
title = "Korelasi Variabel Negara",
lab_size = 4)

Arti:

total_tax_rate_country dan population_country memiliki korelasi positif kuat sebesar 0.65. Artinya, negara dengan jumlah penduduk lebih besar cenderung memiliki tingkat pajak total yang lebih tinggi.
tax_revenue_country_country memiliki korelasi negatif kuat dengan gdp_country (-0.6). Ini menunjukkan bahwa semakin tinggi GDP suatu negara, semakin kecil proporsi pendapatan pajak terhadap PDB-nya. Hal ini bisa terjadi karena negara-negara maju mungkin memiliki sumber pendapatan non-pajak yang lebih besar atau efisiensi fiskal yang berbeda.
tax_revenue_country_country juga negatif dengan population_country (-0.43), yang berarti negara yang lebih padat penduduk cenderung punya rasio pendapatan pajak terhadap PDB yang lebih kecil.
gdp_country berkorelasi positif dengan population_country (0.43), yang wajar karena negara dengan populasi besar cenderung menghasilkan PDB yang lebih tinggi.
cpi_country berkorelasi positif lemah dengan population (0.22) dan total_tax_rate (0.24), namun negatif terhadap GDP (-0.3). Ini bisa mengindikasikan bahwa negara dengan tingkat inflasi (CPI) lebih tinggi cenderung bukan negara dengan GDP terbesar.
total_tax_rate_country memiliki korelasi lemah terhadap GDP (0.14), menunjukkan bahwa tingkat pajak tidak terlalu berhubungan langsung dengan kekuatan ekonomi suatu negara.
Sisanya adalah korelasi lemah yang berarti tidak ada hubungan signifikan antar variabel.

TUGAS BESAR VDE_BILLIONAIRE

Nabila (006), Amelia (007), Agata (036)

2025-06-21