1 What Is a Billionaire?

“The term billionaire describes an individual with a net worth of at least one billion units in their native currency, such as dollars or euros. The assets can range from cash and cash equivalents to real estate, business, and personal property. Since 1987, Forbes has ranked the wealthiest global citizens according to their net worth.”

—Investopedia

1.1 Dataset Introduction

In this LBB Project, we will used Dataset provided by Kaggle which contains statistics on the world’s billinaires, including information about their businesses, industries, and personal details. It provides insights into the wealth distribution, business sectors, and demographics of billionaires worldwide.

Those insights informations into the World’s Billionaires will be summarized using Data Visualization to give us better understanding of these Billionaires' backgrounds.

2 Data Preparation

2.1 Import Libraries

# Install necessary packages
library(dplyr)
library(ggplot2) # for visualization
library(scales) # for custom digits (using comma, etc.) (memberikan koma dll)
library(glue)
library(plotly) 
library(lubridate) # working with datetime
library(gtools)
library(leaflet)

2.2 Read and Check Dataset Structure

# Read dataset
df_billion <- read.csv("data_input/Billionaires Statistics Dataset.csv")

2.2.1 Dataset Structure

# Check the structure of our dataset
str(df_billion)

#> 'data.frame':    2640 obs. of  35 variables:
#>  $ rank                                      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ finalWorth                                : int  211000 180000 114000 107000 106000 104000 94500 93000 83400 80700 ...
#>  $ category                                  : chr  "Fashion & Retail" "Automotive" "Technology" "Technology" ...
#>  $ personName                                : chr  "Bernard Arnault & family" "Elon Musk" "Jeff Bezos" "Larry Ellison" ...
#>  $ age                                       : int  74 51 59 78 92 67 81 83 65 67 ...
#>  $ country                                   : chr  "France" "United States" "United States" "United States" ...
#>  $ city                                      : chr  "Paris" "Austin" "Medina" "Lanai" ...
#>  $ source                                    : chr  "LVMH" "Tesla, SpaceX" "Amazon" "Oracle" ...
#>  $ industries                                : chr  "Fashion & Retail" "Automotive" "Technology" "Technology" ...
#>  $ countryOfCitizenship                      : chr  "France" "United States" "United States" "United States" ...
#>  $ organization                              : chr  "LVMH Moët Hennessy Louis Vuitton" "Tesla" "Amazon" "Oracle" ...
#>  $ selfMade                                  : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
#>  $ status                                    : chr  "U" "D" "D" "U" ...
#>  $ gender                                    : chr  "M" "M" "M" "M" ...
#>  $ birthDate                                 : chr  "3/5/1949 0:00" "6/28/1971 0:00" "1/12/1964 0:00" "8/17/1944 0:00" ...
#>  $ lastName                                  : chr  "Arnault" "Musk" "Bezos" "Ellison" ...
#>  $ firstName                                 : chr  "Bernard" "Elon" "Jeff" "Larry" ...
#>  $ title                                     : chr  "Chairman and CEO" "CEO" "Chairman and Founder" "CTO and Founder" ...
#>  $ date                                      : chr  "4/4/2023 5:01" "4/4/2023 5:01" "4/4/2023 5:01" "4/4/2023 5:01" ...
#>  $ state                                     : chr  "" "Texas" "Washington" "Hawaii" ...
#>  $ residenceStateRegion                      : chr  "" "South" "West" "West" ...
#>  $ birthYear                                 : int  1949 1971 1964 1944 1930 1955 1942 1940 1957 1956 ...
#>  $ birthMonth                                : int  3 6 1 8 8 10 2 1 4 3 ...
#>  $ birthDay                                  : int  5 28 12 17 30 28 14 28 19 24 ...
#>  $ cpi_country                               : num  110 117 117 117 117 ...
#>  $ cpi_change_country                        : num  1.1 7.5 7.5 7.5 7.5 7.5 7.5 3.6 7.7 7.5 ...
#>  $ gdp_country                               : chr  "$2,715,518,274,227 " "$21,427,700,000,000 " "$21,427,700,000,000 " "$21,427,700,000,000 " ...
#>  $ gross_tertiary_education_enrollment       : num  65.6 88.2 88.2 88.2 88.2 88.2 88.2 40.2 28.1 88.2 ...
#>  $ gross_primary_education_enrollment_country: num  102 102 102 102 102 ...
#>  $ life_expectancy_country                   : num  82.5 78.5 78.5 78.5 78.5 78.5 78.5 75 69.4 78.5 ...
#>  $ tax_revenue_country_country               : num  24.2 9.6 9.6 9.6 9.6 9.6 9.6 13.1 11.2 9.6 ...
#>  $ total_tax_rate_country                    : num  60.7 36.6 36.6 36.6 36.6 36.6 36.6 55.1 49.7 36.6 ...
#>  $ population_country                        : int  67059887 328239523 328239523 328239523 328239523 328239523 328239523 126014024 1366417754 328239523 ...
#>  $ latitude_country                          : num  46.2 37.1 37.1 37.1 37.1 ...
#>  $ longitude_country                         : num  2.21 -95.71 -95.71 -95.71 -95.71 ...

2.2.2 Dataset Description

The Dataset contains the following information:

rank: The ranking of the billionaire in terms of wealth.

finalWorth: The final net worth of the billionaire in U.S. dollars.

category: The category or industry in which the billionaire’s business operates.

personName: The full name of the billionaire.

age: The age of the billionaire.

country: The country in which the billionaire resides.

city: The city in which the billionaire resides.

source: The source of the billionaire’s wealth.

industries: The industries associated with the billionaire’s business interests.

countryOfCitizenship: The country of citizenship of the billionaire.

organization: The name of the organization or company associated with the billionaire.

selfMade: Indicates whether the billionaire is self-made (True/False).

status: “D” represents self-made billionaires (Founders/Entrepreneurs) and “U” indicates inherited or unearned wealth.

gender: The gender of the billionaire.

birthDate: The birthdate of the billionaire.

lastName: The last name of the billionaire.

firstName: The first name of the billionaire.

title: The title or honorific of the billionaire.

date: The date of data collection.

state: The state in which the billionaire resides.

residenceStateRegion: The region or state of residence of the billionaire.

birthYear: The birth year of the billionaire.

birthMonth: The birth month of the billionaire.

birthDay: The birth day of the billionaire.

cpi_country: Consumer Price Index (CPI) for the billionaire’s country.

cpi_change_country: CPI change for the billionaire’s country.

gdp_country: Gross Domestic Product (GDP) for the billionaire’s country.

gross_tertiary_education_enrollment: Enrollment in tertiary education in the billionaire’s country.

gross_primary_education_enrollment_country: Enrollment in primary education in the billionaire’s country.

life_expectancy_country: Life expectancy in the billionaire’s country.

tax_revenue_country_country: Tax revenue in the billionaire’s country.

total_tax_rate_country: Total tax rate in the billionaire’s country.

population_country: Population of the billionaire’s country.

latitude_country: Latitude coordinate of the billionaire’s country.

longitude_country: Longitude coordinate of the billionaire’s country.

3 Data Wrangling Process

3.1 Change Data Type

Based on the information regarding Data Description, we will need to change the data type as follows: - birthDate and date -> date time format - all other variable characters -> factor

# Change to date time format
df_billion$birthDate = mdy_hm(df_billion$birthDate)
df_billion$date = mdy_hm(df_billion$date)

# Change all the other variable characters as factor
df_billion <- df_billion %>% 
  mutate(across(where(is.character), as.factor))

# Re-check Data Types have been correctly changed
str(df_billion)

#> 'data.frame':    2640 obs. of  35 variables:
#>  $ rank                                      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ finalWorth                                : int  211000 180000 114000 107000 106000 104000 94500 93000 83400 80700 ...
#>  $ category                                  : Factor w/ 18 levels "Automotive","Construction & Engineering",..: 5 1 17 17 6 17 12 18 3 17 ...
#>  $ personName                                : Factor w/ 2638 levels "A. Jayson Adair",..: 222 587 988 1274 2392 239 1564 303 1634 2136 ...
#>  $ age                                       : int  74 51 59 78 92 67 81 83 65 67 ...
#>  $ country                                   : Factor w/ 79 levels "","Algeria","Andorra",..: 26 76 76 76 76 76 76 45 33 76 ...
#>  $ city                                      : Factor w/ 742 levels "","A Coruña",..: 496 29 403 334 478 403 455 411 434 262 ...
#>  $ source                                    : Factor w/ 906 levels "3D printing",..: 474 823 29 594 82 516 99 814 228 516 ...
#>  $ industries                                : Factor w/ 18 levels "Automotive","Construction & Engineering",..: 5 1 17 17 6 17 12 18 3 17 ...
#>  $ countryOfCitizenship                      : Factor w/ 77 levels "Algeria","Argentina",..: 23 74 74 74 74 74 74 42 31 74 ...
#>  $ organization                              : Factor w/ 295 levels "","ABC Supply",..: 163 260 8 194 29 33 36 9 213 159 ...
#>  $ selfMade                                  : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
#>  $ status                                    : Factor w/ 6 levels "D","E","N","R",..: 6 1 1 6 1 1 6 6 1 1 ...
#>  $ gender                                    : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ birthDate                                 : POSIXct, format: "1949-03-05 00:00:00" "1971-06-28 00:00:00" ...
#>  $ lastName                                  : Factor w/ 1736 levels "Aarnio-Wihuri",..: 69 1049 149 420 216 513 176 1415 43 102 ...
#>  $ firstName                                 : Factor w/ 1771 levels "","A. Jayson",..: 143 342 638 855 1561 152 1012 187 1052 1420 ...
#>  $ title                                     : Factor w/ 98 levels "","Advisor","Athlete",..: 18 5 21 52 5 36 5 77 68 85 ...
#>  $ date                                      : POSIXct, format: "2023-04-04 05:01:00" "2023-04-04 05:01:00" ...
#>  $ state                                     : Factor w/ 46 levels "","Alabama","Arizona",..: 1 40 44 10 26 44 30 1 1 44 ...
#>  $ residenceStateRegion                      : Factor w/ 6 levels "","Midwest","Northeast",..: 1 4 6 6 2 6 3 1 1 6 ...
#>  $ birthYear                                 : int  1949 1971 1964 1944 1930 1955 1942 1940 1957 1956 ...
#>  $ birthMonth                                : int  3 6 1 8 8 10 2 1 4 3 ...
#>  $ birthDay                                  : int  5 28 12 17 30 28 14 28 19 24 ...
#>  $ cpi_country                               : num  110 117 117 117 117 ...
#>  $ cpi_change_country                        : num  1.1 7.5 7.5 7.5 7.5 7.5 7.5 3.6 7.7 7.5 ...
#>  $ gdp_country                               : Factor w/ 69 levels "","$1,119,190,780,753 ",..: 22 26 26 26 26 26 26 3 21 26 ...
#>  $ gross_tertiary_education_enrollment       : num  65.6 88.2 88.2 88.2 88.2 88.2 88.2 40.2 28.1 88.2 ...
#>  $ gross_primary_education_enrollment_country: num  102 102 102 102 102 ...
#>  $ life_expectancy_country                   : num  82.5 78.5 78.5 78.5 78.5 78.5 78.5 75 69.4 78.5 ...
#>  $ tax_revenue_country_country               : num  24.2 9.6 9.6 9.6 9.6 9.6 9.6 13.1 11.2 9.6 ...
#>  $ total_tax_rate_country                    : num  60.7 36.6 36.6 36.6 36.6 36.6 36.6 55.1 49.7 36.6 ...
#>  $ population_country                        : int  67059887 328239523 328239523 328239523 328239523 328239523 328239523 126014024 1366417754 328239523 ...
#>  $ latitude_country                          : num  46.2 37.1 37.1 37.1 37.1 ...
#>  $ longitude_country                         : num  2.21 -95.71 -95.71 -95.71 -95.71 ...

3.2 Check For Missing Values

# Check Column Information of All the NA Values
cols_names <- colnames(df_billion)
cols_NA_values <- colSums(is.na(df_billion))

# Temporary dataframe to keep the names of columns and total NA values for each columns
temp_cols <- data.frame(cols_names, cols_NA_values)

# Data frame for all columns name with its total NA values information
cols_NA <- temp_cols[!(cols_NA_values == 0),]

cols_NA

NA Values above can simply means that there is no information in regards to the corresponding column information but we will still kept the data regardless because there are others information that we can still gain insight from the other variables.

Therefore, there is no data that we will remove from our dataset

4 Exploratory Data Analysis with Visualization

Using our dataset df_billion let us do exploratory visualization on several things :

4.1 The Country with the Most Number of World Billionaires Resides

We will use dplyr() function to count the frequency of variables country where the Billionaires resides and show the top 10 ranking of the Countries with the Most Number of World Billionaires resides

# Data Transformation
top10_country <- df_billion %>% 
  group_by(country) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  arrange(-freq) %>% 
  head(10) %>% 
  # Adding label for tooltip
  mutate(label = glue(
    "Residence Country: {country}
    Total: {comma(freq)} Billionaires"
  ))

# Making Static Plot
plot1 <- ggplot(data = top10_country, 
                aes(x = freq,
                    y = reorder(country, freq),
                    color = freq,
                    text = label)) +
  geom_point(size = 3) +
  geom_segment(aes(x = 0,
                   xend = freq,
                   yend = country),
                   size = 1.0) +
  scale_color_gradient(low = "lightblue", 
                       high = "darkblue") +
  scale_x_continuous(labels = comma) + 
  labs(title = "Top 10 Residences Countries with Most Billionaires",
       x = "Total Billionaires",
       y = "Residence Country") +
  theme_minimal() +
  theme(legend.position = "none") 

# Creating interactive plot
ggplotly(plot1, tooltip = "text")

4.2 What Top 10 category industry have the most Billionaires?

# Data Transformation
top10_industries <- df_billion %>% 
  group_by(industries) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  arrange(-freq) %>% 
  head(10) %>% 
  # Adding label for tooltip
  mutate(label = glue(
    "Industry: {industries}
    Total: {comma(freq)} Billionaires"
  ))

# Making Static Plot
plot2 <- ggplot(data = top10_industries, 
                aes(x = freq,
                    y = reorder(industries, freq),
                    text = label)) +
  geom_col(mapping = aes(fill = freq)) +
  scale_fill_gradient(low = "lightblue", 
                       high = "darkblue") +
  scale_x_continuous(labels = comma) + 
  labs(title = "Top 10 Industries with Most Billionaires",
       x = "Total Billionaires",
       y = "Industry Category") +
  theme_minimal() 

# Creating interactive plot
ggplotly(plot2, tooltip = "text")

4.3 Overview Location Map of Top 50 Billionaires in the World

# Data Transformation of Top 50 Ranked as World Billionaires
loca_top50 <- data.frame(rank = as.character(df_billion$rank),
                        Name = df_billion$personName,
                        Assets = df_billion$finalWorth,
                        Residence_Country = df_billion$country,
                        lat = df_billion$latitude_country,
                        lng = df_billion$longitude_country) %>% 
            head(50)

# Plot the market into the World Map
map <- leaflet()
map <- addTiles(map = map)
addMarkers(map = map, data = loca_top50, popup = glue("<h5>Rank = {loca_top50$rank}</p>
                                                      Name = {loca_top50$Name}</p>
                                                      Assets = $ {comma(loca_top50$Assets)} B</p>
                                                      Country Residence = {loca_top50$Residence_Country}
                                                      "))

5 Conclusions

There are many more variations and insights that we can combine to summarize the data using visualization. The above visualizations are just a small part of the beginning to get us started to more complex analysis. Hope these visualizations will helps to introduce the general knowledge from the dataset.

6 References

Billionaires Statistics Dataset (2023): (https://www.kaggle.com/datasets/nelgiriyewithana/billionaires-statistics-dataset)
What Is a Billionaire?: (https://www.investopedia.com/terms/b/billionaire.asp)

LBB 02: Billionaires Statistics Data Visualization

Melissa Rusli

23 October 2023