Introduction

In an era where wealth concentration and economic disparity are at the forefront of global discussions, understanding the dynamics of billionaire wealth is vital. This project contains statistics on the world’s billionaires, including information about their businesses, industries, and personal details. It provides insights into the wealth distribution, business sectors, and demographics of billionaires worldwide. We chose this to analyze what makes billionaires billionaires and how they compare to one another.

This is the question that drives our analysis: What makes a billionaire more successful than another?

Data Preprocessing

This code chunk represents the data preprocessing steps performed on the “Billionaires Statistics Dataset” to ensure that the data is clean, well-structured, and ready for analysis.

  1. Loading the Dataset: The code begins by loading the dataset from a CSV file named “Billionaires Statistics Dataset.csv” while suppressing the display of column data types.

  2. Column Renaming: Column names are renamed for clarity and consistency. Specifically, the column originally named ‘finalWorth’ is renamed to ‘netWorth’ for better interpretation.

  3. Handling Missing Values: Missing values in critical columns, such as ‘netWorth’ and ‘age,’ are addressed. Rows with missing values in these columns are removed from the dataset. Additionally, missing values in the ‘country’ column are replaced with the label ‘Unknown’ to ensure completeness.

  4. Data Type Conversion: The ‘selfMade’ column is converted to a factor variable, making it suitable for categorical analysis.

  5. Date Parsing: The ‘birthDate’ column is parsed into a standardized date-time format. Any date parsing failures are identified and marked as ‘NA.’

  6. Numeric Conversion and Text Standardization: The ‘gdp_country’ column, which may contain currency symbols and commas, is converted to a numeric format. The ‘country’ column’s text values are standardized by converting them to proper case.

  7. Handling Empty Strings: Any empty string values in the dataset are replaced with ‘NA’ for consistency.

  8. Outlier Detection and Removal: Outliers in the ‘netWorth’ column are identified and removed based on the Interquartile Range (IQR) method. This step helps ensure that extreme values do not distort the analysis.

  9. Duplicate Removal: Duplicate rows in the dataset are eliminated to maintain data integrity and avoid redundancy.

Please note that the code includes commented-out lines for viewing and summarizing the cleaned dataset, which can be uncommented if you wish to inspect the dataset after preprocessing.

These preprocessing steps are crucial for preparing the dataset for subsequent analysis and modeling, ensuring that the data is reliable and suitable for answering research questions and deriving meaningful insights.

# Loading the dataset
billionaires_data <- read_csv("Billionaires Statistics Dataset.csv", show_col_types = FALSE)

# Renaming columns for clarity
billionaires_data <- rename(billionaires_data, netWorth = finalWorth)

# Handling missing values
billionaires_data <- billionaires_data %>%
  filter(!is.na(netWorth) & !is.na(age))
billionaires_data$country <- ifelse(is.na(billionaires_data$country), 'Unknown', billionaires_data$country)

# Data type conversion
billionaires_data$selfMade <- as.factor(billionaires_data$selfMade)

# Parsing dates (corrected format)
billionaires_data$birthDate <- mdy_hm(billionaires_data$birthDate)
failed_dates <- which(is.na(billionaires_data$birthDate))
billionaires_data$birthDate[failed_dates] <- NA

# Parsing 'gdp_country' to numeric and standardizing text data
billionaires_data$gdp_country <- as.numeric(gsub("[\\$,]", "", billionaires_data$gdp_country))
billionaires_data$country <- str_to_title(tolower(billionaires_data$country))

# Handling empty strings
billionaires_data <- billionaires_data %>%
  mutate_if(is.character, na_if, "")

# Outlier detection and removal (for 'netWorth')
worth_iqr <- IQR(billionaires_data$netWorth, na.rm = TRUE)
worth_quantiles <- quantile(billionaires_data$netWorth, probs = c(0.25, 0.75), na.rm = TRUE)
worth_upper <- worth_quantiles[2] + 1.5 * worth_iqr
worth_lower <- worth_quantiles[1] - 1.5 * worth_iqr
billionaires_data <- billionaires_data %>%
  filter(netWorth >= worth_lower & netWorth <= worth_upper)

# Removing duplicates
billionaires_data <- distinct(billionaires_data)

# # Viewing and summarizing the cleaned dataset
# head(billionaires_data)
# summary(billionaires_data)
# str(billionaires_data)

The second chunk of code is designed to perform various data cleaning tasks on a dataset, but it does not include a step for outlier detection and removal. This is neccesary for some of our graphs and calculations.

# Loading the dataset 
billion <- read_csv("Billionaires Statistics Dataset.csv", show_col_types = FALSE)

# Handling missing values
billion <- billion %>%
  filter(!is.na(finalWorth) & !is.na(age))
billion$country <- ifelse(is.na(billion$country), 'Unknown', billion$country)

# Data type conversion
billion$selfMade <- as.factor(billion$selfMade)

# Parsing dates (corrected format)
billion$birthDate <- mdy_hm(billion$birthDate)
failed_dates <- which(is.na(billion$birthDate))
billion$birthDate[failed_dates] <- NA

# Parsing 'gdp_country' to numeric and standardizing text data
billion$gdp_country <- as.numeric(gsub("[\\$,]", "", billion$gdp_country))
billion$country <- str_to_title(tolower(billion$country))

# Handling empty strings
billion <- billion %>%
  mutate_if(is.character, na_if, "")

# Renaming columns for clarity
billion <- rename(billion, netWorth = finalWorth)

# Removing duplicates
billionaires_full <- distinct(billion)

# # Viewing and summarizing the cleaned dataset
# head(billionaires_full)
# summary(billionaires_full)
# str(billionaires_full)

Graphs

The series of world maps visually describe the global distribution of billionaires against the backdrop of national GDPs and total billionaire net worth. These visual insights suggest a correlation between the economic prosperity of a country and the concentration of its high-net-worth individuals. The histograms and box plots further dissect the age and industry factors, potentially illuminating paths to wealth accumulation.

# Renaming 'region' column in WorldData to 'country' for easy merging
colnames(WorldData)[colnames(WorldData) == 'region'] <- 'country'

# Creating a summarized dataset for merging
# This includes filtering, renaming, and aggregating data
billions <- 
  billionaires_full |>
  filter(country != "") |>
  mutate(country = if_else(country == "United States", "USA", country)) |>
  group_by(country) |>
  summarize(
    n = n(),  # Count of billionaires per country
    total_money = sum(netWorth),  # Total net worth per country
    gdp_country = first(gdp_country)  # GDP of the country
  )

# Merging the billionaire data with world map data
world_billion <- left_join(WorldData, billions, by = "country")
world_billion <- world_billion |> filter(long <= 180)

# Creating a world map visualization of the number of billionaires per country
world <- ggplot(data = world_billion,
                aes(x = long, y = lat, group = group)) +
         geom_polygon(aes(fill = n), color = "black") +
         theme_map() +
         labs(
           title = "Number of Billionaires per Country",
           fill = "Billionaire Count",
           caption = "*Grey countries have no billionaires"
         ) +
         scale_fill_viridis_b(breaks = c(seq(0, 800, 150))) +
         coord_map("moll") +
         theme(plot.title = element_text(hjust = 0.5))
world

# Creating a world map visualization of the GDP per country
world_gdp <- ggplot(data = world_billion,
                    aes(x = long, y = lat, group = group)) +
             geom_polygon(aes(fill = gdp_country), color = "black") +
             theme_map() +
             labs(
               title = "Country GDP",
               fill = "GDP",
               caption = "*Grey countries have no billionaires"
             ) +
             scale_fill_viridis_b(labels = c("5 Trillion", "10 Trillion", "15 Trillion", "20 Trillion")) +
             coord_map("moll") +
             theme(plot.title = element_text(hjust = 0.5))

world_gdp

# Creating a world map visualization of the total net worth per country
world_money <- ggplot(data = world_billion,
                      aes(x = long, y = lat, group = group)) +
               geom_polygon(aes(fill = total_money), color = "black") +
               theme_map() +
               labs(
                 title = "Total Net Worth per Country",
                 fill = "Total Net Worth",
                 caption = "*Grey countries have no billionaires"
               ) +
               scale_fill_viridis_b() +
               coord_map("moll") +
               theme(plot.title = element_text(hjust = 0.5))

world_money

The first map shows how many billionaires there are in each country. Lighter colors mean more billionaires, and it’s clear that the USA and China have the most. The second map does the same for each country’s GDP. Not surprisingly, there’s a strong resemblance between the two maps: countries with higher GDPs, like the USA and China, also have more billionaires.

The third map looks at the total wealth of all billionaires within a country (in Millions). This map also highlights the USA and China, showing that these countries don’t just have more billionaires, but billionaires residing in these countries are also wealthier overall.

All three maps point to a common idea: in countries with larger economies, there seems to be a higher chance of becoming a billionaire. This suggests that a strong economy might help people build more wealth.

Histogram of Billionaires by Age

The following histograms provide a snapshot of the age distribution across our billionaire dataset. The shape of the distribution, skewed towards middle age, may indicate peak periods of wealth accumulation.

# Converting 'selfMade' to a factor with descriptive labels
billionaires_full$selfMade <- factor(billionaires_full$selfMade,
                                     levels = c(FALSE, TRUE),
                                     labels = c("Not Self Made", "Self Made"))

# Creating a histogram to visualize the distribution of billionaires by age
age_hist <- ggplot(data = billionaires_full, aes(x = age)) +
            geom_histogram(binwidth = 3, color = "black", fill = "green4") +
            scale_y_continuous(expand = c(0, 0, 0.05, 0)) +
            theme_bw() +
            labs(
              title = "Billionaires by Age",
              x = "Age",
              y = "Billionaire Count"
            ) +
            theme(plot.title = element_text(hjust = 0.5))

# Displaying the histogram
age_hist

# Creating a faceted histogram to compare self-made billionaires and others by age
self_made <- age_hist +
             facet_wrap(~selfMade) +
             labs(
               title = "Billionaires by Age: Self-made vs Not Self-Made",
               x = "Age",
               y = "Billionaire Count"
             )

# Displaying the faceted histogram
self_made

This code chunk produces a histogram that visualizes the distribution of billionaires by age, illustrating the frequency of billionaires at each age interval. The histogram appears to approximate a normal distribution, suggesting that it is unimodal and symmetrical, which is typical for demographic data like age.

The second part of the code enhances the histogram by adding facets to differentiate between self-made billionaires and those who are not. The faceted histogram reveals a higher number of self-made billionaires, as indicated by the ‘Self Made’ facet. This suggests that a larger proportion of billionaires have accumulated their wealth through their endeavors rather than inheriting it.

Additionally, the distribution for non-self-made billionaires skews younger, which aligns with the notion that inherited wealth can confer billionaire status at an earlier age. Conversely, the distribution for self-made billionaires peaks around middle age, indicating that self-made individuals typically reach billionaire status later in life, likely due to the time it takes to build and grow successful enterprises.

Box Plot and Average Net Worth by Industry

# Visualization: Box Plot of Net Worth by Industry
# Adjusting the boxplot orientation for better readability and presentation
ggplot(billionaires_data, aes(x = industries, y = netWorth)) + 
  geom_boxplot() +
  labs(title = "Net Worth by Industry", x = "Industry", y = "Net Worth (in millions)") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x labels for better visibility
        plot.title = element_text(hjust = 0.5)) +  # Center the plot title
  scale_y_continuous(labels = scales::comma)  # Format the net worth with commas for readability

The box plot visualization presents a comprehensive view of the net worth distribution across various industries, revealing the spread and central tendencies within each sector. To ensure clarity, outliers were removed, which allowed us to observe that the Telecom industry unexpectedly has the highest median net worth—surpassing even the Technology sector. This finding prompts further inquiry into the dynamics and characteristics unique to these industries.

Despite the removal of outliers, ther is still significant variability, this is indicated by the data points extending beyond the box and whisker boundaries. This suggests a broad range of wealth within industries, hinting at underlying factors that may influence wealth accumulation.

# Calculate average net worth by industry
average_net_worth <- billionaires_full %>%
  group_by(industries) %>%
  summarise(AverageNetWorth = mean(netWorth, na.rm = TRUE))

# Bar plot for average net worth by industry
ggplot(average_net_worth, aes(x = reorder(industries, -AverageNetWorth), y = AverageNetWorth)) +
  geom_col(fill = "dodgerblue") +
  coord_flip() +
  theme_bw() +
  theme(legend.position = "none") +
  labs(title = "Average Net Worth by Industry", x = "", y = "Average Net Worth (USD)")

The average net worth by industry graph is interesting to see. The top 5 industries by net worth is surprising to us. Automotive is most likely highly skewed by Elon Musk, the second richest person in the world. The same can be said for Fashion and Retail by Bernard Arnault, being the richest man in the world. Technology falls into the sixth highest even though it the industry for 4 out of the top 10 richest people in this data set.

Trend of Self-Made Billionaires Over Time

# Aggregate and plot the count of self-made and non-self-made billionaires by birth year
trend <- billionaires_data %>%
  group_by(birthYear) %>%
  summarise(SelfMadeCount = sum(selfMade == "TRUE", na.rm = TRUE),
            NotSelfMadeCount = sum(selfMade == "FALSE", na.rm = TRUE))

# Line plot for self-made and non-self-made billionaires trend
ggplot(trend) +
  geom_line(aes(x = birthYear, y = SelfMadeCount, color = "Self-Made")) +
  geom_line(aes(x = birthYear, y = NotSelfMadeCount, color = "Not Self-Made")) +
  scale_color_manual(values = c("Self-Made" = "coral", "Not Self-Made" = "steelblue")) +
  theme_bw() +
  labs(title = "Trend of Self-Made vs. Non-Self-Made Billionaires Over Time",
       x = "Birth Year",
       y = "Count",
       color = "Type") +
  theme(legend.title = element_blank())

The trend line for self-made billionaires over time reflects the historical evolution of entrepreneurship and wealth creation.

As the data reveals, there is a point where the number of self-made billionaires began to rise sharply (in the late 1990’s). Following this surge, we observe an increase in the proportion of non-self-made billionaires, possibly suggesting that inheritance and established wealth in the self-made billionaire demographic has trickled down to their offspring or family members.

Analyzing the Factor of Gender

# Count the number of billionaires by gender
gender_count <- billion %>%
  count(gender)

# Plot the count of billionaires by gender
ggplot(data = gender_count, aes(x = gender, y = n, fill = gender)) +
  geom_col(color = "black") +
  labs(title = "Number of Male and Female Billionaires", x = "Gender", y = "Count") +
  theme_bw() +
  scale_fill_manual(values = c("M" = "navy", "F" = "magenta")) +
  scale_y_continuous(expand = c(0, 0, 0.05, 0)) +
  guides(fill = F)

# Count the number of self-made billionaires by gender
gender_self_made_count <- billionaires_data %>%
  count(gender, selfMade)

# Plot the percentage of self-made billionaires by gender
ggplot(data = gender_self_made_count, aes(x = gender, y = n, fill = selfMade)) +
  geom_bar(stat = "identity", position = "fill", color = "black") +
  scale_fill_manual(values = c("TRUE" = "lightblue", "FALSE" = "tomato"), labels = c('Not Self Made', 'Self Made')) +
  scale_y_continuous(labels = scales::percent, expand = c(0, 0, 0, 0.05)) +
  theme_bw() +
  labs(title = "Self-Made Percentage by Gender", x = "Gender", y = "Percentage") +
  theme(legend.title = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))

Gender Distribution Among Billionaires:

The gender distribution graph reveals a stark disparity in the number of male versus female billionaires. This disproportion suggests enduring barriers that women may face in accumulating wealth comparable to their male counterparts. Historical factors, access to resources, and societal roles likely contribute to this uneven representation.

Self-Made Status by Gender:

By examining the proportion of self-made billionaires within each gender, the data indicates that self-made status is more common among male billionaires than female. This could reflect differences in opportunities for entrepreneurship and wealth creation between genders. The graph prompts a discussion on the impact of social structures and cultural norms on the paths to becoming a self-made billionaire.

# Scatter plot of secondary education rate vs. net worth
ggplot(billionaires_data, aes(x = gross_tertiary_education_enrollment, y = netWorth, color = selfMade)) +
  geom_point() +
  labs(
    title = "Tertiary Education Enrollment vs. Billionaire Net Worth",
    x = "Tertiary Education Enrollment Rate",
    y = "Net Worth (USD)",
    caption = "*Excludes outliers"
  ) +
  theme_bw() +
  scale_color_manual(values = c("TRUE" = "lightblue", "FALSE" = "tomato"),
                     labels = c('Self Made', 'Not Self Made')) +
  theme(legend.title = element_blank())

# Scatter plot of primary education rate vs. net worth
ggplot(billionaires_data, aes(x = gross_primary_education_enrollment_country, y = netWorth, color = selfMade)) +
  geom_point() +
  labs(
    title = "Primary Education Enrollment vs. Billionaire Net Worth",
    x = "Primary Education Enrollment Rate",
    y = "Net Worth (USD)",
    caption = "*Excludes outliers"
  ) +
  theme_bw() +
  scale_color_manual(values = c("TRUE" = "lightblue", "FALSE" = "tomato"),
                     labels = c('Self Made', 'Not Self Made')) +
  theme(legend.title = element_blank())

In these two graphs we compare the net worth of every billionaire and compare it to the education rate in their respective countries. In the first plot we look at Primary education which includes elementary, middle and high school; While in the second graph we look at the tertiary education rate which includes college and grad school. From the graphs we can see that there does not seem to be relationship between the quality of education in billionaires respective home countries and self made billionaires; There are both self made billionaires in countries with lower rates of education as well as countries with high rates of education.

# Create a bar chart to visualize the percentage of self-made billionaires by industry
ggplot(billionaires_data, aes(x = category, fill = selfMade)) +
  geom_bar(position = "fill") +
  labs(
    title = "Self-Made Percentage by Industry",
    x = "Industry",
    y = "Self-Made Percentage"
  ) +
  coord_flip() +  # Flip coordinates for easier reading of industries
  scale_y_continuous(labels = scales::percent, expand = expansion(mult = c(0.05, 0))) +
  theme_bw() +
  scale_fill_manual(
    values = c("TRUE" = "lightblue", "FALSE" = "tomato"),
    labels = c('Self Made', 'Not Self Made')
  ) +
  theme(
    legend.title = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1)  # Rotate x-axis text for better legibility
  )

In this graph we see the percentage of self made billionaires by industry. We can see almost all industries have self made and not self made billionaires. Most industries have at least 50% of all billionaires are self made with a few exceptions. We also that that some industries have a high percentage of self made billionaires, such as technology and telecom.

Linear Regression Model

# Fitting a linear regression model to predict netWorth
lm_model <- lm(netWorth ~ age + country + industries + gdp_country, data = billionaires_data)
summary(lm_model)
## 
## Call:
## lm(formula = netWorth ~ age + country + industries + gdp_country, 
##     data = billionaires_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4840.0 -1169.9  -491.3   654.3  6158.9 
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           3327.749   1751.468   1.900   0.0576 .  
## age                                     14.068      3.109   4.525 6.37e-06 ***
## countryAndorra                       -3065.441   2428.931  -1.262   0.2071    
## countryArgentina                     -2745.182   1985.059  -1.383   0.1668    
## countryArmenia                       -3104.575   2429.575  -1.278   0.2015    
## countryAustralia                     -2307.976   1741.949  -1.325   0.1853    
## countryAustria                        -314.018   1804.071  -0.174   0.8618    
## countryBahrain                       -2975.562   2439.161  -1.220   0.2226    
## countryBelgium                       -3193.563   2104.386  -1.518   0.1293    
## countryBrazil                        -2248.076   1738.453  -1.293   0.1961    
## countryCambodia                      -2211.656   2458.280  -0.900   0.3684    
## countryCanada                        -1836.918   1741.434  -1.055   0.2916    
## countryChile                         -2015.894   1883.361  -1.070   0.2846    
## countryChina                         -1886.674   1720.983  -1.096   0.2731    
## countryColombia                       1588.964   2428.973   0.654   0.5131    
## countryCyprus                        -2338.655   1885.408  -1.240   0.2150    
## countryCzech Republic                 -986.749   1881.323  -0.524   0.6000    
## countryDenmark                        1622.904   1839.232   0.882   0.3777    
## countryEgypt                         -1018.608   1925.463  -0.529   0.5968    
## countryFinland                       -2360.473   1838.596  -1.284   0.1993    
## countryFrance                         -395.583   1750.394  -0.226   0.8212    
## countryGeorgia                         412.536   2428.763   0.170   0.8651    
## countryGermany                       -1602.702   1730.017  -0.926   0.3543    
## countryGreece                        -1585.758   1988.284  -0.798   0.4252    
## countryHungary                       -3062.556   1986.669  -1.542   0.1233    
## countryIndia                         -1956.589   1724.409  -1.135   0.2567    
## countryIndonesia                     -2301.165   1757.589  -1.309   0.1906    
## countryIsrael                        -2015.271   1752.254  -1.150   0.2502    
## countryItaly                         -1738.258   1734.858  -1.002   0.3165    
## countryJapan                         -2550.371   1744.111  -1.462   0.1438    
## countryKazakhstan                     -597.626   1840.489  -0.325   0.7454    
## countryLatvia                         -301.532   2428.728  -0.124   0.9012    
## countryLebanon                       -1835.631   2132.470  -0.861   0.3894    
## countryLiechtenstein                 -1874.890   2430.374  -0.771   0.4405    
## countryLuxembourg                     -961.584   2430.182  -0.396   0.6924    
## countryMalaysia                      -2220.792   1804.138  -1.231   0.2185    
## countryMexico                         -933.651   1801.183  -0.518   0.6043    
## countryMonaco                        -1306.256   1772.294  -0.737   0.4612    
## countryMorocco                       -3224.003   2104.364  -1.532   0.1257    
## countryNepal                         -2707.311   2431.511  -1.113   0.2657    
## countryNetherlands                   -1966.120   1802.981  -1.090   0.2756    
## countryNew Zealand                   -2928.164   2434.933  -1.203   0.2293    
## countryNigeria                        2663.792   2108.363   1.263   0.2066    
## countryNorway                        -1274.502   1813.394  -0.703   0.4822    
## countryOman                          -2046.473   2431.137  -0.842   0.4000    
## countryPeru                          -3437.310   2104.356  -1.633   0.1025    
## countryPhilippines                   -2008.014   1785.811  -1.124   0.2610    
## countryPoland                         -723.746   1883.160  -0.384   0.7008    
## countryPortugal                       -102.747   2431.215  -0.042   0.9663    
## countryQatar                         -3025.524   2104.560  -1.438   0.1507    
## countryRomania                       -2888.968   1985.105  -1.455   0.1457    
## countryRussia                        -1813.456   1732.868  -1.047   0.2954    
## countrySingapore                     -1881.490   1738.301  -1.082   0.2792    
## countrySlovakia                      -2741.947   2105.573  -1.302   0.1930    
## countrySouth Africa                  -1019.888   1924.871  -0.530   0.5963    
## countrySouth Korea                   -1751.762   1750.259  -1.001   0.3170    
## countrySpain                         -2045.965   1756.474  -1.165   0.2442    
## countrySweden                        -1950.535   1757.625  -1.110   0.2672    
## countrySwitzerland                   -1287.409   1732.090  -0.743   0.4574    
## countryTanzania                      -2725.943   2433.422  -1.120   0.2628    
## countryThailand                      -2221.909   1756.700  -1.265   0.2061    
## countryTurkey                        -2260.682   1754.268  -1.289   0.1977    
## countryUkraine                       -2405.685   1864.345  -1.290   0.1971    
## countryUnited Arab Emirates          -1043.696   1789.026  -0.583   0.5597    
## countryUnited Kingdom                -1528.292   1732.370  -0.882   0.3778    
## countryUnited States                 -1220.806   1719.384  -0.710   0.4778    
## countryUruguay                       -2989.224   2434.583  -1.228   0.2197    
## countryVietnam                       -2192.685   1856.308  -1.181   0.2377    
## industriesConstruction & Engineering  -268.773    360.265  -0.746   0.4557    
## industriesDiversified                  236.981    273.475   0.867   0.3863    
## industriesEnergy                       364.142    305.068   1.194   0.2328    
## industriesFashion & Retail             275.664    258.322   1.067   0.2860    
## industriesFinance & Investments        217.134    253.193   0.858   0.3912    
## industriesFood & Beverage              160.849    264.231   0.609   0.5428    
## industriesGambling & Casinos           628.779    454.681   1.383   0.1668    
## industriesHealthcare                   -67.304    260.892  -0.258   0.7965    
## industriesLogistics                    445.161    411.476   1.082   0.2794    
## industriesManufacturing                126.290    250.804   0.504   0.6146    
## industriesMedia & Entertainment        242.245    302.823   0.800   0.4238    
## industriesMetals & Mining              427.983    337.147   1.269   0.2044    
## industriesReal Estate                  191.877    268.895   0.714   0.4756    
## industriesService                       50.480    338.432   0.149   0.8814    
## industriesSports                      -322.951    371.774  -0.869   0.3851    
## industriesTechnology                   106.886    255.884   0.418   0.6762    
## industriesTelecom                      287.925    423.114   0.680   0.4963    
## gdp_country                                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1713 on 2114 degrees of freedom
##   (140 observations deleted due to missingness)
## Multiple R-squared:  0.09846,    Adjusted R-squared:  0.06264 
## F-statistic: 2.749 on 84 and 2114 DF,  p-value: 1.386e-14
# Scatter plot for Age vs Net Worth with a regression line
ggplot(billionaires_data, aes(x = age, y = netWorth)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +  # se = FALSE removes the confidence interval shading
  labs(title = "Effect of Age on Net Worth", x = "Age", y = "Net Worth (USD)") +
  theme_minimal()

Our regression analysis identifies age as a significant predictor of net worth, suggesting a potential compounding effect of wealth over time.

Age: This is a significant predictor of net worth, with a positive coefficient indicating that as age increases, so does net worth.

Country: The country coefficient is also significant, which suggests that the country a billionaire is from has a significant effect on their net worth.

GDP of Country (gdp_country): This has a very small but significant positive effect on net worth.

Industries: None of the industries are significantly predicting net worth in this model, given the p-values are greater than 0.05.

Regression Analysis to Identify Key Success Factors

## Regression Analysis to Identify Key Success Factors

# Removing rows with any missing values to ensure a clean dataset for the model
cleaned_data <- billionaires_data %>%
  select(netWorth, age, country, industries, gdp_country) %>%  # Select relevant variables
  na.omit()  # Remove rows with NA values

# Load the randomForest package for regression analysis
library(randomForest)

# Fit a Random Forest regression model to predict net worth based on selected features
rf_model <- randomForest(netWorth ~ ., data = cleaned_data)

# Extract the importance score for each feature used in the Random Forest model
importance_df <- as.data.frame(importance(rf_model))
importance_df$Feature <- rownames(importance_df)

# Plot the feature importance scores using a bar chart for better visualization
feature_importance_plot <- ggplot(importance_df, aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip the plot for better readability of feature names
  labs(title = "Feature Importance in Predicting Net Worth", x = "Features", y = "Increase in Node Purity") +
  theme_minimal()  # Use a minimal theme for a clean look

# Output the feature importance plot to the report
feature_importance_plot

# Predict net worth using the fitted Random Forest model
predicted_values <- predict(rf_model, cleaned_data)

# Calculate residuals as the difference between the observed and predicted net worth
residuals <- cleaned_data$netWorth - predicted_values

# Plot residuals against the fitted values to assess the fit of the regression model
residuals_plot <- ggplot() +
  geom_point(aes(x = predicted_values, y = residuals)) +  # Scatter plot of residuals
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Reference line at zero
  labs(title = "Residuals vs. Fitted Values", x = "Fitted Values", y = "Residuals") +
  theme_minimal()  # Use a minimal theme for a clean look

# Output the residuals plot to the report
residuals_plot

### Feature Importance Plot:

This plot displays the importance of features as determined by the Random Forest model. The features are ranked by their contribution to increasing node purity within the model’s decision trees. The x-axis measures the increase in node purity, which reflects how much each feature contributes to homogenizing the nodes within the trees for the target variable, which in this case is net worth.

Key observations:

Age appears to be the most significant predictor of net worth, suggesting that a billionaire’s age has a strong relationship with their wealth.

Country and GDP of the Country (gdp_country) are also important features but to a lesser extent than age, indicating that the type of industry a billionaire is involved in and the economic output of their country are also influential.

Industries has the least importance among the four features shown, yet it still plays a role in the model’s predictions.

Residuals vs. Fitted Values Plot

This graph depicts the model’s residuals versus its fitted values. The residuals reflect the discrepancies between the actual and predicted net worths.

Key observations:

The spread of residuals across the range of fitted values shows a degree of variability in the model’s prediction accuracy. Ideally, the residuals should scatter randomly around the zero line, indicating that the model’s predictions are unbiased. The presence of patterns or trends in this plot might suggest model inadequacies, such as heteroscedasticity or omitted variable bias.

Classification Analysis

# Logistic Regression to predict selfMade status
logistic_model <- glm(selfMade ~ age + netWorth + industries, data = billionaires_data, family = "binomial")
summary(logistic_model)
## 
## Call:
## glm(formula = selfMade ~ age + netWorth + industries, family = "binomial", 
##     data = billionaires_data)
## 
## Coefficients:
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           7.343e-02  3.581e-01   0.205 0.837550    
## age                                   2.961e-03  3.674e-03   0.806 0.420262    
## netWorth                             -9.468e-05  2.633e-05  -3.595 0.000324 ***
## industriesConstruction & Engineering  8.592e-03  3.993e-01   0.022 0.982833    
## industriesDiversified                -1.886e-01  3.005e-01  -0.628 0.530306    
## industriesEnergy                      8.586e-01  3.520e-01   2.439 0.014710 *  
## industriesFashion & Retail            4.964e-01  2.910e-01   1.706 0.088050 .  
## industriesFinance & Investments       1.163e+00  2.876e-01   4.042 5.29e-05 ***
## industriesFood & Beverage             1.547e-01  2.979e-01   0.519 0.603557    
## industriesGambling & Casinos          1.316e+00  5.693e-01   2.312 0.020778 *  
## industriesHealthcare                  9.934e-01  3.065e-01   3.241 0.001191 ** 
## industriesLogistics                   1.478e+00  5.217e-01   2.833 0.004609 ** 
## industriesManufacturing               9.793e-01  2.882e-01   3.398 0.000679 ***
## industriesMedia & Entertainment       9.565e-01  3.565e-01   2.683 0.007291 ** 
## industriesMetals & Mining             1.097e+00  3.957e-01   2.773 0.005551 ** 
## industriesReal Estate                 7.739e-01  3.029e-01   2.555 0.010621 *  
## industriesService                     1.871e-01  3.816e-01   0.490 0.623951    
## industriesSports                      4.356e-01  4.283e-01   1.017 0.309121    
## industriesTechnology                  2.785e+00  3.661e-01   7.606 2.82e-14 ***
## industriesTelecom                     1.597e+00  5.560e-01   2.873 0.004070 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2871.5  on 2338  degrees of freedom
## Residual deviance: 2655.2  on 2319  degrees of freedom
## AIC: 2695.2
## 
## Number of Fisher Scoring iterations: 5

The classification model provides insights into the industries and factors that align with being self-made, highlighting the nuances that define self-made success.

Net Worth: It has a significant negative effect on the likelihood of being self-made. This could suggest that higher net worths are less likely to be self-made, but this interpretation should be considered cautiously without more context.

Age: Age is not a significant predictor of being self-made in this model.

Industries: Several industries are significant predictors of being self-made. For example, being in the Technology industry has a large positive coefficient, indicating a strong association with being self-made. Finance & Investments, Logistics, Media & Entertainment, and Metals & Mining are other significant industries.

Conclusion:

Our analysis aimed to uncover the variables that distinguish one billionaire from another. We observed that a high concentration of billionaires is found in countries with robust GDPs, suggesting a correlation between national economic health and the propensity for generating extreme wealth. Age emerged as a significant determinant of net worth, implying that wealth tends to accumulate over a lifetime. Additionally, we investigated the impact of being self-made on a billionaire’s success. The findings suggest that while net worth and industry play pivotal roles in classifying a billionaire as self-made, these factors do not necessarily correlate with the magnitude of their wealth. This insight aligns with the notion that wealth accumulation is a function of time, with the potential for net worth to escalate as one progresses through life.

Limitations and Recommendations:

Our study is constrained by its exclusive focus on billionaires, leaving a gap in comparison with the general population or millionaires. The dataset primarily captures macro-level variables such as net worth and industry, rather than individual-level factors like education, which may also influence a person’s financial trajectory. Future research could delve into these personal attributes to better understand the fabric of financial success. There’s a broader societal interest in the mechanisms of becoming a billionaire; an in-depth exploration could provide valuable insights. As socio-economic landscapes rapidly evolve, it is imperative to regularly update and expand datasets to reflect current realities and maintain relevance in predictive analyses.