In an era where wealth concentration and economic disparity are at the forefront of global discussions, understanding the dynamics of billionaire wealth is vital. This project contains statistics on the world’s billionaires, including information about their businesses, industries, and personal details. It provides insights into the wealth distribution, business sectors, and demographics of billionaires worldwide. We chose this to analyze what makes billionaires billionaires and how they compare to one another.
This is the question that drives our analysis: What makes a billionaire more successful than another?
This code chunk represents the data preprocessing steps performed on the “Billionaires Statistics Dataset” to ensure that the data is clean, well-structured, and ready for analysis.
Loading the Dataset: The code begins by loading the dataset from a CSV file named “Billionaires Statistics Dataset.csv” while suppressing the display of column data types.
Column Renaming: Column names are renamed for clarity and consistency. Specifically, the column originally named ‘finalWorth’ is renamed to ‘netWorth’ for better interpretation.
Handling Missing Values: Missing values in critical columns, such as ‘netWorth’ and ‘age,’ are addressed. Rows with missing values in these columns are removed from the dataset. Additionally, missing values in the ‘country’ column are replaced with the label ‘Unknown’ to ensure completeness.
Data Type Conversion: The ‘selfMade’ column is converted to a factor variable, making it suitable for categorical analysis.
Date Parsing: The ‘birthDate’ column is parsed into a standardized date-time format. Any date parsing failures are identified and marked as ‘NA.’
Numeric Conversion and Text Standardization: The ‘gdp_country’ column, which may contain currency symbols and commas, is converted to a numeric format. The ‘country’ column’s text values are standardized by converting them to proper case.
Handling Empty Strings: Any empty string values in the dataset are replaced with ‘NA’ for consistency.
Outlier Detection and Removal: Outliers in the ‘netWorth’ column are identified and removed based on the Interquartile Range (IQR) method. This step helps ensure that extreme values do not distort the analysis.
Duplicate Removal: Duplicate rows in the dataset are eliminated to maintain data integrity and avoid redundancy.
Please note that the code includes commented-out lines for viewing and summarizing the cleaned dataset, which can be uncommented if you wish to inspect the dataset after preprocessing.
These preprocessing steps are crucial for preparing the dataset for subsequent analysis and modeling, ensuring that the data is reliable and suitable for answering research questions and deriving meaningful insights.
# Loading the dataset
billionaires_data <- read_csv("Billionaires Statistics Dataset.csv", show_col_types = FALSE)
# Renaming columns for clarity
billionaires_data <- rename(billionaires_data, netWorth = finalWorth)
# Handling missing values
billionaires_data <- billionaires_data %>%
filter(!is.na(netWorth) & !is.na(age))
billionaires_data$country <- ifelse(is.na(billionaires_data$country), 'Unknown', billionaires_data$country)
# Data type conversion
billionaires_data$selfMade <- as.factor(billionaires_data$selfMade)
# Parsing dates (corrected format)
billionaires_data$birthDate <- mdy_hm(billionaires_data$birthDate)
failed_dates <- which(is.na(billionaires_data$birthDate))
billionaires_data$birthDate[failed_dates] <- NA
# Parsing 'gdp_country' to numeric and standardizing text data
billionaires_data$gdp_country <- as.numeric(gsub("[\\$,]", "", billionaires_data$gdp_country))
billionaires_data$country <- str_to_title(tolower(billionaires_data$country))
# Handling empty strings
billionaires_data <- billionaires_data %>%
mutate_if(is.character, na_if, "")
# Outlier detection and removal (for 'netWorth')
worth_iqr <- IQR(billionaires_data$netWorth, na.rm = TRUE)
worth_quantiles <- quantile(billionaires_data$netWorth, probs = c(0.25, 0.75), na.rm = TRUE)
worth_upper <- worth_quantiles[2] + 1.5 * worth_iqr
worth_lower <- worth_quantiles[1] - 1.5 * worth_iqr
billionaires_data <- billionaires_data %>%
filter(netWorth >= worth_lower & netWorth <= worth_upper)
# Removing duplicates
billionaires_data <- distinct(billionaires_data)
# # Viewing and summarizing the cleaned dataset
# head(billionaires_data)
# summary(billionaires_data)
# str(billionaires_data)
The second chunk of code is designed to perform various data cleaning tasks on a dataset, but it does not include a step for outlier detection and removal. This is neccesary for some of our graphs and calculations.
# Loading the dataset
billion <- read_csv("Billionaires Statistics Dataset.csv", show_col_types = FALSE)
# Handling missing values
billion <- billion %>%
filter(!is.na(finalWorth) & !is.na(age))
billion$country <- ifelse(is.na(billion$country), 'Unknown', billion$country)
# Data type conversion
billion$selfMade <- as.factor(billion$selfMade)
# Parsing dates (corrected format)
billion$birthDate <- mdy_hm(billion$birthDate)
failed_dates <- which(is.na(billion$birthDate))
billion$birthDate[failed_dates] <- NA
# Parsing 'gdp_country' to numeric and standardizing text data
billion$gdp_country <- as.numeric(gsub("[\\$,]", "", billion$gdp_country))
billion$country <- str_to_title(tolower(billion$country))
# Handling empty strings
billion <- billion %>%
mutate_if(is.character, na_if, "")
# Renaming columns for clarity
billion <- rename(billion, netWorth = finalWorth)
# Removing duplicates
billionaires_full <- distinct(billion)
# # Viewing and summarizing the cleaned dataset
# head(billionaires_full)
# summary(billionaires_full)
# str(billionaires_full)
The series of world maps visually describe the global distribution of billionaires against the backdrop of national GDPs and total billionaire net worth. These visual insights suggest a correlation between the economic prosperity of a country and the concentration of its high-net-worth individuals. The histograms and box plots further dissect the age and industry factors, potentially illuminating paths to wealth accumulation.
# Renaming 'region' column in WorldData to 'country' for easy merging
colnames(WorldData)[colnames(WorldData) == 'region'] <- 'country'
# Creating a summarized dataset for merging
# This includes filtering, renaming, and aggregating data
billions <-
billionaires_full |>
filter(country != "") |>
mutate(country = if_else(country == "United States", "USA", country)) |>
group_by(country) |>
summarize(
n = n(), # Count of billionaires per country
total_money = sum(netWorth), # Total net worth per country
gdp_country = first(gdp_country) # GDP of the country
)
# Merging the billionaire data with world map data
world_billion <- left_join(WorldData, billions, by = "country")
world_billion <- world_billion |> filter(long <= 180)
# Creating a world map visualization of the number of billionaires per country
world <- ggplot(data = world_billion,
aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = n), color = "black") +
theme_map() +
labs(
title = "Number of Billionaires per Country",
fill = "Billionaire Count",
caption = "*Grey countries have no billionaires"
) +
scale_fill_viridis_b(breaks = c(seq(0, 800, 150))) +
coord_map("moll") +
theme(plot.title = element_text(hjust = 0.5))
world
# Creating a world map visualization of the GDP per country
world_gdp <- ggplot(data = world_billion,
aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = gdp_country), color = "black") +
theme_map() +
labs(
title = "Country GDP",
fill = "GDP",
caption = "*Grey countries have no billionaires"
) +
scale_fill_viridis_b(labels = c("5 Trillion", "10 Trillion", "15 Trillion", "20 Trillion")) +
coord_map("moll") +
theme(plot.title = element_text(hjust = 0.5))
world_gdp
# Creating a world map visualization of the total net worth per country
world_money <- ggplot(data = world_billion,
aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = total_money), color = "black") +
theme_map() +
labs(
title = "Total Net Worth per Country",
fill = "Total Net Worth",
caption = "*Grey countries have no billionaires"
) +
scale_fill_viridis_b() +
coord_map("moll") +
theme(plot.title = element_text(hjust = 0.5))
world_money
The first map shows how many billionaires there are in each country. Lighter colors mean more billionaires, and it’s clear that the USA and China have the most. The second map does the same for each country’s GDP. Not surprisingly, there’s a strong resemblance between the two maps: countries with higher GDPs, like the USA and China, also have more billionaires.
The third map looks at the total wealth of all billionaires within a country (in Millions). This map also highlights the USA and China, showing that these countries don’t just have more billionaires, but billionaires residing in these countries are also wealthier overall.
All three maps point to a common idea: in countries with larger economies, there seems to be a higher chance of becoming a billionaire. This suggests that a strong economy might help people build more wealth.
The following histograms provide a snapshot of the age distribution across our billionaire dataset. The shape of the distribution, skewed towards middle age, may indicate peak periods of wealth accumulation.
# Converting 'selfMade' to a factor with descriptive labels
billionaires_full$selfMade <- factor(billionaires_full$selfMade,
levels = c(FALSE, TRUE),
labels = c("Not Self Made", "Self Made"))
# Creating a histogram to visualize the distribution of billionaires by age
age_hist <- ggplot(data = billionaires_full, aes(x = age)) +
geom_histogram(binwidth = 3, color = "black", fill = "green4") +
scale_y_continuous(expand = c(0, 0, 0.05, 0)) +
theme_bw() +
labs(
title = "Billionaires by Age",
x = "Age",
y = "Billionaire Count"
) +
theme(plot.title = element_text(hjust = 0.5))
# Displaying the histogram
age_hist
# Creating a faceted histogram to compare self-made billionaires and others by age
self_made <- age_hist +
facet_wrap(~selfMade) +
labs(
title = "Billionaires by Age: Self-made vs Not Self-Made",
x = "Age",
y = "Billionaire Count"
)
# Displaying the faceted histogram
self_made
This code chunk produces a histogram that visualizes the distribution of billionaires by age, illustrating the frequency of billionaires at each age interval. The histogram appears to approximate a normal distribution, suggesting that it is unimodal and symmetrical, which is typical for demographic data like age.
The second part of the code enhances the histogram by adding facets to differentiate between self-made billionaires and those who are not. The faceted histogram reveals a higher number of self-made billionaires, as indicated by the ‘Self Made’ facet. This suggests that a larger proportion of billionaires have accumulated their wealth through their endeavors rather than inheriting it.
Additionally, the distribution for non-self-made billionaires skews younger, which aligns with the notion that inherited wealth can confer billionaire status at an earlier age. Conversely, the distribution for self-made billionaires peaks around middle age, indicating that self-made individuals typically reach billionaire status later in life, likely due to the time it takes to build and grow successful enterprises.
# Visualization: Box Plot of Net Worth by Industry
# Adjusting the boxplot orientation for better readability and presentation
ggplot(billionaires_data, aes(x = industries, y = netWorth)) +
geom_boxplot() +
labs(title = "Net Worth by Industry", x = "Industry", y = "Net Worth (in millions)") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x labels for better visibility
plot.title = element_text(hjust = 0.5)) + # Center the plot title
scale_y_continuous(labels = scales::comma) # Format the net worth with commas for readability
The box plot visualization presents a comprehensive view of the net worth distribution across various industries, revealing the spread and central tendencies within each sector. To ensure clarity, outliers were removed, which allowed us to observe that the Telecom industry unexpectedly has the highest median net worth—surpassing even the Technology sector. This finding prompts further inquiry into the dynamics and characteristics unique to these industries.
Despite the removal of outliers, ther is still significant variability, this is indicated by the data points extending beyond the box and whisker boundaries. This suggests a broad range of wealth within industries, hinting at underlying factors that may influence wealth accumulation.
# Calculate average net worth by industry
average_net_worth <- billionaires_full %>%
group_by(industries) %>%
summarise(AverageNetWorth = mean(netWorth, na.rm = TRUE))
# Bar plot for average net worth by industry
ggplot(average_net_worth, aes(x = reorder(industries, -AverageNetWorth), y = AverageNetWorth)) +
geom_col(fill = "dodgerblue") +
coord_flip() +
theme_bw() +
theme(legend.position = "none") +
labs(title = "Average Net Worth by Industry", x = "", y = "Average Net Worth (USD)")
The average net worth by industry graph is interesting to see. The top 5
industries by net worth is surprising to us. Automotive is most likely
highly skewed by Elon Musk, the second richest person in the world. The
same can be said for Fashion and Retail by Bernard Arnault, being the
richest man in the world. Technology falls into the sixth highest even
though it the industry for 4 out of the top 10 richest people in this
data set.
# Aggregate and plot the count of self-made and non-self-made billionaires by birth year
trend <- billionaires_data %>%
group_by(birthYear) %>%
summarise(SelfMadeCount = sum(selfMade == "TRUE", na.rm = TRUE),
NotSelfMadeCount = sum(selfMade == "FALSE", na.rm = TRUE))
# Line plot for self-made and non-self-made billionaires trend
ggplot(trend) +
geom_line(aes(x = birthYear, y = SelfMadeCount, color = "Self-Made")) +
geom_line(aes(x = birthYear, y = NotSelfMadeCount, color = "Not Self-Made")) +
scale_color_manual(values = c("Self-Made" = "coral", "Not Self-Made" = "steelblue")) +
theme_bw() +
labs(title = "Trend of Self-Made vs. Non-Self-Made Billionaires Over Time",
x = "Birth Year",
y = "Count",
color = "Type") +
theme(legend.title = element_blank())
The trend line for self-made billionaires over time reflects the historical evolution of entrepreneurship and wealth creation.
As the data reveals, there is a point where the number of self-made billionaires began to rise sharply (in the late 1990’s). Following this surge, we observe an increase in the proportion of non-self-made billionaires, possibly suggesting that inheritance and established wealth in the self-made billionaire demographic has trickled down to their offspring or family members.
# Count the number of billionaires by gender
gender_count <- billion %>%
count(gender)
# Plot the count of billionaires by gender
ggplot(data = gender_count, aes(x = gender, y = n, fill = gender)) +
geom_col(color = "black") +
labs(title = "Number of Male and Female Billionaires", x = "Gender", y = "Count") +
theme_bw() +
scale_fill_manual(values = c("M" = "navy", "F" = "magenta")) +
scale_y_continuous(expand = c(0, 0, 0.05, 0)) +
guides(fill = F)
# Count the number of self-made billionaires by gender
gender_self_made_count <- billionaires_data %>%
count(gender, selfMade)
# Plot the percentage of self-made billionaires by gender
ggplot(data = gender_self_made_count, aes(x = gender, y = n, fill = selfMade)) +
geom_bar(stat = "identity", position = "fill", color = "black") +
scale_fill_manual(values = c("TRUE" = "lightblue", "FALSE" = "tomato"), labels = c('Not Self Made', 'Self Made')) +
scale_y_continuous(labels = scales::percent, expand = c(0, 0, 0, 0.05)) +
theme_bw() +
labs(title = "Self-Made Percentage by Gender", x = "Gender", y = "Percentage") +
theme(legend.title = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))
Gender Distribution Among Billionaires:
The gender distribution graph reveals a stark disparity in the number of male versus female billionaires. This disproportion suggests enduring barriers that women may face in accumulating wealth comparable to their male counterparts. Historical factors, access to resources, and societal roles likely contribute to this uneven representation.
Self-Made Status by Gender:
By examining the proportion of self-made billionaires within each gender, the data indicates that self-made status is more common among male billionaires than female. This could reflect differences in opportunities for entrepreneurship and wealth creation between genders. The graph prompts a discussion on the impact of social structures and cultural norms on the paths to becoming a self-made billionaire.
# Scatter plot of secondary education rate vs. net worth
ggplot(billionaires_data, aes(x = gross_tertiary_education_enrollment, y = netWorth, color = selfMade)) +
geom_point() +
labs(
title = "Tertiary Education Enrollment vs. Billionaire Net Worth",
x = "Tertiary Education Enrollment Rate",
y = "Net Worth (USD)",
caption = "*Excludes outliers"
) +
theme_bw() +
scale_color_manual(values = c("TRUE" = "lightblue", "FALSE" = "tomato"),
labels = c('Self Made', 'Not Self Made')) +
theme(legend.title = element_blank())
# Scatter plot of primary education rate vs. net worth
ggplot(billionaires_data, aes(x = gross_primary_education_enrollment_country, y = netWorth, color = selfMade)) +
geom_point() +
labs(
title = "Primary Education Enrollment vs. Billionaire Net Worth",
x = "Primary Education Enrollment Rate",
y = "Net Worth (USD)",
caption = "*Excludes outliers"
) +
theme_bw() +
scale_color_manual(values = c("TRUE" = "lightblue", "FALSE" = "tomato"),
labels = c('Self Made', 'Not Self Made')) +
theme(legend.title = element_blank())
In these two graphs we compare the net worth of every billionaire and compare it to the education rate in their respective countries. In the first plot we look at Primary education which includes elementary, middle and high school; While in the second graph we look at the tertiary education rate which includes college and grad school. From the graphs we can see that there does not seem to be relationship between the quality of education in billionaires respective home countries and self made billionaires; There are both self made billionaires in countries with lower rates of education as well as countries with high rates of education.
# Create a bar chart to visualize the percentage of self-made billionaires by industry
ggplot(billionaires_data, aes(x = category, fill = selfMade)) +
geom_bar(position = "fill") +
labs(
title = "Self-Made Percentage by Industry",
x = "Industry",
y = "Self-Made Percentage"
) +
coord_flip() + # Flip coordinates for easier reading of industries
scale_y_continuous(labels = scales::percent, expand = expansion(mult = c(0.05, 0))) +
theme_bw() +
scale_fill_manual(
values = c("TRUE" = "lightblue", "FALSE" = "tomato"),
labels = c('Self Made', 'Not Self Made')
) +
theme(
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis text for better legibility
)
In this graph we see the percentage of self made billionaires by industry. We can see almost all industries have self made and not self made billionaires. Most industries have at least 50% of all billionaires are self made with a few exceptions. We also that that some industries have a high percentage of self made billionaires, such as technology and telecom.
# Fitting a linear regression model to predict netWorth
lm_model <- lm(netWorth ~ age + country + industries + gdp_country, data = billionaires_data)
summary(lm_model)
##
## Call:
## lm(formula = netWorth ~ age + country + industries + gdp_country,
## data = billionaires_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4840.0 -1169.9 -491.3 654.3 6158.9
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3327.749 1751.468 1.900 0.0576 .
## age 14.068 3.109 4.525 6.37e-06 ***
## countryAndorra -3065.441 2428.931 -1.262 0.2071
## countryArgentina -2745.182 1985.059 -1.383 0.1668
## countryArmenia -3104.575 2429.575 -1.278 0.2015
## countryAustralia -2307.976 1741.949 -1.325 0.1853
## countryAustria -314.018 1804.071 -0.174 0.8618
## countryBahrain -2975.562 2439.161 -1.220 0.2226
## countryBelgium -3193.563 2104.386 -1.518 0.1293
## countryBrazil -2248.076 1738.453 -1.293 0.1961
## countryCambodia -2211.656 2458.280 -0.900 0.3684
## countryCanada -1836.918 1741.434 -1.055 0.2916
## countryChile -2015.894 1883.361 -1.070 0.2846
## countryChina -1886.674 1720.983 -1.096 0.2731
## countryColombia 1588.964 2428.973 0.654 0.5131
## countryCyprus -2338.655 1885.408 -1.240 0.2150
## countryCzech Republic -986.749 1881.323 -0.524 0.6000
## countryDenmark 1622.904 1839.232 0.882 0.3777
## countryEgypt -1018.608 1925.463 -0.529 0.5968
## countryFinland -2360.473 1838.596 -1.284 0.1993
## countryFrance -395.583 1750.394 -0.226 0.8212
## countryGeorgia 412.536 2428.763 0.170 0.8651
## countryGermany -1602.702 1730.017 -0.926 0.3543
## countryGreece -1585.758 1988.284 -0.798 0.4252
## countryHungary -3062.556 1986.669 -1.542 0.1233
## countryIndia -1956.589 1724.409 -1.135 0.2567
## countryIndonesia -2301.165 1757.589 -1.309 0.1906
## countryIsrael -2015.271 1752.254 -1.150 0.2502
## countryItaly -1738.258 1734.858 -1.002 0.3165
## countryJapan -2550.371 1744.111 -1.462 0.1438
## countryKazakhstan -597.626 1840.489 -0.325 0.7454
## countryLatvia -301.532 2428.728 -0.124 0.9012
## countryLebanon -1835.631 2132.470 -0.861 0.3894
## countryLiechtenstein -1874.890 2430.374 -0.771 0.4405
## countryLuxembourg -961.584 2430.182 -0.396 0.6924
## countryMalaysia -2220.792 1804.138 -1.231 0.2185
## countryMexico -933.651 1801.183 -0.518 0.6043
## countryMonaco -1306.256 1772.294 -0.737 0.4612
## countryMorocco -3224.003 2104.364 -1.532 0.1257
## countryNepal -2707.311 2431.511 -1.113 0.2657
## countryNetherlands -1966.120 1802.981 -1.090 0.2756
## countryNew Zealand -2928.164 2434.933 -1.203 0.2293
## countryNigeria 2663.792 2108.363 1.263 0.2066
## countryNorway -1274.502 1813.394 -0.703 0.4822
## countryOman -2046.473 2431.137 -0.842 0.4000
## countryPeru -3437.310 2104.356 -1.633 0.1025
## countryPhilippines -2008.014 1785.811 -1.124 0.2610
## countryPoland -723.746 1883.160 -0.384 0.7008
## countryPortugal -102.747 2431.215 -0.042 0.9663
## countryQatar -3025.524 2104.560 -1.438 0.1507
## countryRomania -2888.968 1985.105 -1.455 0.1457
## countryRussia -1813.456 1732.868 -1.047 0.2954
## countrySingapore -1881.490 1738.301 -1.082 0.2792
## countrySlovakia -2741.947 2105.573 -1.302 0.1930
## countrySouth Africa -1019.888 1924.871 -0.530 0.5963
## countrySouth Korea -1751.762 1750.259 -1.001 0.3170
## countrySpain -2045.965 1756.474 -1.165 0.2442
## countrySweden -1950.535 1757.625 -1.110 0.2672
## countrySwitzerland -1287.409 1732.090 -0.743 0.4574
## countryTanzania -2725.943 2433.422 -1.120 0.2628
## countryThailand -2221.909 1756.700 -1.265 0.2061
## countryTurkey -2260.682 1754.268 -1.289 0.1977
## countryUkraine -2405.685 1864.345 -1.290 0.1971
## countryUnited Arab Emirates -1043.696 1789.026 -0.583 0.5597
## countryUnited Kingdom -1528.292 1732.370 -0.882 0.3778
## countryUnited States -1220.806 1719.384 -0.710 0.4778
## countryUruguay -2989.224 2434.583 -1.228 0.2197
## countryVietnam -2192.685 1856.308 -1.181 0.2377
## industriesConstruction & Engineering -268.773 360.265 -0.746 0.4557
## industriesDiversified 236.981 273.475 0.867 0.3863
## industriesEnergy 364.142 305.068 1.194 0.2328
## industriesFashion & Retail 275.664 258.322 1.067 0.2860
## industriesFinance & Investments 217.134 253.193 0.858 0.3912
## industriesFood & Beverage 160.849 264.231 0.609 0.5428
## industriesGambling & Casinos 628.779 454.681 1.383 0.1668
## industriesHealthcare -67.304 260.892 -0.258 0.7965
## industriesLogistics 445.161 411.476 1.082 0.2794
## industriesManufacturing 126.290 250.804 0.504 0.6146
## industriesMedia & Entertainment 242.245 302.823 0.800 0.4238
## industriesMetals & Mining 427.983 337.147 1.269 0.2044
## industriesReal Estate 191.877 268.895 0.714 0.4756
## industriesService 50.480 338.432 0.149 0.8814
## industriesSports -322.951 371.774 -0.869 0.3851
## industriesTechnology 106.886 255.884 0.418 0.6762
## industriesTelecom 287.925 423.114 0.680 0.4963
## gdp_country NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1713 on 2114 degrees of freedom
## (140 observations deleted due to missingness)
## Multiple R-squared: 0.09846, Adjusted R-squared: 0.06264
## F-statistic: 2.749 on 84 and 2114 DF, p-value: 1.386e-14
# Scatter plot for Age vs Net Worth with a regression line
ggplot(billionaires_data, aes(x = age, y = netWorth)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue", se = FALSE) + # se = FALSE removes the confidence interval shading
labs(title = "Effect of Age on Net Worth", x = "Age", y = "Net Worth (USD)") +
theme_minimal()
Our regression analysis identifies age as a significant predictor of net
worth, suggesting a potential compounding effect of wealth over
time.
Age: This is a significant predictor of net worth, with a positive coefficient indicating that as age increases, so does net worth.
Country: The country coefficient is also significant, which suggests that the country a billionaire is from has a significant effect on their net worth.
GDP of Country (gdp_country): This has a very small but significant positive effect on net worth.
Industries: None of the industries are significantly predicting net worth in this model, given the p-values are greater than 0.05.
## Regression Analysis to Identify Key Success Factors
# Removing rows with any missing values to ensure a clean dataset for the model
cleaned_data <- billionaires_data %>%
select(netWorth, age, country, industries, gdp_country) %>% # Select relevant variables
na.omit() # Remove rows with NA values
# Load the randomForest package for regression analysis
library(randomForest)
# Fit a Random Forest regression model to predict net worth based on selected features
rf_model <- randomForest(netWorth ~ ., data = cleaned_data)
# Extract the importance score for each feature used in the Random Forest model
importance_df <- as.data.frame(importance(rf_model))
importance_df$Feature <- rownames(importance_df)
# Plot the feature importance scores using a bar chart for better visualization
feature_importance_plot <- ggplot(importance_df, aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
geom_bar(stat = "identity") +
coord_flip() + # Flip the plot for better readability of feature names
labs(title = "Feature Importance in Predicting Net Worth", x = "Features", y = "Increase in Node Purity") +
theme_minimal() # Use a minimal theme for a clean look
# Output the feature importance plot to the report
feature_importance_plot
# Predict net worth using the fitted Random Forest model
predicted_values <- predict(rf_model, cleaned_data)
# Calculate residuals as the difference between the observed and predicted net worth
residuals <- cleaned_data$netWorth - predicted_values
# Plot residuals against the fitted values to assess the fit of the regression model
residuals_plot <- ggplot() +
geom_point(aes(x = predicted_values, y = residuals)) + # Scatter plot of residuals
geom_hline(yintercept = 0, linetype = "dashed", color = "red") + # Reference line at zero
labs(title = "Residuals vs. Fitted Values", x = "Fitted Values", y = "Residuals") +
theme_minimal() # Use a minimal theme for a clean look
# Output the residuals plot to the report
residuals_plot
### Feature Importance Plot:
This plot displays the importance of features as determined by the Random Forest model. The features are ranked by their contribution to increasing node purity within the model’s decision trees. The x-axis measures the increase in node purity, which reflects how much each feature contributes to homogenizing the nodes within the trees for the target variable, which in this case is net worth.
Key observations:
Age appears to be the most significant predictor of net worth, suggesting that a billionaire’s age has a strong relationship with their wealth.
Country and GDP of the Country (gdp_country) are also important features but to a lesser extent than age, indicating that the type of industry a billionaire is involved in and the economic output of their country are also influential.
Industries has the least importance among the four features shown, yet it still plays a role in the model’s predictions.
This graph depicts the model’s residuals versus its fitted values. The residuals reflect the discrepancies between the actual and predicted net worths.
Key observations:
The spread of residuals across the range of fitted values shows a degree of variability in the model’s prediction accuracy. Ideally, the residuals should scatter randomly around the zero line, indicating that the model’s predictions are unbiased. The presence of patterns or trends in this plot might suggest model inadequacies, such as heteroscedasticity or omitted variable bias.
# Logistic Regression to predict selfMade status
logistic_model <- glm(selfMade ~ age + netWorth + industries, data = billionaires_data, family = "binomial")
summary(logistic_model)
##
## Call:
## glm(formula = selfMade ~ age + netWorth + industries, family = "binomial",
## data = billionaires_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.343e-02 3.581e-01 0.205 0.837550
## age 2.961e-03 3.674e-03 0.806 0.420262
## netWorth -9.468e-05 2.633e-05 -3.595 0.000324 ***
## industriesConstruction & Engineering 8.592e-03 3.993e-01 0.022 0.982833
## industriesDiversified -1.886e-01 3.005e-01 -0.628 0.530306
## industriesEnergy 8.586e-01 3.520e-01 2.439 0.014710 *
## industriesFashion & Retail 4.964e-01 2.910e-01 1.706 0.088050 .
## industriesFinance & Investments 1.163e+00 2.876e-01 4.042 5.29e-05 ***
## industriesFood & Beverage 1.547e-01 2.979e-01 0.519 0.603557
## industriesGambling & Casinos 1.316e+00 5.693e-01 2.312 0.020778 *
## industriesHealthcare 9.934e-01 3.065e-01 3.241 0.001191 **
## industriesLogistics 1.478e+00 5.217e-01 2.833 0.004609 **
## industriesManufacturing 9.793e-01 2.882e-01 3.398 0.000679 ***
## industriesMedia & Entertainment 9.565e-01 3.565e-01 2.683 0.007291 **
## industriesMetals & Mining 1.097e+00 3.957e-01 2.773 0.005551 **
## industriesReal Estate 7.739e-01 3.029e-01 2.555 0.010621 *
## industriesService 1.871e-01 3.816e-01 0.490 0.623951
## industriesSports 4.356e-01 4.283e-01 1.017 0.309121
## industriesTechnology 2.785e+00 3.661e-01 7.606 2.82e-14 ***
## industriesTelecom 1.597e+00 5.560e-01 2.873 0.004070 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2871.5 on 2338 degrees of freedom
## Residual deviance: 2655.2 on 2319 degrees of freedom
## AIC: 2695.2
##
## Number of Fisher Scoring iterations: 5
The classification model provides insights into the industries and factors that align with being self-made, highlighting the nuances that define self-made success.
Net Worth: It has a significant negative effect on the likelihood of being self-made. This could suggest that higher net worths are less likely to be self-made, but this interpretation should be considered cautiously without more context.
Age: Age is not a significant predictor of being self-made in this model.
Industries: Several industries are significant predictors of being self-made. For example, being in the Technology industry has a large positive coefficient, indicating a strong association with being self-made. Finance & Investments, Logistics, Media & Entertainment, and Metals & Mining are other significant industries.
Our analysis aimed to uncover the variables that distinguish one billionaire from another. We observed that a high concentration of billionaires is found in countries with robust GDPs, suggesting a correlation between national economic health and the propensity for generating extreme wealth. Age emerged as a significant determinant of net worth, implying that wealth tends to accumulate over a lifetime. Additionally, we investigated the impact of being self-made on a billionaire’s success. The findings suggest that while net worth and industry play pivotal roles in classifying a billionaire as self-made, these factors do not necessarily correlate with the magnitude of their wealth. This insight aligns with the notion that wealth accumulation is a function of time, with the potential for net worth to escalate as one progresses through life.
Our study is constrained by its exclusive focus on billionaires, leaving a gap in comparison with the general population or millionaires. The dataset primarily captures macro-level variables such as net worth and industry, rather than individual-level factors like education, which may also influence a person’s financial trajectory. Future research could delve into these personal attributes to better understand the fabric of financial success. There’s a broader societal interest in the mechanisms of becoming a billionaire; an in-depth exploration could provide valuable insights. As socio-economic landscapes rapidly evolve, it is imperative to regularly update and expand datasets to reflect current realities and maintain relevance in predictive analyses.