Billionaires Statistics Project

Introduction

This project regards the world’s most elite billionaires and how their age relates to their net worth by industry. The variables being used are personName (the name of the billionaire), age, finalWorth (the net worth of the billionaire in $billions), and the top 5 biggest industries in the world. I plan to show you the relationship between the youngest and oldest billionaire’s net worth in each industry, comparing which generation billionaire has more money.

Source: Compilation from official reports, government databases, financial institutions, and reputable publications such as Forbes, Bloomberg, and World Bank.

Setting & Cleaning Up

I begin by loading in my libraries I plan to use and set my working directory. I also ask R to read the csv file so that there’s a simple overview.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("/Users/aashkanavale/Desktop/Montgomery College/MC Spring '24/DATA110/PROJECTS")
billionaires <- read_csv("billionaires.csv")

## Rows: 2640 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): category, personName, country, city, source, industries, countryOf...
## dbl (16): rank, finalWorth, age, birthYear, birthMonth, birthDay, cpi_countr...
## lgl  (1): selfMade
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now I removed any N/A values in my data set so that I don’t get N/A’s as a result of my work. After that, I filtered by industries and chose the top 5 biggest industries in the world currently, Fashion & Retail, Media & Entertainment, Technology, Healthcare, and Finance & Investments.

billionaires1 <- billionaires |>
  filter(!is.na(age)) |>
  filter(industries %in% c("Fashion & Retail", "Media & Entertainment", "Technology", "Healthcare", "Finance & Investments"))
billionaires1

## # A tibble: 1,212 × 35
##     rank finalWorth category    personName   age country city  source industries
##    <dbl>      <dbl> <chr>       <chr>      <dbl> <chr>   <chr> <chr>  <chr>     
##  1     1     211000 Fashion & … Bernard A…    74 France  Paris LVMH   Fashion &…
##  2     3     114000 Technology  Jeff Bezos    59 United… Medi… Amazon Technology
##  3     4     107000 Technology  Larry Ell…    78 United… Lanai Oracle Technology
##  4     5     106000 Finance & … Warren Bu…    92 United… Omaha Berks… Finance &…
##  5     6     104000 Technology  Bill Gates    67 United… Medi… Micro… Technology
##  6     7      94500 Media & En… Michael B…    81 United… New … Bloom… Media & E…
##  7    10      80700 Technology  Steve Bal…    67 United… Hunt… Micro… Technology
##  8    11      80500 Fashion & … Francoise…    69 France  Paris L'Oré… Fashion &…
##  9    12      79200 Technology  Larry Page    50 United… Palo… Google Technology
## 10    13      77300 Fashion & … Amancio O…    87 Spain   La C… Zara   Fashion &…
## # ℹ 1,202 more rows
## # ℹ 26 more variables: countryOfCitizenship <chr>, organization <chr>,
## #   selfMade <lgl>, status <chr>, gender <chr>, birthDate <chr>,
## #   lastName <chr>, firstName <chr>, title <chr>, date <chr>, state <chr>,
## #   residenceStateRegion <chr>, birthYear <dbl>, birthMonth <dbl>,
## #   birthDay <dbl>, cpi_country <dbl>, cpi_change_country <dbl>,
## #   gdp_country <chr>, gross_tertiary_education_enrollment <dbl>, …

Now I need to find the names and ages of the youngest billionaire in each industry. To do so, I grouped by industries and filtered by the minimum age to pull out the youngest billionaires from each of the 5 industries.

billionaires2 <- billionaires1 |>
  group_by(industries) |>
  filter(age == min(age))
billionaires2

## # A tibble: 5 × 35
## # Groups:   industries [5]
##    rank finalWorth category     personName   age country city  source industries
##   <dbl>      <dbl> <chr>        <chr>      <dbl> <chr>   <chr> <chr>  <chr>     
## 1   818       3500 Fashion & R… Clemente …    18 Italy   Milan Eyegl… Fashion &…
## 2   949       3100 Finance & I… Timur Tur…    35 Kazakh… Alma… Stock… Finance &…
## 3  1725       1700 Media & Ent… Kim Jung-…    19 <NA>    <NA>  Onlin… Media & E…
## 4  2405       1100 Technology   Ryan Bres…    28 United… Miami E-com… Technology
## 5  2540       1000 Healthcare   Ludwig Th…    33 Germany Mels… Medic… Healthcare
## # ℹ 26 more variables: countryOfCitizenship <chr>, organization <chr>,
## #   selfMade <lgl>, status <chr>, gender <chr>, birthDate <chr>,
## #   lastName <chr>, firstName <chr>, title <chr>, date <chr>, state <chr>,
## #   residenceStateRegion <chr>, birthYear <dbl>, birthMonth <dbl>,
## #   birthDay <dbl>, cpi_country <dbl>, cpi_change_country <dbl>,
## #   gdp_country <chr>, gross_tertiary_education_enrollment <dbl>,
## #   gross_primary_education_enrollment_country <dbl>, …

Since we found the minimum ages, we also have to find the maximum ages. To find the oldest billionaires, I once again group by industries and this time, I filtered by maximum age. This gives me the oldest billionaires from each of the 5 industries. Since there are 2 billionaires with the same age under the fashion & retail industry, Bernard Lewis and Nobutoshi Shimamura, I decided on a coin-toss for a fair pick between the two. Nobutoshi Shimamura won the toss.

billionaires3 <- billionaires1 |>
  group_by(industries) |>
  filter(age == max(age))
billionaires3

## # A tibble: 6 × 35
## # Groups:   industries [5]
##    rank finalWorth category     personName   age country city  source industries
##   <dbl>      <dbl> <chr>        <chr>      <dbl> <chr>   <chr> <chr>  <chr>     
## 1   352       6800 Technology   Gordon Mo…    94 United… Wood… Intel  Technology
## 2   591       4600 Media & Ent… Charles D…    96 United… Oyst… Cable… Media & E…
## 3  1368       2200 Healthcare   Alice Sch…    96 United… El C… Biote… Healthcare
## 4  2020       1400 Fashion & R… Bernard L…    97 United… Lond… Fashi… Fashion &…
## 5  2020       1400 Fashion & R… Nobutoshi…    97 Japan   Sait… Retail Fashion &…
## 6  2133       1300 Finance & I… George Jo…   101 United… Los … Insur… Finance &…
## # ℹ 26 more variables: countryOfCitizenship <chr>, organization <chr>,
## #   selfMade <lgl>, status <chr>, gender <chr>, birthDate <chr>,
## #   lastName <chr>, firstName <chr>, title <chr>, date <chr>, state <chr>,
## #   residenceStateRegion <chr>, birthYear <dbl>, birthMonth <dbl>,
## #   birthDay <dbl>, cpi_country <dbl>, cpi_change_country <dbl>,
## #   gdp_country <chr>, gross_tertiary_education_enrollment <dbl>,
## #   gross_primary_education_enrollment_country <dbl>, …

To remove Bernard Lewis from the dataset, I simply subsetted and removed the fourth row in the billionaires4 dataset.

billionaires3 <- billionaires3[-4, ]
billionaires3

## # A tibble: 5 × 35
## # Groups:   industries [5]
##    rank finalWorth category     personName   age country city  source industries
##   <dbl>      <dbl> <chr>        <chr>      <dbl> <chr>   <chr> <chr>  <chr>     
## 1   352       6800 Technology   Gordon Mo…    94 United… Wood… Intel  Technology
## 2   591       4600 Media & Ent… Charles D…    96 United… Oyst… Cable… Media & E…
## 3  1368       2200 Healthcare   Alice Sch…    96 United… El C… Biote… Healthcare
## 4  2020       1400 Fashion & R… Nobutoshi…    97 Japan   Sait… Retail Fashion &…
## 5  2133       1300 Finance & I… George Jo…   101 United… Los … Insur… Finance &…
## # ℹ 26 more variables: countryOfCitizenship <chr>, organization <chr>,
## #   selfMade <lgl>, status <chr>, gender <chr>, birthDate <chr>,
## #   lastName <chr>, firstName <chr>, title <chr>, date <chr>, state <chr>,
## #   residenceStateRegion <chr>, birthYear <dbl>, birthMonth <dbl>,
## #   birthDay <dbl>, cpi_country <dbl>, cpi_change_country <dbl>,
## #   gdp_country <chr>, gross_tertiary_education_enrollment <dbl>,
## #   gross_primary_education_enrollment_country <dbl>, …

# Source: ChatGPT

Now I’m going to take both tables and combine them to have all 10 billionaires in one dataset. To do so, I use the rbind() function.

billionaires4 <- rbind(billionaires2, billionaires3)
billionaires4

## # A tibble: 10 × 35
## # Groups:   industries [5]
##     rank finalWorth category    personName   age country city  source industries
##    <dbl>      <dbl> <chr>       <chr>      <dbl> <chr>   <chr> <chr>  <chr>     
##  1   818       3500 Fashion & … Clemente …    18 Italy   Milan Eyegl… Fashion &…
##  2   949       3100 Finance & … Timur Tur…    35 Kazakh… Alma… Stock… Finance &…
##  3  1725       1700 Media & En… Kim Jung-…    19 <NA>    <NA>  Onlin… Media & E…
##  4  2405       1100 Technology  Ryan Bres…    28 United… Miami E-com… Technology
##  5  2540       1000 Healthcare  Ludwig Th…    33 Germany Mels… Medic… Healthcare
##  6   352       6800 Technology  Gordon Mo…    94 United… Wood… Intel  Technology
##  7   591       4600 Media & En… Charles D…    96 United… Oyst… Cable… Media & E…
##  8  1368       2200 Healthcare  Alice Sch…    96 United… El C… Biote… Healthcare
##  9  2020       1400 Fashion & … Nobutoshi…    97 Japan   Sait… Retail Fashion &…
## 10  2133       1300 Finance & … George Jo…   101 United… Los … Insur… Finance &…
## # ℹ 26 more variables: countryOfCitizenship <chr>, organization <chr>,
## #   selfMade <lgl>, status <chr>, gender <chr>, birthDate <chr>,
## #   lastName <chr>, firstName <chr>, title <chr>, date <chr>, state <chr>,
## #   residenceStateRegion <chr>, birthYear <dbl>, birthMonth <dbl>,
## #   birthDay <dbl>, cpi_country <dbl>, cpi_change_country <dbl>,
## #   gdp_country <chr>, gross_tertiary_education_enrollment <dbl>,
## #   gross_primary_education_enrollment_country <dbl>, …

Here I reordered the dataset so that the youngest and oldest from each industry were side-by-side.

desiredorder <- c(1, 9, 3, 7, 4, 6, 5, 8, 2, 10)
billionaires5 <- billionaires4[desiredorder, ]
billionaires5

## # A tibble: 10 × 35
## # Groups:   industries [5]
##     rank finalWorth category    personName   age country city  source industries
##    <dbl>      <dbl> <chr>       <chr>      <dbl> <chr>   <chr> <chr>  <chr>     
##  1   818       3500 Fashion & … Clemente …    18 Italy   Milan Eyegl… Fashion &…
##  2  2020       1400 Fashion & … Nobutoshi…    97 Japan   Sait… Retail Fashion &…
##  3  1725       1700 Media & En… Kim Jung-…    19 <NA>    <NA>  Onlin… Media & E…
##  4   591       4600 Media & En… Charles D…    96 United… Oyst… Cable… Media & E…
##  5  2405       1100 Technology  Ryan Bres…    28 United… Miami E-com… Technology
##  6   352       6800 Technology  Gordon Mo…    94 United… Wood… Intel  Technology
##  7  2540       1000 Healthcare  Ludwig Th…    33 Germany Mels… Medic… Healthcare
##  8  1368       2200 Healthcare  Alice Sch…    96 United… El C… Biote… Healthcare
##  9   949       3100 Finance & … Timur Tur…    35 Kazakh… Alma… Stock… Finance &…
## 10  2133       1300 Finance & … George Jo…   101 United… Los … Insur… Finance &…
## # ℹ 26 more variables: countryOfCitizenship <chr>, organization <chr>,
## #   selfMade <lgl>, status <chr>, gender <chr>, birthDate <chr>,
## #   lastName <chr>, firstName <chr>, title <chr>, date <chr>, state <chr>,
## #   residenceStateRegion <chr>, birthYear <dbl>, birthMonth <dbl>,
## #   birthDay <dbl>, cpi_country <dbl>, cpi_change_country <dbl>,
## #   gdp_country <chr>, gross_tertiary_education_enrollment <dbl>,
## #   gross_primary_education_enrollment_country <dbl>, …

# Source: ChatGPT

Linear Regression Analysis

A linear regression analysis is a method to model the relationship between two or more quantitative variables. My regression shows the relationship between age and net worth.

I begin to plot my regression graph by calling upon my most recent dataset, billionaires5, and attaching ggplot() and geom_point() to create a scatterplot. Under aesthetics, I set the x-axis as age and y-axis as the net worth. Then I set the limits of my graph: x-axis from 0 - 101, since 101 is the oldest age and y-axis from 0 - 6800, because 6800 (in billions) is the highest net worth. I set the title to “Billionaire Age vs. Billionaire Net Worth” since we’re only trying to find the relationship between these two variables. I also changed the theme to minimal to have a simple grid in the background.

regression <- billionaires5 |>
  ggplot() +
  geom_point(aes(x = age, y = finalWorth)) +
   xlim(0, 101) + 
  ylim(0, 6800) +
  labs(title = "Billionaire Age vs. Billionaire Net Worth",
       x = "Billionaire Age",
       y = "Net Worth (in $billions)",
       caption = "Compilation from official reports, \n government databases, financial institutions, \n and reputable publications such as \n Forbes, Bloomberg, and World Bank") +
  theme_minimal(base_size = 12)

I’m not printing my graph just yet because there’s still some more to be added on. I use the geom_point() function to once again indicate the x-axis and y-axis variables and then set the color of the points or dots to be one of the colors from my custom color vector that will appear later. I use geom_smooth() to create that slope line. ” method = “lm” ” specifically creates the line, “lm” standing for linear model. This expression, “formula = y~x”, is used to specify a formula for a statistical model and after that, I chose the color of the line from a color vector that I manually created later in the project. The command, “se = FALSE” is used to state that “standard errors” are not to be included in the output. I chose the linetype to be a dot-dash line type and set the size to 1. Finally, I add in the title once more, reset the font to Times New Roman, and call upon my graph.

regression <- regression +
  geom_point(aes(x = age, y = finalWorth), col = "#233349") +
  geom_smooth(aes(x = age, y = finalWorth), 
              method = 'lm', formula = y~x, color = "#A8512F", se = FALSE, linetype = "dotdash", size = 1) +
  ggtitle("Billionaire Age vs. Billionaire Net Worth") +
  theme(text = element_text(family = "serif")) # Source: ChatGPT

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

regression

There is a slight but noticeable positive trend to the graph, meaning that the older the billionaire, the more likely they are to have a higher net worth than a younger counterpart.

cor(billionaires5$age, billionaires5$finalWorth)

## [1] 0.284433

fit1 <- lm(finalWorth ~ age, data = billionaires5)
summary(fit1)

## 
## Call:
## lm(formula = finalWorth ~ age, data = billionaires5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1929.6 -1218.5  -660.2  1283.7  3670.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1791.40    1207.81   1.483    0.176
## age            14.24      16.97   0.839    0.426
## 
## Residual standard error: 1904 on 8 degrees of freedom
## Multiple R-squared:  0.0809, Adjusted R-squared:  -0.03399 
## F-statistic: 0.7042 on 1 and 8 DF,  p-value: 0.4257

“cor()” stands for “correlation”. This value is always between -1 and 1. The correlation coefficient tells us how strong or weak the correlation is. Values closer to positive or negative 1 are strong correlation (the sign is determined by the linear slope and in this case, the linear slope is positive), values close to positive or negative 0.5 show a weak correlation, and values close to zero have no correlation.

Because my value is 0.284433, it’s above 0 but somewhat close to 0.5, meaning that it has a weak positive correlation, but it’s still there.

For a linear regression, the equation (y = mx + b) must be used.
The equation for my model is : finalWorth = 14.24(age) + 1791.4

How do we interpret the equation? As the age increases, there is a predicted increase in net worth by 14.24.

To check if the results are significant, we must look at the p-value. The levels of significance are typically 0.05, or 5%. My p-value is 0.4257, which is close to 0. The p-value is considered very significant to this entire experiment when we are investigating the correlation, it means that there is a weak yet still positive correlation between age and net worth.

Graphing

Because I want my graph to look more aesthetically pleasing, I created a color vector with hex codes from a color palette I found online.

desiredcolors <- c("#C4772A", "#9EAFB6", "#A8512F", "#EDD6CC", "#233349")

Originally, I didn’t include this chunk below and my bars were in alphabetical order. It took me quite some time to figure out what to do so that my bars were in correct, side-by-side position and including this code made it work.

billionaires5$personName <- factor(billionaires5$personName, levels = c("Clemente Del Vecchio", "Nobutoshi Shimamura", "Kim Jung-youn", "Charles Dolan & family", "Ryan Breslow", "Gordon Moore", "Ludwig Theodor Braun", "Alice Schwartz", "Timur Turlov", "George Joseph"))

Here is the actual graphing. I start by calling upon my most recent dataset, billionaires5, and I add to it. The general ggplot() function and I changed the theme to black and white to get a clean look behind my chart.

I began my bar chart by identifying x as the billionaire’s name, y as the net worth, and the bar colors by industry. Under aesthetics, I use stat = “identity” argument to specify that the y-values layer should be treated as the actual height of the bars. The position = “dodge” argument doesn’t do much if I remove it but if I’m being honest, there’s still more changes to be made to my chart and I’m afraid that removing it is going to alter my chart drastically so it stays for now. And then I set the color around the bars to another complimentary color.

I used the scale_x_discrete function to rename the names of the billionaires to include their ages. I couldn’t have mutated the variables because quantitative variables and categorical variables cannot combine. I used to enter a new line so that the names don’t overlap.

Regarding the legend, I used ChatGPT to help me figure out how to change the color of the outline and repositioned my legend to make the whole graph bigger and wider. After that, I used scale_fill_manual to set the colors of the bars to the manual color vector I created earlier.

Then came the labels. My title describes my graph as it is and each axis label is concise yet tells you everything you need to know. For the caption, my dataset didn’t come from a direct source. It came from a compilation of multiple sources, which is why the caption is so long. I also decided to change the font to Times New Roman and got the code from ChatGPT. Finally, I print my graph.

billgraph <- billionaires5 |>
  ggplot() +
  theme_classic() +
  geom_bar(aes(x = personName, y = finalWorth, fill = industries),
           stat = "identity", position = "dodge", color = "#D3A292") +
  scale_x_discrete(labels = c("Clemente D.\nVecchio\n(18 y/o)", "Nobutoshi\nShimamura\n(97 y/o)", "Kim\nJung-Youn\n(19 y/o)", "Charles\nDolan\n (96 y/o)", "Ryan\nBreslow\n(28 y/o)", "Gordon\nMoore\n(94 y/o)", "Ludwig T.\nBraun\n(33 y/o)", "Alice\nSchwartz\n(96 y/o)", "Timur\nTurlov\n(35 y/o)", "George\nJoseph\n(101 y/o)")) +
  theme(legend.box.background = element_rect(color = "#D3A292", size = 1)) + # Source: ChatGPT
  theme(legend.position = c(0.85, 0.86)) +
  scale_fill_manual(values = desiredcolors) +
  labs(title = "Net Worth Relationship Between the World's Youngest \n and Oldest Billionaires from the Top 5 Industries",
       x = "Billionaire Name",
       y = "Net Worth (in $billions)",
       fill = "Industry",
       caption = "Compilation from official reports, \n government databases, financial institutions, \n and reputable publications such as \n Forbes, Bloomberg, and World Bank") +
  theme(text = element_text(family = "serif")) # Source: ChatGPT

## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

billgraph

Conclusion

Cleaning Up

Looking back on my project, there are many things I’ve become confident about. I cleaned my data set thoroughly, removing all of the N/A values, filtering the specific billionaires I needed to complete my graph, grouping by industries, and mutating to combine the billionaire’s names and ages. Cleaning up was difficult, as it required many steps I did not know how to take. However, after completing this project, I have gained the knowledge needed for my next project.

Visualization Representation

My visualization represents the net worth relationship between the world’s youngest and oldest billionaire’s from the world’s current top 5 industries: fashion & retail, finance & investments, healthcare, media & entertainment, and technology. For each industry (each different color), the column on the left is the youngest billionaire and the column on the right is the oldest. Almost anyone would expect the oldest billionaire’s to have the higher net worth, and rightfully so. However, what I found surprising is that two of the youngest billionaires, Clemente Del Vecchio and Timur Turlov, have higher net worths than their older counterparts.

Reflection

In the beginning, I was helped with the filtering, mutating, and grouping by Professor Saidi and then I was able to move on. At the last possible minute, a friend helped me rename my x-values so that I could include the ages. It was a great help to me and pulled my graph together. Another aspect I’d have loved to include was another legend on whether the billionaire was self-made or not. If there are two young billionaires that have a higher net worth than the older ones, it would be interesting to know if they had earned the money by themselves or had their family to help. Timur Turlov is self-made, Clemente Del Vecchio is not.