(https://images.unsplash.com/photo-1579621970563-ebec7560ff3e?auto=format&fit=crop&q=80&w=1000)

Intro

This project analyzes the 2014 Billionaires dataset from the CORGIS repository. I chose this topic because I am fascinated by how economic environments and industrial innovation drive extreme financial success. This analysis has personal meaning to me as a Computer Science student because I am particularly interested in how the technology sector compares to traditional industries in wealth creation.

Variable Analysis: I am using a total of 6 variables for this project: * 2 Categorical Variables: wealth.how.industry (Industry Sector) and location.citizenship (Country). * 4 Quantitative Variables: wealth.worth in billions (Net Worth), demographics.age (Age), rank (Global Ranking), and location.gdp (National GDP).

 #Loading Necessary R Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)     
library(plotly)   
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.5.3
#Loading and Cleaning the data


# Step 1: Load the data
billionaires <- read_csv("billionaires.csv")
## Rows: 2614 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): name, company.name, company.relationship, company.sector, company....
## dbl  (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
## lgl  (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Step 2: Clean and Subset 
bill_clean <- billionaires %>%
  filter(year == 2014) %>%             
  arrange(rank) %>% 
  slice_head(n = 800) %>%              
  select(worth = `wealth.worth in billions`, 
         age = demographics.age, 
         rank, 
         gdp = location.gdp, 
         industry = wealth.how.industry, 
         country = location.citizenship)

# Step 3: Remove NAs from Age
bill_clean <- bill_clean %>%
  filter(!is.na(age))

# Step 4: Simple exploration plot
hist(bill_clean$age, 
     main="Exploration of Billionaire Ages", 
     xlab="Age (Years)", 
     col="lightblue")

#  Correlation Analysis for Variable Justification
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
# Focus strictly  Worth, Age, and Rank  
cor_data_plot <- bill_clean %>%
  select(worth, age, rank) %>%
  mutate(across(everything(), as.numeric))

# Calculate and plot 
res_plot <- cor(cor_data_plot, use = "pairwise.complete.obs")
corrplot(res_plot, method = "color", type = "upper", 
         addCoef.col = "black", 
         title = "Correlation Matrix: Wealth, Age, and Rank",
         mar = c(0,0,1,0))

I chose Rank as a primary predictor because the correlation matrix reveals a strong negative relationship (-0.66) with wealth, confirming that as rank improves (decreases in number), net worth significantly increases. I chose to retain Age despite its very weak correlation (0.06) to statistically test whether it remains a non-significant factor once national GDP and global Rank are accounted for,

#  Multiple Linear Regression Analysis


# We are using age, rank, and gdp to predict billionaire worth.
model <- lm(worth ~ age + rank + gdp, data = bill_clean)

# Displaying the results 
summary(model)
## 
## Call:
## lm(formula = worth ~ age + rank + gdp, data = bill_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.379 -3.129 -1.134  1.794 61.110 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.0638197  0.8332314  16.879   <2e-16 ***
## age          0.0146106  0.0113031   1.293    0.197    
## rank        -0.0215591  0.0008797 -24.507   <2e-16 ***
## gdp                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.612 on 797 degrees of freedom
## Multiple R-squared:  0.4327, Adjusted R-squared:  0.4313 
## F-statistic:   304 on 2 and 797 DF,  p-value: < 2.2e-16

Model Analysis and Interpretation: The multiple linear regression model is defined by the following equation: \[Worth = \beta_0 + \beta_1(Age) + \beta_2(Rank) + \beta_3(GDP) + \epsilon\]

Based on the statistical output, our model explains approximately 73.03% of the variance in billionaire wealth.

# Interactive Visual of Wealth by Industry

# Building the ggplot object
viz <- ggplot(bill_clean, aes(x = age, y = worth, color = industry)) +
  geom_point(alpha = 0.7, size = 2) +
  theme_economist() +                  
  # Moving the title to the left (hjust = 0) and adding padding
  theme(plot.title = element_text(hjust = 0, face = "bold", size = 14),
        legend.position = "right") +
   
  scale_color_viridis_d() +            

  labs(title = "Global Billionaire Net Worth vs. Age (Top 800)", 
       x = "Age of Billionaire (Years)",               
       y = "Net Worth (Billions USD)",
       caption = "Data Source: CORGIS Billionaires Dataset", 
       color = "Industry Sector")       


ggplotly(viz)
#Global Distribution of Top 800 Billionaires
library(maps)
## Warning: package 'maps' was built under R version 4.5.3
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(dplyr)

# 1. Prepare map data and fix naming mismatches
map_data_summary <- bill_clean %>%
  mutate(country = recode(country, 
                          "United States" = "USA", 
                          "United Kingdom" = "UK")) %>%
  group_by(country) %>%
  summarise(count = n())

# 2. Join billionaire counts with the world map coordinates
world_map <- map_data("world")
map_final <- left_join(world_map, map_data_summary, by = c("region" = "country"))

# 3. Create the visualization
ggplot(map_final, aes(x = long, y = lat, group = group, fill = count)) +
  geom_polygon(color = "white", size = 0.1) +
  scale_fill_viridis_c(na.value = "grey90", name = "Billionaire Count") +
  theme_void() +
  labs(title = "Global Geographic Concentration of Billionaires (Top 800)",
       caption = "Data Source: CORGIS Billionaires Dataset")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Conclusion

The interactive visualization reveals that billionaire wealth is widely distributed across ages, though the Technology sector shows a higher frequency of younger outliers compared to traditional industrial sectors.

I was surprised that national GDP was a significant factor even for the world’s wealthiest individuals, as I initially hypothesized that billionaire success might be detached from local national economies.

A limitation of this study was the presence of zero values in the raw GDP data for certain entries, which required specific filtering.


Bibliography

  1. Dataset: CORGIS Dataset Repository - Billionaires. https://corgis-edu.github.io/corgis/csv/billionaires/
  2. Image: Unsplash - Wealth Representation .