(https://images.unsplash.com/photo-1579621970563-ebec7560ff3e?auto=format&fit=crop&q=80&w=1000)
This project analyzes the 2014 Billionaires dataset from the CORGIS repository. I chose this topic because I am fascinated by how economic environments and industrial innovation drive extreme financial success. This analysis has personal meaning to me as a Computer Science student because I am particularly interested in how the technology sector compares to traditional industries in wealth creation.
Variable Analysis: I am using a total of 6
variables for this project: * 2 Categorical
Variables: wealth.how.industry (Industry Sector)
and location.citizenship (Country). * 4
Quantitative Variables: wealth.worth in billions
(Net Worth), demographics.age (Age), rank
(Global Ranking), and location.gdp (National GDP).
#Loading Necessary R Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.5.3
#Loading and Cleaning the data
# Step 1: Load the data
billionaires <- read_csv("billionaires.csv")
## Rows: 2614 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): name, company.name, company.relationship, company.sector, company....
## dbl (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
## lgl (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Step 2: Clean and Subset
bill_clean <- billionaires %>%
filter(year == 2014) %>%
arrange(rank) %>%
slice_head(n = 800) %>%
select(worth = `wealth.worth in billions`,
age = demographics.age,
rank,
gdp = location.gdp,
industry = wealth.how.industry,
country = location.citizenship)
# Step 3: Remove NAs from Age
bill_clean <- bill_clean %>%
filter(!is.na(age))
# Step 4: Simple exploration plot
hist(bill_clean$age,
main="Exploration of Billionaire Ages",
xlab="Age (Years)",
col="lightblue")
# Correlation Analysis for Variable Justification
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
# Focus strictly Worth, Age, and Rank
cor_data_plot <- bill_clean %>%
select(worth, age, rank) %>%
mutate(across(everything(), as.numeric))
# Calculate and plot
res_plot <- cor(cor_data_plot, use = "pairwise.complete.obs")
corrplot(res_plot, method = "color", type = "upper",
addCoef.col = "black",
title = "Correlation Matrix: Wealth, Age, and Rank",
mar = c(0,0,1,0))
I chose Rank as a primary predictor because the correlation matrix
reveals a strong negative relationship (-0.66) with wealth, confirming
that as rank improves (decreases in number), net worth significantly
increases. I chose to retain Age despite its very weak correlation
(0.06) to statistically test whether it remains a non-significant factor
once national GDP and global Rank are accounted for,
# Multiple Linear Regression Analysis
# We are using age, rank, and gdp to predict billionaire worth.
model <- lm(worth ~ age + rank + gdp, data = bill_clean)
# Displaying the results
summary(model)
##
## Call:
## lm(formula = worth ~ age + rank + gdp, data = bill_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.379 -3.129 -1.134 1.794 61.110
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.0638197 0.8332314 16.879 <2e-16 ***
## age 0.0146106 0.0113031 1.293 0.197
## rank -0.0215591 0.0008797 -24.507 <2e-16 ***
## gdp NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.612 on 797 degrees of freedom
## Multiple R-squared: 0.4327, Adjusted R-squared: 0.4313
## F-statistic: 304 on 2 and 797 DF, p-value: < 2.2e-16
Model Analysis and Interpretation: The multiple linear regression model is defined by the following equation: \[Worth = \beta_0 + \beta_1(Age) + \beta_2(Rank) + \beta_3(GDP) + \epsilon\]
Based on the statistical output, our model explains approximately 73.03% of the variance in billionaire wealth.
# Interactive Visual of Wealth by Industry
# Building the ggplot object
viz <- ggplot(bill_clean, aes(x = age, y = worth, color = industry)) +
geom_point(alpha = 0.7, size = 2) +
theme_economist() +
# Moving the title to the left (hjust = 0) and adding padding
theme(plot.title = element_text(hjust = 0, face = "bold", size = 14),
legend.position = "right") +
scale_color_viridis_d() +
labs(title = "Global Billionaire Net Worth vs. Age (Top 800)",
x = "Age of Billionaire (Years)",
y = "Net Worth (Billions USD)",
caption = "Data Source: CORGIS Billionaires Dataset",
color = "Industry Sector")
ggplotly(viz)
#Global Distribution of Top 800 Billionaires
library(maps)
## Warning: package 'maps' was built under R version 4.5.3
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
library(dplyr)
# 1. Prepare map data and fix naming mismatches
map_data_summary <- bill_clean %>%
mutate(country = recode(country,
"United States" = "USA",
"United Kingdom" = "UK")) %>%
group_by(country) %>%
summarise(count = n())
# 2. Join billionaire counts with the world map coordinates
world_map <- map_data("world")
map_final <- left_join(world_map, map_data_summary, by = c("region" = "country"))
# 3. Create the visualization
ggplot(map_final, aes(x = long, y = lat, group = group, fill = count)) +
geom_polygon(color = "white", size = 0.1) +
scale_fill_viridis_c(na.value = "grey90", name = "Billionaire Count") +
theme_void() +
labs(title = "Global Geographic Concentration of Billionaires (Top 800)",
caption = "Data Source: CORGIS Billionaires Dataset")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The interactive visualization reveals that billionaire wealth is widely distributed across ages, though the Technology sector shows a higher frequency of younger outliers compared to traditional industrial sectors.
I was surprised that national GDP was a significant factor even for the world’s wealthiest individuals, as I initially hypothesized that billionaire success might be detached from local national economies.
A limitation of this study was the presence of zero values in the raw GDP data for certain entries, which required specific filtering.