Global Billionaires Analysis

Author

Flioria Akesse

Introduction

This project will be an analysis of the data provided in the dataset regarding billionaires around the world. The dataset will include information about the wealth, age, industry, company information, and geographic information of the people in the dataset. The dataset will have 2,614 observations and 22 variables. The quantitative variables will include wealth in billions, age, GDP, founding year of companies, and rank. The qualitative variables will include gender, region, country, industry, company type, and category of wealth. The purpose of this analysis is to look for patterns in billionaire wealth and see if there is any relationship between wealth and other factors like age or region. The data will be obtained from the Awesome Public Datasets GitHub repository.

Load Libraries and Data

# Load libraries needed for the analysis
library(readr)
Warning: package 'readr' was built under R version 4.5.3
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.5.3
# Load the dataset
billionaires <- read_csv("billionaires.csv")
Rows: 2614 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): name, company.name, company.relationship, company.sector, company....
dbl  (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
lgl  (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first rows of the dataset
head(billionaires)
# A tibble: 6 × 22
  name              rank  year company.founded company.name company.relationship
  <chr>            <dbl> <dbl>           <dbl> <chr>        <chr>               
1 Bill Gates           1  1996            1975 Microsoft    founder             
2 Bill Gates           1  2001            1975 Microsoft    founder             
3 Bill Gates           1  2014            1975 Microsoft    founder             
4 Warren Buffett       2  1996            1962 Berkshire H… founder             
5 Warren Buffett       2  2001            1962 Berkshire H… founder             
6 Carlos Slim Helu     2  2014            1990 Telmex       founder             
# ℹ 16 more variables: company.sector <chr>, company.type <chr>,
#   demographics.age <dbl>, demographics.gender <chr>,
#   location.citizenship <chr>, `location.country code` <chr>,
#   location.gdp <dbl>, location.region <chr>, wealth.type <chr>,
#   `wealth.worth in billions` <dbl>, wealth.how.category <chr>,
#   `wealth.how.from emerging` <lgl>, wealth.how.industry <chr>,
#   wealth.how.inherited <chr>, `wealth.how.was founder` <lgl>, …

Data Cleaning

Before any analysis of the dataset, some basic operations of data cleaning were carried out. For instance, the dataset was examined to get an understanding of the structure of the data, including any missing values. It was decided to remove any rows with missing values, to make sure that all the analysis is done on complete data. Additionally, the variables were checked to make sure they are read by R as intended.

# Check structure of the dataset
str(billionaires)
spc_tbl_ [2,614 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ name                    : chr [1:2614] "Bill Gates" "Bill Gates" "Bill Gates" "Warren Buffett" ...
 $ rank                    : num [1:2614] 1 1 1 2 2 2 3 3 3 4 ...
 $ year                    : num [1:2614] 1996 2001 2014 1996 2001 ...
 $ company.founded         : num [1:2614] 1975 1975 1975 1962 1962 ...
 $ company.name            : chr [1:2614] "Microsoft" "Microsoft" "Microsoft" "Berkshire Hathaway" ...
 $ company.relationship    : chr [1:2614] "founder" "founder" "founder" "founder" ...
 $ company.sector          : chr [1:2614] "Software" "Software" "Software" "Finance" ...
 $ company.type            : chr [1:2614] "new" "new" "new" "new" ...
 $ demographics.age        : num [1:2614] 40 45 58 65 70 74 0 48 77 68 ...
 $ demographics.gender     : chr [1:2614] "male" "male" "male" "male" ...
 $ location.citizenship    : chr [1:2614] "United States" "United States" "United States" "United States" ...
 $ location.country code   : chr [1:2614] "USA" "USA" "USA" "USA" ...
 $ location.gdp            : num [1:2614] 8.10e+12 1.06e+13 0.00 8.10e+12 1.06e+13 ...
 $ location.region         : chr [1:2614] "North America" "North America" "North America" "North America" ...
 $ wealth.type             : chr [1:2614] "founder non-finance" "founder non-finance" "founder non-finance" "founder non-finance" ...
 $ wealth.worth in billions: num [1:2614] 18.5 58.7 76 15 32.3 72 13.1 30.4 64 12.7 ...
 $ wealth.how.category     : chr [1:2614] "New Sectors" "New Sectors" "New Sectors" "Traded Sectors" ...
 $ wealth.how.from emerging: logi [1:2614] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ wealth.how.industry     : chr [1:2614] "Technology-Computer" "Technology-Computer" "Technology-Computer" "Consumer" ...
 $ wealth.how.inherited    : chr [1:2614] "not inherited" "not inherited" "not inherited" "not inherited" ...
 $ wealth.how.was founder  : logi [1:2614] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ wealth.how.was political: logi [1:2614] TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, "spec")=
  .. cols(
  ..   name = col_character(),
  ..   rank = col_double(),
  ..   year = col_double(),
  ..   company.founded = col_double(),
  ..   company.name = col_character(),
  ..   company.relationship = col_character(),
  ..   company.sector = col_character(),
  ..   company.type = col_character(),
  ..   demographics.age = col_double(),
  ..   demographics.gender = col_character(),
  ..   location.citizenship = col_character(),
  ..   `location.country code` = col_character(),
  ..   location.gdp = col_double(),
  ..   location.region = col_character(),
  ..   wealth.type = col_character(),
  ..   `wealth.worth in billions` = col_double(),
  ..   wealth.how.category = col_character(),
  ..   `wealth.how.from emerging` = col_logical(),
  ..   wealth.how.industry = col_character(),
  ..   wealth.how.inherited = col_character(),
  ..   `wealth.how.was founder` = col_logical(),
  ..   `wealth.how.was political` = col_logical()
  .. )
 - attr(*, "problems")=<externalptr> 
# Check for missing values
sum(is.na(billionaires))
[1] 201
# Remove missing values if any
billionaires_clean <- na.omit(billionaires)

Data Visualization

names(billionaires_clean)
 [1] "name"                     "rank"                    
 [3] "year"                     "company.founded"         
 [5] "company.name"             "company.relationship"    
 [7] "company.sector"           "company.type"            
 [9] "demographics.age"         "demographics.gender"     
[11] "location.citizenship"     "location.country code"   
[13] "location.gdp"             "location.region"         
[15] "wealth.type"              "wealth.worth in billions"
[17] "wealth.how.category"      "wealth.how.from emerging"
[19] "wealth.how.industry"      "wealth.how.inherited"    
[21] "wealth.how.was founder"   "wealth.how.was political"
ggplot(billionaires_clean,
       aes(x = wealth.how.industry,
           y = `wealth.worth in billions`,
           fill = wealth.how.industry)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Billionaire Wealth by Industry",
    x = "Industry",
    y = "Wealth (Billions USD)",
    caption = "Source: Forbes Billionaires Dataset"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The above visualization represents the distribution of billionaire wealth in various industries. Each boxplot in the visualization represents the range of wealth for billionaires in a particular industry. It appears that some industries have higher median levels of wealth compared to other industries, which could mean that some industries are producing higher levels of wealth.

Linear Regression Analysis

# Linear regression: Wealth and age

model <- lm(`wealth.worth in billions` ~ demographics.age, data = billionaires_clean)

summary(model)

Call:
lm(formula = `wealth.worth in billions` ~ demographics.age, data = billionaires_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-3.517 -2.202 -1.327  0.018 72.344 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.229920   0.246755   9.037  < 2e-16 ***
demographics.age 0.024591   0.004142   5.937 3.31e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.112 on 2537 degrees of freedom
Multiple R-squared:  0.0137,    Adjusted R-squared:  0.01331 
F-statistic: 35.24 on 1 and 2537 DF,  p-value: 3.309e-09
# Diagnostic plots for the regression model
par(mfrow = c(2, 2))
plot(model)

A linear regression model was developed to analyze the relationship between the wealth of billionaires and age. The equation of the linear regression model is given by: Wealth = b0 + b1 * Age The linear regression model checks if age is a significant predictor of the wealth of billionaires. The results obtained from the linear regression model provide the p-value and the value of the adjusted R-squared. The p-value and the value of the adjusted R-squared are used to check the significance of the relationship and the percentage of variation in the wealth of billionaires that can be explained by age. Plots were used to check the validity of the linear regression assumption.

A linear regression analysis was performed to examine the relationship between wealth and age of billionaires. The equation of the regression line is:

Wealth

2.2299 + 0.0246 ( Age ) Wealth=2.2299+0.0246(Age)

The regression equation indicates that for every unit of age, wealth is expected to increase by 0.0246 billion USD.

The p-value of 3.31 x 10^-9 is extremely low. This indicates that age is a significant factor in wealth.

However, it is also found that R² is 0.0133, which means only 1.33% of wealth can be explained by age. Therefore, it can be concluded that though age is significant in wealth, it is not an important factor in explaining wealth.

Diagnostic plots are also used to check the validity of linear regression analysis. These include linearity, normality of errors, and presence of any influential observations.

Discussion and Conclusion

At this step, which is referred to as the data preparation step, the dataset was examined to comprehend its structure, as well as any possible problems that could affect the analysis. The structure of the dataset was checked, and missing values were also checked. Rows with missing values were removed using the na.omit() function to make sure that the analysis was conducted on complete observations only. Cleaning of the dataset aids in making the analysis accurate.

The visualization created for this project shows the distribution of the wealth of billionaires in different industries using a boxplot. This visualization makes it easier to compare the distribution of the wealth of billionaires in different industries. From the visualization, some industries are seen to have more billionaires than others, which could imply that some industries are more profitable than others.

From the visualization and regression analysis, certain patterns emerge. From the regression analysis, it is evident that there is a statistically significant relationship between age and wealth, which means there is some form of association or connection between age and billionaire wealth. However, from the adjusted R² value, it is evident that age is responsible for only a small portion of the variation in wealth. This indicates that other variables, such as industry, location, inheritance, and business success, likely have much more influence on billionaire wealth.