Project 2

Introduce your topic and dataset in a paragraph or two at the beginning of your markdown document. Be sure you describe any variables included, what kind of variables they are, where the data came from and how you cleaned it up (be detailed and specific, using proper terminology where appropriate). Be sure to explain why you chose this topic and dataset – what meaning does it have for you?

In this project I explore the distribution of billionaire wealth across industries and countries using a dataset of 992 billionaires. I chose this topic because business success is interesting to me, especially how different sectors create extremely high levels of wealth.

The dataset includes both categorical and quantitative variables. The categorical variables used in this project are Industry and Country. The quantitative variables used are Net Worth, Age, Daily Change ($B), and Daily Change (%).

The data came from www.kaggle.com shared dataset file that was converted to CSV format for use in R. To prepare the dataset, I imported it with readr::read_csv(), checked variable types, renamed columns, removed observations with missing values only for variables needed in the analysis, and filtered the dataset down to no more than 800 observations for the final project requirements.

This topic is meaningful to me because it connects economics, business, and global inequality. I wanted to explore is age help explain billionaire net worth, and whether patterns differ across industries and countries.

Load the necessary libraries. Load your dataset using the readr::read_csv() command (do NOT use read.csv() ). Never use na.omit(dataset) or drop_na(datatset). You may use na.omit(dataset$variable) though.

library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(shiny)
billionaires <- readr::read_csv("billionaires_1000.csv")
Rows: 992 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Daily Change ($B), Daily Change (%), Industry, Country
dbl (3): Rank, Net Worth ($B), Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean and explore the data variables and keep track of all of your cleaning and explorations in quarto/markdown document. Be sure to include subtitles and comments for every chunk to describe your actions. Be sure your subtitle formatting works.

billionaires_clean <- billionaires %>%
  rename(
    rank = Rank,
    name = Name,
    net_worth_b = `Net Worth ($B)`,
    daily_change_b = `Daily Change ($B)`,
    daily_change_percent = `Daily Change (%)`,
    industry = Industry,
    age = Age,
    country = Country
  ) %>%
  mutate(
    change_b_text = daily_change_b,
    change_pct_text = daily_change_percent,
    
    # daily_change_b: Remove the parentheses, then make the expression negative if parentheses were present.
    daily_change_b = as.numeric(gsub("[()]", "", change_b_text)),
    daily_change_b = ifelse(grepl("\\(", change_b_text),
                            -daily_change_b,
                            daily_change_b),
    
    # daily_change_percent: Remove the percent signs and parentheses.
    daily_change_percent = as.numeric(gsub("[%()]", "", change_pct_text)),
    daily_change_percent = ifelse(grepl("\\(", change_pct_text),
                                  -daily_change_percent,
                                  daily_change_percent)
  ) %>%
  select(-change_b_text, -change_pct_text) %>%
  filter(
    !is.na(age),
    !is.na(net_worth_b),
    !is.na(industry),
    !is.na(country)
  ) %>%
  arrange(rank) %>%
  slice_head(n = 800)
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `daily_change_b = as.numeric(gsub("[()]", "", change_b_text))`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
glimpse(billionaires_clean)
Rows: 800
Columns: 8
$ rank                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ name                 <chr> "Elon Musk", "Larry Page", "Sergey Brin", "Jeff B…
$ net_worth_b          <dbl> 824.0, 244.9, 226.1, 221.6, 204.7, 191.5, 151.1, …
$ daily_change_b       <dbl> 0.559, -3.400, -3.100, -2.200, -2.600, -5.200, -1…
$ daily_change_percent <dbl> 0.07, -1.39, -1.38, -1.01, -1.27, -2.69, -0.70, -…
$ industry             <chr> "Tesla, SpaceX", "Google", "Google", "Amazon", "F…
$ age                  <dbl> 54, 52, 52, 62, 41, 81, 63, 61, 77, 95, 81, 77, 8…
$ country              <chr> "United States", "United States", "United States"…

Explore both quantitative and categorical variables with simple plots to determine what you want to focus on for your final visualization.

Net Worth Distribution:

ggplot(billionaires_clean, aes(x = net_worth_b)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +
  scale_x_log10() +
  labs(
    title = "Distribution of Billionaire Net Worth (Log Scale)",
    x = "Net Worth (Billions USD, log scale)",
    y = "Count"
  ) +
  theme_minimal()

Most billionaires have lower wealth compared to a few extremely rich individuals. The data shows a clear concentration at lower values and a long tail at higher values.

Age vs Net Worth:

# Scatterplot of age and net worth
ggplot(billionaires_clean, aes(x = age, y = net_worth_b)) +
  geom_point(color = "darkred", alpha = 0.7) +
  labs(
    title = "Relationship Between Age and Net Worth",
    x = "Age",
    y = "Net Worth (Billions of USD)"
  ) +
  theme_light()

The scatterplot shows that there is no strong linear relationship between age and net worth. Most billionaires, regardless of age, have relatively lower net worth, while a few extreme outliers have significantly higher wealth.

TOP Countries:

# Count billionaires by country
top_countries <- billionaires_clean %>%
  count(country, sort = TRUE) %>%
  slice_head(n = 10)

top_countries
# A tibble: 10 × 2
   country            n
   <chr>          <int>
 1 United States    301
 2 China             85
 3 Germany           54
 4 India             33
 5 Russia            33
 6 Canada            24
 7 France            21
 8 United Kingdom    18
 9 Hong Kong         17
10 Italy             17
# Bar chart of top 10 countries
ggplot(top_countries, aes(x = reorder(country, n), y = n, fill = country)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Top 10 Countries by Number of Billionaires",
    x = "Country",
    y = "Number of Billionaires"
  ) +
  theme_bw()

The bar chart shows that the United States has by far the highest number of billionaires, significantly exceeding all other countries. A small group of countries, including China and Germany, also have notable counts, while the rest have much fewer billionaires.

TOP Industries

# Count billionaires by industry
top_industries <- billionaires_clean %>%
  count(industry, sort = TRUE) %>%
  slice_head(n = 10)

top_industries
# A tibble: 10 × 2
   industry            n
   <chr>           <int>
 1 Real estate        31
 2 Pharmaceuticals    29
 3 Investments        26
 4 Diversified        18
 5 Hedge funds        18
 6 Private equity     15
 7 Software           14
 8 Shipping           11
 9 Finance            10
10 Supermarkets        9
# Bar chart of top industries
ggplot(top_industries, aes(x = reorder(industry, n), y = n, fill = industry)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Top 10 Industries by Number of Billionaires",
    x = "Industry",
    y = "Number of Billionaires"
  ) +
  theme_minimal()

The chart shows that billionaire wealth is concentrated in a few key industries, with real estate, pharmaceuticals, and investments having the highest number of billionaires. Other industries have noticeably fewer billionaires, indicating that wealth is not evenly distributed across sectors.

Include a multiple linear regression analysis of 3 or more quantitative variables. Provide a clear justification for your choice of variables (e.g., using backward elimination or correlation plots). Write the equation for your model and analyze your model based on p-
values, adjusted R^2 values.

To build the regression model, I selected Net Worth as the response variable because it represents billionaire wealth and is the main outcome of interest in this project. I chose Age, Daily Change, and Daily Change Percent as predictor variables because they are quantitative and may help explain differences in billionaire net worth. I also looked at the correlation between the variables to make sure they are related before using them in the regression model.

reg_data <- billionaires_clean %>%
  filter(
    !is.na(net_worth_b),
    !is.na(age),
    !is.na(daily_change_b),
    !is.na(daily_change_percent)
  )

glimpse(reg_data)
Rows: 461
Columns: 8
$ rank                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ name                 <chr> "Elon Musk", "Larry Page", "Sergey Brin", "Jeff B…
$ net_worth_b          <dbl> 824.0, 244.9, 226.1, 221.6, 204.7, 191.5, 151.1, …
$ daily_change_b       <dbl> 0.559, -3.400, -3.100, -2.200, -2.600, -5.200, -1…
$ daily_change_percent <dbl> 0.07, -1.39, -1.38, -1.01, -1.27, -2.69, -0.70, -…
$ industry             <chr> "Tesla, SpaceX", "Google", "Google", "Amazon", "F…
$ age                  <dbl> 54, 52, 52, 62, 41, 81, 63, 61, 77, 95, 81, 77, 8…
$ country              <chr> "United States", "United States", "United States"…
reg_data %>%
  select(net_worth_b, age, daily_change_b, daily_change_percent) %>%
  cor()
                     net_worth_b         age daily_change_b
net_worth_b           1.00000000 -0.01973963    -0.28415654
age                  -0.01973963  1.00000000     0.02043353
daily_change_b       -0.28415654  0.02043353     1.00000000
daily_change_percent -0.03678485 -0.03955254     0.55013110
                     daily_change_percent
net_worth_b                   -0.03678485
age                           -0.03955254
daily_change_b                 0.55013110
daily_change_percent           1.00000000
model1 <- lm(net_worth_b ~ age + daily_change_b + daily_change_percent, data = reg_data)

summary(model1)

Call:
lm(formula = net_worth_b ~ age + daily_change_b + daily_change_percent, 
    data = reg_data)

Residuals:
   Min     1Q Median     3Q    Max 
-33.15 -11.41  -8.84  -3.60 826.36 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           19.40517   11.15088   1.740   0.0825 .  
age                   -0.01916    0.16221  -0.118   0.9060    
daily_change_b       -37.55551    5.28089  -7.112 4.45e-12 ***
daily_change_percent   3.81231    1.18588   3.215   0.0014 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.78 on 457 degrees of freedom
Multiple R-squared:  0.1013,    Adjusted R-squared:  0.09536 
F-statistic: 17.16 on 3 and 457 DF,  p-value: 1.408e-10

The regression results show that daily_change_b (0.00000000000445) and daily_change_percent ( 0.0014) are statistically significant predictors of net worth, as their p-values are very small (p < 0.01). In contrast, age (0.9060) is not a statistically significant predictor, since its p-value is large and does not provide strong evidence of an effect.

The coefficient for daily_change_b is negative (-37.55551), suggesting that larger negative daily changes are associated with lower net worth, while the positive coefficient for daily_change_percent (3.81231) indicates that higher percentage changes are associated with higher net worth. However, the adjusted R² is relatively low (about 0.095), meaning that the model explains only a small portion of the variation in billionaire wealth, and other factors likely play an important role.

Plot at least one visualization (you can do more than one) that includes all of the following elements:

a. meaningful labels for axes

b.at least 3 colors used in the visualization(s)

c.a detailed title

d. a caption for the data source

e. some sort of legend to make sense of colors, shapes, and sizes that describe any variables.

f. At least one shinyapp, Highcharter or plotly for mouseover interactivity

g. a non-default ggplot theme and a non-default color palette

industry_summary <- billionaires_clean %>%
  group_by(industry) %>%
  summarize(
    avg_net_worth = mean(net_worth_b, na.rm = TRUE),
    avg_age = mean(age, na.rm = TRUE),
    count = n()
  ) %>%
  filter(count >= 8) %>%
  arrange(desc(avg_net_worth)) %>%
  slice_head(n = 10) %>%
  mutate(
    wealth_group = case_when(
      avg_net_worth >= 40 ~ "High average wealth",
      avg_net_worth >= 20 ~ "Medium average wealth",
      TRUE ~ "Lower average wealth"
    )
  )

industry_summary
# A tibble: 10 × 5
   industry       avg_net_worth avg_age count wealth_group         
   <chr>                  <dbl>   <dbl> <int> <chr>                
 1 Shipping               20.4     68.8    11 Medium average wealth
 2 Telecom                19.5     69       9 Lower average wealth 
 3 Diversified            17.9     74.3    18 Lower average wealth 
 4 Hedge funds            15.6     69.8    18 Lower average wealth 
 5 E-commerce             13.8     51.5     8 Lower average wealth 
 6 Banking                12.4     67.6     8 Lower average wealth 
 7 Supermarkets            9.57    73.2     9 Lower average wealth 
 8 Private equity          9.37    65.7    15 Lower average wealth 
 9 Real estate             9.23    76.5    31 Lower average wealth 
10 Investments             9.16    66.8    26 Lower average wealth 
ggplot(
  industry_summary,
  aes(
    x = reorder(industry, avg_net_worth),
    y = avg_net_worth,
    fill = wealth_group,
    size = count
  )
) +
  geom_point(shape = 21, color = "black", alpha = 0.9) +
  coord_flip() +
  labs(
    title = "Average Billionaire Net Worth by Industry",
    subtitle = "Top industries in the dataset, with bubble size showing number of billionaires",
    x = "Industry",
    y = "Average Net Worth (Billions of USD)",
    fill = "Wealth Category",
    size = "Number of Billionaires",
    caption = "Data source: billionaires_1000.csv"
  ) +
  scale_fill_brewer(palette = "Set2") +
  theme_light()

This visualization shows the average net worth of billionaires across different industries. Industries such as shipping and telecom have the highest average net worth, while industries like real estate and investments have lower average values. The size of each point represents the number of billionaires in that industry, indicating that some industries have more individuals but lower average wealth. Overall, the chart suggests that wealth levels vary significantly across industries and are not evenly distributed.

Interactive Shiny App:

ui <- fluidPage(
  titlePanel("Billionaires Explorer"),
  
  sidebarLayout(
    sidebarPanel(
      selectInput(
        "xvar",
        "Choose X-axis:",
        choices = c("age", "daily_change_b", "daily_change_percent"),
        selected = "age"
      ),
      
      selectInput(
        "yvar",
        "Choose Y-axis:",
        choices = c("net_worth_b", "age", "daily_change_b"),
        selected = "net_worth_b"
      ),
      
      selectInput(
        "colorvar",
        "Color points by:",
        choices = c("industry", "country"),
        selected = "industry"
      ),
      
      checkboxInput("smooth", "Add regression line", value = FALSE)
    ),
    
    mainPanel(
      plotOutput("scatterplot"),
      verbatimTextOutput("summary_text")
    )
  )
)

server <- function(input, output) {
  
  output$scatterplot <- renderPlot({
    p <- ggplot(
      billionaires_clean,
      aes_string(x = input$xvar, y = input$yvar, color = input$colorvar)
    ) +
      geom_point(alpha = 0.7, size = 2) +
      labs(
        title = "Interactive Scatterplot of Billionaire Variables",
        x = input$xvar,
        y = input$yvar,
        color = input$colorvar,
        caption = "Data source: billionaires_1000.csv"
      ) +
      theme_light()
    
    if (input$smooth) {
      p <- p + geom_smooth(method = "lm", se = FALSE)
    }
    
    p
  })
  
  output$summary_text <- renderPrint({
    summary(billionaires_clean[, c(input$xvar, input$yvar)])
  })
}

shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

This Shiny app allows the user to interactively explore relationships between quantitative variables in the billionaire dataset. The user can choose the variables for the x-axis and y-axis, color the points by industry or country, and optionally add a regression line.

Conclusion:

The project established that the distribution of billionaire wealth is heavily right-skewed: the majority possess relatively modest capital, while only a small minority of the population holds extremely high levels of wealth. The analysis revealed the clear dominance of the United States and China in terms of the number of billionaires—a key indicator of national financial wealth. It was also found that certain sectors—such as real estate, pharmaceuticals, and investment—are disproportionately represented among the world’s wealthiest individuals.

Regression analysis results indicate that daily changes in capital (both in absolute and percentage terms) are statistically significant factors, whereas age does not exert a positive influence on wealth levels. This model accounts for only a small fraction of the observed variation (indicated by a low adjusted R²), suggesting the existence of other factors that influence wealth accumulation. Overall, this project contributes to a better understanding of wealth distribution and the key variables associated with the status of billionaires.