This project examines whether CEO background characteristics are
associated with firm success among Fortune 100 companies. The main
outcome variable is company revenue, measured as revenue_m.
Because revenue is strongly right skewed, I also use
log_revenue in plots and summaries to better understand the
distribution. The analysis focuses on whether background variables such
as tenure, athletic background, gender, Ivy League education,
undergraduate major category, and postgraduate major category are
related to differences in firm revenue.
The dataset contains 100 Fortune 100 CEOs and their companies. Each row represents one CEO and the company they lead. The population of interest is CEOs of Fortune 100 firms.
Building this dataset was a fairly tedious process because much of the information was not available in one place. To collect undergraduate school, postgraduate school, major, and tenure, I first searched each CEO on LinkedIn. If the relevant information was not listed there, I then used the CEO biography or leadership page on the company website. Information on collegiate sports participation and company revenue came from a dataset published by Psychology Today. Since this dataset focuses specifically on Fortune 100 CEOs rather than a random sample of all CEOs, the results should be interpreted as describing this group rather than all corporate executives.
Before beginning the analysis, I checked that the data were in a
rectangular format with one row per CEO and one column per variable. I
also verified that the variables had the expected types, converted the
indicator variables into factors with readable labels, and created
log_revenue to better handle the skewness in company
revenue.
I then checked for missing values and unusual features. The only
variables with missing values were postgrad_college and
postgrad_major_category. In these cases, the missing values
reflected CEOs who did not attend postgraduate school, so I replaced
those values with "None" to make the dataset easier to
interpret. The quantitative variables, especially revenue and tenure,
include some large values, but these appear to be real observations
rather than obvious errors, so they were kept in the analysis.
| variable | missing |
|---|
The first few rows of the data frame are shown below. There are 100 total observations in the dataset.
## # A tibble: 6 × 14
## ceo_name female_indicator athlete_indicator undergrad_college
## <chr> <fct> <fct> <chr>
## 1 Douglas McMillon Male Athlete University of Arkansas
## 2 Andrew Jassy Male Athlete Harvard University
## 3 Tim Cook Male Athlete Auburn University
## 4 Andrew Witty Male Athlete University of Nottingham
## 5 Warren Buffett Male Non-Athlete University of Nebraska
## 6 Karen Lynch Female Athlete Boston College
## # ℹ 10 more variables: undergrad_major_category <fct>, postgrad_college <chr>,
## # postgrad_major_category <fct>, ivy_undergrad <fct>, ivy_postgrad <fct>,
## # company <chr>, revenue_m <dbl>, fortune_rank <dbl>, tenure <dbl>,
## # log_revenue <dbl>
Each row represents one Fortune 100 CEO. The most important columns
used in this analysis are: revenue_m, company revenue in
millions of dollars; log_revenue, the natural log of
revenue; tenure, years as CEO;
female_indicator, CEO gender indicator;
athlete_indicator, whether the CEO has an athletic
background; ivy_undergrad, whether the CEO attended an Ivy
League college for undergraduate study; ivy_postgrad,
whether the CEO attended an Ivy League institution for postgraduate
study; undergrad_major_category, broad undergraduate major
category; and postgrad_major_category, broad postgraduate
major category.
Because this dataset contains several variables, it is important to be selective rather than show every possible graph or summary. I focus on the variables that are most relevant to the research question: company revenue, log revenue, tenure, athletic background, Ivy League undergraduate education, and undergraduate major category. This makes the analysis more coherent and keeps attention on the variables that are most informative for understanding possible differences in firm success.
For quantitative variables, the main features I focus on are center, spread, skewness, and potential outliers. For categorical variables, I focus on the counts in each category and whether the distribution is balanced or concentrated in only a few groups. I also make sure that the response variable, revenue, is summarized carefully since it is the main measure of firm success in this project.
| n | mean_revenue | median_revenue | sd_revenue | min_revenue | max_revenue |
|---|---|---|---|---|---|
| 100 | 122435.7 | 80296 | 107869.7 | 43452 | 648125 |
Revenue is strongly right skewed, with a few companies having much larger revenues than the rest. The mean is noticeably larger than the median, which is another sign of right skewness. The histogram shows that most firms are grouped at lower revenue values, while a small number of very large companies create a long right tail.
| mean_log_revenue | median_log_revenue | sd_log_revenue | min_log_revenue | max_log_revenue |
|---|---|---|---|---|
| 11.46798 | 11.29325 | 0.6502731 | 10.67941 | 13.38184 |
Taking the log of revenue makes the distribution much more symmetric and easier to analyze. The extreme right tail seen in the original revenue variable is reduced, so the histogram now shows a more balanced shape. This is helpful because it makes comparisons across groups clearer and better matches the kinds of patterns we want for later modeling.
| mean_tenure | median_tenure | sd_tenure | min_tenure | max_tenure |
|---|---|---|---|---|
| 7.85 | 5.5 | 7.75167 | 1 | 54 |
Tenure is also right skewed. Most CEOs have relatively short tenures, while a smaller number have been in the role for many years. The histogram shows a clear concentration at lower values and a few larger observations in the tail, which suggests that long serving CEOs are less common in this group.
| athlete_indicator | n |
|---|---|
| Non-Athlete | 32 |
| Athlete | 68 |
This bar chart shows that most CEOs in the dataset are classified as non athletes, though a substantial minority have an athletic background. Since the categories are not evenly split, any comparison across these groups should be interpreted carefully. Still, the plot is useful because it quickly shows how common athletic participation is among Fortune 100 CEOs.
| ivy_undergrad | n |
|---|---|
| Non-Ivy | 86 |
| Ivy | 14 |
Most CEOs in the dataset did not attend an Ivy League school for their undergraduate degree. The distribution is fairly unbalanced, with the non Ivy group much larger than the Ivy group. This suggests that while Ivy League education may be notable, it is not the typical undergraduate path among these CEOs.
| undergrad_major_category | n |
|---|---|
| Business | 44 |
| Economics | 5 |
| Engineering | 22 |
| Humanities | 9 |
| Mixed | 6 |
| Science | 9 |
| Unknown | 5 |
The distribution of undergraduate majors is concentrated in a few categories, especially business and engineering. Other categories appear less often, which suggests that some academic backgrounds are much more common than others among Fortune 100 CEOs. The bar chart is useful here because it makes the imbalance across categories easy to see.
To understand how CEO background characteristics relate to firm success, I focus on a small number of relationship plots rather than plotting every possible pair of variables. These plots were chosen because they connect directly to the research question and compare the response variable, log revenue, to several meaningful CEO characteristics. The scatterplot is especially useful for looking at the association between two quantitative variables, while the boxplots are useful for comparing a quantitative outcome across categorical groups.
I chose this scatterplot because both tenure and revenue are central to the research question, and it allows me to examine whether more years in the CEO role are associated with differences in firm revenue. The plot suggests a slight positive relationship, meaning firms led by longer tenured CEOs may have somewhat higher revenues on average. However, the points are fairly spread out, so the association appears weak rather than strong.
I included this boxplot because athletic background was one of the main CEO characteristics collected in the dataset. The distributions overlap heavily, which suggests that firms led by former athletes and non athletes have fairly similar revenue patterns. While there may be some difference in center, the plot does not show a large or obvious separation between the two groups.
This boxplot was chosen to examine whether CEOs with Ivy League undergraduate education tend to lead higher revenue firms. The two groups look fairly similar, with substantial overlap in both spread and center. This suggests that Ivy League undergraduate education does not have a strong visible relationship with revenue in this dataset.
I also included undergraduate major because it is another meaningful background characteristic that varies across CEOs. Some categories appear to have slightly different centers, but there is still a large amount of overlap, and no category stands out as dramatically different from the others. This suggests that undergraduate major may not be strongly related to firm revenue on its own.
Overall, the descriptive analysis suggests that company revenue varies widely across Fortune 100 firms and is strongly right skewed, which makes the log transformation useful for interpretation. CEO background variables such as tenure, athlete status, Ivy League undergraduate education, and undergraduate major category show some differences across groups, but the plots do not suggest strong or dramatic relationships with revenue.
The clearest pattern is that most of the relationship plots show substantial overlap across groups and considerable variability within groups. This means that while CEO background characteristics may be associated with small differences in firm revenue, they do not appear to explain large amounts of variation on their own. Because the data are observational and limited to Fortune 100 CEOs, these results should be interpreted as descriptive associations rather than causal effects.