Introduction

This project examines whether CEO background characteristics are associated with firm success among Fortune 100 companies. The main outcome variable is company revenue, measured as revenue_m. Because revenue is strongly right skewed, I also use log_revenue in plots and summaries to better understand the distribution. The analysis focuses on whether background variables such as tenure, athletic background, gender, Ivy League education, undergraduate major category, and postgraduate major category are related to differences in firm revenue.

Data

The dataset contains 100 Fortune 100 CEOs and their companies. Each row represents one CEO and the company they lead. The population of interest is CEOs of Fortune 100 firms.

Building this dataset was a fairly tedious process because much of the information was not available in one place. To collect undergraduate school, postgraduate school, major, and tenure, I first searched each CEO on LinkedIn. If the relevant information was not listed there, I then used the CEO biography or leadership page on the company website. Information on collegiate sports participation and company revenue came from a dataset published by Psychology Today. Since this dataset focuses specifically on Fortune 100 CEOs rather than a random sample of all CEOs, the results should be interpreted as describing this group rather than all corporate executives.

Data preparation

Before beginning the analysis, I checked that the data were in a rectangular format with one row per CEO and one column per variable. I also verified that the variables had the expected types, converted the indicator variables into factors with readable labels, and created log_revenue to better handle the skewness in company revenue.

I then checked for missing values and unusual features. The only variables with missing values were postgrad_college and postgrad_major_category. In these cases, the missing values reflected CEOs who did not attend postgraduate school, so I replaced those values with "None" to make the dataset easier to interpret. The quantitative variables, especially revenue and tenure, include some large values, but these appear to be real observations rather than obvious errors, so they were kept in the analysis.

Variables with remaining missing values
variable missing

First rows of the data

The first few rows of the data frame are shown below. There are 100 total observations in the dataset.

## # A tibble: 6 × 14
##   ceo_name         female_indicator athlete_indicator undergrad_college       
##   <chr>            <fct>            <fct>             <chr>                   
## 1 Douglas McMillon Male             Athlete           University of Arkansas  
## 2 Andrew Jassy     Male             Athlete           Harvard University      
## 3 Tim Cook         Male             Athlete           Auburn University       
## 4 Andrew Witty     Male             Athlete           University of Nottingham
## 5 Warren Buffett   Male             Non-Athlete       University of Nebraska  
## 6 Karen Lynch      Female           Athlete           Boston College          
## # ℹ 10 more variables: undergrad_major_category <fct>, postgrad_college <chr>,
## #   postgrad_major_category <fct>, ivy_undergrad <fct>, ivy_postgrad <fct>,
## #   company <chr>, revenue_m <dbl>, fortune_rank <dbl>, tenure <dbl>,
## #   log_revenue <dbl>

Each row represents one Fortune 100 CEO. The most important columns used in this analysis are: revenue_m, company revenue in millions of dollars; log_revenue, the natural log of revenue; tenure, years as CEO; female_indicator, CEO gender indicator; athlete_indicator, whether the CEO has an athletic background; ivy_undergrad, whether the CEO attended an Ivy League college for undergraduate study; ivy_postgrad, whether the CEO attended an Ivy League institution for postgraduate study; undergrad_major_category, broad undergraduate major category; and postgrad_major_category, broad postgraduate major category.

Graphical and numerical summaries

Because this dataset contains several variables, it is important to be selective rather than show every possible graph or summary. I focus on the variables that are most relevant to the research question: company revenue, log revenue, tenure, athletic background, Ivy League undergraduate education, and undergraduate major category. This makes the analysis more coherent and keeps attention on the variables that are most informative for understanding possible differences in firm success.

For quantitative variables, the main features I focus on are center, spread, skewness, and potential outliers. For categorical variables, I focus on the counts in each category and whether the distribution is balanced or concentrated in only a few groups. I also make sure that the response variable, revenue, is summarized carefully since it is the main measure of firm success in this project.

Revenue

Numerical summary of revenue
n mean_revenue median_revenue sd_revenue min_revenue max_revenue
100 122435.7 80296 107869.7 43452 648125

Revenue is strongly right skewed, with a few companies having much larger revenues than the rest. The mean is noticeably larger than the median, which is another sign of right skewness. The histogram shows that most firms are grouped at lower revenue values, while a small number of very large companies create a long right tail.

Log revenue

Numerical summary of log revenue
mean_log_revenue median_log_revenue sd_log_revenue min_log_revenue max_log_revenue
11.46798 11.29325 0.6502731 10.67941 13.38184

Taking the log of revenue makes the distribution much more symmetric and easier to analyze. The extreme right tail seen in the original revenue variable is reduced, so the histogram now shows a more balanced shape. This is helpful because it makes comparisons across groups clearer and better matches the kinds of patterns we want for later modeling.

Tenure

Numerical summary of tenure
mean_tenure median_tenure sd_tenure min_tenure max_tenure
7.85 5.5 7.75167 1 54

Tenure is also right skewed. Most CEOs have relatively short tenures, while a smaller number have been in the role for many years. The histogram shows a clear concentration at lower values and a few larger observations in the tail, which suggests that long serving CEOs are less common in this group.

Athlete background

Counts for athlete background
athlete_indicator n
Non-Athlete 32
Athlete 68

This bar chart shows that most CEOs in the dataset are classified as non athletes, though a substantial minority have an athletic background. Since the categories are not evenly split, any comparison across these groups should be interpreted carefully. Still, the plot is useful because it quickly shows how common athletic participation is among Fortune 100 CEOs.

Ivy League undergraduate education

Counts for Ivy League undergraduate education
ivy_undergrad n
Non-Ivy 86
Ivy 14

Most CEOs in the dataset did not attend an Ivy League school for their undergraduate degree. The distribution is fairly unbalanced, with the non Ivy group much larger than the Ivy group. This suggests that while Ivy League education may be notable, it is not the typical undergraduate path among these CEOs.

Undergraduate major category

Counts for undergraduate major category
undergrad_major_category n
Business 44
Economics 5
Engineering 22
Humanities 9
Mixed 6
Science 9
Unknown 5

The distribution of undergraduate majors is concentrated in a few categories, especially business and engineering. Other categories appear less often, which suggests that some academic backgrounds are much more common than others among Fortune 100 CEOs. The bar chart is useful here because it makes the imbalance across categories easy to see.

Relationships between variables

To understand how CEO background characteristics relate to firm success, I focus on a small number of relationship plots rather than plotting every possible pair of variables. These plots were chosen because they connect directly to the research question and compare the response variable, log revenue, to several meaningful CEO characteristics. The scatterplot is especially useful for looking at the association between two quantitative variables, while the boxplots are useful for comparing a quantitative outcome across categorical groups.

Tenure and log revenue

I chose this scatterplot because both tenure and revenue are central to the research question, and it allows me to examine whether more years in the CEO role are associated with differences in firm revenue. The plot suggests a slight positive relationship, meaning firms led by longer tenured CEOs may have somewhat higher revenues on average. However, the points are fairly spread out, so the association appears weak rather than strong.

Athlete background and log revenue

I included this boxplot because athletic background was one of the main CEO characteristics collected in the dataset. The distributions overlap heavily, which suggests that firms led by former athletes and non athletes have fairly similar revenue patterns. While there may be some difference in center, the plot does not show a large or obvious separation between the two groups.

Ivy League undergraduate education and log revenue

This boxplot was chosen to examine whether CEOs with Ivy League undergraduate education tend to lead higher revenue firms. The two groups look fairly similar, with substantial overlap in both spread and center. This suggests that Ivy League undergraduate education does not have a strong visible relationship with revenue in this dataset.

Undergraduate major category and log revenue

I also included undergraduate major because it is another meaningful background characteristic that varies across CEOs. Some categories appear to have slightly different centers, but there is still a large amount of overlap, and no category stands out as dramatically different from the others. This suggests that undergraduate major may not be strongly related to firm revenue on its own.

Conclusion

Overall, the descriptive analysis suggests that company revenue varies widely across Fortune 100 firms and is strongly right skewed, which makes the log transformation useful for interpretation. CEO background variables such as tenure, athlete status, Ivy League undergraduate education, and undergraduate major category show some differences across groups, but the plots do not suggest strong or dramatic relationships with revenue.

The clearest pattern is that most of the relationship plots show substantial overlap across groups and considerable variability within groups. This means that while CEO background characteristics may be associated with small differences in firm revenue, they do not appear to explain large amounts of variation on their own. Because the data are observational and limited to Fortune 100 CEOs, these results should be interpreted as descriptive associations rather than causal effects.