The dataset documents the reasons for CEO departure in S&P 1500 firms from 2000 through 2018. Goal is to predict CEO departure (ceo_dismissal) by using the departures dataset.

Import Data

data <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-27/departures.csv")

Explore Data

skimr::skim(data)

Data summary
Name	data
Number of rows	9423
Number of columns	19
_______________________
Column type frequency:
character	9
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
coname	0	1.00	2	30	3860
exec_fullname	0	1.00	5	790	8701
interim_coceo	9105	0.03	6	7	6
leftofc	1802	0.81	20	20	3627
still_there	7311	0.22	3	10	77
notes	1644	0.83	5	3117	7755
sources	1475	0.84	18	1843	7915
eight_ks	4499	0.52	69	3884	4914
X_merge	0	1.00	11	11	1

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
dismissal_dataset_id	0	1.00	5684.10	25005.46	1	2305.5	4593	6812.5	559044	▇▁▁▁▁
gvkey	0	1.00	40132.48	53921.34	1004	7337.0	14385	60900.5	328795	▇▁▁▁▁
fyear	0	1.00	2007.74	8.19	1987	2000.0	2008	2016.0	2020	▁▆▅▅▇
co_per_rol	0	1.00	25580.22	18202.38	-1	8555.5	22980	39275.5	64602	▇▆▅▃▃
departure_code	1667	0.82	5.20	1.53	1	5.0	5	7.0	9	▁▃▇▅▁
ceo_dismissal	1813	0.81	0.20	0.40	0	0.0	0	0.0	1	▇▁▁▁▂
tenure_no_ceodb	0	1.00	1.03	0.17	0	1.0	1	1.0	3	▁▇▁▁▁
max_tenure_ceodb	0	1.00	1.05	0.24	1	1.0	1	1.0	4	▇▁▁▁▁
fyear_gone	1802	0.81	2006.64	13.63	1980	2000.0	2007	2013.0	2997	▇▁▁▁▁
cik	245	0.97	741469.17	486551.43	1750	106413.0	857323	1050375.8	1808065	▆▁▇▂▁

Issues With The Data

Missing Values

Interim_coceo (97% missing)
still_there (78% missing)
eight_ks (48% missing)

Factors or Numeric Variables

departure_code (categorical but currently numeric)
interim_coceo (needs to be a factor)
leftofc (needs to be a factor)
still_there (needs to be a factor)

0 Variance Variables

X_merge

Character Names

coname
exec_fullname
sources

Unbalanced Target Variable

ceo_dismissal

ID Variable

dismissal_dataset_id
gvkey
cik

Data cleaning

# Clean the data and ensure ceo_dismissal is a factor
data_clean <- data %>%
  # Convert ceo_dismissal and factors to proper types
  filter(!is.na(ceo_dismissal)) %>%
  mutate(ceo_dismissal = if_else(ceo_dismissal == 1, "dismissed", "not_dis")) %>%
  mutate(ceo_dismissal = as.factor(ceo_dismissal)) %>% 
  
  # Remove variables with missing values in key columns
  select(-c(interim_coceo, still_there, eight_ks)) %>%

  # Remove irrelevant variables that don't seem to have predictive power
  select(-c(X_merge, sources)) %>%

  # Remove variable with info that only becomes available after the fact
  select(-departure_code) %>%

  # Remove redundant variables 
  select(-c(gvkey, cik, co_per_rol, leftofc, fyear)) %>%

  # Remove duplicates in dismissal_dataset_id, which is the id variable
  distinct(dismissal_dataset_id, .keep_all = TRUE) %>%

  # Remove 2997 in fyear_gone
  filter(fyear_gone < 2025) %>%

  # Convert numeric variables that should be factors
  mutate(across(c(tenure_no_ceodb, max_tenure_ceodb, fyear_gone), as.factor)) %>%

  # Convert all character variables to factors
  mutate(across(where(is.character), as.factor)) %>%

  # Convert notes to character
  mutate(notes = as.character(notes)) %>%
  
  # Remove missing values
  na.omit()

skimr::skim(data_clean)

Data summary
Name	data_clean
Number of rows	7458
Number of columns	8
_______________________
Column type frequency:
character	1
factor	6
numeric	1
________________________
Group variables	None