Executive Summary

This report presents an exploratory data analysis of a large financial ratio dataset (25,121 observations, 67 variables) used for bankruptcy prediction. I focus on two key quantitative variables—net_profit_over_total_assets (profitability) and total_liabilities_over_total_assets (leverage)—and two qualitative variables—forecasting_period and class (Solvent vs. Bankrupt). These variables were selected based on their theoretical importance in the financial distress literature and to satisfy the requirement of analyzing both quantitative and qualitative data.

The univariate analysis shows that both profitability and leverage exhibit extreme skewness and heavy tails, with many firms clustered near typical values but a small number of firms in extreme financial conditions. The class variable is highly imbalanced, with relatively few bankrupt firms, while forecasting periods are reasonably well distributed. Bivariate analyses reveal a weak but statistically significant negative correlation between profitability and leverage, and a statistically significant association between forecasting horizon and bankruptcy status: longer horizons show slightly higher bankruptcy rates. These patterns justify the use of robust descriptive statistics, careful visualization, and future modeling strategies that account for class imbalance and non-normality.

Introduction

The purpose of this assignment is to continue the Exploratory Data Analysis (EDA) process using the assigned dataset, train 14(in).csv. The dataset contains 25,121 observations and 67 variables describing financial ratios and other characteristics for firms, along with a binary target variable, class, which indicates whether the firm is in financial distress (1) or not (0).

For this analysis, I focus on four variables:

These variables were selected because they are conceptually important in the bankruptcy prediction literature. Profitability and leverage are classic predictors of financial distress, while forecasting period and class allow examination of how predicted failure status varies over different forecast horizons.

Data Import and Preparation

# Read the dataset (assumes the CSV is in the working directory)
df <- read.csv("train 14(in).csv", stringsAsFactors = FALSE)

# Basic structure
glimpse(df)
## Rows: 25,121
## Columns: 67
## $ id                                                                                                                                 <int> …
## $ forecasting.period                                                                                                                 <int> …
## $ net_profit_over_total_assets                                                                                                       <chr> …
## $ total_liabilities_over_total_assets                                                                                                <chr> …
## $ working_capital_over_total_assets                                                                                                  <chr> …
## $ current_assets_over_short_term_liabilities                                                                                         <chr> …
## $ cash_plus_short_term_securities_plus_receivables_minus_short_term_liabilities_over_operating_expenses_minus_depreciation_times_365 <chr> …
## $ retained_earnings_over_total_assets                                                                                                <chr> …
## $ EBIT_over_total_assets                                                                                                             <chr> …
## $ book_value_of_equity_over_total_liabilities                                                                                        <chr> …
## $ sales_over_total_assets                                                                                                            <chr> …
## $ equity_over_total_assets                                                                                                           <chr> …
## $ gross_profit_plus_extraordinary_items_plus_financial_expenses_over_total_assets                                                    <chr> …
## $ gross_profit_over_short_term_liabilities                                                                                           <chr> …
## $ gross_profit_plus_depreciation_over_sales                                                                                          <chr> …
## $ gross_profit_plus_interest_over_total_assets                                                                                       <chr> …
## $ total_liabilities_times_365_over_gross_profit_plus_depreciation                                                                    <chr> …
## $ gross_profit_plus_depreciation_over_total_liabilities                                                                              <chr> …
## $ total_assets_over_total_liabilities                                                                                                <chr> …
## $ gross_profit_over_total_assets                                                                                                     <chr> …
## $ gross_profit_over_sales                                                                                                            <chr> …
## $ inventory_times_365_over_sales                                                                                                     <chr> …
## $ sales_n_over_sales_n_minus_1                                                                                                       <chr> …
## $ profit_on_operating_activities_over_total_assets                                                                                   <chr> …
## $ net_profit_over_sales                                                                                                              <chr> …
## $ gross_profit_in_3_years_over_total_assets                                                                                          <chr> …
## $ equity_minus_share_capital_over_total_assets                                                                                       <chr> …
## $ net_profit_plus_depreciation_over_total_liabilities                                                                                <chr> …
## $ profit_on_operating_activities_over_financial_expenses                                                                             <chr> …
## $ working_capital_over_fixed_assets                                                                                                  <chr> …
## $ logarithm_of_total_assets                                                                                                          <chr> …
## $ total_liabilities_minus_cash_over_sales                                                                                            <chr> …
## $ gross_profit_plus_interest_over_sales                                                                                              <chr> …
## $ current_liabilities_times_365_over_cost_of_products_sold                                                                           <chr> …
## $ operating_expenses_over_short_term_liabilities                                                                                     <chr> …
## $ operating_expenses_over_total_liabilities                                                                                          <chr> …
## $ profit_on_sales_over_total_assets                                                                                                  <chr> …
## $ total_sales_over_total_assets                                                                                                      <chr> …
## $ current_assets_minus_inventories_over_long_term_liabilities                                                                        <chr> …
## $ constant_capital_over_total_assets                                                                                                 <chr> …
## $ profit_on_sales_over_sales                                                                                                         <chr> …
## $ current_assets_minus_inventory_minus_receivables_over_short_term_liabilities                                                       <chr> …
## $ total_liabilities_over_profit_on_operating_activities_plus_depreciation_times_12_over_365                                          <chr> …
## $ profit_on_operating_activities_over_sales                                                                                          <chr> …
## $ rotation_receivables_plus_inventory_turnover_in_days                                                                               <chr> …
## $ receivables_times_365_over_sales                                                                                                   <chr> …
## $ net_profit_over_inventory                                                                                                          <chr> …
## $ current_assets_minus_inventory_over_short_term_liabilities                                                                         <chr> …
## $ inventory_times_365_over_cost_of_products_sold                                                                                     <chr> …
## $ EBITDA_profit_on_operating_activities_minus_depreciation_over_total_assets                                                         <chr> …
## $ EBITDA_profit_on_operating_activities_minus_depreciation_over_sales                                                                <chr> …
## $ current_assets_over_total_liabilities                                                                                              <chr> …
## $ short_term_liabilities_over_total_assets                                                                                           <chr> …
## $ short_term_liabilities_times_365_over_cost_of_products_sold                                                                        <chr> …
## $ equity_over_fixed_assets                                                                                                           <chr> …
## $ constant_capital_over_fixed_assets                                                                                                 <chr> …
## $ working_capital                                                                                                                    <chr> …
## $ sales_minus_cost_of_products_sold_over_sales                                                                                       <chr> …
## $ current_assets_minus_inventory_minus_short_term_liabilities_over_sales_minus_gross_profit_minus_depreciation                       <chr> …
## $ total_costs_overtotal_sales                                                                                                        <chr> …
## $ long_term_liabilities_over_equity                                                                                                  <chr> …
## $ sales_over_inventory                                                                                                               <chr> …
## $ sales_over_receivables                                                                                                             <chr> …
## $ short_term_liabilities_times_365_over_sales                                                                                        <chr> …
## $ sales_over_short_term_liabilities                                                                                                  <chr> …
## $ sales_over_fixed_assets                                                                                                            <chr> …
## $ class                                                                                                                              <int> …

The financial ratio variables are stored as character strings, so they must be converted to numeric. In addition, forecasting period and class are recoded as factors for qualitative analysis.

df <- df %>%
  # rename forecasting period to a syntactically friendly name
  rename(forecasting_period = `forecasting.period`) %>%
  mutate(
    net_profit_over_total_assets = as.numeric(net_profit_over_total_assets),
    total_liabilities_over_total_assets = as.numeric(total_liabilities_over_total_assets),
    forecasting_period = factor(forecasting_period),
    class = factor(class,
                   levels = c(0, 1),
                   labels = c("Solvent", "Bankrupt"))
  )

# Focused dataset with four variables of interest
df_focus <- df %>%
  select(net_profit_over_total_assets,
         total_liabilities_over_total_assets,
         forecasting_period,
         class)

summary(df_focus)
##  net_profit_over_total_assets total_liabilities_over_total_assets
##  Min.   :-2.569e+02           Min.   :-72.1620                   
##  1st Qu.: 3.354e-03           1st Qu.:  0.2710                   
##  Median : 4.986e-02           Median :  0.4770                   
##  Mean   : 4.688e-02           Mean   :  0.6119                   
##  3rd Qu.: 1.314e-01           3rd Qu.:  0.6936                   
##  Max.   : 5.265e+01           Max.   :480.9600                   
##  NA's   :3                    NA's   :3                          
##  forecasting_period      class      
##  1:4712             Solvent :23995  
##  2:6900             Bankrupt: 1126  
##  3:7018                             
##  4:6491                             
##                                     
##                                     
## 

Missing Data

# Percent missing for each selected variable
missing_pct <- sapply(df_focus, function(x) mean(is.na(x))) * 100
missing_table <- data.frame(
  Variable = names(missing_pct),
  PercentMissing = round(missing_pct, 4)
)

missing_table %>%
  kable(caption = "Percentage of Missing Values for Selected Variables") %>%
  kable_styling(full_width = FALSE)
Percentage of Missing Values for Selected Variables
Variable PercentMissing
net_profit_over_total_assets net_profit_over_total_assets 0.0119
total_liabilities_over_total_assets total_liabilities_over_total_assets 0.0119
forecasting_period forecasting_period 0.0000
class class 0.0000

In this dataset, the four selected variables have essentially no missing data (on the order of 0.01% or less for the quantitative variables and none for the qualitative variables). Because missingness is negligible, I proceed with listwise deletion (dropping any rows with missing values in these variables), which is unlikely to bias the results.

df_complete <- df_focus %>%
  drop_na(net_profit_over_total_assets,
          total_liabilities_over_total_assets,
          forecasting_period,
          class)

nrow(df_focus); nrow(df_complete)
## [1] 25121
## [1] 25118

The number of observations remains virtually unchanged after listwise deletion, confirming that missing data are not a substantive issue for this assignment.

Step 2: Univariate Descriptive Statistics for Quantitative Variables

quant_vars <- c("net_profit_over_total_assets",
                "total_liabilities_over_total_assets")

# Helper function to compute required statistics
univar_stats <- function(x) {
  x_clean <- na.omit(x)
  data.frame(
    N = length(x_clean),
    Mean = mean(x_clean),
    Median = median(x_clean),
    SD = sd(x_clean),
    Variance = var(x_clean),
    Range = diff(range(x_clean)),
    IQR = IQR(x_clean),
    Skewness = skewness(x_clean),
    Kurtosis = kurtosis(x_clean)  # excess kurtosis; 0 = normal
  )
}

stats_list <- lapply(df_complete[quant_vars], univar_stats)
stats_df <- bind_rows(stats_list, .id = "Variable")

stats_df %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(caption = "Univariate Descriptive Statistics for Quantitative Variables") %>%
  kable_styling(full_width = FALSE)
Univariate Descriptive Statistics for Quantitative Variables
Variable N Mean Median SD Variance Range IQR Skewness Kurtosis
net_profit_over_total_assets 25118 0.0469 0.0499 1.8217 3.3186 309.542 0.1281 -115.7549 15971.870
total_liabilities_over_total_assets 25118 0.6119 0.4770 5.9334 35.2058 553.122 0.4226 75.7450 5913.556

Visualizations for Quantitative Variables

for (v in quant_vars) {
  p_hist <- ggplot(df_complete, aes_string(x = v)) +
    geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.7) +
    geom_density(color = "black", linewidth = 0.7) +
    labs(title = paste("Histogram and Density:", v),
         x = v, y = "Density")

  p_box <- ggplot(df_complete, aes_string(y = v)) +
    geom_boxplot(alpha = 0.7) +
    labs(title = paste("Boxplot:", v),
         y = v)

  print(p_hist)
  print(p_box)
}

Interpretation of Quantitative Univariate Statistics

For net_profit_over_total_assets, the mean is close to zero, and the median is slightly higher. The standard deviation is sizable, but the range is extremely large, and the interquartile range (IQR) is relatively small. This pattern indicates that the middle 50% of firms have profit ratios clustered near zero, while a small number of firms have extremely large positive or negative values, producing extreme outliers. The skewness is highly negative, and kurtosis is extremely large, confirming a distribution with a very long left tail and heavy tails overall. This variable is far from normal and heavily influenced by a few extreme cases.

For total_liabilities_over_total_assets, the mean is higher, and the median is lower than the mean, with a moderate IQR. Again, the range is very large, skewness is highly positive, and kurtosis is very large. These indices indicate a strong right-skewed distribution: most firms have leverage ratios in a plausible range, but a small number of firms have extremely high or negative leverage ratios, again creating extreme outliers. Visual inspection of the histograms and boxplots supports this interpretation, showing a dense central region with a few values far from the bulk of the data.

Given the magnitude of skewness and kurtosis for both variables, the assumption of normality is clearly violated. In a predictive modeling context, this would motivate transformations or robust methods, but for this EDA assignment the untransformed distributions highlight the presence of extreme financial conditions.

Step 3: Univariate Descriptive Statistics for Qualitative Variables

# Frequency tables for qualitative variables
df_complete %>%
  tabyl(forecasting_period) %>%
  adorn_totals("row") %>%
  adorn_pct_formatting(digits = 1) %>%
  kable(caption = "Frequency and Percentage of Forecasting Period") %>%
  kable_styling(full_width = FALSE)
Frequency and Percentage of Forecasting Period
forecasting_period n percent
1 4711 18.8%
2 6899 27.5%
3 7018 27.9%
4 6490 25.8%
Total 25118 100.0%
df_complete %>%
  tabyl(class) %>%
  adorn_totals("row") %>%
  adorn_pct_formatting(digits = 1) %>%
  kable(caption = "Frequency and Percentage of Class (Solvent vs. Bankrupt)") %>%
  kable_styling(full_width = FALSE)
Frequency and Percentage of Class (Solvent vs. Bankrupt)
class n percent
Solvent 23992 95.5%
Bankrupt 1126 4.5%
Total 25118 100.0%

Visualizations for Qualitative Variables

# Bar plot for forecasting period
ggplot(df_complete, aes(x = forecasting_period)) +
  geom_bar() +
  labs(title = "Counts by Forecasting Period",
       x = "Forecasting Period (Years)",
       y = "Count")

# Bar plot for class
ggplot(df_complete, aes(x = class)) +
  geom_bar() +
  labs(title = "Counts by Class",
       x = "Class",
       y = "Count")

Interpretation of Qualitative Univariate Statistics

The forecasting_period variable takes integer values from 1 to 4 years. The frequencies are fairly balanced across periods 2, 3, and 4, with somewhat fewer observations at a 1-year horizon. Overall, the dataset covers a range of short- to medium-term forecasting horizons, which is useful for studying how predictive performance may vary with the length of the forecast.

The class variable is highly imbalanced: the majority of firms are labeled Solvent, while only a small proportion are labeled Bankrupt. This imbalance is typical in financial distress datasets and has implications for model building (e.g., the need to consider class imbalance methods), but here it simply indicates that bankruptcy is relatively rare compared to non-bankruptcy in the observed data.

Step 4: Bivariate Statistics

Correlation Between Quantitative Variables

cor_mat <- cor(df_complete[, quant_vars], use = "complete.obs")
cor_mat %>%
  round(4) %>%
  kable(caption = "Correlation Matrix for Quantitative Variables") %>%
  kable_styling(full_width = FALSE)
Correlation Matrix for Quantitative Variables
net_profit_over_total_assets total_liabilities_over_total_assets
net_profit_over_total_assets 1.0000 -0.0736
total_liabilities_over_total_assets -0.0736 1.0000
# Correlation test between the two variables
cor_test <- cor.test(df_complete$net_profit_over_total_assets,
                     df_complete$total_liabilities_over_total_assets,
                     use = "complete.obs",
                     method = "pearson")
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  df_complete$net_profit_over_total_assets and df_complete$total_liabilities_over_total_assets
## t = -11.7, df = 25116, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08591227 -0.06131261
## sample estimates:
##         cor 
## -0.07362364

Interpretation of Correlation

The Pearson correlation between net_profit_over_total_assets and total_liabilities_over_total_assets is small and negative but statistically significant due to the large sample size. This indicates that firms with higher leverage tend to have slightly lower profitability, on average, but the effect size is weak in practical terms (|r| < 0.10). The significance test reflects the large sample size, not a strong association.

From an EDA standpoint, this suggests that while leverage and profitability are related in the expected direction (higher liabilities associated with lower profits), profitability is influenced by many other factors not captured by this single ratio.

Crosstabs for Qualitative Variables

ct <- table(df_complete$forecasting_period, df_complete$class)
ct %>%
  kable(caption = "Crosstab: Forecasting Period by Class") %>%
  kable_styling(full_width = FALSE)
Crosstab: Forecasting Period by Class
Solvent Bankrupt
4535 176
6635 264
6683 335
6139 351
prop.table(ct, margin = 1) %>%
  round(3) %>%
  kable(caption = "Row Proportions: Bankruptcy Rates by Forecasting Period") %>%
  kable_styling(full_width = FALSE)
Row Proportions: Bankruptcy Rates by Forecasting Period
Solvent Bankrupt
0.963 0.037
0.962 0.038
0.952 0.048
0.946 0.054
chi <- chisq.test(ct)
chi
## 
##  Pearson's Chi-squared test
## 
## data:  ct
## X-squared = 27.442, df = 3, p-value = 4.757e-06
chi$expected %>%
  round(2) %>%
  kable(caption = "Expected Counts under Independence") %>%
  kable_styling(full_width = FALSE)
Expected Counts under Independence
Solvent Bankrupt
4499.81 211.19
6589.73 309.27
6703.39 314.61
6199.06 290.94

Mosaic/Heatmap-Style Visualization for Crosstab

ct_df <- as.data.frame(ct)
colnames(ct_df) <- c("forecasting_period", "class", "count")

ggplot(ct_df, aes(x = forecasting_period, y = class, fill = count)) +
  geom_tile() +
  geom_text(aes(label = count), color = "white") +
  labs(title = "Heatmap of Forecasting Period by Class",
       x = "Forecasting Period",
       y = "Class",
       fill = "Count")

Interpretation of Crosstabs and Chi-Square Test

The crosstab shows the joint distribution of forecasting_period and class. When we examine row proportions (bankruptcy rates within each forecasting period), the proportion of Bankrupt firms increases gradually with longer forecasting horizons. For example, the bankruptcy rate is lower at 1–2 years and rises at 3–4 years.

The chi-square test of independence indicates that forecasting_period and class are not independent. In other words, the probability of bankruptcy depends on the forecast horizon: longer forecast horizons are associated with slightly higher probabilities of firms being classified as bankrupt. Although the effect is modest in magnitude, it is statistically significant and consistent with the intuition that longer horizons allow more time for financial distress to materialize. The heatmap visually reinforces this pattern, with higher counts of bankrupt firms appearing in longer forecasting horizons.

Step 5: Interpretation of Descriptive and Bivariate Statistics

Bringing together the univariate and bivariate results:

Overall, the descriptive and bivariate statistics reveal a dataset characterized by extreme outliers, non-normal distributions, an imbalanced outcome variable, and modest but meaningful associations between key ratios and the class label.

Step 6: Importance of Statistics and Visualizations in EDA

EDA is inherently iterative and exploratory, combining numerical summaries with visual displays to reveal structure in the data (Tukey, 1977). In this analysis, histograms, density plots, and boxplots were essential for diagnosing extreme skewness and outliers that simple means and standard deviations would obscure. Statistical measures such as skewness, kurtosis, correlations, and chi-square tests complement these visuals by providing quantitative evidence about distributional shape and associations.

For example, histograms quickly revealed that financial ratios were concentrated near a central mass with a few extreme points, while skewness and kurtosis quantified this non-normality. Similarly, bar plots and crosstabs illustrate how bankruptcy rates vary across forecasting periods, and the heatmap visually summarizes their joint distribution. The chi-square test formally assesses whether those differences are likely due to chance. Using both statistics and visualizations provides a richer and more reliable understanding of the data than either approach alone.

Step 7: Summary of EDA Findings and Justification of Choices

In this assignment, I selected two quantitatively important financial ratios—net profit over total assets (profitability) and total liabilities over total assets (leverage)—along with two categorical variables, forecasting period and class. These choices were guided by the bankruptcy prediction literature, which emphasizes profitability and leverage as key determinants of financial distress, and by the need to analyze the outcome variable (class) across different forecast horizons.

The EDA steps included:

  1. Data preparation and missing-data assessment, confirming that missingness was negligible and justifying listwise deletion.
  2. Univariate descriptive statistics for quantitative variables (mean, median, SD, variance, range, IQR, skewness, kurtosis), revealing extreme non-normality and highlighting the importance of robust measures such as the median and IQR.
  3. Univariate descriptive statistics for qualitative variables, documenting class imbalance and the distribution of forecasting periods.
  4. Bivariate statistics:
    • Pearson correlation between profitability and leverage, showing a weak but significant negative association.
    • Crosstabs, heatmap, and a chi-square test between forecasting period and class, revealing a significant increase in bankruptcy rates at longer horizons.

These analyses were chosen to satisfy the assignment requirements and to reflect good EDA practice: starting with basic univariate summaries, then moving to bivariate relationships that are theoretically meaningful. The results provide an empirical foundation for later modeling by clarifying the distributional properties of the predictors, the extent of class imbalance, and the nature of associations among key variables.

References

Beaver, W. H. (1966). Financial ratios as predictors of failure. Journal of Accounting Research, 4, 71–111.

Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Wiley.

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.