This report presents an exploratory data analysis of a large financial ratio dataset (25,121 observations, 67 variables) used for bankruptcy prediction. I focus on two key quantitative variables—net_profit_over_total_assets (profitability) and total_liabilities_over_total_assets (leverage)—and two qualitative variables—forecasting_period and class (Solvent vs. Bankrupt). These variables were selected based on their theoretical importance in the financial distress literature and to satisfy the requirement of analyzing both quantitative and qualitative data.
The univariate analysis shows that both profitability and leverage exhibit extreme skewness and heavy tails, with many firms clustered near typical values but a small number of firms in extreme financial conditions. The class variable is highly imbalanced, with relatively few bankrupt firms, while forecasting periods are reasonably well distributed. Bivariate analyses reveal a weak but statistically significant negative correlation between profitability and leverage, and a statistically significant association between forecasting horizon and bankruptcy status: longer horizons show slightly higher bankruptcy rates. These patterns justify the use of robust descriptive statistics, careful visualization, and future modeling strategies that account for class imbalance and non-normality.
The purpose of this assignment is to continue the Exploratory Data
Analysis (EDA) process using the assigned dataset,
train 14(in).csv. The dataset contains 25,121 observations
and 67 variables describing financial ratios and other characteristics
for firms, along with a binary target variable, class,
which indicates whether the firm is in financial distress (1) or not
(0).
For this analysis, I focus on four variables:
net_profit_over_total_assets (profitability ratio)total_liabilities_over_total_assets (leverage
ratio)forecasting period (the forecast horizon in years)class (0 = solvent, 1 = bankrupt/financially
distressed)These variables were selected because they are conceptually important in the bankruptcy prediction literature. Profitability and leverage are classic predictors of financial distress, while forecasting period and class allow examination of how predicted failure status varies over different forecast horizons.
# Read the dataset (assumes the CSV is in the working directory)
df <- read.csv("train 14(in).csv", stringsAsFactors = FALSE)
# Basic structure
glimpse(df)
## Rows: 25,121
## Columns: 67
## $ id <int> …
## $ forecasting.period <int> …
## $ net_profit_over_total_assets <chr> …
## $ total_liabilities_over_total_assets <chr> …
## $ working_capital_over_total_assets <chr> …
## $ current_assets_over_short_term_liabilities <chr> …
## $ cash_plus_short_term_securities_plus_receivables_minus_short_term_liabilities_over_operating_expenses_minus_depreciation_times_365 <chr> …
## $ retained_earnings_over_total_assets <chr> …
## $ EBIT_over_total_assets <chr> …
## $ book_value_of_equity_over_total_liabilities <chr> …
## $ sales_over_total_assets <chr> …
## $ equity_over_total_assets <chr> …
## $ gross_profit_plus_extraordinary_items_plus_financial_expenses_over_total_assets <chr> …
## $ gross_profit_over_short_term_liabilities <chr> …
## $ gross_profit_plus_depreciation_over_sales <chr> …
## $ gross_profit_plus_interest_over_total_assets <chr> …
## $ total_liabilities_times_365_over_gross_profit_plus_depreciation <chr> …
## $ gross_profit_plus_depreciation_over_total_liabilities <chr> …
## $ total_assets_over_total_liabilities <chr> …
## $ gross_profit_over_total_assets <chr> …
## $ gross_profit_over_sales <chr> …
## $ inventory_times_365_over_sales <chr> …
## $ sales_n_over_sales_n_minus_1 <chr> …
## $ profit_on_operating_activities_over_total_assets <chr> …
## $ net_profit_over_sales <chr> …
## $ gross_profit_in_3_years_over_total_assets <chr> …
## $ equity_minus_share_capital_over_total_assets <chr> …
## $ net_profit_plus_depreciation_over_total_liabilities <chr> …
## $ profit_on_operating_activities_over_financial_expenses <chr> …
## $ working_capital_over_fixed_assets <chr> …
## $ logarithm_of_total_assets <chr> …
## $ total_liabilities_minus_cash_over_sales <chr> …
## $ gross_profit_plus_interest_over_sales <chr> …
## $ current_liabilities_times_365_over_cost_of_products_sold <chr> …
## $ operating_expenses_over_short_term_liabilities <chr> …
## $ operating_expenses_over_total_liabilities <chr> …
## $ profit_on_sales_over_total_assets <chr> …
## $ total_sales_over_total_assets <chr> …
## $ current_assets_minus_inventories_over_long_term_liabilities <chr> …
## $ constant_capital_over_total_assets <chr> …
## $ profit_on_sales_over_sales <chr> …
## $ current_assets_minus_inventory_minus_receivables_over_short_term_liabilities <chr> …
## $ total_liabilities_over_profit_on_operating_activities_plus_depreciation_times_12_over_365 <chr> …
## $ profit_on_operating_activities_over_sales <chr> …
## $ rotation_receivables_plus_inventory_turnover_in_days <chr> …
## $ receivables_times_365_over_sales <chr> …
## $ net_profit_over_inventory <chr> …
## $ current_assets_minus_inventory_over_short_term_liabilities <chr> …
## $ inventory_times_365_over_cost_of_products_sold <chr> …
## $ EBITDA_profit_on_operating_activities_minus_depreciation_over_total_assets <chr> …
## $ EBITDA_profit_on_operating_activities_minus_depreciation_over_sales <chr> …
## $ current_assets_over_total_liabilities <chr> …
## $ short_term_liabilities_over_total_assets <chr> …
## $ short_term_liabilities_times_365_over_cost_of_products_sold <chr> …
## $ equity_over_fixed_assets <chr> …
## $ constant_capital_over_fixed_assets <chr> …
## $ working_capital <chr> …
## $ sales_minus_cost_of_products_sold_over_sales <chr> …
## $ current_assets_minus_inventory_minus_short_term_liabilities_over_sales_minus_gross_profit_minus_depreciation <chr> …
## $ total_costs_overtotal_sales <chr> …
## $ long_term_liabilities_over_equity <chr> …
## $ sales_over_inventory <chr> …
## $ sales_over_receivables <chr> …
## $ short_term_liabilities_times_365_over_sales <chr> …
## $ sales_over_short_term_liabilities <chr> …
## $ sales_over_fixed_assets <chr> …
## $ class <int> …
The financial ratio variables are stored as character strings, so
they must be converted to numeric. In addition,
forecasting period and class are recoded as
factors for qualitative analysis.
df <- df %>%
# rename forecasting period to a syntactically friendly name
rename(forecasting_period = `forecasting.period`) %>%
mutate(
net_profit_over_total_assets = as.numeric(net_profit_over_total_assets),
total_liabilities_over_total_assets = as.numeric(total_liabilities_over_total_assets),
forecasting_period = factor(forecasting_period),
class = factor(class,
levels = c(0, 1),
labels = c("Solvent", "Bankrupt"))
)
# Focused dataset with four variables of interest
df_focus <- df %>%
select(net_profit_over_total_assets,
total_liabilities_over_total_assets,
forecasting_period,
class)
summary(df_focus)
## net_profit_over_total_assets total_liabilities_over_total_assets
## Min. :-2.569e+02 Min. :-72.1620
## 1st Qu.: 3.354e-03 1st Qu.: 0.2710
## Median : 4.986e-02 Median : 0.4770
## Mean : 4.688e-02 Mean : 0.6119
## 3rd Qu.: 1.314e-01 3rd Qu.: 0.6936
## Max. : 5.265e+01 Max. :480.9600
## NA's :3 NA's :3
## forecasting_period class
## 1:4712 Solvent :23995
## 2:6900 Bankrupt: 1126
## 3:7018
## 4:6491
##
##
##
# Percent missing for each selected variable
missing_pct <- sapply(df_focus, function(x) mean(is.na(x))) * 100
missing_table <- data.frame(
Variable = names(missing_pct),
PercentMissing = round(missing_pct, 4)
)
missing_table %>%
kable(caption = "Percentage of Missing Values for Selected Variables") %>%
kable_styling(full_width = FALSE)
| Variable | PercentMissing | |
|---|---|---|
| net_profit_over_total_assets | net_profit_over_total_assets | 0.0119 |
| total_liabilities_over_total_assets | total_liabilities_over_total_assets | 0.0119 |
| forecasting_period | forecasting_period | 0.0000 |
| class | class | 0.0000 |
In this dataset, the four selected variables have essentially no missing data (on the order of 0.01% or less for the quantitative variables and none for the qualitative variables). Because missingness is negligible, I proceed with listwise deletion (dropping any rows with missing values in these variables), which is unlikely to bias the results.
df_complete <- df_focus %>%
drop_na(net_profit_over_total_assets,
total_liabilities_over_total_assets,
forecasting_period,
class)
nrow(df_focus); nrow(df_complete)
## [1] 25121
## [1] 25118
The number of observations remains virtually unchanged after listwise deletion, confirming that missing data are not a substantive issue for this assignment.
quant_vars <- c("net_profit_over_total_assets",
"total_liabilities_over_total_assets")
# Helper function to compute required statistics
univar_stats <- function(x) {
x_clean <- na.omit(x)
data.frame(
N = length(x_clean),
Mean = mean(x_clean),
Median = median(x_clean),
SD = sd(x_clean),
Variance = var(x_clean),
Range = diff(range(x_clean)),
IQR = IQR(x_clean),
Skewness = skewness(x_clean),
Kurtosis = kurtosis(x_clean) # excess kurtosis; 0 = normal
)
}
stats_list <- lapply(df_complete[quant_vars], univar_stats)
stats_df <- bind_rows(stats_list, .id = "Variable")
stats_df %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(caption = "Univariate Descriptive Statistics for Quantitative Variables") %>%
kable_styling(full_width = FALSE)
| Variable | N | Mean | Median | SD | Variance | Range | IQR | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|---|
| net_profit_over_total_assets | 25118 | 0.0469 | 0.0499 | 1.8217 | 3.3186 | 309.542 | 0.1281 | -115.7549 | 15971.870 |
| total_liabilities_over_total_assets | 25118 | 0.6119 | 0.4770 | 5.9334 | 35.2058 | 553.122 | 0.4226 | 75.7450 | 5913.556 |
for (v in quant_vars) {
p_hist <- ggplot(df_complete, aes_string(x = v)) +
geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.7) +
geom_density(color = "black", linewidth = 0.7) +
labs(title = paste("Histogram and Density:", v),
x = v, y = "Density")
p_box <- ggplot(df_complete, aes_string(y = v)) +
geom_boxplot(alpha = 0.7) +
labs(title = paste("Boxplot:", v),
y = v)
print(p_hist)
print(p_box)
}
For net_profit_over_total_assets, the mean is close to zero, and the median is slightly higher. The standard deviation is sizable, but the range is extremely large, and the interquartile range (IQR) is relatively small. This pattern indicates that the middle 50% of firms have profit ratios clustered near zero, while a small number of firms have extremely large positive or negative values, producing extreme outliers. The skewness is highly negative, and kurtosis is extremely large, confirming a distribution with a very long left tail and heavy tails overall. This variable is far from normal and heavily influenced by a few extreme cases.
For total_liabilities_over_total_assets, the mean is higher, and the median is lower than the mean, with a moderate IQR. Again, the range is very large, skewness is highly positive, and kurtosis is very large. These indices indicate a strong right-skewed distribution: most firms have leverage ratios in a plausible range, but a small number of firms have extremely high or negative leverage ratios, again creating extreme outliers. Visual inspection of the histograms and boxplots supports this interpretation, showing a dense central region with a few values far from the bulk of the data.
Given the magnitude of skewness and kurtosis for both variables, the assumption of normality is clearly violated. In a predictive modeling context, this would motivate transformations or robust methods, but for this EDA assignment the untransformed distributions highlight the presence of extreme financial conditions.
# Frequency tables for qualitative variables
df_complete %>%
tabyl(forecasting_period) %>%
adorn_totals("row") %>%
adorn_pct_formatting(digits = 1) %>%
kable(caption = "Frequency and Percentage of Forecasting Period") %>%
kable_styling(full_width = FALSE)
| forecasting_period | n | percent |
|---|---|---|
| 1 | 4711 | 18.8% |
| 2 | 6899 | 27.5% |
| 3 | 7018 | 27.9% |
| 4 | 6490 | 25.8% |
| Total | 25118 | 100.0% |
df_complete %>%
tabyl(class) %>%
adorn_totals("row") %>%
adorn_pct_formatting(digits = 1) %>%
kable(caption = "Frequency and Percentage of Class (Solvent vs. Bankrupt)") %>%
kable_styling(full_width = FALSE)
| class | n | percent |
|---|---|---|
| Solvent | 23992 | 95.5% |
| Bankrupt | 1126 | 4.5% |
| Total | 25118 | 100.0% |
# Bar plot for forecasting period
ggplot(df_complete, aes(x = forecasting_period)) +
geom_bar() +
labs(title = "Counts by Forecasting Period",
x = "Forecasting Period (Years)",
y = "Count")
# Bar plot for class
ggplot(df_complete, aes(x = class)) +
geom_bar() +
labs(title = "Counts by Class",
x = "Class",
y = "Count")
The forecasting_period variable takes integer values from 1 to 4 years. The frequencies are fairly balanced across periods 2, 3, and 4, with somewhat fewer observations at a 1-year horizon. Overall, the dataset covers a range of short- to medium-term forecasting horizons, which is useful for studying how predictive performance may vary with the length of the forecast.
The class variable is highly imbalanced: the majority of firms are labeled Solvent, while only a small proportion are labeled Bankrupt. This imbalance is typical in financial distress datasets and has implications for model building (e.g., the need to consider class imbalance methods), but here it simply indicates that bankruptcy is relatively rare compared to non-bankruptcy in the observed data.
cor_mat <- cor(df_complete[, quant_vars], use = "complete.obs")
cor_mat %>%
round(4) %>%
kable(caption = "Correlation Matrix for Quantitative Variables") %>%
kable_styling(full_width = FALSE)
| net_profit_over_total_assets | total_liabilities_over_total_assets | |
|---|---|---|
| net_profit_over_total_assets | 1.0000 | -0.0736 |
| total_liabilities_over_total_assets | -0.0736 | 1.0000 |
# Correlation test between the two variables
cor_test <- cor.test(df_complete$net_profit_over_total_assets,
df_complete$total_liabilities_over_total_assets,
use = "complete.obs",
method = "pearson")
cor_test
##
## Pearson's product-moment correlation
##
## data: df_complete$net_profit_over_total_assets and df_complete$total_liabilities_over_total_assets
## t = -11.7, df = 25116, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08591227 -0.06131261
## sample estimates:
## cor
## -0.07362364
The Pearson correlation between net_profit_over_total_assets and total_liabilities_over_total_assets is small and negative but statistically significant due to the large sample size. This indicates that firms with higher leverage tend to have slightly lower profitability, on average, but the effect size is weak in practical terms (|r| < 0.10). The significance test reflects the large sample size, not a strong association.
From an EDA standpoint, this suggests that while leverage and profitability are related in the expected direction (higher liabilities associated with lower profits), profitability is influenced by many other factors not captured by this single ratio.
ct <- table(df_complete$forecasting_period, df_complete$class)
ct %>%
kable(caption = "Crosstab: Forecasting Period by Class") %>%
kable_styling(full_width = FALSE)
| Solvent | Bankrupt |
|---|---|
| 4535 | 176 |
| 6635 | 264 |
| 6683 | 335 |
| 6139 | 351 |
prop.table(ct, margin = 1) %>%
round(3) %>%
kable(caption = "Row Proportions: Bankruptcy Rates by Forecasting Period") %>%
kable_styling(full_width = FALSE)
| Solvent | Bankrupt |
|---|---|
| 0.963 | 0.037 |
| 0.962 | 0.038 |
| 0.952 | 0.048 |
| 0.946 | 0.054 |
chi <- chisq.test(ct)
chi
##
## Pearson's Chi-squared test
##
## data: ct
## X-squared = 27.442, df = 3, p-value = 4.757e-06
chi$expected %>%
round(2) %>%
kable(caption = "Expected Counts under Independence") %>%
kable_styling(full_width = FALSE)
| Solvent | Bankrupt |
|---|---|
| 4499.81 | 211.19 |
| 6589.73 | 309.27 |
| 6703.39 | 314.61 |
| 6199.06 | 290.94 |
ct_df <- as.data.frame(ct)
colnames(ct_df) <- c("forecasting_period", "class", "count")
ggplot(ct_df, aes(x = forecasting_period, y = class, fill = count)) +
geom_tile() +
geom_text(aes(label = count), color = "white") +
labs(title = "Heatmap of Forecasting Period by Class",
x = "Forecasting Period",
y = "Class",
fill = "Count")
The crosstab shows the joint distribution of forecasting_period and class. When we examine row proportions (bankruptcy rates within each forecasting period), the proportion of Bankrupt firms increases gradually with longer forecasting horizons. For example, the bankruptcy rate is lower at 1–2 years and rises at 3–4 years.
The chi-square test of independence indicates that forecasting_period and class are not independent. In other words, the probability of bankruptcy depends on the forecast horizon: longer forecast horizons are associated with slightly higher probabilities of firms being classified as bankrupt. Although the effect is modest in magnitude, it is statistically significant and consistent with the intuition that longer horizons allow more time for financial distress to materialize. The heatmap visually reinforces this pattern, with higher counts of bankrupt firms appearing in longer forecasting horizons.
Bringing together the univariate and bivariate results:
Quantitative variables
Both net_profit_over_total_assets and
total_liabilities_over_total_assets exhibit extreme
non-normality, with heavy tails and strong skewness. The median
and IQR provide more robust descriptions of central tendency and
dispersion than the mean and standard deviation, which are heavily
influenced by a small number of extreme outliers. For EDA, this
highlights the importance of looking beyond simple means and considering
robust measures and visualizations such as boxplots.
Qualitative variables
forecasting_period is roughly balanced across levels, while
class is highly imbalanced, with bankruptcy being rare.
This imbalance reflects the real-world rarity of corporate failure and
suggests that any subsequent modeling would need to explicitly account
for class imbalance.
Correlations
The small negative correlation between profitability and leverage is
statistically significant but practically weak,
indicating only a modest tendency for more heavily leveraged firms to be
less profitable. This aligns with theoretical expectations, yet suggests
that leverage alone is not a strong predictor of profitability.
Crosstabs
The relationship between forecasting horizon and class is statistically
significant: longer horizons are associated with higher bankruptcy
rates. This suggests that the timing of prediction matters, and EDA
reveals a potentially important interaction between time horizon and
failure risk that would warrant further modeling.
Overall, the descriptive and bivariate statistics reveal a dataset characterized by extreme outliers, non-normal distributions, an imbalanced outcome variable, and modest but meaningful associations between key ratios and the class label.
EDA is inherently iterative and exploratory, combining numerical summaries with visual displays to reveal structure in the data (Tukey, 1977). In this analysis, histograms, density plots, and boxplots were essential for diagnosing extreme skewness and outliers that simple means and standard deviations would obscure. Statistical measures such as skewness, kurtosis, correlations, and chi-square tests complement these visuals by providing quantitative evidence about distributional shape and associations.
For example, histograms quickly revealed that financial ratios were concentrated near a central mass with a few extreme points, while skewness and kurtosis quantified this non-normality. Similarly, bar plots and crosstabs illustrate how bankruptcy rates vary across forecasting periods, and the heatmap visually summarizes their joint distribution. The chi-square test formally assesses whether those differences are likely due to chance. Using both statistics and visualizations provides a richer and more reliable understanding of the data than either approach alone.
In this assignment, I selected two quantitatively important financial
ratios—net profit over total assets (profitability) and total
liabilities over total assets (leverage)—along with two categorical
variables, forecasting period and class. These choices were guided by
the bankruptcy prediction literature, which emphasizes profitability and
leverage as key determinants of financial distress, and by the need to
analyze the outcome variable (class) across different
forecast horizons.
The EDA steps included:
These analyses were chosen to satisfy the assignment requirements and to reflect good EDA practice: starting with basic univariate summaries, then moving to bivariate relationships that are theoretically meaningful. The results provide an empirical foundation for later modeling by clarifying the distributional properties of the predictors, the extent of class imbalance, and the nature of associations among key variables.
Beaver, W. H. (1966). Financial ratios as predictors of failure. Journal of Accounting Research, 4, 71–111.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Wiley.
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.