Link: https://rpubs.com/staszkiewicz/RID_Ins_Step_2

Initial checks and adding up variables to make the panel working

I start with the data form th file Step_1 and called them the df2. They are still in the wide format, I will make few basic checks to see the structure of the revised data, but first I need to make the index, because curen CoyIndex index only by the individual excel sheets there were used for download from the Orbis

df2 <- readRDS("InterimData/DB_final.rds")

Check the doubles and repettive

sum(duplicated(df2))

## [1] 0

We do not have the duplicated data.

Check whether BvdIdNumber is unique

sum(duplicated(df2$BvdIdNumber))

## [1] 0

check wheter this equait to the number o row

nrow(df2)

## [1] 35566

length(unique(df2$BvdIdNumber))

## [1] 35566

No problems as well here

Coutnries distribution

sort(table(df2$Country), decreasing = TRUE)

## 
##         Poland        Hungary      Lithuania Czech Republic       Slovakia 
##          15641           6666           4190           2665           2313 
##         Latvia       Slovenia        Estonia         Cyprus          Malta 
##           1651           1401            874            120             44 
##        COUNTRY 
##              1

Chceking the currency strage allocation atc.

Suspisious datastructure

#Search for terms like USD, GBP, PLN, currency, local currency, etc.

names(df2)[grepl("Usd|USD|Gbp|GBP|Pln|PLN|Currency|Local",
                 names(df2))]

## character(0)

# Find all non-EUR financial variables
names(df2)[grepl("Assets|Sales|Revenue|Profit|Debt|Cash",
                 names(df2)) &
           !grepl("Eur", names(df2))]

## character(0)

No problem with US dollars any more

Data consistency check

Simple scaling check

quantile(df2$TotalAssetsEur2025,
         probs = c(.01,.05,.5,.95,.99),
         na.rm = TRUE)

##           1%           5%          50%          95%          99% 
## 2.957131e+05 8.668501e+05 1.043993e+07 4.783625e+08 1.621946e+10

more or less ok

Simple total assets check

year <- 2025

diff <- df2[[paste0("TotalAssetsEur", year)]] -
        df2[[paste0("TotalShareholdersFundsLiabilitiesEur", year)]]

problems <- df2[abs(diff) > 1e-6,
                 c("CompanyNameLatinAlphabet",
                   "BvdIdNumber")]

head(problems)

##      CompanyNameLatinAlphabet BvdIdNumber
## NA                       <NA>        <NA>
## NA.1                     <NA>        <NA>
## NA.2                     <NA>        <NA>
## NA.3                     <NA>        <NA>
## NA.4                     <NA>        <NA>
## NA.5                     <NA>        <NA>

full chcek accors all years:

years <- 2015:2025

checks <- lapply(years, function(y) {

  diff <- df2[[paste0("TotalAssetsEur", y)]] -
          df2[[paste0("TotalShareholdersFundsLiabilitiesEur", y)]]

  data.frame(
    Year = y,
    Mean_Difference = mean(diff, na.rm = TRUE),
    Median_Difference = median(diff, na.rm = TRUE),
    Min_Difference = min(diff, na.rm = TRUE),
    Max_Difference = max(diff, na.rm = TRUE),
    N_NonZero = sum(abs(diff) > 1e-6, na.rm = TRUE)
  )
})

checks_table <- do.call(rbind, checks)

checks_table

##    Year Mean_Difference Median_Difference Min_Difference Max_Difference
## 1  2015       -9.532658                 0  -2.771154e+05       54389.71
## 2  2016     2220.604293                 0  -1.550000e+07    79571424.00
## 3  2017      135.947529                 0  -4.924544e+06     3873340.98
## 4  2018      231.350501                 0  -9.291826e+05     1538432.62
## 5  2019      424.637843                 0  -1.640752e+07     7726536.58
## 6  2020      990.278322                 0  -5.000106e+06    17930013.00
## 7  2021   -24522.427109                 0  -8.043695e+08    10165849.00
## 8  2022        8.307208                 0  -7.620984e+06    11039034.55
## 9  2023   -36905.368342                 0  -1.229405e+09     4687711.93
## 10 2024      100.888994                 0  -7.533912e+04      738409.14
## 11 2025     3130.346279                 0  -2.345356e+02    10004819.24
##    N_NonZero
## 1         37
## 2         82
## 3         82
## 4         92
## 5         85
## 6         88
## 7         84
## 8         84
## 9         68
## 10        69
## 11         5

Conlusion: resonblea ala bmedia zero no significatn probles accross the years, fluctuation possible due to rounding up or non consisten treatement of negative equity

Negative equity (streessed companie across all data)

years <- 2015:2025

negative_equity <- data.frame(
  Year = years,
  Negative_Equity_Count = sapply(years, function(y) {
    sum(df2[[paste0("ShareholdersFundsEur", y)]] < 0,
        na.rm = TRUE)
  }),
  Percentage = sapply(years, function(y) {
    round(
      mean(df2[[paste0("ShareholdersFundsEur", y)]] < 0,
           na.rm = TRUE) * 100,
      2
    )
  })
)

negative_equity

##    Year Negative_Equity_Count Percentage
## 1  2015                   980       4.00
## 2  2016                  1339       4.57
## 3  2017                  1418       4.65
## 4  2018                  1497       4.77
## 5  2019                  1565       4.80
## 6  2020                  1688       5.12
## 7  2021                  1606       4.79
## 8  2022                  1625       4.79
## 9  2023                  1534       4.58
## 10 2024                  1503       4.63
## 11 2025                    68       2.08

Probably we will have to trim 2025 year, becasue of the data issue (chcek after the panel has been finalized) or exclude the 2024 from the main modeling, add it up as the robustness checks

Denisty and uabilty of the data

mean(is.na(df2)) * 100

## [1] 46.06101

Quite sparsam data (small models on acutal and perhaps consider the imputation for non-runing variables)

missing data by variables decresing to check wherea are the most likely gaps:

head(sort(colMeans(is.na(df2)) * 100, decreasing = TRUE), 20)

##                         Latitude                        Longitude 
##                         99.99719                         99.99719 
##          WomanOwnedIndicatorInUs EthnicMinorityOwnedIndicatorInUs 
##                         99.99719                         99.99719 
##                      Status_1983                      Status_1961 
##                         99.99719                         99.99719 
##                      Status_1981                      Status_1985 
##                         99.99719                         99.99719 
##                      Status_1972                      Status_1955 
##                         99.99719                         99.99719 
##                      Status_1946               Status_STATUS_DATE 
##                         99.99719                         99.99719 
##                      Status_1963                      Status_1988 
##                         99.99719                         99.99438 
##                      Status_1986                      Status_1947 
##                         99.99438                         99.99438 
##                      Status_1977                      Status_1984 
##                         99.99156                         99.98875 
##                      Status_1980                      Status_1982 
##                         99.98875                         99.98594

tail(sort(colMeans(is.na(df2)) * 100, decreasing = TRUE), 20)

##           FiscalYear2023           FiscalYear2021           FiscalYear2022 
##              2.890400945              2.704830456              0.427374459 
##      NaceRev2MainSection          NationalIdLabel               NationalId 
##              0.064668504              0.028116741              0.005623348 
##                 CoyIndex   NumberOfAvailableYears              Source.Name 
##              0.002811674              0.002811674              0.000000000 
## CompanyNameLatinAlphabet                  Country           CountryIsoCode 
##              0.000000000              0.000000000              0.000000000 
##              BvdIdNumber         BvdAccountNumber            OrbisIdNumber 
##              0.000000000              0.000000000              0.000000000 
##       SizeClassification                   Status             TypeOfEntity 
##              0.000000000              0.000000000              0.000000000 
##        ConsolidationCode               FilingType 
##              0.000000000              0.000000000

Ups, only the main data form balance sheet and profit and loss, the detailed format rather missing, audit data sparsame, should be crossed with Audit Analytics, estimation of minimum valuable data 35k x10 years = 350k x 0.1

3-35k minimum model, for regression and classification ok, for the AI and data driven model 35k at the border of the power, the power on the country level 3000/10 = 300 year-firm observation - likely sufficient for segments depends on distribtuion (small counties in other group likely)

For Slovak methods, if imputation was done, the same imputation, has to be done here on those data to perserve the comparablity of the models

Missing year structure after the reshaping

library(stringr)

years_in_data <- str_extract(names(df2), "\\d{4}")
years_in_data <- as.numeric(years_in_data)

sort(unique(na.omit(years_in_data)))

##  [1] 1946 1947 1955 1961 1963 1972 1974 1975 1976 1977 1978 1979 1980 1981 1982
## [16] 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
## [31] 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## [46] 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026

We will need to tream status year outside the 2015-2025 year ragne

missing_years <- setdiff(2015:2025, unique(na.omit(years_in_data)))
missing_years

## integer(0)

All years are presents, not gaps

Conclusion

Initial checks for data passed.

No saving of the data from this stage is necessary the DB_finals.rds data ok, for further research steps

Next steps Widing the data with Polish insolvency data, and audit analytics’s data only of the robustness purposes, major calculation could be done on those data “DB_final.rds”

Cleaning up

Removing all object from the 2 Stage

# rm(ls=list())

Step 2 RID Insovlency data integreti check wide format