Link: https://rpubs.com/staszkiewicz/RID_Ins_Step_2
df2 <- readRDS("InterimData/DB_final.rds")
sum(duplicated(df2))
## [1] 0
We do not have the duplicated data.
sum(duplicated(df2$BvdIdNumber))
## [1] 0
check wheter this equait to the number o row
nrow(df2)
## [1] 35566
length(unique(df2$BvdIdNumber))
## [1] 35566
No problems as well here
sort(table(df2$Country), decreasing = TRUE)
##
## Poland Hungary Lithuania Czech Republic Slovakia
## 15641 6666 4190 2665 2313
## Latvia Slovenia Estonia Cyprus Malta
## 1651 1401 874 120 44
## COUNTRY
## 1
#Search for terms like USD, GBP, PLN, currency, local currency, etc.
names(df2)[grepl("Usd|USD|Gbp|GBP|Pln|PLN|Currency|Local",
names(df2))]
## character(0)
# Find all non-EUR financial variables
names(df2)[grepl("Assets|Sales|Revenue|Profit|Debt|Cash",
names(df2)) &
!grepl("Eur", names(df2))]
## character(0)
No problem with US dollars any more
Simple scaling check
quantile(df2$TotalAssetsEur2025,
probs = c(.01,.05,.5,.95,.99),
na.rm = TRUE)
## 1% 5% 50% 95% 99%
## 2.957131e+05 8.668501e+05 1.043993e+07 4.783625e+08 1.621946e+10
more or less ok
Simple total assets check
year <- 2025
diff <- df2[[paste0("TotalAssetsEur", year)]] -
df2[[paste0("TotalShareholdersFundsLiabilitiesEur", year)]]
problems <- df2[abs(diff) > 1e-6,
c("CompanyNameLatinAlphabet",
"BvdIdNumber")]
head(problems)
## CompanyNameLatinAlphabet BvdIdNumber
## NA <NA> <NA>
## NA.1 <NA> <NA>
## NA.2 <NA> <NA>
## NA.3 <NA> <NA>
## NA.4 <NA> <NA>
## NA.5 <NA> <NA>
years <- 2015:2025
checks <- lapply(years, function(y) {
diff <- df2[[paste0("TotalAssetsEur", y)]] -
df2[[paste0("TotalShareholdersFundsLiabilitiesEur", y)]]
data.frame(
Year = y,
Mean_Difference = mean(diff, na.rm = TRUE),
Median_Difference = median(diff, na.rm = TRUE),
Min_Difference = min(diff, na.rm = TRUE),
Max_Difference = max(diff, na.rm = TRUE),
N_NonZero = sum(abs(diff) > 1e-6, na.rm = TRUE)
)
})
checks_table <- do.call(rbind, checks)
checks_table
## Year Mean_Difference Median_Difference Min_Difference Max_Difference
## 1 2015 -9.532658 0 -2.771154e+05 54389.71
## 2 2016 2220.604293 0 -1.550000e+07 79571424.00
## 3 2017 135.947529 0 -4.924544e+06 3873340.98
## 4 2018 231.350501 0 -9.291826e+05 1538432.62
## 5 2019 424.637843 0 -1.640752e+07 7726536.58
## 6 2020 990.278322 0 -5.000106e+06 17930013.00
## 7 2021 -24522.427109 0 -8.043695e+08 10165849.00
## 8 2022 8.307208 0 -7.620984e+06 11039034.55
## 9 2023 -36905.368342 0 -1.229405e+09 4687711.93
## 10 2024 100.888994 0 -7.533912e+04 738409.14
## 11 2025 3130.346279 0 -2.345356e+02 10004819.24
## N_NonZero
## 1 37
## 2 82
## 3 82
## 4 92
## 5 85
## 6 88
## 7 84
## 8 84
## 9 68
## 10 69
## 11 5
Conlusion: resonblea ala bmedia zero no significatn probles accross the years, fluctuation possible due to rounding up or non consisten treatement of negative equity
years <- 2015:2025
negative_equity <- data.frame(
Year = years,
Negative_Equity_Count = sapply(years, function(y) {
sum(df2[[paste0("ShareholdersFundsEur", y)]] < 0,
na.rm = TRUE)
}),
Percentage = sapply(years, function(y) {
round(
mean(df2[[paste0("ShareholdersFundsEur", y)]] < 0,
na.rm = TRUE) * 100,
2
)
})
)
negative_equity
## Year Negative_Equity_Count Percentage
## 1 2015 980 4.00
## 2 2016 1339 4.57
## 3 2017 1418 4.65
## 4 2018 1497 4.77
## 5 2019 1565 4.80
## 6 2020 1688 5.12
## 7 2021 1606 4.79
## 8 2022 1625 4.79
## 9 2023 1534 4.58
## 10 2024 1503 4.63
## 11 2025 68 2.08
Probably we will have to trim 2025 year, becasue of the data issue (chcek after the panel has been finalized) or exclude the 2024 from the main modeling, add it up as the robustness checks
mean(is.na(df2)) * 100
## [1] 46.06101
Quite sparsam data (small models on acutal and perhaps consider the imputation for non-runing variables)
missing data by variables decresing to check wherea are the most likely gaps:
head(sort(colMeans(is.na(df2)) * 100, decreasing = TRUE), 20)
## Latitude Longitude
## 99.99719 99.99719
## WomanOwnedIndicatorInUs EthnicMinorityOwnedIndicatorInUs
## 99.99719 99.99719
## Status_1983 Status_1961
## 99.99719 99.99719
## Status_1981 Status_1985
## 99.99719 99.99719
## Status_1972 Status_1955
## 99.99719 99.99719
## Status_1946 Status_STATUS_DATE
## 99.99719 99.99719
## Status_1963 Status_1988
## 99.99719 99.99438
## Status_1986 Status_1947
## 99.99438 99.99438
## Status_1977 Status_1984
## 99.99156 99.98875
## Status_1980 Status_1982
## 99.98875 99.98594
tail(sort(colMeans(is.na(df2)) * 100, decreasing = TRUE), 20)
## FiscalYear2023 FiscalYear2021 FiscalYear2022
## 2.890400945 2.704830456 0.427374459
## NaceRev2MainSection NationalIdLabel NationalId
## 0.064668504 0.028116741 0.005623348
## CoyIndex NumberOfAvailableYears Source.Name
## 0.002811674 0.002811674 0.000000000
## CompanyNameLatinAlphabet Country CountryIsoCode
## 0.000000000 0.000000000 0.000000000
## BvdIdNumber BvdAccountNumber OrbisIdNumber
## 0.000000000 0.000000000 0.000000000
## SizeClassification Status TypeOfEntity
## 0.000000000 0.000000000 0.000000000
## ConsolidationCode FilingType
## 0.000000000 0.000000000
Ups, only the main data form balance sheet and profit and loss, the detailed format rather missing, audit data sparsame, should be crossed with Audit Analytics, estimation of minimum valuable data 35k x10 years = 350k x 0.1
3-35k minimum model, for regression and classification ok, for the AI and data driven model 35k at the border of the power, the power on the country level 3000/10 = 300 year-firm observation - likely sufficient for segments depends on distribtuion (small counties in other group likely)
For Slovak methods, if imputation was done, the same imputation, has to be done here on those data to perserve the comparablity of the models
library(stringr)
years_in_data <- str_extract(names(df2), "\\d{4}")
years_in_data <- as.numeric(years_in_data)
sort(unique(na.omit(years_in_data)))
## [1] 1946 1947 1955 1961 1963 1972 1974 1975 1976 1977 1978 1979 1980 1981 1982
## [16] 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
## [31] 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## [46] 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026
We will need to tream status year outside the 2015-2025 year ragne
missing_years <- setdiff(2015:2025, unique(na.omit(years_in_data)))
missing_years
## integer(0)
All years are presents, not gaps
Initial checks for data passed.
No saving of the data from this stage is necessary the DB_finals.rds data ok, for further research steps
Next steps Widing the data with Polish insolvency data, and audit analytics’s data only of the robustness purposes, major calculation could be done on those data “DB_final.rds”
Removing all object from the 2 Stage
# rm(ls=list())