Podstawowe operacje w R - część 4.
Czyszczenie danych
Walidacja danych
Definicja
An activity, checking whether a combination of values comes from a predefined set of allowed value combinations.
Przykłady
Czy zyski są trzymane jako liczba (nie text)?
Czy wiek >= 0?
Kiedy wiek < 15, czy status_pracy == “brak pracy”?
Czy średni obrót w firmie jest dodatni?
Reguły
Rozpatrzmy następujący zbiór danych. Co widzisz w początkowych wierszach?
data(retailers)
head(retailers[3:7],3)## staff turnover other.rev total.rev staff.costs
## 1 75 NA NA 1130 NA
## 2 9 1607 NA 1607 131
## 3 NA 6886 -33 6919 324
# dodajmy ID
retailers$rec_id <- sprintf("%03d",1:nrow(retailers))Zdefiniujmy szereg reguł, które muszą spełniać nasze dane.
rules <- validator(
turnover + other.rev == total.rev
, turnover >= 0
, other.rev >= 0
, total.rev >= 0
, if (staff > 0) staff.costs > 0
)Pora zobaczyć, czy nasz zbiór danych spełnia te reguły…
cf <- confront(retailers, rules, key="rec_id")
summary(cf)## name items passes fails nNA error warning
## 1 V1 60 19 4 37 FALSE FALSE
## 2 V2 60 56 0 4 FALSE FALSE
## 3 V3 60 23 1 36 FALSE FALSE
## 4 V4 60 58 0 2 FALSE FALSE
## 5 V5 60 50 0 10 FALSE FALSE
## expression
## 1 abs(turnover + other.rev - total.rev) <= 0.00000001
## 2 turnover - 0 >= -0.00000001
## 3 other.rev - 0 >= -0.00000001
## 4 total.rev - 0 >= -0.00000001
## 5 staff <= 0 | (staff.costs > 0)
barplot(cf, main="retailers")## Warning: The 'barplot' method for confrontation objects is deprecated. Use
## 'plot' instead
#as.data.frame(cf) %>% head()Zarządzanie regułami
Problem
Zestawy reguł często rosną organicznie i prawie nie są przycinane. W rezultacie, mogą zawierać redundancje lub sprzeczności, które sprawiają, że zestaw reguł staje się nieefektywny i trudny do zrozumienia.
Pomysł
Automatycznie wyeliminuj nadmiarowość i sprzeczności (na tyle jak to możliwe)!
rules## Object of class 'validator' with 5 elements:
## V1: turnover + other.rev == total.rev
## V2: turnover >= 0
## V3: other.rev >= 0
## V4: total.rev >= 0
## V5: staff <= 0 | (staff.costs > 0)
rules <- simplify_rules(rules)## No fixed values found.
Modyfikacje danych
Obserwacja Do około 50% zmian w danych można wykonać ręcznie (lub skryptowo) na podstawie bezpośredniego wkładu ekspertów dziedzinowych.
Pytanie Czy możemy wesprzeć wykorzystanie wiedzy z domeny zewnętrznej w procesie czyszczenia danych, jednocześnie oddzielając ją od kodu?
Spójrzmy na możliwości pakietu ‘dcmodify’.
Zdefiniuje on reguły modyfikujące w wierszu poleceń lub oddzielnym pliku tekstowym
Dodaje metadane do reguł modyfikujących
Czyta, sprawdza, manipuluje i stosuje reguły dla danych
m <- modifier(
if (other.rev < 0) other.rev <- -1 * other.rev
)
modified <- modify(retailers, m)
head(modified[3:7], 3)## staff turnover other.rev total.rev staff.costs
## 1 75 NA NA 1130 NA
## 2 9 1607 NA 1607 131
## 3 NA 6886 33 6919 324
Lokalizacja błędów
Pytanie
Wiedząc, że rekord narusza szereg zasad, jakie pola mam zmienić, żebym to naprawić?
Odpowiedź
Znajdź najmniejszą (ważoną) liczbę pól, których wartości mogą być zastąpione, aby wszystkie zasady mogły być spełnione!
error_locations <- locate_errors(modified, rules)
values(error_locations)[30:37, 3:7]## staff turnover other.rev total.rev staff.costs
## [1,] FALSE TRUE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE NA TRUE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE NA FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE TRUE FALSE FALSE
## [8,] FALSE TRUE FALSE FALSE FALSE
summary(error_locations)## Variable:
## name errors missing
## 4 turnover 2 4
## 5 other.rev 1 36
## 6 total.rev 1 2
## 1 size 0 0
## 2 incl.prob 0 0
## 3 staff 0 6
## 7 staff.costs 0 10
## 8 total.costs 0 5
## 9 profit 0 5
## 10 vat 0 12
## 11 rec_id 0 0
## Errors per record:
## errors records
## 1 0 56
## 2 1 4
Zamieńmy błędne pola na NA:
fixable_data <- replace_errors(retailers, rules)
# check nr of missings
sum(is.na(retailers))## [1] 80
sum(is.na(fixable_data))## [1] 85
Dedukcyjne czyszczenie danych
impute_lr
Wyprowadzaj unikalne imputacje (tam, gdzie to możliwe) na podstawie
ograniczeń liniowych. Jeśli total.rev = 0, wtedy
turnover oraz other.rev muszą być równe
0.
turnover + other.rev == total.rev
turnover >= 0
other.rev >= 0
lr_imputed <- impute_lr(fixable_data, rules)
# check nr of imputations using validate::cells
cells(start=retailers, fixable=fixable_data, impute_lr=lr_imputed
, compare='sequential')## Object of class cellComparison:
##
## cells(start = retailers, fixable = fixable_data, impute_lr = lr_imputed, compare = "sequential")
##
## start fixable impute_lr
## cells 660 660 660
## available 580 575 611
## still_available 580 575 575
## unadapted 580 575 575
## adapted 0 0 0
## imputed 0 0 36
## missing 80 85 49
## still_missing 80 80 49
## removed 0 5 0
correct_typos
Sprawdź, czy ograniczenia równowagi liniowej można naprawić, zakładając błąd typograficzny w jednej z liczb.
Jeśli turnover = 100, other.rev = 50 oraz total.rev = 105, zamiana ostatnich 2 cyfr w total.rev naprawia błąd.
typos_corrected <- correct_typos(lr_imputed,rules[1:3])
# Compare progress on rule violation using validate::compare
compare(rules, lr_imputed, typos_corrected)## Object of class validatorComparison:
##
## compare(x = rules, lr_imputed, typos_corrected)
##
## Version
## Status D0001 D0002
## validations 240 240
## verifiable 216 216
## unverifiable 24 24
## still_unverifiable 24 24
## new_unverifiable 0 0
## satisfied 216 216
## still_satisfied 216 216
## new_satisfied 0 0
## violated 0 0
## still_violated 0 0
## new_violated 0 0
Brakujące obserwacje
Missing Completely at Random (MCAR)
MCAR to wartości w zbiorze danych, które brakują całkowicie losowo (MCAR), tj. jeśli zdarzenia, które prowadzą do braku danego elementu danych są niezależne zarówno od obserwowalnych zmiennych, jak i nieobserwowalnych parametrów i występują całkowicie losowo.
Kiedy dane są MCAR, analiza przeprowadzona na danych jest nieobciążona; jednakże dane rzadko są MCAR.
W przypadku MCAR, brak danych nie jest związany z żadną zmienną w badaniu.
Missing at Random (MAR)
Brak losowy (MAR) występuje wtedy, gdy brak nie jest losowy, ale gdy brak może być w pełni wyjaśniony przez zmienne, o których istnieje pełna informacja.
Ponieważ MAR jest założeniem, którego nie da się zweryfikować statystycznie, musimy polegać na jego merytorycznej zasadności.
Przykładem może być to, że mężczyźni rzadziej wypełniają ankietę dotyczącą depresji, ale nie ma to nic wspólnego z ich poziomem depresji, po uwzględnieniu płci.
Missing Not at Random (MNAR)
MNAR (znane również jako nonignorable nonresponse) to dane, które nie są ani MAR, ani MCAR (tj. wartość zmiennej, której brakuje, jest związana z powodem jej braku).
Rozszerzając poprzedni przykład, taka sytuacja miałaby miejsce, gdyby mężczyźni nie wypełnili ankiety dotyczącej depresji z powodu poziomu depresji.
Analiza braków danych
Jak dokonywać detekcji oraz imputacji przedstawiono na poniższym schemacie:
Imputacja/interpolacja - zastępowanie średnią, medianą, dominantą, modelem, zaawansowane algorytmy, usuwanie - rozwiązania są dostępne w następujących pakietach:
Metody dla zmiennej ilościowej
zastępowanie średnią, medianą, dominantą
KNN – K-najbliższych sąsiadów
RPART – drzewa losowe
“mice” - Multivariate Imputation by Chained Equations – wielowymiarowe wypełnianie przez równania łańcuchowe
Metody dla zmiennej jakościowej
dominanta
RPART - drzewa losowe
“mice”
Wizualizacja brakujących obserwacji
VIM::aggr(typos_corrected[3:7])Inspekcja brakujących danych
VIM::pbox(typos_corrected[3:7], pos=1, las=2)SIMPUTACJE
Tzw. “simputacje” dostarczają:
jednolity interfejs,
konsekwentnym zachowanie wśród powszechnie stosowanych metodologii,
ułatwiają eksperymenty,
konfigurację danych do produkcji,
integrację z innymi etapami procesu.
impute_<model>(data, <imputed vars> ~ <predictor vars>)
Przykład: liniowa imputacja
typos_corrected[3:7] %>%
impute_lm(other.rev ~ turnover) %>%
head(3)## staff turnover other.rev total.rev staff.costs
## 1 75 NA NA 1130 NA
## 2 9 1607 0 1607 131
## 3 NA 6886 33 6919 324
Łańcuchowe imputacje:
typos_corrected[3:7] %>%
impute_lm(other.rev ~ turnover + staff) %>%
impute_lm(other.rev ~ staff) %>%
head(3)## staff turnover other.rev total.rev staff.costs
## 1 75 NA 207.312 1130 NA
## 2 9 1607 0.000 1607 131
## 3 NA 6886 33.000 6919 324
Odporne (“robust”) imputacje:
typos_corrected[3:7] %>%
impute_rlm(other.rev ~ turnover + staff) %>%
impute_rlm(other.rev ~ staff) %>%
head(3)## staff turnover other.rev total.rev staff.costs
## 1 75 NA 55.3733 1130 NA
## 2 9 1607 0.0000 1607 131
## 3 NA 6886 33.0000 6919 324
Wiele zmiennych, te same predyktory:
typos_corrected %>%
impute_rlm(other.rev + total.rev ~ turnover)## size incl.prob staff turnover other.rev total.rev staff.costs total.costs
## 1 sc0 0.02 75 NA NA 1130.0 NA 18915
## 2 sc3 0.14 9 1607 0.0000 1607.0 131 1544
## 3 sc3 0.14 NA 6886 33.0000 6919.0 324 6493
## 4 sc3 0.14 NA 3861 13.0000 3874.0 290 3600
## 5 sc3 0.14 NA 5565 37.0000 5602.0 314 5530
## 6 sc0 0.02 1 25 0.0000 25.0 NA 22
## 7 sc3 0.14 5 NA NA 1335.0 135 136
## 8 sc1 0.02 3 404 13.0000 417.0 NA 342
## 9 sc3 0.14 6 2596 0.0000 2596.0 147 2486
## 10 sc2 0.05 5 NA NA NA NA NA
## 11 sc2 0.05 5 645 0.0000 645.0 130 636
## 12 sc2 0.05 5 2872 0.0000 2872.0 182 2652
## 13 sc3 0.14 13 5678 12.0000 5690.0 326 5656
## 14 sc1 0.02 NA 931397 0.0000 931397.0 36872 841489
## 15 sc1 0.02 3 80000 6.3938 80006.4 40000 NA
## 16 sc0 0.02 52 9067 622.0000 9689.0 1125 9911
## 17 sc3 0.14 10 1500 20.0000 1520.0 195 1384
## 18 sc1 0.02 4 440 0.0000 440.0 16 379
## 19 sc2 0.05 3 690 0.0000 690.0 19000 464507
## 20 sc3 0.14 8 1852 0.0000 1852.0 120 1812
## 21 sc0 0.02 2 359 9.0000 368.0 NA 339
## 22 sc0 0.02 3 839 0.0000 839.0 2 717
## 23 sc1 0.02 2 471 0.0000 471.0 34 411
## 24 sc1 0.02 4 933 2.0000 935.0 31 814
## 25 sc2 0.05 3 1665 0.0000 1665.0 70 186
## 26 sc3 0.14 6 2318 0.0000 2318.0 184 390
## 27 sc2 0.05 2 1175 12.0000 1187.0 114 NA
## 28 sc3 0.14 16 2946 7.0000 2953.0 245 2870
## 29 sc0 0.02 1 492 0.0000 492.0 NA 470
## 30 sc2 0.05 6 1831 1831.0000 3662.0 53 1443
## 31 sc3 0.14 29 7271 30.0000 7301.0 451 7242
## 32 sc2 0.05 8 NA NA 107.0 28 95
## 33 sc3 0.14 13 4118 11.0000 4129.0 57 3601
## 34 sc3 0.14 9 2803 0.0000 2803.0 106 2643
## 35 sc3 0.14 15 2876 33.0000 2909.0 539 2627
## 36 sc3 0.14 14 2649 98.0000 2747.0 221302 2725410
## 37 sc2 0.05 6 202 4.0000 206.0 64 170
## 38 sc2 0.05 53 9842 0.0000 9842.0 837 10000
## 39 sc2 0.05 7 2463 38.0000 2501.0 87 2347
## 40 sc3 0.14 NA 4445 98.0000 4543.0 369 4266
## 41 sc3 0.14 20 3284 11.0000 3295.0 181 3168
## 42 sc2 0.05 2 814 0.0000 814.0 107 175
## 43 sc1 0.02 NA 1210 0.0000 1210.0 52 1124
## 44 sc0 0.02 1 343 0.0000 343.0 NA NA
## 45 sc2 0.05 3 952 0.0000 952.0 79 NA
## 46 sc0 0.02 1 41 0.0000 41.0 NA 32
## 47 sc3 0.14 60 3633 0.0000 3633.0 257 3626
## 48 sc3 0.14 8 2906 0.0000 2906.0 144 453
## 49 sc3 0.14 10 2333 6.0000 2339.0 193 2353
## 50 sc3 0.14 12 2275 5.0000 2280.0 222 2302
## 51 sc2 0.05 7 1728 0.0000 1728.0 153 1681
## 52 sc3 0.14 24 6872 32.0000 6904.0 485 6729
## 53 sc3 0.14 29 3571 76.0000 3647.0 311 3554
## 54 sc3 0.14 11 1021 0.0000 1021.0 235 472
## 55 sc0 0.02 1 197 0.0000 197.0 NA 168
## 56 sc2 0.05 7 917 0.0000 917.0 30 781
## 57 sc2 0.05 8 2000 0.0000 2000.0 NA 1700
## 58 sc3 0.14 3 200 0.0000 200.0 49 177
## 59 sc2 0.05 4 342 0.0000 342.0 30 299
## 60 sc2 0.05 6 1 1410.0000 1411.0 179 1215
## profit vat rec_id
## 1 20045 NA 001
## 2 63 NA 002
## 3 426 NA 003
## 4 274 NA 004
## 5 72 NA 005
## 6 3 NA 006
## 7 1 1346 007
## 8 75 NA 008
## 9 110 NA 009
## 10 NA NA 010
## 11 9 NA 011
## 12 220 NA 012
## 13 34 NA 013
## 14 89908 863 014
## 15 NA 813 015
## 16 -222 964 016
## 17 136 733 017
## 18 60 296 018
## 19 225493 486 019
## 20 40 1312 020
## 21 29 257 021
## 22 122 654 022
## 23 60 377 023
## 24 121 811 024
## 25 1478 1472 025
## 26 86 2082 026
## 27 17 1058 027
## 28 83 2670 028
## 29 22 449 029
## 30 388 1695 030
## 31 59 6754 031
## 32 100 905 032
## 33 528 3841 033
## 34 160 2668 034
## 35 282 2758 035
## 36 22457 2548 036
## 37 37 995 037
## 38 -160 9655 038
## 39 154 2441 039
## 40 277 4412 040
## 41 127 3263 041
## 42 NA 810 042
## 43 86 1205 043
## 44 NA 343 044
## 45 149 952 045
## 46 9 41 046
## 47 7 3634 047
## 48 53 2907 048
## 49 -14 2335 049
## 50 -22 2277 050
## 51 47 1742 051
## 52 174 6959 052
## 53 93 3700 053
## 54 549 1067 054
## 55 30 221 055
## 56 136 1030 056
## 57 NA 2271 057
## 58 222 251 058
## 59 43 1068 059
## 60 196 1389 060
typos_corrected %>%
impute_rlm( . - turnover ~ turnover)## size incl.prob staff turnover other.rev total.rev staff.costs total.costs
## 1 sc0 0.02 75.00000 NA NA 1130.0 NA 18915.000
## 2 sc3 0.14 9.00000 1607 0.0000 1607.0 131.0000 1544.000
## 3 sc3 0.14 7.65358 6886 33.0000 6919.0 324.0000 6493.000
## 4 sc3 0.14 7.60328 3861 13.0000 3874.0 290.0000 3600.000
## 5 sc3 0.14 7.63162 5565 37.0000 5602.0 314.0000 5530.000
## 6 sc0 0.02 1.00000 25 0.0000 25.0 91.0412 22.000
## 7 sc3 0.14 5.00000 NA NA 1335.0 135.0000 136.000
## 8 sc1 0.02 3.00000 404 13.0000 417.0 106.0146 342.000
## 9 sc3 0.14 6.00000 2596 0.0000 2596.0 147.0000 2486.000
## 10 sc2 0.05 5.00000 NA NA NA NA NA
## 11 sc2 0.05 5.00000 645 0.0000 645.0 130.0000 636.000
## 12 sc2 0.05 5.00000 2872 0.0000 2872.0 182.0000 2652.000
## 13 sc3 0.14 13.00000 5678 12.0000 5690.0 326.0000 5656.000
## 14 sc1 0.02 23.02605 931397 0.0000 931397.0 36872.0000 841489.000
## 15 sc1 0.02 3.00000 80000 6.3938 80006.4 40000.0000 72369.653
## 16 sc0 0.02 52.00000 9067 622.0000 9689.0 1125.0000 9911.000
## 17 sc3 0.14 10.00000 1500 20.0000 1520.0 195.0000 1384.000
## 18 sc1 0.02 4.00000 440 0.0000 440.0 16.0000 379.000
## 19 sc2 0.05 3.00000 690 0.0000 690.0 19000.0000 464507.000
## 20 sc3 0.14 8.00000 1852 0.0000 1852.0 120.0000 1812.000
## 21 sc0 0.02 2.00000 359 9.0000 368.0 104.2367 339.000
## 22 sc0 0.02 3.00000 839 0.0000 839.0 2.0000 717.000
## 23 sc1 0.02 2.00000 471 0.0000 471.0 34.0000 411.000
## 24 sc1 0.02 4.00000 933 2.0000 935.0 31.0000 814.000
## 25 sc2 0.05 3.00000 1665 0.0000 1665.0 70.0000 186.000
## 26 sc3 0.14 6.00000 2318 0.0000 2318.0 184.0000 390.000
## 27 sc2 0.05 2.00000 1175 12.0000 1187.0 114.0000 1160.898
## 28 sc3 0.14 16.00000 2946 7.0000 2953.0 245.0000 2870.000
## 29 sc0 0.02 1.00000 492 0.0000 492.0 109.4912 470.000
## 30 sc2 0.05 6.00000 1831 1831.0000 3662.0 53.0000 1443.000
## 31 sc3 0.14 29.00000 7271 30.0000 7301.0 451.0000 7242.000
## 32 sc2 0.05 8.00000 NA NA 107.0 28.0000 95.000
## 33 sc3 0.14 13.00000 4118 11.0000 4129.0 57.0000 3601.000
## 34 sc3 0.14 9.00000 2803 0.0000 2803.0 106.0000 2643.000
## 35 sc3 0.14 15.00000 2876 33.0000 2909.0 539.0000 2627.000
## 36 sc3 0.14 14.00000 2649 98.0000 2747.0 221302.0000 2725410.000
## 37 sc2 0.05 6.00000 202 4.0000 206.0 64.0000 170.000
## 38 sc2 0.05 53.00000 9842 0.0000 9842.0 837.0000 10000.000
## 39 sc2 0.05 7.00000 2463 38.0000 2501.0 87.0000 2347.000
## 40 sc3 0.14 7.61299 4445 98.0000 4543.0 369.0000 4266.000
## 41 sc3 0.14 20.00000 3284 11.0000 3295.0 181.0000 3168.000
## 42 sc2 0.05 2.00000 814 0.0000 814.0 107.0000 175.000
## 43 sc1 0.02 7.55920 1210 0.0000 1210.0 52.0000 1124.000
## 44 sc0 0.02 1.00000 343 0.0000 343.0 103.6046 409.287
## 45 sc2 0.05 3.00000 952 0.0000 952.0 79.0000 959.445
## 46 sc0 0.02 1.00000 41 0.0000 41.0 91.6733 32.000
## 47 sc3 0.14 60.00000 3633 0.0000 3633.0 257.0000 3626.000
## 48 sc3 0.14 8.00000 2906 0.0000 2906.0 144.0000 453.000
## 49 sc3 0.14 10.00000 2333 6.0000 2339.0 193.0000 2353.000
## 50 sc3 0.14 12.00000 2275 5.0000 2280.0 222.0000 2302.000
## 51 sc2 0.05 7.00000 1728 0.0000 1728.0 153.0000 1681.000
## 52 sc3 0.14 24.00000 6872 32.0000 6904.0 485.0000 6729.000
## 53 sc3 0.14 29.00000 3571 76.0000 3647.0 311.0000 3554.000
## 54 sc3 0.14 11.00000 1021 0.0000 1021.0 235.0000 472.000
## 55 sc0 0.02 1.00000 197 0.0000 197.0 97.8365 168.000
## 56 sc2 0.05 7.00000 917 0.0000 917.0 30.0000 781.000
## 57 sc2 0.05 8.00000 2000 0.0000 2000.0 169.0685 1700.000
## 58 sc3 0.14 3.00000 200 0.0000 200.0 49.0000 177.000
## 59 sc2 0.05 4.00000 342 0.0000 342.0 30.0000 299.000
## 60 sc2 0.05 6.00000 1 1410.0000 1411.0 179.0000 1215.000
## profit vat rec_id
## 1 20045.0000 NA 001
## 2 63.0000 1669.26 002
## 3 426.0000 1664.79 003
## 4 274.0000 1667.35 004
## 5 72.0000 1665.91 005
## 6 3.0000 1670.60 006
## 7 1.0000 1346.00 007
## 8 75.0000 1670.28 008
## 9 110.0000 1668.43 009
## 10 NA NA 010
## 11 9.0000 1670.08 011
## 12 220.0000 1668.19 012
## 13 34.0000 1665.82 013
## 14 89908.0000 863.00 014
## 15 7659.4122 813.00 015
## 16 -222.0000 964.00 016
## 17 136.0000 733.00 017
## 18 60.0000 296.00 018
## 19 225493.0000 486.00 019
## 20 40.0000 1312.00 020
## 21 29.0000 257.00 021
## 22 122.0000 654.00 022
## 23 60.0000 377.00 023
## 24 121.0000 811.00 024
## 25 1478.0000 1472.00 025
## 26 86.0000 2082.00 026
## 27 17.0000 1058.00 027
## 28 83.0000 2670.00 028
## 29 22.0000 449.00 029
## 30 388.0000 1695.00 030
## 31 59.0000 6754.00 031
## 32 100.0000 905.00 032
## 33 528.0000 3841.00 033
## 34 160.0000 2668.00 034
## 35 282.0000 2758.00 035
## 36 22457.0000 2548.00 036
## 37 37.0000 995.00 037
## 38 -160.0000 9655.00 038
## 39 154.0000 2441.00 039
## 40 277.0000 4412.00 040
## 41 127.0000 3263.00 041
## 42 10.8422 810.00 042
## 43 86.0000 1205.00 043
## 44 -34.6517 343.00 044
## 45 149.0000 952.00 045
## 46 9.0000 41.00 046
## 47 7.0000 3634.00 047
## 48 53.0000 2907.00 048
## 49 -14.0000 2335.00 049
## 50 -22.0000 2277.00 050
## 51 47.0000 1742.00 051
## 52 174.0000 6959.00 052
## 53 93.0000 3700.00 053
## 54 549.0000 1067.00 054
## 55 30.0000 221.00 055
## 56 136.0000 1030.00 056
## 57 125.3979 2271.00 057
## 58 222.0000 251.00 058
## 59 43.0000 1068.00 059
## 60 196.0000 1389.00 060
Przykład: grupowanie
typos_corrected %>%
impute_rlm(total.rev ~ turnover | size)## size incl.prob staff turnover other.rev total.rev staff.costs total.costs
## 1 sc0 0.02 75 NA NA 1130.0 NA 18915
## 2 sc3 0.14 9 1607 0 1607.0 131 1544
## 3 sc3 0.14 NA 6886 33 6919.0 324 6493
## 4 sc3 0.14 NA 3861 13 3874.0 290 3600
## 5 sc3 0.14 NA 5565 37 5602.0 314 5530
## 6 sc0 0.02 1 25 0 25.0 NA 22
## 7 sc3 0.14 5 NA NA 1335.0 135 136
## 8 sc1 0.02 3 404 13 417.0 NA 342
## 9 sc3 0.14 6 2596 0 2596.0 147 2486
## 10 sc2 0.05 5 NA NA NA NA NA
## 11 sc2 0.05 5 645 0 645.0 130 636
## 12 sc2 0.05 5 2872 0 2872.0 182 2652
## 13 sc3 0.14 13 5678 12 5690.0 326 5656
## 14 sc1 0.02 NA 931397 0 931397.0 36872 841489
## 15 sc1 0.02 3 80000 NA 80000.9 40000 NA
## 16 sc0 0.02 52 9067 622 9689.0 1125 9911
## 17 sc3 0.14 10 1500 20 1520.0 195 1384
## 18 sc1 0.02 4 440 0 440.0 16 379
## 19 sc2 0.05 3 690 0 690.0 19000 464507
## 20 sc3 0.14 8 1852 0 1852.0 120 1812
## 21 sc0 0.02 2 359 9 368.0 NA 339
## 22 sc0 0.02 3 839 0 839.0 2 717
## 23 sc1 0.02 2 471 0 471.0 34 411
## 24 sc1 0.02 4 933 2 935.0 31 814
## 25 sc2 0.05 3 1665 0 1665.0 70 186
## 26 sc3 0.14 6 2318 0 2318.0 184 390
## 27 sc2 0.05 2 1175 12 1187.0 114 NA
## 28 sc3 0.14 16 2946 7 2953.0 245 2870
## 29 sc0 0.02 1 492 0 492.0 NA 470
## 30 sc2 0.05 6 1831 1831 3662.0 53 1443
## 31 sc3 0.14 29 7271 30 7301.0 451 7242
## 32 sc2 0.05 8 NA NA 107.0 28 95
## 33 sc3 0.14 13 4118 11 4129.0 57 3601
## 34 sc3 0.14 9 2803 0 2803.0 106 2643
## 35 sc3 0.14 15 2876 33 2909.0 539 2627
## 36 sc3 0.14 14 2649 98 2747.0 221302 2725410
## 37 sc2 0.05 6 202 4 206.0 64 170
## 38 sc2 0.05 53 9842 0 9842.0 837 10000
## 39 sc2 0.05 7 2463 38 2501.0 87 2347
## 40 sc3 0.14 NA 4445 98 4543.0 369 4266
## 41 sc3 0.14 20 3284 11 3295.0 181 3168
## 42 sc2 0.05 2 814 0 814.0 107 175
## 43 sc1 0.02 NA 1210 0 1210.0 52 1124
## 44 sc0 0.02 1 343 0 343.0 NA NA
## 45 sc2 0.05 3 952 0 952.0 79 NA
## 46 sc0 0.02 1 41 0 41.0 NA 32
## 47 sc3 0.14 60 3633 0 3633.0 257 3626
## 48 sc3 0.14 8 2906 0 2906.0 144 453
## 49 sc3 0.14 10 2333 6 2339.0 193 2353
## 50 sc3 0.14 12 2275 5 2280.0 222 2302
## 51 sc2 0.05 7 1728 0 1728.0 153 1681
## 52 sc3 0.14 24 6872 32 6904.0 485 6729
## 53 sc3 0.14 29 3571 76 3647.0 311 3554
## 54 sc3 0.14 11 1021 0 1021.0 235 472
## 55 sc0 0.02 1 197 0 197.0 NA 168
## 56 sc2 0.05 7 917 0 917.0 30 781
## 57 sc2 0.05 8 2000 0 2000.0 NA 1700
## 58 sc3 0.14 3 200 0 200.0 49 177
## 59 sc2 0.05 4 342 0 342.0 30 299
## 60 sc2 0.05 6 1 1410 1411.0 179 1215
## profit vat rec_id
## 1 20045 NA 001
## 2 63 NA 002
## 3 426 NA 003
## 4 274 NA 004
## 5 72 NA 005
## 6 3 NA 006
## 7 1 1346 007
## 8 75 NA 008
## 9 110 NA 009
## 10 NA NA 010
## 11 9 NA 011
## 12 220 NA 012
## 13 34 NA 013
## 14 89908 863 014
## 15 NA 813 015
## 16 -222 964 016
## 17 136 733 017
## 18 60 296 018
## 19 225493 486 019
## 20 40 1312 020
## 21 29 257 021
## 22 122 654 022
## 23 60 377 023
## 24 121 811 024
## 25 1478 1472 025
## 26 86 2082 026
## 27 17 1058 027
## 28 83 2670 028
## 29 22 449 029
## 30 388 1695 030
## 31 59 6754 031
## 32 100 905 032
## 33 528 3841 033
## 34 160 2668 034
## 35 282 2758 035
## 36 22457 2548 036
## 37 37 995 037
## 38 -160 9655 038
## 39 154 2441 039
## 40 277 4412 040
## 41 127 3263 041
## 42 NA 810 042
## 43 86 1205 043
## 44 NA 343 044
## 45 149 952 045
## 46 9 41 046
## 47 7 3634 047
## 48 53 2907 048
## 49 -14 2335 049
## 50 -22 2277 050
## 51 47 1742 051
## 52 174 6959 052
## 53 93 3700 053
## 54 549 1067 054
## 55 30 221 055
## 56 136 1030 056
## 57 NA 2271 057
## 58 222 251 058
## 59 43 1068 059
## 60 196 1389 060
# or, using dplyr::group_by
typos_corrected %>%
group_by(size) %>%
impute_rlm(total.rev ~ turnover)## # A tibble: 60 × 11
## # Groups: size [4]
## size incl.prob staff turnover other.rev total…¹ staff…² total…³ profit vat
## * <fct> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
## 1 sc0 0.02 75 NA NA 1130 NA 18915 20045 NA
## 2 sc3 0.14 9 1607 0 1607 131 1544 63 NA
## 3 sc3 0.14 NA 6886 33 6919 324 6493 426 NA
## 4 sc3 0.14 NA 3861 13 3874 290 3600 274 NA
## 5 sc3 0.14 NA 5565 37 5602 314 5530 72 NA
## 6 sc0 0.02 1 25 0 25 NA 22 3 NA
## 7 sc3 0.14 5 NA NA 1335 135 136 1 1346
## 8 sc1 0.02 3 404 13 417 NA 342 75 NA
## 9 sc3 0.14 6 2596 0 2596 147 2486 110 NA
## 10 sc2 0.05 5 NA NA NA NA NA NA NA
## # … with 50 more rows, 1 more variable: rec_id <chr>, and abbreviated variable
## # names ¹total.rev, ²staff.costs, ³total.costs
Przykład: losowe reszty
typos_corrected %>%
impute_rlm(total.rev ~ turnover | size,
add_residual="observed")## size incl.prob staff turnover other.rev total.rev staff.costs total.costs
## 1 sc0 0.02 75 NA NA 1130.0 NA 18915
## 2 sc3 0.14 9 1607 0 1607.0 131 1544
## 3 sc3 0.14 NA 6886 33 6919.0 324 6493
## 4 sc3 0.14 NA 3861 13 3874.0 290 3600
## 5 sc3 0.14 NA 5565 37 5602.0 314 5530
## 6 sc0 0.02 1 25 0 25.0 NA 22
## 7 sc3 0.14 5 NA NA 1335.0 135 136
## 8 sc1 0.02 3 404 13 417.0 NA 342
## 9 sc3 0.14 6 2596 0 2596.0 147 2486
## 10 sc2 0.05 5 NA NA NA NA NA
## 11 sc2 0.05 5 645 0 645.0 130 636
## 12 sc2 0.05 5 2872 0 2872.0 182 2652
## 13 sc3 0.14 13 5678 12 5690.0 326 5656
## 14 sc1 0.02 NA 931397 0 931397.0 36872 841489
## 15 sc1 0.02 3 80000 NA 79999.9 40000 NA
## 16 sc0 0.02 52 9067 622 9689.0 1125 9911
## 17 sc3 0.14 10 1500 20 1520.0 195 1384
## 18 sc1 0.02 4 440 0 440.0 16 379
## 19 sc2 0.05 3 690 0 690.0 19000 464507
## 20 sc3 0.14 8 1852 0 1852.0 120 1812
## 21 sc0 0.02 2 359 9 368.0 NA 339
## 22 sc0 0.02 3 839 0 839.0 2 717
## 23 sc1 0.02 2 471 0 471.0 34 411
## 24 sc1 0.02 4 933 2 935.0 31 814
## 25 sc2 0.05 3 1665 0 1665.0 70 186
## 26 sc3 0.14 6 2318 0 2318.0 184 390
## 27 sc2 0.05 2 1175 12 1187.0 114 NA
## 28 sc3 0.14 16 2946 7 2953.0 245 2870
## 29 sc0 0.02 1 492 0 492.0 NA 470
## 30 sc2 0.05 6 1831 1831 3662.0 53 1443
## 31 sc3 0.14 29 7271 30 7301.0 451 7242
## 32 sc2 0.05 8 NA NA 107.0 28 95
## 33 sc3 0.14 13 4118 11 4129.0 57 3601
## 34 sc3 0.14 9 2803 0 2803.0 106 2643
## 35 sc3 0.14 15 2876 33 2909.0 539 2627
## 36 sc3 0.14 14 2649 98 2747.0 221302 2725410
## 37 sc2 0.05 6 202 4 206.0 64 170
## 38 sc2 0.05 53 9842 0 9842.0 837 10000
## 39 sc2 0.05 7 2463 38 2501.0 87 2347
## 40 sc3 0.14 NA 4445 98 4543.0 369 4266
## 41 sc3 0.14 20 3284 11 3295.0 181 3168
## 42 sc2 0.05 2 814 0 814.0 107 175
## 43 sc1 0.02 NA 1210 0 1210.0 52 1124
## 44 sc0 0.02 1 343 0 343.0 NA NA
## 45 sc2 0.05 3 952 0 952.0 79 NA
## 46 sc0 0.02 1 41 0 41.0 NA 32
## 47 sc3 0.14 60 3633 0 3633.0 257 3626
## 48 sc3 0.14 8 2906 0 2906.0 144 453
## 49 sc3 0.14 10 2333 6 2339.0 193 2353
## 50 sc3 0.14 12 2275 5 2280.0 222 2302
## 51 sc2 0.05 7 1728 0 1728.0 153 1681
## 52 sc3 0.14 24 6872 32 6904.0 485 6729
## 53 sc3 0.14 29 3571 76 3647.0 311 3554
## 54 sc3 0.14 11 1021 0 1021.0 235 472
## 55 sc0 0.02 1 197 0 197.0 NA 168
## 56 sc2 0.05 7 917 0 917.0 30 781
## 57 sc2 0.05 8 2000 0 2000.0 NA 1700
## 58 sc3 0.14 3 200 0 200.0 49 177
## 59 sc2 0.05 4 342 0 342.0 30 299
## 60 sc2 0.05 6 1 1410 1411.0 179 1215
## profit vat rec_id
## 1 20045 NA 001
## 2 63 NA 002
## 3 426 NA 003
## 4 274 NA 004
## 5 72 NA 005
## 6 3 NA 006
## 7 1 1346 007
## 8 75 NA 008
## 9 110 NA 009
## 10 NA NA 010
## 11 9 NA 011
## 12 220 NA 012
## 13 34 NA 013
## 14 89908 863 014
## 15 NA 813 015
## 16 -222 964 016
## 17 136 733 017
## 18 60 296 018
## 19 225493 486 019
## 20 40 1312 020
## 21 29 257 021
## 22 122 654 022
## 23 60 377 023
## 24 121 811 024
## 25 1478 1472 025
## 26 86 2082 026
## 27 17 1058 027
## 28 83 2670 028
## 29 22 449 029
## 30 388 1695 030
## 31 59 6754 031
## 32 100 905 032
## 33 528 3841 033
## 34 160 2668 034
## 35 282 2758 035
## 36 22457 2548 036
## 37 37 995 037
## 38 -160 9655 038
## 39 154 2441 039
## 40 277 4412 040
## 41 127 3263 041
## 42 NA 810 042
## 43 86 1205 043
## 44 NA 343 044
## 45 149 952 045
## 46 9 41 046
## 47 7 3634 047
## 48 53 2907 048
## 49 -14 2335 049
## 50 -22 2277 050
## 51 47 1742 051
## 52 174 6959 052
## 53 93 3700 053
## 54 549 1067 054
## 55 30 221 055
## 56 136 1030 056
## 57 NA 2271 057
## 58 222 251 058
## 59 43 1068 059
## 60 196 1389 060
typos_corrected %>%
impute_rlm(total.rev ~ turnover | size,
add_residual="normal")## size incl.prob staff turnover other.rev total.rev staff.costs total.costs
## 1 sc0 0.02 75 NA NA 1130.0 NA 18915
## 2 sc3 0.14 9 1607 0 1607.0 131 1544
## 3 sc3 0.14 NA 6886 33 6919.0 324 6493
## 4 sc3 0.14 NA 3861 13 3874.0 290 3600
## 5 sc3 0.14 NA 5565 37 5602.0 314 5530
## 6 sc0 0.02 1 25 0 25.0 NA 22
## 7 sc3 0.14 5 NA NA 1335.0 135 136
## 8 sc1 0.02 3 404 13 417.0 NA 342
## 9 sc3 0.14 6 2596 0 2596.0 147 2486
## 10 sc2 0.05 5 NA NA NA NA NA
## 11 sc2 0.05 5 645 0 645.0 130 636
## 12 sc2 0.05 5 2872 0 2872.0 182 2652
## 13 sc3 0.14 13 5678 12 5690.0 326 5656
## 14 sc1 0.02 NA 931397 0 931397.0 36872 841489
## 15 sc1 0.02 3 80000 NA 80009.9 40000 NA
## 16 sc0 0.02 52 9067 622 9689.0 1125 9911
## 17 sc3 0.14 10 1500 20 1520.0 195 1384
## 18 sc1 0.02 4 440 0 440.0 16 379
## 19 sc2 0.05 3 690 0 690.0 19000 464507
## 20 sc3 0.14 8 1852 0 1852.0 120 1812
## 21 sc0 0.02 2 359 9 368.0 NA 339
## 22 sc0 0.02 3 839 0 839.0 2 717
## 23 sc1 0.02 2 471 0 471.0 34 411
## 24 sc1 0.02 4 933 2 935.0 31 814
## 25 sc2 0.05 3 1665 0 1665.0 70 186
## 26 sc3 0.14 6 2318 0 2318.0 184 390
## 27 sc2 0.05 2 1175 12 1187.0 114 NA
## 28 sc3 0.14 16 2946 7 2953.0 245 2870
## 29 sc0 0.02 1 492 0 492.0 NA 470
## 30 sc2 0.05 6 1831 1831 3662.0 53 1443
## 31 sc3 0.14 29 7271 30 7301.0 451 7242
## 32 sc2 0.05 8 NA NA 107.0 28 95
## 33 sc3 0.14 13 4118 11 4129.0 57 3601
## 34 sc3 0.14 9 2803 0 2803.0 106 2643
## 35 sc3 0.14 15 2876 33 2909.0 539 2627
## 36 sc3 0.14 14 2649 98 2747.0 221302 2725410
## 37 sc2 0.05 6 202 4 206.0 64 170
## 38 sc2 0.05 53 9842 0 9842.0 837 10000
## 39 sc2 0.05 7 2463 38 2501.0 87 2347
## 40 sc3 0.14 NA 4445 98 4543.0 369 4266
## 41 sc3 0.14 20 3284 11 3295.0 181 3168
## 42 sc2 0.05 2 814 0 814.0 107 175
## 43 sc1 0.02 NA 1210 0 1210.0 52 1124
## 44 sc0 0.02 1 343 0 343.0 NA NA
## 45 sc2 0.05 3 952 0 952.0 79 NA
## 46 sc0 0.02 1 41 0 41.0 NA 32
## 47 sc3 0.14 60 3633 0 3633.0 257 3626
## 48 sc3 0.14 8 2906 0 2906.0 144 453
## 49 sc3 0.14 10 2333 6 2339.0 193 2353
## 50 sc3 0.14 12 2275 5 2280.0 222 2302
## 51 sc2 0.05 7 1728 0 1728.0 153 1681
## 52 sc3 0.14 24 6872 32 6904.0 485 6729
## 53 sc3 0.14 29 3571 76 3647.0 311 3554
## 54 sc3 0.14 11 1021 0 1021.0 235 472
## 55 sc0 0.02 1 197 0 197.0 NA 168
## 56 sc2 0.05 7 917 0 917.0 30 781
## 57 sc2 0.05 8 2000 0 2000.0 NA 1700
## 58 sc3 0.14 3 200 0 200.0 49 177
## 59 sc2 0.05 4 342 0 342.0 30 299
## 60 sc2 0.05 6 1 1410 1411.0 179 1215
## profit vat rec_id
## 1 20045 NA 001
## 2 63 NA 002
## 3 426 NA 003
## 4 274 NA 004
## 5 72 NA 005
## 6 3 NA 006
## 7 1 1346 007
## 8 75 NA 008
## 9 110 NA 009
## 10 NA NA 010
## 11 9 NA 011
## 12 220 NA 012
## 13 34 NA 013
## 14 89908 863 014
## 15 NA 813 015
## 16 -222 964 016
## 17 136 733 017
## 18 60 296 018
## 19 225493 486 019
## 20 40 1312 020
## 21 29 257 021
## 22 122 654 022
## 23 60 377 023
## 24 121 811 024
## 25 1478 1472 025
## 26 86 2082 026
## 27 17 1058 027
## 28 83 2670 028
## 29 22 449 029
## 30 388 1695 030
## 31 59 6754 031
## 32 100 905 032
## 33 528 3841 033
## 34 160 2668 034
## 35 282 2758 035
## 36 22457 2548 036
## 37 37 995 037
## 38 -160 9655 038
## 39 154 2441 039
## 40 277 4412 040
## 41 127 3263 041
## 42 NA 810 042
## 43 86 1205 043
## 44 NA 343 044
## 45 149 952 045
## 46 9 41 046
## 47 7 3634 047
## 48 53 2907 048
## 49 -14 2335 049
## 50 -22 2277 050
## 51 47 1742 051
## 52 174 6959 052
## 53 93 3700 053
## 54 549 1067 054
## 55 30 221 055
## 56 136 1030 056
## 57 NA 2271 057
## 58 222 251 058
## 59 43 1068 059
## 60 196 1389 060
Przykład: naucz na A, zastosuj dla B
m <- MASS::rlm(other.rev ~ turnover + staff
, data=typos_corrected)
impute(retailers, other.rev ~ m)## size incl.prob staff turnover other.rev total.rev staff.costs
## 1 sc0 0.02 75 NA NA 1130 NA
## 2 sc3 0.14 9 1607 5.1118178 1607 131
## 3 sc3 0.14 NA 6886 -33.0000000 6919 324
## 4 sc3 0.14 NA 3861 13.0000000 3874 290
## 5 sc3 0.14 NA NA 37.0000000 5602 314
## 6 sc0 0.02 1 25 -0.1941089 25 NA
## 7 sc3 0.14 5 NA NA 1335 135
## 8 sc1 0.02 3 404 13.0000000 417 NA
## 9 sc3 0.14 6 2596 8.0157561 2596 147
## 10 sc2 0.05 5 NA NA NA NA
## 11 sc2 0.05 5 645 1.9299772 645 130
## 12 sc2 0.05 5 2872 8.8177527 2872 182
## 13 sc3 0.14 13 5678 12.0000000 5690 326
## 14 sc1 0.02 NA 931397 NA 931397 36872
## 15 sc1 0.02 3 80000 247.2597777 NA 40000
## 16 sc0 0.02 52 9067 622.0000000 9689 1125
## 17 sc3 0.14 10 1500 20.0000000 1520 195
## 18 sc1 0.02 4 440 1.2443132 440 16
## 19 sc2 0.05 3 690 1.9658956 690 19000
## 20 sc3 0.14 8 1852 5.8179360 1852 120
## 21 sc0 0.02 2 359 9.0000000 368 NA
## 22 sc0 0.02 3 839 2.4267302 839 2
## 23 sc1 0.02 2 471 1.2369318 471 34
## 24 sc1 0.02 4 933 2.0000000 935 31
## 25 sc2 0.05 3 1665 4.9814238 1665 70
## 26 sc3 0.14 6 2318 7.1559440 2318 184
## 27 sc2 0.05 2 1175 12.0000000 1187 114
## 28 sc3 0.14 16 2946 7.0000000 2953 245
## 29 sc0 0.02 1 492 1.2502517 492 NA
## 30 sc2 0.05 6 1831 1831.0000000 1831 53
## 31 sc3 0.14 29 7271 30.0000000 7301 451
## 32 sc2 0.05 8 971 3.0931357 107 28
## 33 sc3 0.14 13 4118 11.0000000 4129 57
## 34 sc3 0.14 9 2803 8.8108656 2803 106
## 35 sc3 0.14 15 2876 33.0000000 2909 539
## 36 sc3 0.14 14 2649 98350.0000000 2747 221302
## 37 sc2 0.05 6 1024 4.0000000 206 64
## 38 sc2 0.05 53 9842 32.8531468 9842 837
## 39 sc2 0.05 7 2463 38.0000000 2501 87
## 40 sc3 0.14 NA 4445 98.0000000 4543 369
## 41 sc3 0.14 20 3284 11.0000000 3295 181
## 42 sc2 0.05 2 814 2.2977791 814 107
## 43 sc1 0.02 NA 1210 NA 1210 52
## 44 sc0 0.02 1 343 0.7894172 343 NA
## 45 sc2 0.05 3 952 2.7762222 952 79
## 46 sc0 0.02 1 41 -0.1446233 41 NA
## 47 sc3 0.14 60 3633 14.0110544 3633 257
## 48 sc3 0.14 8 2906 9.0777992 2906 144
## 49 sc3 0.14 10 2333 6.0000000 2339 193
## 50 sc3 0.14 12 2275 5.0000000 2280 222
## 51 sc2 0.05 7 1728 5.3827928 1728 153
## 52 sc3 0.14 24 6872 32.0000000 6904 485
## 53 sc3 0.14 29 3571 76.0000000 3647 311
## 54 sc3 0.14 11 1021 3.4026678 1021 235
## 55 sc0 0.02 1 197 0.3378612 197 NA
## 56 sc2 0.05 7 917 2.8744920 917 30
## 57 sc2 0.05 8 2000 6.2756777 2000 NA
## 58 sc3 0.14 3 200 0.4503995 200 49
## 59 sc2 0.05 4 342 0.9412140 342 30
## 60 sc2 0.05 6 1 -0.0101879 1411 179
## total.costs profit vat rec_id
## 1 18915 20045 NA 001
## 2 1544 63 NA 002
## 3 6493 426 NA 003
## 4 3600 274 NA 004
## 5 5530 72 NA 005
## 6 22 3 NA 006
## 7 136 1 1346 007
## 8 342 75 NA 008
## 9 2486 110 NA 009
## 10 NA NA NA 010
## 11 636 9 NA 011
## 12 2652 220 NA 012
## 13 5656 34 NA 013
## 14 841489 89908 863 014
## 15 NA NA 813 015
## 16 9911 -222 964 016
## 17 1384 136 733 017
## 18 379 60 296 018
## 19 464507 225493 486 019
## 20 1812 40 1312 020
## 21 339 29 257 021
## 22 717 122 654 022
## 23 411 60 377 023
## 24 814 121 811 024
## 25 186 1478 1472 025
## 26 390 86 2082 026
## 27 NA 17 1058 027
## 28 2870 83 2670 028
## 29 470 22 449 029
## 30 1443 388 1695 030
## 31 7242 59 6754 031
## 32 95 100 905 032
## 33 3601 528 3841 033
## 34 2643 160 2668 034
## 35 2627 282 2758 035
## 36 2725410 22457 2548 036
## 37 170 37 995 037
## 38 10000 -160 9655 038
## 39 2347 154 2441 039
## 40 4266 277 4412 040
## 41 3168 127 3263 041
## 42 175 NA 810 042
## 43 1124 86 1205 043
## 44 NA NA 343 044
## 45 NA 149 952 045
## 46 32 9 41 046
## 47 3626 7 3634 047
## 48 453 53 2907 048
## 49 2353 -14 2335 049
## 50 2302 -22 2277 050
## 51 1681 47 1742 051
## 52 6729 174 6959 052
## 53 3554 93 3700 053
## 54 472 549 1067 054
## 55 168 30 221 055
## 56 781 136 1030 056
## 57 1700 NA 2271 057
## 58 177 222 251 058
## 59 299 43 1068 059
## 60 1215 196 1389 060
Logi zmian
Każdy analityk chciałby wiedzieć, która operacja czyszczenia danych miała wpływ na wartości, statystyki, wyniki walidacji…
Rozwiązanie:
Wszystkie dane przepływają przez fajkę %>%.
Widzi ona dane wejściowe i wyjściowe. Użyjmy zatem specjalnego operatora
fajki, który mierzy i przechowuje różnice między wejściem a
wyjściem.
Operator %>>%
Pakiet “lumberjack” pozwala na zamianę fajki typu magrittr
%>% na operatora “drwala” %>>%
Rejestruje on, co dzieje się z Twoimi danymi, gdy przepływają przez
%>>% Używa on loggera wyeksportowanego przez “drwala”
lub walidacji lub definiowania przez własny rejestrator.
imputed <- typos_corrected %>>%
start_log(log = validate::lbj_cells()) %>>%
impute_lm(turnover ~ staff) %>>%
impute_median(other.rev + turnover ~ size) %>>%
dump_log()## Dumped a log at C:\Users\Maciek\Downloads\lbj_cells.csv
Zobaczmy, jak wygląda plik z logami (jego początek):
read.csv("lbj_cells.csv") %>% head()## step time expression cells
## 1 0 2022-12-11 09:58:52 660
## 2 1 2022-12-11 09:58:52 impute_lm(turnover ~ staff) 660
## 3 2 2022-12-11 09:58:52 impute_median(other.rev + turnover ~ size) 660
## available still_available unadapted adapted imputed missing still_missing
## 1 611 611 611 0 0 49 49
## 2 615 611 611 0 4 45 45
## 3 620 615 615 0 5 40 40
## removed
## 1 0
## 2 0
## 3 0
Pakiet dlookr
Transformacje danych z pakietem “dlookr”:
find_na()- znajduje zmienną, która zawiera brakujące obserwacjeimputate_na()- wypełnia brakujące obserwacjesummary.imputation()orazplot.imputation()- przedstawiają podsumowania i wizualizują przeprowadzone imputacje brakówfind_skewness()- znajduje zmienne skośne i raportuje tę skośnośćtransform()- przeprowadza standaryzację, normalizację zmiennych numerycznychsummary.transform()orazplot.transform()- przedstawiają podsumowania i wizualizują przeprowadzone transformacjebinning()orazbinning_by()konwertuje dane do kategorycznych (ilościowe do jakościowych)transformation_web_report()- przeprowadza w/w i tworzy raport z transformacji
Jako przykład, zabrudzimy paczkę danych “Carseats”.
?Carseats## uruchamianie serwera httpd dla pomocy ... wykonano
str(Carseats)## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
data(Carseats)
attach(Carseats)
#Wprowadzamy braki danych:
carseats <- ISLR::Carseats
suppressWarnings(RNGversion("3.5.0"))
set.seed(123)
carseats[sample(seq(NROW(carseats)), 20), "Income"] <- NA
suppressWarnings(RNGversion("3.5.0"))
set.seed(456)
carseats[sample(seq(NROW(carseats)), 10), "Urban"] <- NAPrzykład 1. Zamieniamy brakujące dochody (20 sztuk) za pomocą mediany:
#wypełniamy brakujące obserwacje
# ?imputate_na
dochod<-imputate_na(carseats, Income, method = "median")
summary(dochod)## Impute missing values with median
##
## * Information of Imputation (before vs after)
## Original Imputation
## described_variables "value" "value"
## n "380" "400"
## na "20" " 0"
## mean "68.8605" "68.8675"
## sd "28.0916" "27.3785"
## se_mean "1.44107" "1.36893"
## IQR "48.25" "45.25"
## skewness "0.0449060" "0.0452954"
## kurtosis "-1.089201" "-0.987569"
## p00 "21" "21"
## p01 "21.79" "21.99"
## p05 "26" "26"
## p10 "30.0" "30.9"
## p20 "39" "40"
## p25 "42.75" "44.75"
## p30 "48" "52"
## p40 "62" "63"
## p50 "69" "69"
## p60 "78" "76"
## p70 "86.3" "84.0"
## p75 "91" "90"
## p80 "96.2" "94.2"
## p90 "108.1" "106.1"
## p95 "115.05" "115.00"
## p99 "119.21" "119.01"
## p100 "120" "120"
plot(dochod)Przykład 2. Zamieniamy brakujące dochody (20 sztuk) za pomocą metody “mice” wielorównaniowej:
dochod<-imputate_na(carseats, Income, Urban, method = "mice")##
## iter imp variable
## 1 1 Income
## 1 2 Income
## 1 3 Income
## 1 4 Income
## 1 5 Income
## 2 1 Income
## 2 2 Income
## 2 3 Income
## 2 4 Income
## 2 5 Income
## 3 1 Income
## 3 2 Income
## 3 3 Income
## 3 4 Income
## 3 5 Income
## 4 1 Income
## 4 2 Income
## 4 3 Income
## 4 4 Income
## 4 5 Income
## 5 1 Income
## 5 2 Income
## 5 3 Income
## 5 4 Income
## 5 5 Income
summary(dochod)## * Impute missing values based on Multivariate Imputation by Chained Equations
## - method : mice
## - random seed : 37295
##
## * Information of Imputation (before vs after)
## Original Imputation
## described_variables "value" "value"
## n "380" "400"
## na "20" " 0"
## mean "68.8605" "69.1370"
## sd "28.0916" "27.4940"
## se_mean "1.44107" "1.37470"
## IQR "48.25" "45.25"
## skewness "0.044906" "0.018794"
## kurtosis "-1.08920" "-1.01939"
## p00 "21" "21"
## p01 "21.79" "21.99"
## p05 "26" "26"
## p10 "30.0" "30.9"
## p20 "39" "40"
## p25 "42.75" "44.75"
## p30 "48" "52"
## p40 "62" "63"
## p50 "69" "69"
## p60 "78" "78"
## p70 "86.3" "84.0"
## p75 "91" "90"
## p80 "96.2" "94.2"
## p90 "108.1" "106.1"
## p95 "115.05" "115.00"
## p99 "119.21" "119.01"
## p100 "120" "120"
plot(dochod)Obserwacje odstające
Jednowymiarowe:
Wielowymiarowe:
Metody wykrywania i usuwania obserwacji odstających:
wykres ramkowy, rozrzutu
Z-score (reguła 3 sigm - 3 odchyleń od średniej)
testy statystyczne (Cook’a, Grubbs’a, Tukey’a)
percentyle
filtr Hampela
dystans Cook’a - dla wielowymiarowych
Przykład 3. Zamiana wartości odstających komendą imputate_outlier()
# ?imputate_outlier
ceny<-imputate_outlier(carseats, Price, method="capping")
summary(ceny)## Impute outliers with capping
##
## * Information of Imputation (before vs after)
## Original Imputation
## described_variables "value" "value"
## n "400" "400"
## na "0" "0"
## mean "115.795" "115.893"
## sd "23.6767" "22.6109"
## se_mean "1.18383" "1.13055"
## IQR "31" "31"
## skewness "-0.1252862" "-0.0461621"
## kurtosis " 0.451885" "-0.303058"
## p00 "24" "54"
## p01 "54.99" "67.96"
## p05 "77" "77"
## p10 "87" "87"
## p20 "96.8" "96.8"
## p25 "100" "100"
## p30 "104" "104"
## p40 "110" "110"
## p50 "117" "117"
## p60 "122" "122"
## p70 "128.3" "128.3"
## p75 "131" "131"
## p80 "134" "134"
## p90 "146" "146"
## p95 "155.050" "155.002"
## p99 "166.05" "164.02"
## p100 "191" "173"
plot(ceny)Standaryzacja danych
Standaryzacja:
Metoda z-score – jest to taka standaryzacja danych, która sprowadzi naszą zmienną do skali uniwersalnej, bez wpływu średniej i odchylenie standardowego – od zmiennej musimy odjąć średnią i podzielić przez odchylenie: dane otrzymują średnią równą 0 i odchylenie równe 1.
Metoda “minmax” - inna odmiana standaryzacji - względem minimum i maksimum: od wartości zmiennej odejmujemy minimum i dzielimy to przez rozstęp: (x-min)/(max-min).
Jak rozwiązać problem ze skośnymi danymi? Transformować!
log – transformacja z użyciem logarytmu log(x)
log+1 – tak jak wyżej, ale umożliwia badanie danych, które zawierają 0
sqrt – pierwiastek
1/x
x^2
x^3
metoda Boxa-Coxa
Przykład 4. Za pomocą komendy “mutate” oraz “transform” dokonaj standaryzacji tworząc nową zmienną metodą z-score oraz minmax i wyświetl je na wykresie ramkowym:
carseats$Income<-as.numeric(carseats$Income) #usuwam informacje o imputacjach
carseats %>%
mutate(dochody_z = transform(carseats$Income, method = "zscore"),
dochody_minmax = transform(carseats$Income, method = "minmax")) %>%
select(dochody_z, dochody_minmax) %>%
boxplot()Przykład 5. Znajdź zmienne (zmienną) w ramce danych, które stwarza(ją) problem skośności.
znajdujemy zmienne - które to zmienne stwarzają problem?
find_skewness(carseats)## [1] 4to samo co powyżej, ale po nazwie:
find_skewness(carseats, index=FALSE)## [1] "Advertising"
- policzmy wskaźnik asymetrii dla zmiennych:
find_skewness(carseats, value=TRUE)## Sales CompPrice Income Advertising Population Price
## 0.185 -0.043 0.045 0.637 -0.051 -0.125
## Age Education
## -0.077 0.044
- policzmy wskaźnik asymetrii dla zmiennych i filtrujemy tylko te, które mają skośność > 0.1:
find_skewness(carseats, value=TRUE, thres=0.1)## Sales Advertising Price
## 0.185 0.637 -0.125
hist(carseats$Advertising)w takim razie musimy dokonać transformacji tej skośnej zmiennej:
advertising_log <- transform(carseats$Advertising, method = "log")
advertising_log## [1] 2.397895 2.772589 2.302585 1.386294 1.098612 2.564949 -Inf 2.708050
## [9] -Inf -Inf 2.197225 1.386294 0.693147 2.397895 2.397895 1.609438
## [17] -Inf 2.564949 -Inf 2.772589 0.693147 2.484907 1.791759 -Inf
## [25] 2.772589 -Inf 2.397895 -Inf -Inf 2.708050 -Inf 2.772589
## [33] 2.484907 2.564949 -Inf 2.397895 -Inf 1.609438 -Inf -Inf
## [41] -Inf -Inf -Inf 2.397895 1.791759 -Inf 2.639057 -Inf
## [49] -Inf -Inf 2.890372 -Inf 1.098612 2.564949 2.564949 1.609438
## [57] -Inf -Inf 2.708050 1.386294 2.944439 -Inf -Inf 2.302585
## [65] 2.484907 -Inf -Inf 2.639057 2.995732 -Inf 2.708050 2.772589
## [73] -Inf 2.302585 1.609438 3.135494 2.302585 2.484907 0.000000 -Inf
## [81] 2.772589 -Inf 1.386294 1.945910 -Inf -Inf 2.197225 1.945910
## [89] 1.945910 1.098612 -Inf 2.397895 -Inf -Inf 1.609438 2.302585
## [97] 2.302585 1.609438 3.178054 1.098612 2.397895 -Inf -Inf -Inf
## [105] -Inf 2.079442 -Inf -Inf 0.693147 -Inf 1.945910 2.484907
## [113] 1.609438 2.397895 2.197225 -Inf -Inf -Inf 0.693147 2.079442
## [121] 2.397895 2.302585 1.609438 -Inf -Inf -Inf 0.693147 1.098612
## [129] 1.098612 1.945910 2.564949 1.098612 2.197225 0.693147 -Inf 2.639057
## [137] -Inf -Inf 2.484907 2.302585 2.302585 -Inf -Inf 1.945910
## [145] -Inf 2.397895 -Inf 2.197225 -Inf 2.564949 2.079442 2.833213
## [153] -Inf 1.945910 2.302585 -Inf -Inf 2.079442 0.000000 -Inf
## [161] -Inf 1.609438 -Inf -Inf -Inf 1.945910 2.833213 -Inf
## [169] -Inf 2.708050 2.484907 2.484907 2.564949 1.609438 -Inf -Inf
## [177] 2.197225 -Inf 2.639057 1.098612 2.708050 -Inf 1.386294 1.791759
## [185] 1.945910 2.397895 -Inf -Inf -Inf 2.890372 2.564949 2.564949
## [193] -Inf 1.945910 2.890372 1.386294 1.791759 -Inf 1.609438 1.609438
## [201] -Inf -Inf 1.386294 -Inf -Inf 0.000000 -Inf -Inf
## [209] -Inf 2.397895 0.693147 2.639057 2.944439 1.609438 1.098612 2.708050
## [217] -Inf -Inf 2.484907 2.944439 2.708050 -Inf 1.791759 2.197225
## [225] -Inf -Inf -Inf 2.302585 2.564949 -Inf -Inf -Inf
## [233] 2.302585 2.890372 2.397895 2.079442 2.772589 2.079442 -Inf -Inf
## [241] -Inf -Inf -Inf 2.564949 -Inf -Inf 2.995732 -Inf
## [249] -Inf -Inf 2.302585 1.609438 -Inf 1.609438 3.135494 2.079442
## [257] -Inf 2.639057 -Inf 2.302585 2.079442 1.386294 2.708050 1.791759
## [265] 1.609438 2.302585 2.484907 1.945910 -Inf -Inf -Inf -Inf
## [273] -Inf 2.079442 0.693147 2.397895 2.639057 2.484907 0.693147 2.564949
## [281] 2.302585 1.945910 -Inf -Inf 2.397895 2.397895 2.397895 1.386294
## [289] -Inf 3.218876 2.639057 -Inf 2.772589 -Inf 1.098612 2.639057
## [297] 2.564949 2.564949 -Inf 2.833213 0.000000 -Inf 2.564949 2.772589
## [305] 2.484907 3.258097 0.000000 -Inf 2.944439 2.564949 3.367296 2.484907
## [313] 1.609438 1.098612 2.302585 2.079442 1.609438 -Inf 2.302585 2.944439
## [321] 2.484907 1.609438 2.302585 2.890372 1.386294 2.397895 -Inf 2.833213
## [329] 0.000000 2.197225 -Inf 2.708050 2.995732 1.945910 2.197225 2.708050
## [337] 1.791759 -Inf -Inf 1.386294 -Inf -Inf 2.564949 2.302585
## [345] -Inf -Inf -Inf -Inf 2.995732 2.890372 2.833213 2.772589
## [353] 2.639057 2.484907 0.000000 -Inf -Inf 1.098612 2.302585 2.397895
## [361] 1.945910 2.302585 -Inf 0.000000 2.772589 -Inf 2.397895 -Inf
## [369] 2.302585 3.091042 3.091042 -Inf -Inf -Inf 1.945910 1.386294
## [377] 2.944439 -Inf 1.098612 -Inf 2.302585 3.044522 2.944439 -Inf
## [385] 2.708050 2.564949 -Inf 2.639057 2.397895 2.079442 2.197225 -Inf
## [393] 2.564949 2.302585 2.944439 2.833213 1.098612 2.484907 1.945910 -Inf
## attr(,"method")
## [1] "log"
## attr(,"origin")
## [1] 11 16 10 4 3 13 0 15 0 0 9 4 2 11 11 5 0 13 0 16 2 12 6 0 16
## [26] 0 11 0 0 15 0 16 12 13 0 11 0 5 0 0 0 0 0 11 6 0 14 0 0 0
## [51] 18 0 3 13 13 5 0 0 15 4 19 0 0 10 12 0 0 14 20 0 15 16 0 10 5
## [76] 23 10 12 1 0 16 0 4 7 0 0 9 7 7 3 0 11 0 0 5 10 10 5 24 3
## [101] 11 0 0 0 0 8 0 0 2 0 7 12 5 11 9 0 0 0 2 8 11 10 5 0 0
## [126] 0 2 3 3 7 13 3 9 2 0 14 0 0 12 10 10 0 0 7 0 11 0 9 0 13
## [151] 8 17 0 7 10 0 0 8 1 0 0 5 0 0 0 7 17 0 0 15 12 12 13 5 0
## [176] 0 9 0 14 3 15 0 4 6 7 11 0 0 0 18 13 13 0 7 18 4 6 0 5 5
## [201] 0 0 4 0 0 1 0 0 0 11 2 14 19 5 3 15 0 0 12 19 15 0 6 9 0
## [226] 0 0 10 13 0 0 0 10 18 11 8 16 8 0 0 0 0 0 13 0 0 20 0 0 0
## [251] 10 5 0 5 23 8 0 14 0 10 8 4 15 6 5 10 12 7 0 0 0 0 0 8 2
## [276] 11 14 12 2 13 10 7 0 0 11 11 11 4 0 25 14 0 16 0 3 14 13 13 0 17
## [301] 1 0 13 16 12 26 1 0 19 13 29 12 5 3 10 8 5 0 10 19 12 5 10 18 4
## [326] 11 0 17 1 9 0 15 20 7 9 15 6 0 0 4 0 0 13 10 0 0 0 0 20 18
## [351] 17 16 14 12 1 0 0 3 10 11 7 10 0 1 16 0 11 0 10 22 22 0 0 0 7
## [376] 4 19 0 3 0 10 21 19 0 15 13 0 14 11 8 9 0 13 10 19 17 3 12 7 0
## attr(,"class")
## [1] "transform" "numeric"
summary(advertising_log)## * Resolving Skewness with log
##
## * Information of Transformation (before vs after)
## Original Transformation
## n 400.000000 400.000000
## na 0.000000 0.000000
## mean 6.635000 -Inf
## sd 6.650364 NaN
## se_mean 0.332518 NaN
## IQR 12.000000 Inf
## skewness 0.639586 NaN
## kurtosis -0.545118 NaN
## p00 0.000000 -Inf
## p01 0.000000 -Inf
## p05 0.000000 -Inf
## p10 0.000000 -Inf
## p20 0.000000 -Inf
## p25 0.000000 -Inf
## p30 0.000000 -Inf
## p40 2.000000 0.693147
## p50 5.000000 1.609438
## p60 8.400000 2.126555
## p70 11.000000 2.397895
## p75 12.000000 2.484907
## p80 13.000000 2.564949
## p90 16.000000 2.772589
## p95 19.000000 2.944439
## p99 23.010000 3.135920
## p100 29.000000 3.367296
plot(advertising_log)Kategoryzacja danych
Kategoryzacja danych – tzw. BINNING: binning() -
transformuje numeryczne, ilościowe zmienne w jakościowe - kategoryzowane
z etykietami.
Następujące typy kategoryzacji są wspierane:
“quantile” - kategoryzowanie po kwantylach rozkładu zmiennej (zapewnia się w ten sposób takie same liczebności w kolejnych kategoriach)
“equal” - kategoryzowanie tak, aby każda nowo utworzona klasa miała tę samą długość
“pretty” - kompromis między 2 wyżej wymienionymi
“kmeans” - kategoryzacja z użyciem algorytmu K-średnich
“bclust” - kategoryzacja z użyciem algorytmu “bagged clustering”
Przykład 6. Używając metody kwantylowej dokonaj podziału i kategoryzacji zmiennej “Income”:
dochod_kat<-binning(carseats$Income, type="quantile")
summary(dochod_kat)## levels freq rate
## 1 [21,30] 40 0.1000
## 2 (30,39] 37 0.0925
## 3 (39,48] 38 0.0950
## 4 (48,62] 40 0.1000
## 5 (62,69] 42 0.1050
## 6 (69,78] 33 0.0825
## 7 (78,86.56667] 36 0.0900
## 8 (86.56667,96.6] 38 0.0950
## 9 (96.6,108.6333] 38 0.0950
## 10 (108.6333,120] 38 0.0950
## 11 <NA> 20 0.0500
plot(dochod_kat)Możemy też przeprowadzić kategoryzację wg liczby kategorii:
dochod_4<- binning(carseats$Income, nbins = 4, labels = c("niskie","srednie","wysokie", "bwysokie"))
summary(dochod_4)## levels freq rate
## 1 niskie 95 0.2375
## 2 srednie 102 0.2550
## 3 wysokie 89 0.2225
## 4 bwysokie 94 0.2350
## 5 <NA> 20 0.0500
plot(dochod_4)Dyskretyzacja zmiennych
Jeśli zależy nam na stworzeniu nowej zmiennej kategoryzowanej, której rozkład będzie zależny idealnie od zmiennej referencyjnej – w takim przypadku należy skorzystać z algorytmu optymalnej kategoryzacji.
Np. kategoryzacja wg zmiennej Yes/No, wg ryzyka kredytowego itp.
Przykład 7. Kategoryzujemy zmienną “Advertising” wg zmiennej referencyjnej “US”:
# optimal binning
bin <- binning_by(carseats, y="US", x="Advertising")## Warning in binning_by(carseats, y = "US", x = "Advertising"): The factor y has been changed to a numeric vector consisting of 0 and 1.
## 'Yes' changed to 1 (positive) and 'No' changed to 0 (negative).
summary(bin)## ── Binning Table ──────────────────────── Several Metrics ──
## Bin CntRec CntPos CntNeg RatePos RateNeg Odds WoE IV JSD
## 1 [-1,0] 144 19 125 0.07364 0.88028 0.1520 -2.48101 2.00128 0.20093
## 2 (0,6] 69 54 15 0.20930 0.10563 3.6000 0.68380 0.07089 0.00869
## 3 (6,29] 187 185 2 0.71705 0.01408 92.5000 3.93008 2.76272 0.21861
## 4 Total 400 258 142 1.00000 1.00000 1.8169 NA 4.83489 0.42823
## AUC
## 1 0.03241
## 2 0.01883
## 3 0.00903
## 4 0.06028
##
## ── General Metrics ─────────────────────────────────────────
## • Gini index : -0.87944
## • IV (Jeffrey) : 4.83489
## • JS (Jensen-Shannon) Divergence : 0.42823
## • Kolmogorov-Smirnov Statistics : 0.80664
## • HHI (Herfindahl-Hirschman Index) : 0.37791
## • HHI (normalized) : 0.06687
## • Cramer's V : 0.81863
##
## ── Significance Tests ──────────────────── Chisquare Test ──
## Bin A Bin B statistics p_value
## 1 [-1,0] (0,6] 87.6706 0.00000000000000000000773135
## 2 (0,6] (6,29] 34.7335 0.00000000378070568864985667
UWAGA! Co to takiego information value i jak ją się interpretuje? Zwróć uwagę na inne metryki i testy w raporcie.
Możemy w końcu zwizualizować optymalną kategoryzację:
plot(bin)Na koniec - możemy wykonać raport podsumowujący wszystkie operacje związane z czyszczeniem danych:
w formacie PDF (pamiętajmy o Tex, sterownikach…).
w formacie HTML:
#carseats %>% transformation_web_report(target = US, output_format = "html", output_file ="transformation_carseats.html")Zadanie domowe
Korzystając z paczki danych “germancredit” dotyczącą oceny kredytowej (creditability) wybranych klientów pewnego banku:
Czy w zbiorze danych mamy obserwacje brakujące?
Proszę dokonać kategoryzacji zmiennej “age.in.years” (wiek w latach) wg oceny kredytowej “creditability”.
Podaj i zinterpretuj wskaźniki informacyjne. Oceń skośność zmiennych ilościowych.
Sprawdź, czy nie mamy obserwacji odstających dla zmiennej “age.in.years” (wiek w latach). Jeśli są - dokonaj imputacji wybraną przez siebie metodą.
data("germancredit")
attach(germancredit)
sum(is.na(germancredit))## [1] 0
aggr(germancredit)bin<- binning_by(germancredit, y="creditability", x="age.in.years")## Warning in binning_by(germancredit, y = "creditability", x = "age.in.years"): The factor y has been changed to a numeric vector consisting of 0 and 1.
## 'good' changed to 1 (positive) and 'bad' changed to 0 (negative).
plot(bin)summary(bin)## ── Binning Table ──────────────────────── Several Metrics ──
## Bin CntRec CntPos CntNeg RatePos RateNeg Odds WoE IV JSD
## 1 [19,25] 190 110 80 0.15714 0.26667 1.37500 -0.52884 0.05792 0.00716
## 2 (25,75] 810 590 220 0.84286 0.73333 2.68182 0.13920 0.01525 0.00190
## 3 Total 1000 700 300 1.00000 1.00000 2.33333 NA 0.07317 0.00906
## AUC
## 1 0.02095
## 2 0.42429
## 3 0.44524
##
## ── General Metrics ─────────────────────────────────────────
## • Gini index : -0.10952
## • IV (Jeffrey) : 0.07317
## • JS (Jensen-Shannon) Divergence : 0.00906
## • Kolmogorov-Smirnov Statistics : 0.10952
## • HHI (Herfindahl-Hirschman Index) : 0.6922
## • HHI (normalized) : 0.3844
## • Cramer's V : 0.12794
##
## ── Significance Tests ──────────────────── Chisquare Test ──
## Bin A Bin B statistics p_value
## 1 [19,25] (25,75] 16.3681 0.0000521562
find_skewness(germancredit, value = TRUE ,thres = 0.1)## duration.in.month
## 1.093
## credit.amount
## 1.947
## installment.rate.in.percentage.of.disposable.income
## -0.531
## present.residence.since
## -0.272
## age.in.years
## 1.019
## number.of.existing.credits.at.this.bank
## 1.271
## number.of.people.being.liable.to.provide.maintenance.for
## 1.907
boxplot(germancredit$age.in.years)wiek<-imputate_outlier(germancredit, age.in.years, method="capping")
summary(wiek)## Impute outliers with capping
##
## * Information of Imputation (before vs after)
## Original Imputation
## described_variables "value" "value"
## n "1000" "1000"
## na "0" "0"
## mean "35.546" "35.350"
## sd "11.3755" "10.8530"
## se_mean "0.359724" "0.343202"
## IQR "15" "15"
## skewness "1.020739" "0.821878"
## kurtosis " 0.595780" "-0.132573"
## p00 "19" "19"
## p01 "20" "20"
## p05 "22" "22"
## p10 "23" "23"
## p20 "26" "26"
## p25 "27" "27"
## p30 "28" "28"
## p40 "30" "30"
## p50 "33" "33"
## p60 "36" "36"
## p70 "39" "39"
## p75 "42" "42"
## p80 "45" "45"
## p90 "52" "52"
## p95 "60" "60"
## p99 "67.01" "63.00"
## p100 "75" "64"
plot(wiek)extract(bin)## [1] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [10] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25]
## [19] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [28] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [37] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75]
## [46] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
## [55] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75]
## [64] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75]
## [73] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
## [82] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [91] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
## [100] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
## [109] (25,75] (25,75] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75]
## [118] (25,75] [19,25] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75]
## [127] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25]
## [136] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [145] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [154] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [163] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75]
## [172] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [181] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [190] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
## [199] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
## [208] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [217] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
## [226] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] [19,25]
## [235] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [244] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
## [253] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
## [262] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [271] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [280] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [289] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [298] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [307] [19,25] (25,75] [19,25] [19,25] (25,75] [19,25] (25,75] [19,25] (25,75]
## [316] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [325] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [334] [19,25] [19,25] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75]
## [343] [19,25] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25]
## [352] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25]
## [361] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75]
## [370] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
## [379] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] [19,25] [19,25]
## [388] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75]
## [397] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [406] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
## [415] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75]
## [424] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
## [433] (25,75] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75]
## [442] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [451] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [460] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [469] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25]
## [478] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
## [487] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [496] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [505] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [514] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25]
## [523] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [532] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
## [541] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [550] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
## [559] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75]
## [568] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75] [19,25]
## [577] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [586] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [595] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [604] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
## [613] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
## [622] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [631] (25,75] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75]
## [640] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [649] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
## [658] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75]
## [667] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
## [676] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [685] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75]
## [694] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
## [703] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75]
## [712] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [721] (25,75] [19,25] [19,25] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75]
## [730] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
## [739] (25,75] (25,75] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] [19,25]
## [748] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75]
## [757] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75]
## [766] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
## [775] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75]
## [784] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [793] (25,75] (25,75] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75]
## [802] (25,75] [19,25] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75]
## [811] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
## [820] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [829] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75] [19,25]
## [838] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
## [847] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [856] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [865] [19,25] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
## [874] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [883] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
## [892] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [901] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75]
## [910] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
## [919] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25]
## [928] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75]
## [937] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [946] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
## [955] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
## [964] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75]
## [973] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
## [982] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75]
## [991] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [1000] (25,75]
## Levels: [19,25] < (25,75]
Po więcej informacji nt. pakietu ‘dlookr’ zapraszam na jego stronę domową z rozwiązanymi przykładami.