Podstawowe operacje w R - część 4.

Czyszczenie danych

Walidacja danych

Definicja

An activity, checking whether a combination of values comes from a predefined set of allowed value combinations.

Przykłady

  • Czy zyski są trzymane jako liczba (nie text)?

  • Czy wiek >= 0?

  • Kiedy wiek < 15, czy status_pracy == “brak pracy”?

  • Czy średni obrót w firmie jest dodatni?

Reguły

Rozpatrzmy następujący zbiór danych. Co widzisz w początkowych wierszach?

data(retailers)
head(retailers[3:7],3)
##   staff turnover other.rev total.rev staff.costs
## 1    75       NA        NA      1130          NA
## 2     9     1607        NA      1607         131
## 3    NA     6886       -33      6919         324
# dodajmy ID
retailers$rec_id <- sprintf("%03d",1:nrow(retailers))

Zdefiniujmy szereg reguł, które muszą spełniać nasze dane.

rules <- validator(
turnover + other.rev == total.rev
, turnover >= 0
, other.rev >= 0
, total.rev >= 0
, if (staff > 0) staff.costs > 0
)

Pora zobaczyć, czy nasz zbiór danych spełnia te reguły…

cf <- confront(retailers, rules, key="rec_id")
summary(cf)
##   name items passes fails nNA error warning
## 1   V1    60     19     4  37 FALSE   FALSE
## 2   V2    60     56     0   4 FALSE   FALSE
## 3   V3    60     23     1  36 FALSE   FALSE
## 4   V4    60     58     0   2 FALSE   FALSE
## 5   V5    60     50     0  10 FALSE   FALSE
##                                            expression
## 1 abs(turnover + other.rev - total.rev) <= 0.00000001
## 2                         turnover - 0 >= -0.00000001
## 3                        other.rev - 0 >= -0.00000001
## 4                        total.rev - 0 >= -0.00000001
## 5                      staff <= 0 | (staff.costs > 0)
barplot(cf, main="retailers")
## Warning: The 'barplot' method for confrontation objects is deprecated. Use
## 'plot' instead

#as.data.frame(cf) %>% head()

Zarządzanie regułami

Problem

Zestawy reguł często rosną organicznie i prawie nie są przycinane. W rezultacie, mogą zawierać redundancje lub sprzeczności, które sprawiają, że zestaw reguł staje się nieefektywny i trudny do zrozumienia.

Pomysł

Automatycznie wyeliminuj nadmiarowość i sprzeczności (na tyle jak to możliwe)!

rules
## Object of class 'validator' with 5 elements:
##  V1: turnover + other.rev == total.rev
##  V2: turnover >= 0
##  V3: other.rev >= 0
##  V4: total.rev >= 0
##  V5: staff <= 0 | (staff.costs > 0)
rules <- simplify_rules(rules)
## No fixed values found.

Modyfikacje danych

Obserwacja Do około 50% zmian w danych można wykonać ręcznie (lub skryptowo) na podstawie bezpośredniego wkładu ekspertów dziedzinowych.

Pytanie Czy możemy wesprzeć wykorzystanie wiedzy z domeny zewnętrznej w procesie czyszczenia danych, jednocześnie oddzielając ją od kodu?

Spójrzmy na możliwości pakietu ‘dcmodify’.

  1. Zdefiniuje on reguły modyfikujące w wierszu poleceń lub oddzielnym pliku tekstowym

  2. Dodaje metadane do reguł modyfikujących

  3. Czyta, sprawdza, manipuluje i stosuje reguły dla danych

m <- modifier(
if (other.rev < 0) other.rev <- -1 * other.rev
)
modified <- modify(retailers, m)
head(modified[3:7], 3)
##   staff turnover other.rev total.rev staff.costs
## 1    75       NA        NA      1130          NA
## 2     9     1607        NA      1607         131
## 3    NA     6886        33      6919         324

Lokalizacja błędów

Pytanie

Wiedząc, że rekord narusza szereg zasad, jakie pola mam zmienić, żebym to naprawić?

Odpowiedź

Znajdź najmniejszą (ważoną) liczbę pól, których wartości mogą być zastąpione, aby wszystkie zasady mogły być spełnione!

error_locations <- locate_errors(modified, rules)
values(error_locations)[30:37, 3:7]
##      staff turnover other.rev total.rev staff.costs
## [1,] FALSE     TRUE     FALSE     FALSE       FALSE
## [2,] FALSE    FALSE     FALSE     FALSE       FALSE
## [3,] FALSE    FALSE        NA      TRUE       FALSE
## [4,] FALSE    FALSE     FALSE     FALSE       FALSE
## [5,] FALSE    FALSE        NA     FALSE       FALSE
## [6,] FALSE    FALSE     FALSE     FALSE       FALSE
## [7,] FALSE    FALSE      TRUE     FALSE       FALSE
## [8,] FALSE     TRUE     FALSE     FALSE       FALSE
summary(error_locations)
## Variable:
##           name errors missing
## 4     turnover      2       4
## 5    other.rev      1      36
## 6    total.rev      1       2
## 1         size      0       0
## 2    incl.prob      0       0
## 3        staff      0       6
## 7  staff.costs      0      10
## 8  total.costs      0       5
## 9       profit      0       5
## 10         vat      0      12
## 11      rec_id      0       0
## Errors per record:
##   errors records
## 1      0      56
## 2      1       4

Zamieńmy błędne pola na NA:

fixable_data <- replace_errors(retailers, rules)
# check nr of missings
sum(is.na(retailers))
## [1] 80
sum(is.na(fixable_data))
## [1] 85

Dedukcyjne czyszczenie danych

impute_lr

Wyprowadzaj unikalne imputacje (tam, gdzie to możliwe) na podstawie ograniczeń liniowych. Jeśli total.rev = 0, wtedy turnover oraz other.rev muszą być równe 0.

turnover + other.rev == total.rev

turnover >= 0

other.rev >= 0

lr_imputed <- impute_lr(fixable_data, rules)
# check nr of imputations using validate::cells
cells(start=retailers, fixable=fixable_data, impute_lr=lr_imputed
, compare='sequential')
## Object of class cellComparison:
## 
##    cells(start = retailers, fixable = fixable_data, impute_lr = lr_imputed, compare = "sequential")
## 
##                 start fixable impute_lr
## cells             660     660       660
## available         580     575       611
## still_available   580     575       575
## unadapted         580     575       575
## adapted             0       0         0
## imputed             0       0        36
## missing            80      85        49
## still_missing      80      80        49
## removed             0       5         0

correct_typos

Sprawdź, czy ograniczenia równowagi liniowej można naprawić, zakładając błąd typograficzny w jednej z liczb.

Jeśli turnover = 100, other.rev = 50 oraz total.rev = 105, zamiana ostatnich 2 cyfr w total.rev naprawia błąd.

typos_corrected <- correct_typos(lr_imputed,rules[1:3])

# Compare progress on rule violation using validate::compare
compare(rules, lr_imputed, typos_corrected)
## Object of class validatorComparison:
## 
##    compare(x = rules, lr_imputed, typos_corrected)
## 
##                     Version
## Status               D0001 D0002
##   validations          240   240
##   verifiable           216   216
##   unverifiable          24    24
##   still_unverifiable    24    24
##   new_unverifiable       0     0
##   satisfied            216   216
##   still_satisfied      216   216
##   new_satisfied          0     0
##   violated               0     0
##   still_violated         0     0
##   new_violated           0     0

Brakujące obserwacje

Missing Completely at Random (MCAR)

MCAR to wartości w zbiorze danych, które brakują całkowicie losowo (MCAR), tj. jeśli zdarzenia, które prowadzą do braku danego elementu danych są niezależne zarówno od obserwowalnych zmiennych, jak i nieobserwowalnych parametrów i występują całkowicie losowo.

Kiedy dane są MCAR, analiza przeprowadzona na danych jest nieobciążona; jednakże dane rzadko są MCAR.

W przypadku MCAR, brak danych nie jest związany z żadną zmienną w badaniu.

Missing at Random (MAR)

Brak losowy (MAR) występuje wtedy, gdy brak nie jest losowy, ale gdy brak może być w pełni wyjaśniony przez zmienne, o których istnieje pełna informacja.

Ponieważ MAR jest założeniem, którego nie da się zweryfikować statystycznie, musimy polegać na jego merytorycznej zasadności.

Przykładem może być to, że mężczyźni rzadziej wypełniają ankietę dotyczącą depresji, ale nie ma to nic wspólnego z ich poziomem depresji, po uwzględnieniu płci.

Missing Not at Random (MNAR)

MNAR (znane również jako nonignorable nonresponse) to dane, które nie są ani MAR, ani MCAR (tj. wartość zmiennej, której brakuje, jest związana z powodem jej braku).

Rozszerzając poprzedni przykład, taka sytuacja miałaby miejsce, gdyby mężczyźni nie wypełnili ankiety dotyczącej depresji z powodu poziomu depresji.

Analiza braków danych

Jak dokonywać detekcji oraz imputacji przedstawiono na poniższym schemacie:

Imputacja/interpolacja - zastępowanie średnią, medianą, dominantą, modelem, zaawansowane algorytmy, usuwanie - rozwiązania są dostępne w następujących pakietach:

Metody dla zmiennej ilościowej

  • zastępowanie średnią, medianą, dominantą

  • KNN – K-najbliższych sąsiadów

  • RPART – drzewa losowe

  • “mice” - Multivariate Imputation by Chained Equations – wielowymiarowe wypełnianie przez równania łańcuchowe

Metody dla zmiennej jakościowej

  • dominanta

  • RPART - drzewa losowe

  • “mice”

Wizualizacja brakujących obserwacji

VIM::aggr(typos_corrected[3:7])

Inspekcja brakujących danych

VIM::pbox(typos_corrected[3:7], pos=1, las=2)

SIMPUTACJE

Tzw. “simputacje” dostarczają:

  • jednolity interfejs,

  • konsekwentnym zachowanie wśród powszechnie stosowanych metodologii,

  • ułatwiają eksperymenty,

  • konfigurację danych do produkcji,

  • integrację z innymi etapami procesu.

impute_<model>(data, <imputed vars> ~ <predictor vars>)

Przykład: liniowa imputacja

typos_corrected[3:7] %>%
impute_lm(other.rev ~ turnover) %>%
head(3)
##   staff turnover other.rev total.rev staff.costs
## 1    75       NA        NA      1130          NA
## 2     9     1607         0      1607         131
## 3    NA     6886        33      6919         324

Łańcuchowe imputacje:

typos_corrected[3:7] %>%
impute_lm(other.rev ~ turnover + staff) %>%
impute_lm(other.rev ~ staff) %>%
head(3)
##   staff turnover other.rev total.rev staff.costs
## 1    75       NA   207.312      1130          NA
## 2     9     1607     0.000      1607         131
## 3    NA     6886    33.000      6919         324

Odporne (“robust”) imputacje:

typos_corrected[3:7] %>%
impute_rlm(other.rev ~ turnover + staff) %>%
impute_rlm(other.rev ~ staff) %>%
head(3)
##   staff turnover other.rev total.rev staff.costs
## 1    75       NA   55.3733      1130          NA
## 2     9     1607    0.0000      1607         131
## 3    NA     6886   33.0000      6919         324

Wiele zmiennych, te same predyktory:

typos_corrected %>%
impute_rlm(other.rev + total.rev ~ turnover)
##    size incl.prob staff turnover other.rev total.rev staff.costs total.costs
## 1   sc0      0.02    75       NA        NA    1130.0          NA       18915
## 2   sc3      0.14     9     1607    0.0000    1607.0         131        1544
## 3   sc3      0.14    NA     6886   33.0000    6919.0         324        6493
## 4   sc3      0.14    NA     3861   13.0000    3874.0         290        3600
## 5   sc3      0.14    NA     5565   37.0000    5602.0         314        5530
## 6   sc0      0.02     1       25    0.0000      25.0          NA          22
## 7   sc3      0.14     5       NA        NA    1335.0         135         136
## 8   sc1      0.02     3      404   13.0000     417.0          NA         342
## 9   sc3      0.14     6     2596    0.0000    2596.0         147        2486
## 10  sc2      0.05     5       NA        NA        NA          NA          NA
## 11  sc2      0.05     5      645    0.0000     645.0         130         636
## 12  sc2      0.05     5     2872    0.0000    2872.0         182        2652
## 13  sc3      0.14    13     5678   12.0000    5690.0         326        5656
## 14  sc1      0.02    NA   931397    0.0000  931397.0       36872      841489
## 15  sc1      0.02     3    80000    6.3938   80006.4       40000          NA
## 16  sc0      0.02    52     9067  622.0000    9689.0        1125        9911
## 17  sc3      0.14    10     1500   20.0000    1520.0         195        1384
## 18  sc1      0.02     4      440    0.0000     440.0          16         379
## 19  sc2      0.05     3      690    0.0000     690.0       19000      464507
## 20  sc3      0.14     8     1852    0.0000    1852.0         120        1812
## 21  sc0      0.02     2      359    9.0000     368.0          NA         339
## 22  sc0      0.02     3      839    0.0000     839.0           2         717
## 23  sc1      0.02     2      471    0.0000     471.0          34         411
## 24  sc1      0.02     4      933    2.0000     935.0          31         814
## 25  sc2      0.05     3     1665    0.0000    1665.0          70         186
## 26  sc3      0.14     6     2318    0.0000    2318.0         184         390
## 27  sc2      0.05     2     1175   12.0000    1187.0         114          NA
## 28  sc3      0.14    16     2946    7.0000    2953.0         245        2870
## 29  sc0      0.02     1      492    0.0000     492.0          NA         470
## 30  sc2      0.05     6     1831 1831.0000    3662.0          53        1443
## 31  sc3      0.14    29     7271   30.0000    7301.0         451        7242
## 32  sc2      0.05     8       NA        NA     107.0          28          95
## 33  sc3      0.14    13     4118   11.0000    4129.0          57        3601
## 34  sc3      0.14     9     2803    0.0000    2803.0         106        2643
## 35  sc3      0.14    15     2876   33.0000    2909.0         539        2627
## 36  sc3      0.14    14     2649   98.0000    2747.0      221302     2725410
## 37  sc2      0.05     6      202    4.0000     206.0          64         170
## 38  sc2      0.05    53     9842    0.0000    9842.0         837       10000
## 39  sc2      0.05     7     2463   38.0000    2501.0          87        2347
## 40  sc3      0.14    NA     4445   98.0000    4543.0         369        4266
## 41  sc3      0.14    20     3284   11.0000    3295.0         181        3168
## 42  sc2      0.05     2      814    0.0000     814.0         107         175
## 43  sc1      0.02    NA     1210    0.0000    1210.0          52        1124
## 44  sc0      0.02     1      343    0.0000     343.0          NA          NA
## 45  sc2      0.05     3      952    0.0000     952.0          79          NA
## 46  sc0      0.02     1       41    0.0000      41.0          NA          32
## 47  sc3      0.14    60     3633    0.0000    3633.0         257        3626
## 48  sc3      0.14     8     2906    0.0000    2906.0         144         453
## 49  sc3      0.14    10     2333    6.0000    2339.0         193        2353
## 50  sc3      0.14    12     2275    5.0000    2280.0         222        2302
## 51  sc2      0.05     7     1728    0.0000    1728.0         153        1681
## 52  sc3      0.14    24     6872   32.0000    6904.0         485        6729
## 53  sc3      0.14    29     3571   76.0000    3647.0         311        3554
## 54  sc3      0.14    11     1021    0.0000    1021.0         235         472
## 55  sc0      0.02     1      197    0.0000     197.0          NA         168
## 56  sc2      0.05     7      917    0.0000     917.0          30         781
## 57  sc2      0.05     8     2000    0.0000    2000.0          NA        1700
## 58  sc3      0.14     3      200    0.0000     200.0          49         177
## 59  sc2      0.05     4      342    0.0000     342.0          30         299
## 60  sc2      0.05     6        1 1410.0000    1411.0         179        1215
##    profit  vat rec_id
## 1   20045   NA    001
## 2      63   NA    002
## 3     426   NA    003
## 4     274   NA    004
## 5      72   NA    005
## 6       3   NA    006
## 7       1 1346    007
## 8      75   NA    008
## 9     110   NA    009
## 10     NA   NA    010
## 11      9   NA    011
## 12    220   NA    012
## 13     34   NA    013
## 14  89908  863    014
## 15     NA  813    015
## 16   -222  964    016
## 17    136  733    017
## 18     60  296    018
## 19 225493  486    019
## 20     40 1312    020
## 21     29  257    021
## 22    122  654    022
## 23     60  377    023
## 24    121  811    024
## 25   1478 1472    025
## 26     86 2082    026
## 27     17 1058    027
## 28     83 2670    028
## 29     22  449    029
## 30    388 1695    030
## 31     59 6754    031
## 32    100  905    032
## 33    528 3841    033
## 34    160 2668    034
## 35    282 2758    035
## 36  22457 2548    036
## 37     37  995    037
## 38   -160 9655    038
## 39    154 2441    039
## 40    277 4412    040
## 41    127 3263    041
## 42     NA  810    042
## 43     86 1205    043
## 44     NA  343    044
## 45    149  952    045
## 46      9   41    046
## 47      7 3634    047
## 48     53 2907    048
## 49    -14 2335    049
## 50    -22 2277    050
## 51     47 1742    051
## 52    174 6959    052
## 53     93 3700    053
## 54    549 1067    054
## 55     30  221    055
## 56    136 1030    056
## 57     NA 2271    057
## 58    222  251    058
## 59     43 1068    059
## 60    196 1389    060
typos_corrected %>%
impute_rlm( . - turnover ~ turnover)
##    size incl.prob    staff turnover other.rev total.rev staff.costs total.costs
## 1   sc0      0.02 75.00000       NA        NA    1130.0          NA   18915.000
## 2   sc3      0.14  9.00000     1607    0.0000    1607.0    131.0000    1544.000
## 3   sc3      0.14  7.65358     6886   33.0000    6919.0    324.0000    6493.000
## 4   sc3      0.14  7.60328     3861   13.0000    3874.0    290.0000    3600.000
## 5   sc3      0.14  7.63162     5565   37.0000    5602.0    314.0000    5530.000
## 6   sc0      0.02  1.00000       25    0.0000      25.0     91.0412      22.000
## 7   sc3      0.14  5.00000       NA        NA    1335.0    135.0000     136.000
## 8   sc1      0.02  3.00000      404   13.0000     417.0    106.0146     342.000
## 9   sc3      0.14  6.00000     2596    0.0000    2596.0    147.0000    2486.000
## 10  sc2      0.05  5.00000       NA        NA        NA          NA          NA
## 11  sc2      0.05  5.00000      645    0.0000     645.0    130.0000     636.000
## 12  sc2      0.05  5.00000     2872    0.0000    2872.0    182.0000    2652.000
## 13  sc3      0.14 13.00000     5678   12.0000    5690.0    326.0000    5656.000
## 14  sc1      0.02 23.02605   931397    0.0000  931397.0  36872.0000  841489.000
## 15  sc1      0.02  3.00000    80000    6.3938   80006.4  40000.0000   72369.653
## 16  sc0      0.02 52.00000     9067  622.0000    9689.0   1125.0000    9911.000
## 17  sc3      0.14 10.00000     1500   20.0000    1520.0    195.0000    1384.000
## 18  sc1      0.02  4.00000      440    0.0000     440.0     16.0000     379.000
## 19  sc2      0.05  3.00000      690    0.0000     690.0  19000.0000  464507.000
## 20  sc3      0.14  8.00000     1852    0.0000    1852.0    120.0000    1812.000
## 21  sc0      0.02  2.00000      359    9.0000     368.0    104.2367     339.000
## 22  sc0      0.02  3.00000      839    0.0000     839.0      2.0000     717.000
## 23  sc1      0.02  2.00000      471    0.0000     471.0     34.0000     411.000
## 24  sc1      0.02  4.00000      933    2.0000     935.0     31.0000     814.000
## 25  sc2      0.05  3.00000     1665    0.0000    1665.0     70.0000     186.000
## 26  sc3      0.14  6.00000     2318    0.0000    2318.0    184.0000     390.000
## 27  sc2      0.05  2.00000     1175   12.0000    1187.0    114.0000    1160.898
## 28  sc3      0.14 16.00000     2946    7.0000    2953.0    245.0000    2870.000
## 29  sc0      0.02  1.00000      492    0.0000     492.0    109.4912     470.000
## 30  sc2      0.05  6.00000     1831 1831.0000    3662.0     53.0000    1443.000
## 31  sc3      0.14 29.00000     7271   30.0000    7301.0    451.0000    7242.000
## 32  sc2      0.05  8.00000       NA        NA     107.0     28.0000      95.000
## 33  sc3      0.14 13.00000     4118   11.0000    4129.0     57.0000    3601.000
## 34  sc3      0.14  9.00000     2803    0.0000    2803.0    106.0000    2643.000
## 35  sc3      0.14 15.00000     2876   33.0000    2909.0    539.0000    2627.000
## 36  sc3      0.14 14.00000     2649   98.0000    2747.0 221302.0000 2725410.000
## 37  sc2      0.05  6.00000      202    4.0000     206.0     64.0000     170.000
## 38  sc2      0.05 53.00000     9842    0.0000    9842.0    837.0000   10000.000
## 39  sc2      0.05  7.00000     2463   38.0000    2501.0     87.0000    2347.000
## 40  sc3      0.14  7.61299     4445   98.0000    4543.0    369.0000    4266.000
## 41  sc3      0.14 20.00000     3284   11.0000    3295.0    181.0000    3168.000
## 42  sc2      0.05  2.00000      814    0.0000     814.0    107.0000     175.000
## 43  sc1      0.02  7.55920     1210    0.0000    1210.0     52.0000    1124.000
## 44  sc0      0.02  1.00000      343    0.0000     343.0    103.6046     409.287
## 45  sc2      0.05  3.00000      952    0.0000     952.0     79.0000     959.445
## 46  sc0      0.02  1.00000       41    0.0000      41.0     91.6733      32.000
## 47  sc3      0.14 60.00000     3633    0.0000    3633.0    257.0000    3626.000
## 48  sc3      0.14  8.00000     2906    0.0000    2906.0    144.0000     453.000
## 49  sc3      0.14 10.00000     2333    6.0000    2339.0    193.0000    2353.000
## 50  sc3      0.14 12.00000     2275    5.0000    2280.0    222.0000    2302.000
## 51  sc2      0.05  7.00000     1728    0.0000    1728.0    153.0000    1681.000
## 52  sc3      0.14 24.00000     6872   32.0000    6904.0    485.0000    6729.000
## 53  sc3      0.14 29.00000     3571   76.0000    3647.0    311.0000    3554.000
## 54  sc3      0.14 11.00000     1021    0.0000    1021.0    235.0000     472.000
## 55  sc0      0.02  1.00000      197    0.0000     197.0     97.8365     168.000
## 56  sc2      0.05  7.00000      917    0.0000     917.0     30.0000     781.000
## 57  sc2      0.05  8.00000     2000    0.0000    2000.0    169.0685    1700.000
## 58  sc3      0.14  3.00000      200    0.0000     200.0     49.0000     177.000
## 59  sc2      0.05  4.00000      342    0.0000     342.0     30.0000     299.000
## 60  sc2      0.05  6.00000        1 1410.0000    1411.0    179.0000    1215.000
##         profit     vat rec_id
## 1   20045.0000      NA    001
## 2      63.0000 1669.26    002
## 3     426.0000 1664.79    003
## 4     274.0000 1667.35    004
## 5      72.0000 1665.91    005
## 6       3.0000 1670.60    006
## 7       1.0000 1346.00    007
## 8      75.0000 1670.28    008
## 9     110.0000 1668.43    009
## 10          NA      NA    010
## 11      9.0000 1670.08    011
## 12    220.0000 1668.19    012
## 13     34.0000 1665.82    013
## 14  89908.0000  863.00    014
## 15   7659.4122  813.00    015
## 16   -222.0000  964.00    016
## 17    136.0000  733.00    017
## 18     60.0000  296.00    018
## 19 225493.0000  486.00    019
## 20     40.0000 1312.00    020
## 21     29.0000  257.00    021
## 22    122.0000  654.00    022
## 23     60.0000  377.00    023
## 24    121.0000  811.00    024
## 25   1478.0000 1472.00    025
## 26     86.0000 2082.00    026
## 27     17.0000 1058.00    027
## 28     83.0000 2670.00    028
## 29     22.0000  449.00    029
## 30    388.0000 1695.00    030
## 31     59.0000 6754.00    031
## 32    100.0000  905.00    032
## 33    528.0000 3841.00    033
## 34    160.0000 2668.00    034
## 35    282.0000 2758.00    035
## 36  22457.0000 2548.00    036
## 37     37.0000  995.00    037
## 38   -160.0000 9655.00    038
## 39    154.0000 2441.00    039
## 40    277.0000 4412.00    040
## 41    127.0000 3263.00    041
## 42     10.8422  810.00    042
## 43     86.0000 1205.00    043
## 44    -34.6517  343.00    044
## 45    149.0000  952.00    045
## 46      9.0000   41.00    046
## 47      7.0000 3634.00    047
## 48     53.0000 2907.00    048
## 49    -14.0000 2335.00    049
## 50    -22.0000 2277.00    050
## 51     47.0000 1742.00    051
## 52    174.0000 6959.00    052
## 53     93.0000 3700.00    053
## 54    549.0000 1067.00    054
## 55     30.0000  221.00    055
## 56    136.0000 1030.00    056
## 57    125.3979 2271.00    057
## 58    222.0000  251.00    058
## 59     43.0000 1068.00    059
## 60    196.0000 1389.00    060

Przykład: grupowanie

typos_corrected %>%
impute_rlm(total.rev ~ turnover | size)
##    size incl.prob staff turnover other.rev total.rev staff.costs total.costs
## 1   sc0      0.02    75       NA        NA    1130.0          NA       18915
## 2   sc3      0.14     9     1607         0    1607.0         131        1544
## 3   sc3      0.14    NA     6886        33    6919.0         324        6493
## 4   sc3      0.14    NA     3861        13    3874.0         290        3600
## 5   sc3      0.14    NA     5565        37    5602.0         314        5530
## 6   sc0      0.02     1       25         0      25.0          NA          22
## 7   sc3      0.14     5       NA        NA    1335.0         135         136
## 8   sc1      0.02     3      404        13     417.0          NA         342
## 9   sc3      0.14     6     2596         0    2596.0         147        2486
## 10  sc2      0.05     5       NA        NA        NA          NA          NA
## 11  sc2      0.05     5      645         0     645.0         130         636
## 12  sc2      0.05     5     2872         0    2872.0         182        2652
## 13  sc3      0.14    13     5678        12    5690.0         326        5656
## 14  sc1      0.02    NA   931397         0  931397.0       36872      841489
## 15  sc1      0.02     3    80000        NA   80000.9       40000          NA
## 16  sc0      0.02    52     9067       622    9689.0        1125        9911
## 17  sc3      0.14    10     1500        20    1520.0         195        1384
## 18  sc1      0.02     4      440         0     440.0          16         379
## 19  sc2      0.05     3      690         0     690.0       19000      464507
## 20  sc3      0.14     8     1852         0    1852.0         120        1812
## 21  sc0      0.02     2      359         9     368.0          NA         339
## 22  sc0      0.02     3      839         0     839.0           2         717
## 23  sc1      0.02     2      471         0     471.0          34         411
## 24  sc1      0.02     4      933         2     935.0          31         814
## 25  sc2      0.05     3     1665         0    1665.0          70         186
## 26  sc3      0.14     6     2318         0    2318.0         184         390
## 27  sc2      0.05     2     1175        12    1187.0         114          NA
## 28  sc3      0.14    16     2946         7    2953.0         245        2870
## 29  sc0      0.02     1      492         0     492.0          NA         470
## 30  sc2      0.05     6     1831      1831    3662.0          53        1443
## 31  sc3      0.14    29     7271        30    7301.0         451        7242
## 32  sc2      0.05     8       NA        NA     107.0          28          95
## 33  sc3      0.14    13     4118        11    4129.0          57        3601
## 34  sc3      0.14     9     2803         0    2803.0         106        2643
## 35  sc3      0.14    15     2876        33    2909.0         539        2627
## 36  sc3      0.14    14     2649        98    2747.0      221302     2725410
## 37  sc2      0.05     6      202         4     206.0          64         170
## 38  sc2      0.05    53     9842         0    9842.0         837       10000
## 39  sc2      0.05     7     2463        38    2501.0          87        2347
## 40  sc3      0.14    NA     4445        98    4543.0         369        4266
## 41  sc3      0.14    20     3284        11    3295.0         181        3168
## 42  sc2      0.05     2      814         0     814.0         107         175
## 43  sc1      0.02    NA     1210         0    1210.0          52        1124
## 44  sc0      0.02     1      343         0     343.0          NA          NA
## 45  sc2      0.05     3      952         0     952.0          79          NA
## 46  sc0      0.02     1       41         0      41.0          NA          32
## 47  sc3      0.14    60     3633         0    3633.0         257        3626
## 48  sc3      0.14     8     2906         0    2906.0         144         453
## 49  sc3      0.14    10     2333         6    2339.0         193        2353
## 50  sc3      0.14    12     2275         5    2280.0         222        2302
## 51  sc2      0.05     7     1728         0    1728.0         153        1681
## 52  sc3      0.14    24     6872        32    6904.0         485        6729
## 53  sc3      0.14    29     3571        76    3647.0         311        3554
## 54  sc3      0.14    11     1021         0    1021.0         235         472
## 55  sc0      0.02     1      197         0     197.0          NA         168
## 56  sc2      0.05     7      917         0     917.0          30         781
## 57  sc2      0.05     8     2000         0    2000.0          NA        1700
## 58  sc3      0.14     3      200         0     200.0          49         177
## 59  sc2      0.05     4      342         0     342.0          30         299
## 60  sc2      0.05     6        1      1410    1411.0         179        1215
##    profit  vat rec_id
## 1   20045   NA    001
## 2      63   NA    002
## 3     426   NA    003
## 4     274   NA    004
## 5      72   NA    005
## 6       3   NA    006
## 7       1 1346    007
## 8      75   NA    008
## 9     110   NA    009
## 10     NA   NA    010
## 11      9   NA    011
## 12    220   NA    012
## 13     34   NA    013
## 14  89908  863    014
## 15     NA  813    015
## 16   -222  964    016
## 17    136  733    017
## 18     60  296    018
## 19 225493  486    019
## 20     40 1312    020
## 21     29  257    021
## 22    122  654    022
## 23     60  377    023
## 24    121  811    024
## 25   1478 1472    025
## 26     86 2082    026
## 27     17 1058    027
## 28     83 2670    028
## 29     22  449    029
## 30    388 1695    030
## 31     59 6754    031
## 32    100  905    032
## 33    528 3841    033
## 34    160 2668    034
## 35    282 2758    035
## 36  22457 2548    036
## 37     37  995    037
## 38   -160 9655    038
## 39    154 2441    039
## 40    277 4412    040
## 41    127 3263    041
## 42     NA  810    042
## 43     86 1205    043
## 44     NA  343    044
## 45    149  952    045
## 46      9   41    046
## 47      7 3634    047
## 48     53 2907    048
## 49    -14 2335    049
## 50    -22 2277    050
## 51     47 1742    051
## 52    174 6959    052
## 53     93 3700    053
## 54    549 1067    054
## 55     30  221    055
## 56    136 1030    056
## 57     NA 2271    057
## 58    222  251    058
## 59     43 1068    059
## 60    196 1389    060
# or, using dplyr::group_by
typos_corrected %>%
group_by(size) %>%
impute_rlm(total.rev ~ turnover)
## # A tibble: 60 × 11
## # Groups:   size [4]
##    size  incl.prob staff turnover other.rev total…¹ staff…² total…³ profit   vat
##  * <fct>     <dbl> <int>    <dbl>     <dbl>   <dbl>   <int>   <int>  <int> <int>
##  1 sc0        0.02    75       NA        NA    1130      NA   18915  20045    NA
##  2 sc3        0.14     9     1607         0    1607     131    1544     63    NA
##  3 sc3        0.14    NA     6886        33    6919     324    6493    426    NA
##  4 sc3        0.14    NA     3861        13    3874     290    3600    274    NA
##  5 sc3        0.14    NA     5565        37    5602     314    5530     72    NA
##  6 sc0        0.02     1       25         0      25      NA      22      3    NA
##  7 sc3        0.14     5       NA        NA    1335     135     136      1  1346
##  8 sc1        0.02     3      404        13     417      NA     342     75    NA
##  9 sc3        0.14     6     2596         0    2596     147    2486    110    NA
## 10 sc2        0.05     5       NA        NA      NA      NA      NA     NA    NA
## # … with 50 more rows, 1 more variable: rec_id <chr>, and abbreviated variable
## #   names ¹​total.rev, ²​staff.costs, ³​total.costs

Przykład: losowe reszty

typos_corrected %>%
impute_rlm(total.rev ~ turnover | size,
add_residual="observed")
##    size incl.prob staff turnover other.rev total.rev staff.costs total.costs
## 1   sc0      0.02    75       NA        NA    1130.0          NA       18915
## 2   sc3      0.14     9     1607         0    1607.0         131        1544
## 3   sc3      0.14    NA     6886        33    6919.0         324        6493
## 4   sc3      0.14    NA     3861        13    3874.0         290        3600
## 5   sc3      0.14    NA     5565        37    5602.0         314        5530
## 6   sc0      0.02     1       25         0      25.0          NA          22
## 7   sc3      0.14     5       NA        NA    1335.0         135         136
## 8   sc1      0.02     3      404        13     417.0          NA         342
## 9   sc3      0.14     6     2596         0    2596.0         147        2486
## 10  sc2      0.05     5       NA        NA        NA          NA          NA
## 11  sc2      0.05     5      645         0     645.0         130         636
## 12  sc2      0.05     5     2872         0    2872.0         182        2652
## 13  sc3      0.14    13     5678        12    5690.0         326        5656
## 14  sc1      0.02    NA   931397         0  931397.0       36872      841489
## 15  sc1      0.02     3    80000        NA   79999.9       40000          NA
## 16  sc0      0.02    52     9067       622    9689.0        1125        9911
## 17  sc3      0.14    10     1500        20    1520.0         195        1384
## 18  sc1      0.02     4      440         0     440.0          16         379
## 19  sc2      0.05     3      690         0     690.0       19000      464507
## 20  sc3      0.14     8     1852         0    1852.0         120        1812
## 21  sc0      0.02     2      359         9     368.0          NA         339
## 22  sc0      0.02     3      839         0     839.0           2         717
## 23  sc1      0.02     2      471         0     471.0          34         411
## 24  sc1      0.02     4      933         2     935.0          31         814
## 25  sc2      0.05     3     1665         0    1665.0          70         186
## 26  sc3      0.14     6     2318         0    2318.0         184         390
## 27  sc2      0.05     2     1175        12    1187.0         114          NA
## 28  sc3      0.14    16     2946         7    2953.0         245        2870
## 29  sc0      0.02     1      492         0     492.0          NA         470
## 30  sc2      0.05     6     1831      1831    3662.0          53        1443
## 31  sc3      0.14    29     7271        30    7301.0         451        7242
## 32  sc2      0.05     8       NA        NA     107.0          28          95
## 33  sc3      0.14    13     4118        11    4129.0          57        3601
## 34  sc3      0.14     9     2803         0    2803.0         106        2643
## 35  sc3      0.14    15     2876        33    2909.0         539        2627
## 36  sc3      0.14    14     2649        98    2747.0      221302     2725410
## 37  sc2      0.05     6      202         4     206.0          64         170
## 38  sc2      0.05    53     9842         0    9842.0         837       10000
## 39  sc2      0.05     7     2463        38    2501.0          87        2347
## 40  sc3      0.14    NA     4445        98    4543.0         369        4266
## 41  sc3      0.14    20     3284        11    3295.0         181        3168
## 42  sc2      0.05     2      814         0     814.0         107         175
## 43  sc1      0.02    NA     1210         0    1210.0          52        1124
## 44  sc0      0.02     1      343         0     343.0          NA          NA
## 45  sc2      0.05     3      952         0     952.0          79          NA
## 46  sc0      0.02     1       41         0      41.0          NA          32
## 47  sc3      0.14    60     3633         0    3633.0         257        3626
## 48  sc3      0.14     8     2906         0    2906.0         144         453
## 49  sc3      0.14    10     2333         6    2339.0         193        2353
## 50  sc3      0.14    12     2275         5    2280.0         222        2302
## 51  sc2      0.05     7     1728         0    1728.0         153        1681
## 52  sc3      0.14    24     6872        32    6904.0         485        6729
## 53  sc3      0.14    29     3571        76    3647.0         311        3554
## 54  sc3      0.14    11     1021         0    1021.0         235         472
## 55  sc0      0.02     1      197         0     197.0          NA         168
## 56  sc2      0.05     7      917         0     917.0          30         781
## 57  sc2      0.05     8     2000         0    2000.0          NA        1700
## 58  sc3      0.14     3      200         0     200.0          49         177
## 59  sc2      0.05     4      342         0     342.0          30         299
## 60  sc2      0.05     6        1      1410    1411.0         179        1215
##    profit  vat rec_id
## 1   20045   NA    001
## 2      63   NA    002
## 3     426   NA    003
## 4     274   NA    004
## 5      72   NA    005
## 6       3   NA    006
## 7       1 1346    007
## 8      75   NA    008
## 9     110   NA    009
## 10     NA   NA    010
## 11      9   NA    011
## 12    220   NA    012
## 13     34   NA    013
## 14  89908  863    014
## 15     NA  813    015
## 16   -222  964    016
## 17    136  733    017
## 18     60  296    018
## 19 225493  486    019
## 20     40 1312    020
## 21     29  257    021
## 22    122  654    022
## 23     60  377    023
## 24    121  811    024
## 25   1478 1472    025
## 26     86 2082    026
## 27     17 1058    027
## 28     83 2670    028
## 29     22  449    029
## 30    388 1695    030
## 31     59 6754    031
## 32    100  905    032
## 33    528 3841    033
## 34    160 2668    034
## 35    282 2758    035
## 36  22457 2548    036
## 37     37  995    037
## 38   -160 9655    038
## 39    154 2441    039
## 40    277 4412    040
## 41    127 3263    041
## 42     NA  810    042
## 43     86 1205    043
## 44     NA  343    044
## 45    149  952    045
## 46      9   41    046
## 47      7 3634    047
## 48     53 2907    048
## 49    -14 2335    049
## 50    -22 2277    050
## 51     47 1742    051
## 52    174 6959    052
## 53     93 3700    053
## 54    549 1067    054
## 55     30  221    055
## 56    136 1030    056
## 57     NA 2271    057
## 58    222  251    058
## 59     43 1068    059
## 60    196 1389    060
typos_corrected %>%
impute_rlm(total.rev ~ turnover | size,
add_residual="normal")
##    size incl.prob staff turnover other.rev total.rev staff.costs total.costs
## 1   sc0      0.02    75       NA        NA    1130.0          NA       18915
## 2   sc3      0.14     9     1607         0    1607.0         131        1544
## 3   sc3      0.14    NA     6886        33    6919.0         324        6493
## 4   sc3      0.14    NA     3861        13    3874.0         290        3600
## 5   sc3      0.14    NA     5565        37    5602.0         314        5530
## 6   sc0      0.02     1       25         0      25.0          NA          22
## 7   sc3      0.14     5       NA        NA    1335.0         135         136
## 8   sc1      0.02     3      404        13     417.0          NA         342
## 9   sc3      0.14     6     2596         0    2596.0         147        2486
## 10  sc2      0.05     5       NA        NA        NA          NA          NA
## 11  sc2      0.05     5      645         0     645.0         130         636
## 12  sc2      0.05     5     2872         0    2872.0         182        2652
## 13  sc3      0.14    13     5678        12    5690.0         326        5656
## 14  sc1      0.02    NA   931397         0  931397.0       36872      841489
## 15  sc1      0.02     3    80000        NA   80009.9       40000          NA
## 16  sc0      0.02    52     9067       622    9689.0        1125        9911
## 17  sc3      0.14    10     1500        20    1520.0         195        1384
## 18  sc1      0.02     4      440         0     440.0          16         379
## 19  sc2      0.05     3      690         0     690.0       19000      464507
## 20  sc3      0.14     8     1852         0    1852.0         120        1812
## 21  sc0      0.02     2      359         9     368.0          NA         339
## 22  sc0      0.02     3      839         0     839.0           2         717
## 23  sc1      0.02     2      471         0     471.0          34         411
## 24  sc1      0.02     4      933         2     935.0          31         814
## 25  sc2      0.05     3     1665         0    1665.0          70         186
## 26  sc3      0.14     6     2318         0    2318.0         184         390
## 27  sc2      0.05     2     1175        12    1187.0         114          NA
## 28  sc3      0.14    16     2946         7    2953.0         245        2870
## 29  sc0      0.02     1      492         0     492.0          NA         470
## 30  sc2      0.05     6     1831      1831    3662.0          53        1443
## 31  sc3      0.14    29     7271        30    7301.0         451        7242
## 32  sc2      0.05     8       NA        NA     107.0          28          95
## 33  sc3      0.14    13     4118        11    4129.0          57        3601
## 34  sc3      0.14     9     2803         0    2803.0         106        2643
## 35  sc3      0.14    15     2876        33    2909.0         539        2627
## 36  sc3      0.14    14     2649        98    2747.0      221302     2725410
## 37  sc2      0.05     6      202         4     206.0          64         170
## 38  sc2      0.05    53     9842         0    9842.0         837       10000
## 39  sc2      0.05     7     2463        38    2501.0          87        2347
## 40  sc3      0.14    NA     4445        98    4543.0         369        4266
## 41  sc3      0.14    20     3284        11    3295.0         181        3168
## 42  sc2      0.05     2      814         0     814.0         107         175
## 43  sc1      0.02    NA     1210         0    1210.0          52        1124
## 44  sc0      0.02     1      343         0     343.0          NA          NA
## 45  sc2      0.05     3      952         0     952.0          79          NA
## 46  sc0      0.02     1       41         0      41.0          NA          32
## 47  sc3      0.14    60     3633         0    3633.0         257        3626
## 48  sc3      0.14     8     2906         0    2906.0         144         453
## 49  sc3      0.14    10     2333         6    2339.0         193        2353
## 50  sc3      0.14    12     2275         5    2280.0         222        2302
## 51  sc2      0.05     7     1728         0    1728.0         153        1681
## 52  sc3      0.14    24     6872        32    6904.0         485        6729
## 53  sc3      0.14    29     3571        76    3647.0         311        3554
## 54  sc3      0.14    11     1021         0    1021.0         235         472
## 55  sc0      0.02     1      197         0     197.0          NA         168
## 56  sc2      0.05     7      917         0     917.0          30         781
## 57  sc2      0.05     8     2000         0    2000.0          NA        1700
## 58  sc3      0.14     3      200         0     200.0          49         177
## 59  sc2      0.05     4      342         0     342.0          30         299
## 60  sc2      0.05     6        1      1410    1411.0         179        1215
##    profit  vat rec_id
## 1   20045   NA    001
## 2      63   NA    002
## 3     426   NA    003
## 4     274   NA    004
## 5      72   NA    005
## 6       3   NA    006
## 7       1 1346    007
## 8      75   NA    008
## 9     110   NA    009
## 10     NA   NA    010
## 11      9   NA    011
## 12    220   NA    012
## 13     34   NA    013
## 14  89908  863    014
## 15     NA  813    015
## 16   -222  964    016
## 17    136  733    017
## 18     60  296    018
## 19 225493  486    019
## 20     40 1312    020
## 21     29  257    021
## 22    122  654    022
## 23     60  377    023
## 24    121  811    024
## 25   1478 1472    025
## 26     86 2082    026
## 27     17 1058    027
## 28     83 2670    028
## 29     22  449    029
## 30    388 1695    030
## 31     59 6754    031
## 32    100  905    032
## 33    528 3841    033
## 34    160 2668    034
## 35    282 2758    035
## 36  22457 2548    036
## 37     37  995    037
## 38   -160 9655    038
## 39    154 2441    039
## 40    277 4412    040
## 41    127 3263    041
## 42     NA  810    042
## 43     86 1205    043
## 44     NA  343    044
## 45    149  952    045
## 46      9   41    046
## 47      7 3634    047
## 48     53 2907    048
## 49    -14 2335    049
## 50    -22 2277    050
## 51     47 1742    051
## 52    174 6959    052
## 53     93 3700    053
## 54    549 1067    054
## 55     30  221    055
## 56    136 1030    056
## 57     NA 2271    057
## 58    222  251    058
## 59     43 1068    059
## 60    196 1389    060

Przykład: naucz na A, zastosuj dla B

m <- MASS::rlm(other.rev ~ turnover + staff
, data=typos_corrected)
impute(retailers, other.rev ~ m)
##    size incl.prob staff turnover     other.rev total.rev staff.costs
## 1   sc0      0.02    75       NA            NA      1130          NA
## 2   sc3      0.14     9     1607     5.1118178      1607         131
## 3   sc3      0.14    NA     6886   -33.0000000      6919         324
## 4   sc3      0.14    NA     3861    13.0000000      3874         290
## 5   sc3      0.14    NA       NA    37.0000000      5602         314
## 6   sc0      0.02     1       25    -0.1941089        25          NA
## 7   sc3      0.14     5       NA            NA      1335         135
## 8   sc1      0.02     3      404    13.0000000       417          NA
## 9   sc3      0.14     6     2596     8.0157561      2596         147
## 10  sc2      0.05     5       NA            NA        NA          NA
## 11  sc2      0.05     5      645     1.9299772       645         130
## 12  sc2      0.05     5     2872     8.8177527      2872         182
## 13  sc3      0.14    13     5678    12.0000000      5690         326
## 14  sc1      0.02    NA   931397            NA    931397       36872
## 15  sc1      0.02     3    80000   247.2597777        NA       40000
## 16  sc0      0.02    52     9067   622.0000000      9689        1125
## 17  sc3      0.14    10     1500    20.0000000      1520         195
## 18  sc1      0.02     4      440     1.2443132       440          16
## 19  sc2      0.05     3      690     1.9658956       690       19000
## 20  sc3      0.14     8     1852     5.8179360      1852         120
## 21  sc0      0.02     2      359     9.0000000       368          NA
## 22  sc0      0.02     3      839     2.4267302       839           2
## 23  sc1      0.02     2      471     1.2369318       471          34
## 24  sc1      0.02     4      933     2.0000000       935          31
## 25  sc2      0.05     3     1665     4.9814238      1665          70
## 26  sc3      0.14     6     2318     7.1559440      2318         184
## 27  sc2      0.05     2     1175    12.0000000      1187         114
## 28  sc3      0.14    16     2946     7.0000000      2953         245
## 29  sc0      0.02     1      492     1.2502517       492          NA
## 30  sc2      0.05     6     1831  1831.0000000      1831          53
## 31  sc3      0.14    29     7271    30.0000000      7301         451
## 32  sc2      0.05     8      971     3.0931357       107          28
## 33  sc3      0.14    13     4118    11.0000000      4129          57
## 34  sc3      0.14     9     2803     8.8108656      2803         106
## 35  sc3      0.14    15     2876    33.0000000      2909         539
## 36  sc3      0.14    14     2649 98350.0000000      2747      221302
## 37  sc2      0.05     6     1024     4.0000000       206          64
## 38  sc2      0.05    53     9842    32.8531468      9842         837
## 39  sc2      0.05     7     2463    38.0000000      2501          87
## 40  sc3      0.14    NA     4445    98.0000000      4543         369
## 41  sc3      0.14    20     3284    11.0000000      3295         181
## 42  sc2      0.05     2      814     2.2977791       814         107
## 43  sc1      0.02    NA     1210            NA      1210          52
## 44  sc0      0.02     1      343     0.7894172       343          NA
## 45  sc2      0.05     3      952     2.7762222       952          79
## 46  sc0      0.02     1       41    -0.1446233        41          NA
## 47  sc3      0.14    60     3633    14.0110544      3633         257
## 48  sc3      0.14     8     2906     9.0777992      2906         144
## 49  sc3      0.14    10     2333     6.0000000      2339         193
## 50  sc3      0.14    12     2275     5.0000000      2280         222
## 51  sc2      0.05     7     1728     5.3827928      1728         153
## 52  sc3      0.14    24     6872    32.0000000      6904         485
## 53  sc3      0.14    29     3571    76.0000000      3647         311
## 54  sc3      0.14    11     1021     3.4026678      1021         235
## 55  sc0      0.02     1      197     0.3378612       197          NA
## 56  sc2      0.05     7      917     2.8744920       917          30
## 57  sc2      0.05     8     2000     6.2756777      2000          NA
## 58  sc3      0.14     3      200     0.4503995       200          49
## 59  sc2      0.05     4      342     0.9412140       342          30
## 60  sc2      0.05     6        1    -0.0101879      1411         179
##    total.costs profit  vat rec_id
## 1        18915  20045   NA    001
## 2         1544     63   NA    002
## 3         6493    426   NA    003
## 4         3600    274   NA    004
## 5         5530     72   NA    005
## 6           22      3   NA    006
## 7          136      1 1346    007
## 8          342     75   NA    008
## 9         2486    110   NA    009
## 10          NA     NA   NA    010
## 11         636      9   NA    011
## 12        2652    220   NA    012
## 13        5656     34   NA    013
## 14      841489  89908  863    014
## 15          NA     NA  813    015
## 16        9911   -222  964    016
## 17        1384    136  733    017
## 18         379     60  296    018
## 19      464507 225493  486    019
## 20        1812     40 1312    020
## 21         339     29  257    021
## 22         717    122  654    022
## 23         411     60  377    023
## 24         814    121  811    024
## 25         186   1478 1472    025
## 26         390     86 2082    026
## 27          NA     17 1058    027
## 28        2870     83 2670    028
## 29         470     22  449    029
## 30        1443    388 1695    030
## 31        7242     59 6754    031
## 32          95    100  905    032
## 33        3601    528 3841    033
## 34        2643    160 2668    034
## 35        2627    282 2758    035
## 36     2725410  22457 2548    036
## 37         170     37  995    037
## 38       10000   -160 9655    038
## 39        2347    154 2441    039
## 40        4266    277 4412    040
## 41        3168    127 3263    041
## 42         175     NA  810    042
## 43        1124     86 1205    043
## 44          NA     NA  343    044
## 45          NA    149  952    045
## 46          32      9   41    046
## 47        3626      7 3634    047
## 48         453     53 2907    048
## 49        2353    -14 2335    049
## 50        2302    -22 2277    050
## 51        1681     47 1742    051
## 52        6729    174 6959    052
## 53        3554     93 3700    053
## 54         472    549 1067    054
## 55         168     30  221    055
## 56         781    136 1030    056
## 57        1700     NA 2271    057
## 58         177    222  251    058
## 59         299     43 1068    059
## 60        1215    196 1389    060

Logi zmian

Każdy analityk chciałby wiedzieć, która operacja czyszczenia danych miała wpływ na wartości, statystyki, wyniki walidacji…

Rozwiązanie:

Wszystkie dane przepływają przez fajkę %>%. Widzi ona dane wejściowe i wyjściowe. Użyjmy zatem specjalnego operatora fajki, który mierzy i przechowuje różnice między wejściem a wyjściem.

Operator %>>%

Pakiet “lumberjack” pozwala na zamianę fajki typu magrittr %>% na operatora “drwala” %>>% Rejestruje on, co dzieje się z Twoimi danymi, gdy przepływają przez %>>% Używa on loggera wyeksportowanego przez “drwala” lub walidacji lub definiowania przez własny rejestrator.

imputed <- typos_corrected %>>%
start_log(log = validate::lbj_cells()) %>>%
impute_lm(turnover ~ staff) %>>%
impute_median(other.rev + turnover ~ size) %>>%
dump_log()
## Dumped a log at C:\Users\Maciek\Downloads\lbj_cells.csv

Zobaczmy, jak wygląda plik z logami (jego początek):

read.csv("lbj_cells.csv") %>% head()
##   step                time                                 expression cells
## 1    0 2022-12-11 09:58:52                                              660
## 2    1 2022-12-11 09:58:52                impute_lm(turnover ~ staff)   660
## 3    2 2022-12-11 09:58:52 impute_median(other.rev + turnover ~ size)   660
##   available still_available unadapted adapted imputed missing still_missing
## 1       611             611       611       0       0      49            49
## 2       615             611       611       0       4      45            45
## 3       620             615       615       0       5      40            40
##   removed
## 1       0
## 2       0
## 3       0

Pakiet dlookr

Transformacje danych z pakietem “dlookr”:

  • find_na() - znajduje zmienną, która zawiera brakujące obserwacje

  • imputate_na() - wypełnia brakujące obserwacje

  • summary.imputation() oraz plot.imputation() - przedstawiają podsumowania i wizualizują przeprowadzone imputacje braków

  • find_skewness() - znajduje zmienne skośne i raportuje tę skośność

  • transform() - przeprowadza standaryzację, normalizację zmiennych numerycznych

  • summary.transform() oraz plot.transform() - przedstawiają podsumowania i wizualizują przeprowadzone transformacje

  • binning() oraz binning_by() konwertuje dane do kategorycznych (ilościowe do jakościowych)

  • transformation_web_report() - przeprowadza w/w i tworzy raport z transformacji

Jako przykład, zabrudzimy paczkę danych “Carseats”.

?Carseats
## uruchamianie serwera httpd dla pomocy ... wykonano
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
data(Carseats)
attach(Carseats)
#Wprowadzamy braki danych: 
carseats <- ISLR::Carseats  
suppressWarnings(RNGversion("3.5.0")) 
set.seed(123) 
carseats[sample(seq(NROW(carseats)), 20), "Income"] <- NA 
suppressWarnings(RNGversion("3.5.0")) 
set.seed(456) 
carseats[sample(seq(NROW(carseats)), 10), "Urban"] <- NA

Przykład 1. Zamieniamy brakujące dochody (20 sztuk) za pomocą mediany:

#wypełniamy brakujące obserwacje

# ?imputate_na
dochod<-imputate_na(carseats, Income, method = "median")

summary(dochod)
## Impute missing values with median
## 
## * Information of Imputation (before vs after)
##                     Original    Imputation 
## described_variables "value"     "value"    
## n                   "380"       "400"      
## na                  "20"        " 0"       
## mean                "68.8605"   "68.8675"  
## sd                  "28.0916"   "27.3785"  
## se_mean             "1.44107"   "1.36893"  
## IQR                 "48.25"     "45.25"    
## skewness            "0.0449060" "0.0452954"
## kurtosis            "-1.089201" "-0.987569"
## p00                 "21"        "21"       
## p01                 "21.79"     "21.99"    
## p05                 "26"        "26"       
## p10                 "30.0"      "30.9"     
## p20                 "39"        "40"       
## p25                 "42.75"     "44.75"    
## p30                 "48"        "52"       
## p40                 "62"        "63"       
## p50                 "69"        "69"       
## p60                 "78"        "76"       
## p70                 "86.3"      "84.0"     
## p75                 "91"        "90"       
## p80                 "96.2"      "94.2"     
## p90                 "108.1"     "106.1"    
## p95                 "115.05"    "115.00"   
## p99                 "119.21"    "119.01"   
## p100                "120"       "120"
plot(dochod)

Przykład 2. Zamieniamy brakujące dochody (20 sztuk) za pomocą metody “mice” wielorównaniowej:

dochod<-imputate_na(carseats, Income, Urban, method = "mice")
## 
##  iter imp variable
##   1   1  Income
##   1   2  Income
##   1   3  Income
##   1   4  Income
##   1   5  Income
##   2   1  Income
##   2   2  Income
##   2   3  Income
##   2   4  Income
##   2   5  Income
##   3   1  Income
##   3   2  Income
##   3   3  Income
##   3   4  Income
##   3   5  Income
##   4   1  Income
##   4   2  Income
##   4   3  Income
##   4   4  Income
##   4   5  Income
##   5   1  Income
##   5   2  Income
##   5   3  Income
##   5   4  Income
##   5   5  Income
summary(dochod)
## * Impute missing values based on Multivariate Imputation by Chained Equations
##  - method : mice
##  - random seed : 37295
## 
## * Information of Imputation (before vs after)
##                     Original   Imputation
## described_variables "value"    "value"   
## n                   "380"      "400"     
## na                  "20"       " 0"      
## mean                "68.8605"  "69.1370" 
## sd                  "28.0916"  "27.4940" 
## se_mean             "1.44107"  "1.37470" 
## IQR                 "48.25"    "45.25"   
## skewness            "0.044906" "0.018794"
## kurtosis            "-1.08920" "-1.01939"
## p00                 "21"       "21"      
## p01                 "21.79"    "21.99"   
## p05                 "26"       "26"      
## p10                 "30.0"     "30.9"    
## p20                 "39"       "40"      
## p25                 "42.75"    "44.75"   
## p30                 "48"       "52"      
## p40                 "62"       "63"      
## p50                 "69"       "69"      
## p60                 "78"       "78"      
## p70                 "86.3"     "84.0"    
## p75                 "91"       "90"      
## p80                 "96.2"     "94.2"    
## p90                 "108.1"    "106.1"   
## p95                 "115.05"   "115.00"  
## p99                 "119.21"   "119.01"  
## p100                "120"      "120"
plot(dochod)

Obserwacje odstające

Jednowymiarowe:

Wielowymiarowe:

Metody wykrywania i usuwania obserwacji odstających:

  • wykres ramkowy, rozrzutu

  • Z-score (reguła 3 sigm - 3 odchyleń od średniej)

  • testy statystyczne (Cook’a, Grubbs’a, Tukey’a)

  • percentyle

  • filtr Hampela

  • dystans Cook’a - dla wielowymiarowych

Przykład 3. Zamiana wartości odstających komendą imputate_outlier()

# ?imputate_outlier

ceny<-imputate_outlier(carseats, Price, method="capping")

summary(ceny)
## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##                     Original     Imputation  
## described_variables "value"      "value"     
## n                   "400"        "400"       
## na                  "0"          "0"         
## mean                "115.795"    "115.893"   
## sd                  "23.6767"    "22.6109"   
## se_mean             "1.18383"    "1.13055"   
## IQR                 "31"         "31"        
## skewness            "-0.1252862" "-0.0461621"
## kurtosis            " 0.451885"  "-0.303058" 
## p00                 "24"         "54"        
## p01                 "54.99"      "67.96"     
## p05                 "77"         "77"        
## p10                 "87"         "87"        
## p20                 "96.8"       "96.8"      
## p25                 "100"        "100"       
## p30                 "104"        "104"       
## p40                 "110"        "110"       
## p50                 "117"        "117"       
## p60                 "122"        "122"       
## p70                 "128.3"      "128.3"     
## p75                 "131"        "131"       
## p80                 "134"        "134"       
## p90                 "146"        "146"       
## p95                 "155.050"    "155.002"   
## p99                 "166.05"     "164.02"    
## p100                "191"        "173"
plot(ceny)

Standaryzacja danych

Standaryzacja:

  • Metoda z-score – jest to taka standaryzacja danych, która sprowadzi naszą zmienną do skali uniwersalnej, bez wpływu średniej i odchylenie standardowego – od zmiennej musimy odjąć średnią i podzielić przez odchylenie: dane otrzymują średnią równą 0 i odchylenie równe 1.

  • Metoda “minmax” - inna odmiana standaryzacji - względem minimum i maksimum: od wartości zmiennej odejmujemy minimum i dzielimy to przez rozstęp: (x-min)/(max-min).

Jak rozwiązać problem ze skośnymi danymi? Transformować!

  • log – transformacja z użyciem logarytmu log(x)

  • log+1 – tak jak wyżej, ale umożliwia badanie danych, które zawierają 0

  • sqrt – pierwiastek

  • 1/x

  • x^2

  • x^3

  • metoda Boxa-Coxa

Przykład 4. Za pomocą komendy “mutate” oraz “transform” dokonaj standaryzacji tworząc nową zmienną metodą z-score oraz minmax i wyświetl je na wykresie ramkowym:

carseats$Income<-as.numeric(carseats$Income)   #usuwam informacje o imputacjach

carseats %>% 
  mutate(dochody_z = transform(carseats$Income, method = "zscore"),
         dochody_minmax = transform(carseats$Income, method = "minmax"))  %>% 
  select(dochody_z, dochody_minmax) %>% 
  boxplot()

Przykład 5. Znajdź zmienne (zmienną) w ramce danych, które stwarza(ją) problem skośności.

  • znajdujemy zmienne - które to zmienne stwarzają problem?

    find_skewness(carseats)
    ## [1] 4
  • to samo co powyżej, ale po nazwie:

find_skewness(carseats, index=FALSE)
## [1] "Advertising"
  • policzmy wskaźnik asymetrii dla zmiennych:
find_skewness(carseats, value=TRUE)
##       Sales   CompPrice      Income Advertising  Population       Price 
##       0.185      -0.043       0.045       0.637      -0.051      -0.125 
##         Age   Education 
##      -0.077       0.044
  • policzmy wskaźnik asymetrii dla zmiennych i filtrujemy tylko te, które mają skośność > 0.1:
find_skewness(carseats, value=TRUE, thres=0.1)
##       Sales Advertising       Price 
##       0.185       0.637      -0.125
hist(carseats$Advertising)

w takim razie musimy dokonać transformacji tej skośnej zmiennej:

advertising_log <- transform(carseats$Advertising, method = "log")
advertising_log
##   [1] 2.397895 2.772589 2.302585 1.386294 1.098612 2.564949     -Inf 2.708050
##   [9]     -Inf     -Inf 2.197225 1.386294 0.693147 2.397895 2.397895 1.609438
##  [17]     -Inf 2.564949     -Inf 2.772589 0.693147 2.484907 1.791759     -Inf
##  [25] 2.772589     -Inf 2.397895     -Inf     -Inf 2.708050     -Inf 2.772589
##  [33] 2.484907 2.564949     -Inf 2.397895     -Inf 1.609438     -Inf     -Inf
##  [41]     -Inf     -Inf     -Inf 2.397895 1.791759     -Inf 2.639057     -Inf
##  [49]     -Inf     -Inf 2.890372     -Inf 1.098612 2.564949 2.564949 1.609438
##  [57]     -Inf     -Inf 2.708050 1.386294 2.944439     -Inf     -Inf 2.302585
##  [65] 2.484907     -Inf     -Inf 2.639057 2.995732     -Inf 2.708050 2.772589
##  [73]     -Inf 2.302585 1.609438 3.135494 2.302585 2.484907 0.000000     -Inf
##  [81] 2.772589     -Inf 1.386294 1.945910     -Inf     -Inf 2.197225 1.945910
##  [89] 1.945910 1.098612     -Inf 2.397895     -Inf     -Inf 1.609438 2.302585
##  [97] 2.302585 1.609438 3.178054 1.098612 2.397895     -Inf     -Inf     -Inf
## [105]     -Inf 2.079442     -Inf     -Inf 0.693147     -Inf 1.945910 2.484907
## [113] 1.609438 2.397895 2.197225     -Inf     -Inf     -Inf 0.693147 2.079442
## [121] 2.397895 2.302585 1.609438     -Inf     -Inf     -Inf 0.693147 1.098612
## [129] 1.098612 1.945910 2.564949 1.098612 2.197225 0.693147     -Inf 2.639057
## [137]     -Inf     -Inf 2.484907 2.302585 2.302585     -Inf     -Inf 1.945910
## [145]     -Inf 2.397895     -Inf 2.197225     -Inf 2.564949 2.079442 2.833213
## [153]     -Inf 1.945910 2.302585     -Inf     -Inf 2.079442 0.000000     -Inf
## [161]     -Inf 1.609438     -Inf     -Inf     -Inf 1.945910 2.833213     -Inf
## [169]     -Inf 2.708050 2.484907 2.484907 2.564949 1.609438     -Inf     -Inf
## [177] 2.197225     -Inf 2.639057 1.098612 2.708050     -Inf 1.386294 1.791759
## [185] 1.945910 2.397895     -Inf     -Inf     -Inf 2.890372 2.564949 2.564949
## [193]     -Inf 1.945910 2.890372 1.386294 1.791759     -Inf 1.609438 1.609438
## [201]     -Inf     -Inf 1.386294     -Inf     -Inf 0.000000     -Inf     -Inf
## [209]     -Inf 2.397895 0.693147 2.639057 2.944439 1.609438 1.098612 2.708050
## [217]     -Inf     -Inf 2.484907 2.944439 2.708050     -Inf 1.791759 2.197225
## [225]     -Inf     -Inf     -Inf 2.302585 2.564949     -Inf     -Inf     -Inf
## [233] 2.302585 2.890372 2.397895 2.079442 2.772589 2.079442     -Inf     -Inf
## [241]     -Inf     -Inf     -Inf 2.564949     -Inf     -Inf 2.995732     -Inf
## [249]     -Inf     -Inf 2.302585 1.609438     -Inf 1.609438 3.135494 2.079442
## [257]     -Inf 2.639057     -Inf 2.302585 2.079442 1.386294 2.708050 1.791759
## [265] 1.609438 2.302585 2.484907 1.945910     -Inf     -Inf     -Inf     -Inf
## [273]     -Inf 2.079442 0.693147 2.397895 2.639057 2.484907 0.693147 2.564949
## [281] 2.302585 1.945910     -Inf     -Inf 2.397895 2.397895 2.397895 1.386294
## [289]     -Inf 3.218876 2.639057     -Inf 2.772589     -Inf 1.098612 2.639057
## [297] 2.564949 2.564949     -Inf 2.833213 0.000000     -Inf 2.564949 2.772589
## [305] 2.484907 3.258097 0.000000     -Inf 2.944439 2.564949 3.367296 2.484907
## [313] 1.609438 1.098612 2.302585 2.079442 1.609438     -Inf 2.302585 2.944439
## [321] 2.484907 1.609438 2.302585 2.890372 1.386294 2.397895     -Inf 2.833213
## [329] 0.000000 2.197225     -Inf 2.708050 2.995732 1.945910 2.197225 2.708050
## [337] 1.791759     -Inf     -Inf 1.386294     -Inf     -Inf 2.564949 2.302585
## [345]     -Inf     -Inf     -Inf     -Inf 2.995732 2.890372 2.833213 2.772589
## [353] 2.639057 2.484907 0.000000     -Inf     -Inf 1.098612 2.302585 2.397895
## [361] 1.945910 2.302585     -Inf 0.000000 2.772589     -Inf 2.397895     -Inf
## [369] 2.302585 3.091042 3.091042     -Inf     -Inf     -Inf 1.945910 1.386294
## [377] 2.944439     -Inf 1.098612     -Inf 2.302585 3.044522 2.944439     -Inf
## [385] 2.708050 2.564949     -Inf 2.639057 2.397895 2.079442 2.197225     -Inf
## [393] 2.564949 2.302585 2.944439 2.833213 1.098612 2.484907 1.945910     -Inf
## attr(,"method")
## [1] "log"
## attr(,"origin")
##   [1] 11 16 10  4  3 13  0 15  0  0  9  4  2 11 11  5  0 13  0 16  2 12  6  0 16
##  [26]  0 11  0  0 15  0 16 12 13  0 11  0  5  0  0  0  0  0 11  6  0 14  0  0  0
##  [51] 18  0  3 13 13  5  0  0 15  4 19  0  0 10 12  0  0 14 20  0 15 16  0 10  5
##  [76] 23 10 12  1  0 16  0  4  7  0  0  9  7  7  3  0 11  0  0  5 10 10  5 24  3
## [101] 11  0  0  0  0  8  0  0  2  0  7 12  5 11  9  0  0  0  2  8 11 10  5  0  0
## [126]  0  2  3  3  7 13  3  9  2  0 14  0  0 12 10 10  0  0  7  0 11  0  9  0 13
## [151]  8 17  0  7 10  0  0  8  1  0  0  5  0  0  0  7 17  0  0 15 12 12 13  5  0
## [176]  0  9  0 14  3 15  0  4  6  7 11  0  0  0 18 13 13  0  7 18  4  6  0  5  5
## [201]  0  0  4  0  0  1  0  0  0 11  2 14 19  5  3 15  0  0 12 19 15  0  6  9  0
## [226]  0  0 10 13  0  0  0 10 18 11  8 16  8  0  0  0  0  0 13  0  0 20  0  0  0
## [251] 10  5  0  5 23  8  0 14  0 10  8  4 15  6  5 10 12  7  0  0  0  0  0  8  2
## [276] 11 14 12  2 13 10  7  0  0 11 11 11  4  0 25 14  0 16  0  3 14 13 13  0 17
## [301]  1  0 13 16 12 26  1  0 19 13 29 12  5  3 10  8  5  0 10 19 12  5 10 18  4
## [326] 11  0 17  1  9  0 15 20  7  9 15  6  0  0  4  0  0 13 10  0  0  0  0 20 18
## [351] 17 16 14 12  1  0  0  3 10 11  7 10  0  1 16  0 11  0 10 22 22  0  0  0  7
## [376]  4 19  0  3  0 10 21 19  0 15 13  0 14 11  8  9  0 13 10 19 17  3 12  7  0
## attr(,"class")
## [1] "transform" "numeric"
summary(advertising_log)
## * Resolving Skewness with log
## 
## * Information of Transformation (before vs after)
##            Original Transformation
## n        400.000000     400.000000
## na         0.000000       0.000000
## mean       6.635000           -Inf
## sd         6.650364            NaN
## se_mean    0.332518            NaN
## IQR       12.000000            Inf
## skewness   0.639586            NaN
## kurtosis  -0.545118            NaN
## p00        0.000000           -Inf
## p01        0.000000           -Inf
## p05        0.000000           -Inf
## p10        0.000000           -Inf
## p20        0.000000           -Inf
## p25        0.000000           -Inf
## p30        0.000000           -Inf
## p40        2.000000       0.693147
## p50        5.000000       1.609438
## p60        8.400000       2.126555
## p70       11.000000       2.397895
## p75       12.000000       2.484907
## p80       13.000000       2.564949
## p90       16.000000       2.772589
## p95       19.000000       2.944439
## p99       23.010000       3.135920
## p100      29.000000       3.367296
plot(advertising_log)

Kategoryzacja danych

Kategoryzacja danych – tzw. BINNING: binning() - transformuje numeryczne, ilościowe zmienne w jakościowe - kategoryzowane z etykietami.

Następujące typy kategoryzacji są wspierane:

  • “quantile” - kategoryzowanie po kwantylach rozkładu zmiennej (zapewnia się w ten sposób takie same liczebności w kolejnych kategoriach)

  • “equal” - kategoryzowanie tak, aby każda nowo utworzona klasa miała tę samą długość

  • “pretty” - kompromis między 2 wyżej wymienionymi

  • “kmeans” - kategoryzacja z użyciem algorytmu K-średnich

  • “bclust” - kategoryzacja z użyciem algorytmu “bagged clustering”

Przykład 6. Używając metody kwantylowej dokonaj podziału i kategoryzacji zmiennej “Income”:

dochod_kat<-binning(carseats$Income, type="quantile")
summary(dochod_kat)
##             levels freq   rate
## 1          [21,30]   40 0.1000
## 2          (30,39]   37 0.0925
## 3          (39,48]   38 0.0950
## 4          (48,62]   40 0.1000
## 5          (62,69]   42 0.1050
## 6          (69,78]   33 0.0825
## 7    (78,86.56667]   36 0.0900
## 8  (86.56667,96.6]   38 0.0950
## 9  (96.6,108.6333]   38 0.0950
## 10  (108.6333,120]   38 0.0950
## 11            <NA>   20 0.0500
plot(dochod_kat)

Możemy też przeprowadzić kategoryzację wg liczby kategorii:

dochod_4<- binning(carseats$Income, nbins = 4, labels = c("niskie","srednie","wysokie", "bwysokie"))
summary(dochod_4)
##     levels freq   rate
## 1   niskie   95 0.2375
## 2  srednie  102 0.2550
## 3  wysokie   89 0.2225
## 4 bwysokie   94 0.2350
## 5     <NA>   20 0.0500
plot(dochod_4)

Dyskretyzacja zmiennych

Jeśli zależy nam na stworzeniu nowej zmiennej kategoryzowanej, której rozkład będzie zależny idealnie od zmiennej referencyjnej – w takim przypadku należy skorzystać z algorytmu optymalnej kategoryzacji.

Np. kategoryzacja wg zmiennej Yes/No, wg ryzyka kredytowego itp.

Przykład 7. Kategoryzujemy zmienną “Advertising” wg zmiennej referencyjnej “US”:

# optimal binning
bin <- binning_by(carseats, y="US", x="Advertising")
## Warning in binning_by(carseats, y = "US", x = "Advertising"): The factor y has been changed to a numeric vector consisting of 0 and 1.
## 'Yes' changed to 1 (positive) and 'No' changed to 0 (negative).
summary(bin)
## ── Binning Table ──────────────────────── Several Metrics ── 
##      Bin CntRec CntPos CntNeg RatePos RateNeg    Odds      WoE      IV     JSD
## 1 [-1,0]    144     19    125 0.07364 0.88028  0.1520 -2.48101 2.00128 0.20093
## 2  (0,6]     69     54     15 0.20930 0.10563  3.6000  0.68380 0.07089 0.00869
## 3 (6,29]    187    185      2 0.71705 0.01408 92.5000  3.93008 2.76272 0.21861
## 4  Total    400    258    142 1.00000 1.00000  1.8169       NA 4.83489 0.42823
##       AUC
## 1 0.03241
## 2 0.01883
## 3 0.00903
## 4 0.06028
## 
## ── General Metrics ───────────────────────────────────────── 
## • Gini index                       :  -0.87944
## • IV (Jeffrey)                     :  4.83489
## • JS (Jensen-Shannon) Divergence   :  0.42823
## • Kolmogorov-Smirnov Statistics    :  0.80664
## • HHI (Herfindahl-Hirschman Index) :  0.37791
## • HHI (normalized)                 :  0.06687
## • Cramer's V                       :  0.81863 
## 
## ── Significance Tests ──────────────────── Chisquare Test ── 
##    Bin A  Bin B statistics                      p_value
## 1 [-1,0]  (0,6]    87.6706 0.00000000000000000000773135
## 2  (0,6] (6,29]    34.7335 0.00000000378070568864985667

UWAGA! Co to takiego information value i jak ją się interpretuje? Zwróć uwagę na inne metryki i testy w raporcie.

Możemy w końcu zwizualizować optymalną kategoryzację:

plot(bin)

Na koniec - możemy wykonać raport podsumowujący wszystkie operacje związane z czyszczeniem danych:

  • w formacie PDF (pamiętajmy o Tex, sterownikach…).

  • w formacie HTML:

#carseats %>% transformation_web_report(target = US, output_format = "html", output_file ="transformation_carseats.html")

Zadanie domowe

Korzystając z paczki danych “germancredit” dotyczącą oceny kredytowej (creditability) wybranych klientów pewnego banku:

Czy w zbiorze danych mamy obserwacje brakujące?

Proszę dokonać kategoryzacji zmiennej “age.in.years” (wiek w latach) wg oceny kredytowej “creditability”.

Podaj i zinterpretuj wskaźniki informacyjne. Oceń skośność zmiennych ilościowych.

Sprawdź, czy nie mamy obserwacji odstających dla zmiennej “age.in.years” (wiek w latach). Jeśli są - dokonaj imputacji wybraną przez siebie metodą.

data("germancredit")
attach(germancredit)
sum(is.na(germancredit))
## [1] 0
aggr(germancredit)

bin<- binning_by(germancredit, y="creditability", x="age.in.years")
## Warning in binning_by(germancredit, y = "creditability", x = "age.in.years"): The factor y has been changed to a numeric vector consisting of 0 and 1.
## 'good' changed to 1 (positive) and 'bad' changed to 0 (negative).
plot(bin)

summary(bin)
## ── Binning Table ──────────────────────── Several Metrics ── 
##       Bin CntRec CntPos CntNeg RatePos RateNeg    Odds      WoE      IV     JSD
## 1 [19,25]    190    110     80 0.15714 0.26667 1.37500 -0.52884 0.05792 0.00716
## 2 (25,75]    810    590    220 0.84286 0.73333 2.68182  0.13920 0.01525 0.00190
## 3   Total   1000    700    300 1.00000 1.00000 2.33333       NA 0.07317 0.00906
##       AUC
## 1 0.02095
## 2 0.42429
## 3 0.44524
## 
## ── General Metrics ───────────────────────────────────────── 
## • Gini index                       :  -0.10952
## • IV (Jeffrey)                     :  0.07317
## • JS (Jensen-Shannon) Divergence   :  0.00906
## • Kolmogorov-Smirnov Statistics    :  0.10952
## • HHI (Herfindahl-Hirschman Index) :  0.6922
## • HHI (normalized)                 :  0.3844
## • Cramer's V                       :  0.12794 
## 
## ── Significance Tests ──────────────────── Chisquare Test ── 
##     Bin A   Bin B statistics      p_value
## 1 [19,25] (25,75]    16.3681 0.0000521562
find_skewness(germancredit, value = TRUE ,thres = 0.1)
##                                        duration.in.month 
##                                                    1.093 
##                                            credit.amount 
##                                                    1.947 
##      installment.rate.in.percentage.of.disposable.income 
##                                                   -0.531 
##                                  present.residence.since 
##                                                   -0.272 
##                                             age.in.years 
##                                                    1.019 
##                  number.of.existing.credits.at.this.bank 
##                                                    1.271 
## number.of.people.being.liable.to.provide.maintenance.for 
##                                                    1.907
boxplot(germancredit$age.in.years)

wiek<-imputate_outlier(germancredit, age.in.years, method="capping")
summary(wiek)
## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##                     Original    Imputation 
## described_variables "value"     "value"    
## n                   "1000"      "1000"     
## na                  "0"         "0"        
## mean                "35.546"    "35.350"   
## sd                  "11.3755"   "10.8530"  
## se_mean             "0.359724"  "0.343202" 
## IQR                 "15"        "15"       
## skewness            "1.020739"  "0.821878" 
## kurtosis            " 0.595780" "-0.132573"
## p00                 "19"        "19"       
## p01                 "20"        "20"       
## p05                 "22"        "22"       
## p10                 "23"        "23"       
## p20                 "26"        "26"       
## p25                 "27"        "27"       
## p30                 "28"        "28"       
## p40                 "30"        "30"       
## p50                 "33"        "33"       
## p60                 "36"        "36"       
## p70                 "39"        "39"       
## p75                 "42"        "42"       
## p80                 "45"        "45"       
## p90                 "52"        "52"       
## p95                 "60"        "60"       
## p99                 "67.01"     "63.00"    
## p100                "75"        "64"
plot(wiek)

extract(bin)
##    [1] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##   [10] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25]
##   [19] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##   [28] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##   [37] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75]
##   [46] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
##   [55] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75]
##   [64] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75]
##   [73] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
##   [82] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##   [91] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [100] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [109] (25,75] (25,75] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [118] (25,75] [19,25] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [127] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25]
##  [136] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [145] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [154] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [163] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75]
##  [172] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [181] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [190] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [199] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [208] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [217] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [226] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] [19,25]
##  [235] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [244] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
##  [253] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
##  [262] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [271] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [280] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [289] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [298] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [307] [19,25] (25,75] [19,25] [19,25] (25,75] [19,25] (25,75] [19,25] (25,75]
##  [316] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [325] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [334] [19,25] [19,25] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75]
##  [343] [19,25] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25]
##  [352] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25]
##  [361] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75]
##  [370] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [379] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] [19,25] [19,25]
##  [388] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75]
##  [397] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [406] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [415] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75]
##  [424] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [433] (25,75] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75]
##  [442] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [451] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [460] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [469] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [478] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [487] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [496] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [505] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [514] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25]
##  [523] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [532] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [541] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [550] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
##  [559] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75]
##  [568] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75] [19,25]
##  [577] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [586] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
##  [595] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [604] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
##  [613] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
##  [622] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [631] (25,75] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75]
##  [640] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [649] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
##  [658] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75]
##  [667] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
##  [676] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [685] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75]
##  [694] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [703] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75]
##  [712] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [721] (25,75] [19,25] [19,25] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75]
##  [730] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
##  [739] (25,75] (25,75] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] [19,25]
##  [748] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] [19,25] (25,75] (25,75]
##  [757] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75]
##  [766] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
##  [775] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75] (25,75]
##  [784] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [793] (25,75] (25,75] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [802] (25,75] [19,25] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75]
##  [811] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
##  [820] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [829] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] [19,25] (25,75] [19,25]
##  [838] [19,25] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [847] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [856] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [865] [19,25] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [874] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [883] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75]
##  [892] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [901] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75]
##  [910] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [919] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25]
##  [928] (25,75] (25,75] (25,75] [19,25] [19,25] (25,75] (25,75] [19,25] (25,75]
##  [937] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [946] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75]
##  [955] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75]
##  [964] (25,75] [19,25] (25,75] [19,25] (25,75] (25,75] (25,75] [19,25] (25,75]
##  [973] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75]
##  [982] (25,75] (25,75] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75]
##  [991] (25,75] (25,75] [19,25] (25,75] (25,75] (25,75] (25,75] (25,75] [19,25]
## [1000] (25,75]
## Levels: [19,25] < (25,75]

Po więcej informacji nt. pakietu ‘dlookr’ zapraszam na jego stronę domową z rozwiązanymi przykładami.