Čistenie údajov

Predpokladajme, že v názvoch stĺpcov v pôvodnom súbore máme medzery, teda Náhľad na xls databázu otvorenú v tabuľkovom procesore

Tu sa v názvoch stĺpcov vyskytujú medzery. Názvy stĺpcov sa v prostredí R stávajú názvami premenných a tie nesmú byť súčasťou názvu premennej. Neprípustné znaky v názvoch premenných vo všeobecnosti môžeme nahradiť s pomocou knižnice janitor.

Import údajov

Skontrolujeme si, či sa doplňujú na miesta chýbajúcich údajov doplňujú NA hodnoty (NA - Not Available).

# Import the CSV file into a data frame
# - header = TRUE: the first row contains variable names
# - sep = ";": variables are separated by semicolons
# - dec = ".": decimal numbers use a dot
# - na.strings = c("", "NA"): empty cells and text "NA" are treated as missing values
# - stringsAsFactors = FALSE: text variables remain text, not factors

udaje1 <- read.csv2(
  "udaje/ChybnaDatabaza.csv",
  header = TRUE,
  sep = ";",
  dec = ".",
  na.strings = c("", "NA"),
  stringsAsFactors = FALSE
)

# Show the first rows of the dataset
head(udaje1)
##   YEARS                          COMPANIES   EXCHANGE.SECTOR PRIMARY.BUSINESS
## 1  2013 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 2  2014 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 3  2015 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 4    NA Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 5  2017 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 6  2018 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
##   TOBIN.Q MARKET.CAPITALIZATION RETURN.ON.ASSETS DEBT.TO.ASSET FIRM.SIZE
## 1    1.08                  6.44            14.41         36.85      6.46
## 2    0.96                  6.36            11.19         41.01      6.53
## 3    0.75                  6.20             5.74         34.32      6.51
## 4    0.63                  6.17             3.51         40.16      6.56
## 5    0.60                    NA             9.78         43.69      6.63
## 6    0.70                  6.24            13.43         40.72      6.65
##   SOCIAL.DISCLOSURE.INDEX ENVIRONMENTAL.DISCLOSURE.INDEX
## 1                    0.36                             NA
## 2                    0.29                              0
## 3                    0.29                              0
## 4                    0.29                              0
## 5                    0.29                              0
## 6                    0.29                              0
##   GOVERNANCE.DISCLOSURE.INDEX ESG.INDEX
## 1                        0.38      0.25
## 2                        0.38      0.22
## 3                        0.38      0.22
## 4                        0.38      0.22
## 5                        0.38      0.22
## 6                        0.38      0.22

Upravovanie názvov premenných

# Load the dplyr package
# dplyr provides convenient tools for working with data frames
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# -----------------------------
# 1. Save the original column names
# -----------------------------
old_names <- names(udaje1)

# -----------------------------
# 2. Shorten (abbreviate) column names
# -----------------------------
# rename_with() applies a function to all column names
# abbreviate() automatically shortens long names
# strict = FALSE allows a more flexible abbreviation
udaje1 <- udaje1 %>%
  rename_with(~ abbreviate(.x, strict = FALSE))

# -----------------------------
# 3. Ensure that column names are unique
# -----------------------------
# Sometimes abbreviation may create identical names
# make.unique() automatically adds suffixes (.1, .2, ...) if necessary
names(udaje1) <- make.unique(names(udaje1))

# -----------------------------
# 4. Show comparison: old vs. new names
# -----------------------------
comparison <- data.frame(
  Original_Name = old_names,
  Shortened_Name = names(udaje1)
)

print(comparison)
##                     Original_Name Shortened_Name
## 1                           YEARS           YEAR
## 2                       COMPANIES           COMP
## 3                 EXCHANGE.SECTOR           EXCH
## 4                PRIMARY.BUSINESS           PRIM
## 5                         TOBIN.Q           TOBI
## 6           MARKET.CAPITALIZATION           MARK
## 7                RETURN.ON.ASSETS           RETU
## 8                   DEBT.TO.ASSET           DEBT
## 9                       FIRM.SIZE           FIRM
## 10        SOCIAL.DISCLOSURE.INDEX           SOCI
## 11 ENVIRONMENTAL.DISCLOSURE.INDEX           ENVI
## 12    GOVERNANCE.DISCLOSURE.INDEX           GOVE
## 13                      ESG.INDEX           ESG.

Počiatočné čistenie obsahu databázy, imputácia chýbahúcich údajov

Odporúčam tu použiť knižnice VIM, Amelia, mice a iné. Pokiaľ máme databázu dostatočne nekonzistentnú a nevieme ju upraviť vynechaním niekoľkých riadkov / stĺpcov, potom odporúčame blog M. Fatih Tüzen: Handling Missing Data in R: A Comprehensive Guide, R bloggers.

E3te raz si pozrime našu pôvodnú databázu s chýbajúcimi údajmi:

Databáza s chýbajúcimi údajmi
Databáza s chýbajúcimi údajmi
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## Registered S3 method overwritten by 'car':
##   method           from
##   na.action.merMod lme4
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
# Count missing values in each column
print("pocet chybajucich udajov za jednotlive premenne")
## [1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG. 
##    1    0    0    1    0    1    0    1    1    1    1    1    0

Štatistika vyššie nám hovorí, koľko NA má ktorý stĺpec databázy. Ďalšie riadky nám hovoria o štruktúre záznamov, kde sa nachádzajú chýbajúce hodnoty. Posledný riadok hovorí o počte chýbajúcich údajov za jednotlivé premenné a za celú databázu. Máme 760 záznamov , z kotých 754 je úplných a mámo 8 chýbajúcich hodnôt. Podbnú informáciu nám dáva nasledovný graf.

# pattern of missingness
md.pattern(udaje1)

##     COMP EXCH TOBI RETU ESG. YEAR PRIM MARK DEBT FIRM SOCI ENVI GOVE  
## 754    1    1    1    1    1    1    1    1    1    1    1    1    1 0
## 1      1    1    1    1    1    1    1    1    1    1    1    0    1 1
## 1      1    1    1    1    1    1    1    1    1    0    1    1    1 1
## 1      1    1    1    1    1    1    1    1    0    1    0    1    0 3
## 1      1    1    1    1    1    1    1    0    1    1    1    1    1 1
## 1      1    1    1    1    1    1    0    1    1    1    1    1    1 1
## 1      1    1    1    1    1    0    1    1    1    1    1    1    1 1
##        0    0    0    0    0    1    1    1    1    1    1    1    1 8
# visualize missing data
aggr(udaje1, bars=FALSE,col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE)  # cervena farba signalizuje chybahuce polozky

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##      YEAR 0.001315789
##      PRIM 0.001315789
##      MARK 0.001315789
##      DEBT 0.001315789
##      FIRM 0.001315789
##      SOCI 0.001315789
##      ENVI 0.001315789
##      GOVE 0.001315789
##      COMP 0.000000000
##      EXCH 0.000000000
##      TOBI 0.000000000
##      RETU 0.000000000
##      ESG. 0.000000000
# multiple imputation - v pripade, ak vam chyba mensi rozsah udajov
imp <- mice(udaje1, seed=123)   # konkretne parametre imputacie vieme nastavovat - pozri help
## 
##  iter imp variable
##   1   1  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   1   2  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   1   3  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   1   4  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   1   5  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   2   1  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   2   2  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   2   3  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   2   4  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   2   5  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   3   1  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   3   2  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   3   3  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   3   4  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   3   5  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   4   1  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   4   2  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   4   3  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   4   4  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   4   5  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   5   1  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   5   2  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   5   3  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   5   4  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
##   5   5  YEAR  MARK  DEBT  FIRM  SOCI  ENVI  GOVE
## Warning: Number of logged events: 88
udaje_imputovane <- complete(imp, 1)
udaje1 <- udaje_imputovane
head(udaje1)
##   YEAR                               COMP              EXCH             PRIM
## 1 2013 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 2 2014 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 3 2015 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 4 2019 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 5 2017 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 6 2018 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
##   TOBI MARK  RETU  DEBT FIRM SOCI ENVI GOVE ESG.
## 1 1.08 6.44 14.41 36.85 6.46 0.36    0 0.38 0.25
## 2 0.96 6.36 11.19 41.01 6.53 0.29    0 0.38 0.22
## 3 0.75 6.20  5.74 34.32 6.51 0.29    0 0.38 0.22
## 4 0.63 6.17  3.51 40.16 6.56 0.29    0 0.38 0.22
## 5 0.60 5.66  9.78 43.69 6.63 0.29    0 0.38 0.22
## 6 0.70 6.24 13.43 40.72 6.65 0.29    0 0.38 0.22
rm(imp)
rm(udaje_imputovane)
print("pocet chybajucich udajov za jednotlive premenne")
## [1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG. 
##    0    0    0    1    0    0    0    0    0    0    0    0    0

Celkove nám teda ostala nevyplnená jedna premenná - textová - ktorá označuje Primary Business referencovanej firmy