Predpokladajme, že v názvoch stĺpcov v pôvodnom súbore máme medzery,
teda
Tu sa v názvoch stĺpcov vyskytujú medzery. Názvy stĺpcov sa v prostredí R stávajú názvami premenných a tie nesmú byť súčasťou názvu premennej. Neprípustné znaky v názvoch premenných vo všeobecnosti môžeme nahradiť s pomocou knižnice janitor.
Skontrolujeme si, či sa doplňujú na miesta chýbajúcich údajov doplňujú NA hodnoty (NA - Not Available).
# Import the CSV file into a data frame
# - header = TRUE: the first row contains variable names
# - sep = ";": variables are separated by semicolons
# - dec = ".": decimal numbers use a dot
# - na.strings = c("", "NA"): empty cells and text "NA" are treated as missing values
# - stringsAsFactors = FALSE: text variables remain text, not factors
udaje1 <- read.csv2(
"udaje/ChybnaDatabaza.csv",
header = TRUE,
sep = ";",
dec = ".",
na.strings = c("", "NA"),
stringsAsFactors = FALSE
)
# Show the first rows of the dataset
head(udaje1)## YEARS COMPANIES EXCHANGE.SECTOR PRIMARY.BUSINESS
## 1 2013 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 2 2014 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 3 2015 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 4 NA Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 5 2017 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 6 2018 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## TOBIN.Q MARKET.CAPITALIZATION RETURN.ON.ASSETS DEBT.TO.ASSET FIRM.SIZE
## 1 1.08 6.44 14.41 36.85 6.46
## 2 0.96 6.36 11.19 41.01 6.53
## 3 0.75 6.20 5.74 34.32 6.51
## 4 0.63 6.17 3.51 40.16 6.56
## 5 0.60 NA 9.78 43.69 6.63
## 6 0.70 6.24 13.43 40.72 6.65
## SOCIAL.DISCLOSURE.INDEX ENVIRONMENTAL.DISCLOSURE.INDEX
## 1 0.36 NA
## 2 0.29 0
## 3 0.29 0
## 4 0.29 0
## 5 0.29 0
## 6 0.29 0
## GOVERNANCE.DISCLOSURE.INDEX ESG.INDEX
## 1 0.38 0.25
## 2 0.38 0.22
## 3 0.38 0.22
## 4 0.38 0.22
## 5 0.38 0.22
## 6 0.38 0.22
# Load the dplyr package
# dplyr provides convenient tools for working with data frames
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# -----------------------------
# 1. Save the original column names
# -----------------------------
old_names <- names(udaje1)
# -----------------------------
# 2. Shorten (abbreviate) column names
# -----------------------------
# rename_with() applies a function to all column names
# abbreviate() automatically shortens long names
# strict = FALSE allows a more flexible abbreviation
udaje1 <- udaje1 %>%
rename_with(~ abbreviate(.x, strict = FALSE))
# -----------------------------
# 3. Ensure that column names are unique
# -----------------------------
# Sometimes abbreviation may create identical names
# make.unique() automatically adds suffixes (.1, .2, ...) if necessary
names(udaje1) <- make.unique(names(udaje1))
# -----------------------------
# 4. Show comparison: old vs. new names
# -----------------------------
comparison <- data.frame(
Original_Name = old_names,
Shortened_Name = names(udaje1)
)
print(comparison)## Original_Name Shortened_Name
## 1 YEARS YEAR
## 2 COMPANIES COMP
## 3 EXCHANGE.SECTOR EXCH
## 4 PRIMARY.BUSINESS PRIM
## 5 TOBIN.Q TOBI
## 6 MARKET.CAPITALIZATION MARK
## 7 RETURN.ON.ASSETS RETU
## 8 DEBT.TO.ASSET DEBT
## 9 FIRM.SIZE FIRM
## 10 SOCIAL.DISCLOSURE.INDEX SOCI
## 11 ENVIRONMENTAL.DISCLOSURE.INDEX ENVI
## 12 GOVERNANCE.DISCLOSURE.INDEX GOVE
## 13 ESG.INDEX ESG.
Odporúčam tu použiť knižnice VIM, Amelia, mice a iné. Pokiaľ máme databázu dostatočne nekonzistentnú a nevieme ju upraviť vynechaním niekoľkých riadkov / stĺpcov, potom odporúčame blog M. Fatih Tüzen: Handling Missing Data in R: A Comprehensive Guide, R bloggers.
E3te raz si pozrime našu pôvodnú databázu s chýbajúcimi údajmi:
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
## Loading required package: colorspace
## Loading required package: grid
## Registered S3 method overwritten by 'car':
## method from
## na.action.merMod lme4
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
## [1] "pocet chybajucich udajov za jednotlive premenne"
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 0 0 1 0 1 0 1 1 1 1 1 0
Štatistika vyššie nám hovorí, koľko NA má ktorý stĺpec databázy. Ďalšie riadky nám hovoria o štruktúre záznamov, kde sa nachádzajú chýbajúce hodnoty. Posledný riadok hovorí o počte chýbajúcich údajov za jednotlivé premenné a za celú databázu. Máme 760 záznamov , z kotých 754 je úplných a mámo 8 chýbajúcich hodnôt. Podbnú informáciu nám dáva nasledovný graf.
## COMP EXCH TOBI RETU ESG. YEAR PRIM MARK DEBT FIRM SOCI ENVI GOVE
## 754 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 1 0 3
## 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
## 0 0 0 0 0 1 1 1 1 1 1 1 1 8
# visualize missing data
aggr(udaje1, bars=FALSE,col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE) # cervena farba signalizuje chybahuce polozky##
## Variables sorted by number of missings:
## Variable Count
## YEAR 0.001315789
## PRIM 0.001315789
## MARK 0.001315789
## DEBT 0.001315789
## FIRM 0.001315789
## SOCI 0.001315789
## ENVI 0.001315789
## GOVE 0.001315789
## COMP 0.000000000
## EXCH 0.000000000
## TOBI 0.000000000
## RETU 0.000000000
## ESG. 0.000000000
# multiple imputation - v pripade, ak vam chyba mensi rozsah udajov
imp <- mice(udaje1, seed=123) # konkretne parametre imputacie vieme nastavovat - pozri help##
## iter imp variable
## 1 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 1 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 1 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 1 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 1 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 2 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 2 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 2 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 2 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 2 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 3 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 3 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 3 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 3 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 3 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 4 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 4 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 4 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 4 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 4 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 5 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 5 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 5 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 5 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## 5 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
## Warning: Number of logged events: 88
## YEAR COMP EXCH PRIM
## 1 2013 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 2 2014 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 3 2015 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 4 2019 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 5 2017 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## 6 2018 Industrial & Medical Gases Nigeria Natural Resources Industrial Gases
## TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 1.08 6.44 14.41 36.85 6.46 0.36 0 0.38 0.25
## 2 0.96 6.36 11.19 41.01 6.53 0.29 0 0.38 0.22
## 3 0.75 6.20 5.74 34.32 6.51 0.29 0 0.38 0.22
## 4 0.63 6.17 3.51 40.16 6.56 0.29 0 0.38 0.22
## 5 0.60 5.66 9.78 43.69 6.63 0.29 0 0.38 0.22
## 6 0.70 6.24 13.43 40.72 6.65 0.29 0 0.38 0.22
## [1] "pocet chybajucich udajov za jednotlive premenne"
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 0 0 0 1 0 0 0 0 0 0 0 0 0
Celkove nám teda ostala nevyplnená jedna premenná - textová - ktorá označuje Primary Business referencovanej firmy