Čistenie údajov

Predpokladajme, že v názvoch stĺpcov v pôvodnom súbore máme medzery, teda Náhľad na xls databázu otvorenú v tabuľkovom procesore

Tu sa v názvoch stĺpcov vyskytujú medzery. Názvy stĺpcov sa v prostredí R stávajú názvami premenných a tie nesmú byť súčasťou názvu premennej. Neprípustné znaky v názvoch premenných vo všeobecnosti môžeme nahradiť s pomocou knižnice janitor.

Import údajov

Skontrolujeme si, či sa doplňujú na miesta chýbajúcich údajov doplňujú NA hodnoty (NA - Not Available).

# Import the CSV file into a data frame
# - header = TRUE: the first row contains variable names
# - sep = ";": variables are separated by semicolons
# - dec = ".": decimal numbers use a dot
# - na.strings = c("", "NA"): empty cells and text "NA" are treated as missing values
# - stringsAsFactors = FALSE: text variables remain text, not factors

udaje1 <- read.csv2(
  "imputacia/udaje/ChybnaDatabaza.csv",
  header = TRUE,
  sep = ";",
  dec = ".",
  na.strings = c("", "NA"),
  stringsAsFactors = FALSE
)

# Show the first rows of the dataset
head(udaje1)

Upravovanie názvov premenných

# Load the dplyr package
# dplyr provides convenient tools for working with data frames
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# -----------------------------
# 1. Save the original column names
# -----------------------------
old_names <- names(udaje1)

# -----------------------------
# 2. Shorten (abbreviate) column names
# -----------------------------
# rename_with() applies a function to all column names
# abbreviate() automatically shortens long names
# strict = FALSE allows a more flexible abbreviation
udaje1 <- udaje1 %>%
  rename_with(~ abbreviate(.x, strict = FALSE))

# -----------------------------
# 3. Ensure that column names are unique
# -----------------------------
# Sometimes abbreviation may create identical names
# make.unique() automatically adds suffixes (.1, .2, ...) if necessary
names(udaje1) <- make.unique(names(udaje1))

# -----------------------------
# 4. Show comparison: old vs. new names
# -----------------------------
comparison <- data.frame(
  Original_Name = old_names,
  Shortened_Name = names(udaje1)
)

print(comparison)
##                     Original_Name Shortened_Name
## 1                           YEARS           YEAR
## 2                       COMPANIES           COMP
## 3                 EXCHANGE.SECTOR           EXCH
## 4                PRIMARY.BUSINESS           PRIM
## 5                         TOBIN.Q           TOBI
## 6           MARKET.CAPITALIZATION           MARK
## 7                RETURN.ON.ASSETS           RETU
## 8                   DEBT.TO.ASSET           DEBT
## 9                       FIRM.SIZE           FIRM
## 10        SOCIAL.DISCLOSURE.INDEX           SOCI
## 11 ENVIRONMENTAL.DISCLOSURE.INDEX           ENVI
## 12    GOVERNANCE.DISCLOSURE.INDEX           GOVE
## 13                      ESG.INDEX           ESG.

Počiatočné čistenie obsahu databázy, imputácia chýbahúcich údajov

Odporúčam tu použiť knižnice VIM, Amelia, mice a iné. Pokiaľ máme databázu dostatočne nekonzistentnú a nevieme ju upraviť vynechaním niekoľkých riadkov / stĺpcov, potom odporúčame blog M. Fatih Tüzen: Handling Missing Data in R: A Comprehensive Guide, R bloggers.

E3te raz si pozrime našu pôvodnú databázu s chýbajúcimi údajmi:

Databáza s chýbajúcimi údajmi
Databáza s chýbajúcimi údajmi
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
# Count missing values in each column
print("pocet chybajucich udajov za jednotlive premenne")
## [1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG. 
##    1    9    9   10    9   11    9   17   11   10   10   11   11

Štatistika vyššie nám hovorí, koľko NA má ktorý stĺpec databázy. Ďalšie riadky nám hovoria o štruktúre záznamov, kde sa nachádzajú chýbajúce hodnoty. Posledný riadok hovorí o počte chýbajúcich údajov za jednotlivé premenné a za celú databázu. Máme 760 záznamov , z kotých 754 je úplných a mámo 8 chýbajúcich hodnôt. Podbnú informáciu nám dáva nasledovný graf.

# pattern of missingness
md.pattern(udaje1)

##     YEAR COMP EXCH TOBI RETU PRIM SOCI ENVI MARK FIRM GOVE ESG. DEBT    
## 744    1    1    1    1    1    1    1    1    1    1    1    1    1   0
## 6      1    1    1    1    1    1    1    1    1    1    1    1    0   1
## 1      1    1    1    1    1    1    1    1    1    1    1    0    1   1
## 1      1    1    1    1    1    1    1    1    1    1    1    0    0   2
## 1      1    1    1    1    1    1    1    1    1    1    0    1    1   1
## 1      1    1    1    1    1    1    1    1    1    0    1    1    1   1
## 2      1    1    1    1    1    1    1    1    0    1    1    1    1   1
## 1      1    1    1    1    1    1    1    0    1    1    1    1    1   1
## 1      1    1    1    1    1    1    0    1    1    1    0    1    0   3
## 1      1    1    1    1    1    0    1    1    1    0    1    1    1   2
## 9      1    0    0    0    0    0    0    0    0    0    0    0    0  12
## 1      0    1    1    1    1    1    1    1    1    1    1    1    1   1
##        1    9    9    9    9   10   10   10   11   11   11   11   17 128
# visualize missing data
aggr(udaje1, bars=FALSE,col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE)  # cervena farba signalizuje chybahuce polozky

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##      DEBT 0.02210663
##      MARK 0.01430429
##      FIRM 0.01430429
##      GOVE 0.01430429
##      ESG. 0.01430429
##      PRIM 0.01300390
##      SOCI 0.01300390
##      ENVI 0.01300390
##      COMP 0.01170351
##      EXCH 0.01170351
##      TOBI 0.01170351
##      RETU 0.01170351
##      YEAR 0.00130039
# multiple imputation - v pripade, ak vam chyba mensi rozsah udajov
imp <- mice(udaje1, seed=123)   # konkretne parametre imputacie vieme nastavovat - pozri help
## 
##  iter imp variable
##   1   1  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   1   2  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   1   3  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   1   4  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   1   5  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   2   1  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   2   2  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   2   3  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   2   4  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   2   5  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   3   1  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   3   2  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   3   3  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   3   4  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   3   5  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   4   1  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   4   2  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   4   3  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   4   4  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   4   5  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   5   1  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   5   2  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   5   3  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   5   4  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
##   5   5  YEAR  TOBI  MARK  RETU  DEBT  FIRM  SOCI  ENVI  GOVE  ESG.
## Warning: Number of logged events: 3
udaje_imputovane <- complete(imp, 1)
udaje1 <- udaje_imputovane
head(udaje1)
rm(imp)
rm(udaje_imputovane)
print("pocet chybajucich udajov za jednotlive premenne")
## [1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG. 
##    0    9    9   10    0    0    0    0    0    0    0    0    0

Celkove nám teda ostala nevyplnená jedna premenná - textová - ktorá označuje Primary Business referencovanej firmy

Tabuľky, grafy, jednoduché štatistiky

Grafy

ggplot2 - knižnica pre grafy

Výber a následné triedenie

library(dplyr)

udaje.2019 <- udaje1 %>%
  filter(YEAR == 2019) %>%
  dplyr::select(RETU, ESG., DEBT, FIRM)

Knižnica .ggplot2. je v súčasnosti najčastejšie používaná grafická knižnica, pričom predpripravené kódy k jednotlivým obrázkom si viete nájsť v R Graph Gallery. Tu si uvedieme jednoduchšie z nich.

Scatter plot

# Basic scatter plot
library(ggplot2)
ggplot(udaje.2019, aes(x = DEBT, y = ESG.)) +            # specifikacia osi
  geom_point() +                                                   # typ grafu 
  theme_minimal() +
  labs(title = "ESG index vs Debt", x = "Debt to Asset", y = "ESG Score")      # oznacenie osi

Boxplot

# Bar plot with grouping
library(ggplot2)

library(ggplot2)

ggplot(udaje1, aes(x = factor(YEAR), y = DEBT)) +        # specifikacia osi
  geom_boxplot(fill = "lightgreen", color = "blue") +      # typ grafu - boxplot
  labs(                                                       # oznacenie nazov grafu
    title = "Debt to Asset by Year",
    x = "Year",
    y = "Debt to Asset"
  ) +
  theme_minimal()

Základné štatistiky.

knitr - tabuľka

library(dplyr)
library(knitr)

# Summarise basic statistics
esg.stats <- udaje1 %>%
  filter(YEAR %in% 2015:2019) %>%
  group_by(YEAR) %>%
  summarise(
    n     = n(),
    mean  = mean(DEBT, na.rm = TRUE),
    sd    = sd(DEBT, na.rm = TRUE),
    min   = min(DEBT, na.rm = TRUE),
    q25   = quantile(DEBT, 0.25, na.rm = TRUE),
    median= median(DEBT, na.rm = TRUE),
    q75   = quantile(DEBT, 0.75, na.rm = TRUE),
    max   = max(DEBT, na.rm = TRUE),
    .groups = "drop"
  )

# Create knitr table
kable(esg.stats, digits = 2, caption = "Basic statistics of DEBT (2015-2019)")
Basic statistics of DEBT (2015-2019)
YEAR n mean sd min q25 median q75 max
2015 76 63.94 34.13 1.61 43.34 60.64 75.97 286.14
2016 75 66.64 45.92 1.67 43.36 63.49 77.93 380.77
2017 76 70.12 51.89 2.38 45.03 60.92 78.88 395.45
2018 77 68.40 46.28 -20.78 43.48 60.71 77.52 292.73
2019 76 69.48 45.66 -11.59 44.25 59.68 79.96 285.85

alebo krajšie tabuľky s pomocou .kableExtra.:

library(dplyr)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# Summarise basic statistics
esg.stats <- udaje1 %>%
  filter(YEAR %in% 2015:2019) %>%
  group_by(YEAR) %>%
  summarise(
    n      = n(),
    mean   = mean(DEBT, na.rm = TRUE),
    sd     = sd(DEBT, na.rm = TRUE),
    min    = min(DEBT, na.rm = TRUE),
    q25    = quantile(DEBT, 0.25, na.rm = TRUE),
    median = median(DEBT, na.rm = TRUE),
    q75    = quantile(DEBT, 0.75, na.rm = TRUE),
    max    = max(DEBT, na.rm = TRUE),
    .groups = "drop"
  )

# Create styled kableExtra table
esg.stats %>%
  kable(digits = 1, caption = "Basic statistics of DEBT (2015–2019)") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "condensed", "hover")) %>%
  column_spec(1, bold = TRUE) %>%          # make YEAR bold
  row_spec(0, bold = TRUE, background = "#d9edf7") %>%  # style header row
  add_header_above(c(" " = 1, "DEBT statistics" = 8))
Basic statistics of DEBT (2015–2019)
DEBT statistics
YEAR n mean sd min q25 median q75 max
2015 76 63.9 34.1 1.6 43.3 60.6 76.0 286.1
2016 75 66.6 45.9 1.7 43.4 63.5 77.9 380.8
2017 76 70.1 51.9 2.4 45.0 60.9 78.9 395.4
2018 77 68.4 46.3 -20.8 43.5 60.7 77.5 292.7
2019 76 69.5 45.7 -11.6 44.3 59.7 80.0 285.9

Testovanie hypotéz

t-test: Porovnanie priemeru ESG indexu v rokoch 2015 a 2019

t.test.result <- t.test(
  udaje1$ESG.[udaje1$YEAR == 2015],
  udaje1$ESG.[udaje1$YEAR == 2019]
)

print(t.test.result)
## 
##  Welch Two Sample t-test
## 
## data:  udaje1$ESG.[udaje1$YEAR == 2015] and udaje1$ESG.[udaje1$YEAR == 2019]
## t = -2.0364, df = 147.87, p-value = 0.04349
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.081667223 -0.001227514
## sample estimates:
## mean of x mean of y 
## 0.2510526 0.2925000

ANOVA: Comparing Reading Scores Across Programs

anova.result <- aov(ESG. ~ YEAR, data = udaje1)
summary(anova.result)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## YEAR          1  0.457  0.4575   20.33 7.53e-06 ***
## Residuals   767 17.258  0.0225                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear Regression: Predicting Math Scores

library(dplyr)

udaje.2015 <- udaje1 %>%
  filter(YEAR == 2015)

model <- lm(ESG. ~ RETU + FIRM + DEBT, data = udaje.2015)
summary(model)
## 
## Call:
## lm(formula = ESG. ~ RETU + FIRM + DEBT, data = udaje.2015)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.171445 -0.050215 -0.000873  0.050132  0.263040 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.614e-01  9.101e-02  -3.971 0.000168 ***
## RETU        -3.911e-06  8.977e-06  -0.436 0.664397    
## FIRM         8.932e-02  1.251e-02   7.137 6.17e-10 ***
## DEBT        -2.953e-04  3.119e-04  -0.947 0.346871    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0916 on 72 degrees of freedom
## Multiple R-squared:  0.4185, Adjusted R-squared:  0.3943 
## F-statistic: 17.27 on 3 and 72 DF,  p-value: 1.504e-08