Predpokladajme, že v názvoch stĺpcov v pôvodnom súbore máme medzery,
teda 
Tu sa v názvoch stĺpcov vyskytujú medzery. Názvy stĺpcov sa v prostredí R stávajú názvami premenných a tie nesmú byť súčasťou názvu premennej. Neprípustné znaky v názvoch premenných vo všeobecnosti môžeme nahradiť s pomocou knižnice janitor.
Skontrolujeme si, či sa doplňujú na miesta chýbajúcich údajov doplňujú NA hodnoty (NA - Not Available).
# Import the CSV file into a data frame
# - header = TRUE: the first row contains variable names
# - sep = ";": variables are separated by semicolons
# - dec = ".": decimal numbers use a dot
# - na.strings = c("", "NA"): empty cells and text "NA" are treated as missing values
# - stringsAsFactors = FALSE: text variables remain text, not factors
udaje1 <- read.csv2(
"imputacia/udaje/ChybnaDatabaza.csv",
header = TRUE,
sep = ";",
dec = ".",
na.strings = c("", "NA"),
stringsAsFactors = FALSE
)
# Show the first rows of the dataset
head(udaje1)
# Load the dplyr package
# dplyr provides convenient tools for working with data frames
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# -----------------------------
# 1. Save the original column names
# -----------------------------
old_names <- names(udaje1)
# -----------------------------
# 2. Shorten (abbreviate) column names
# -----------------------------
# rename_with() applies a function to all column names
# abbreviate() automatically shortens long names
# strict = FALSE allows a more flexible abbreviation
udaje1 <- udaje1 %>%
rename_with(~ abbreviate(.x, strict = FALSE))
# -----------------------------
# 3. Ensure that column names are unique
# -----------------------------
# Sometimes abbreviation may create identical names
# make.unique() automatically adds suffixes (.1, .2, ...) if necessary
names(udaje1) <- make.unique(names(udaje1))
# -----------------------------
# 4. Show comparison: old vs. new names
# -----------------------------
comparison <- data.frame(
Original_Name = old_names,
Shortened_Name = names(udaje1)
)
print(comparison)
## Original_Name Shortened_Name
## 1 YEARS YEAR
## 2 COMPANIES COMP
## 3 EXCHANGE.SECTOR EXCH
## 4 PRIMARY.BUSINESS PRIM
## 5 TOBIN.Q TOBI
## 6 MARKET.CAPITALIZATION MARK
## 7 RETURN.ON.ASSETS RETU
## 8 DEBT.TO.ASSET DEBT
## 9 FIRM.SIZE FIRM
## 10 SOCIAL.DISCLOSURE.INDEX SOCI
## 11 ENVIRONMENTAL.DISCLOSURE.INDEX ENVI
## 12 GOVERNANCE.DISCLOSURE.INDEX GOVE
## 13 ESG.INDEX ESG.
Odporúčam tu použiť knižnice VIM, Amelia, mice a iné. Pokiaľ máme databázu dostatočne nekonzistentnú a nevieme ju upraviť vynechaním niekoľkých riadkov / stĺpcov, potom odporúčame blog M. Fatih Tüzen: Handling Missing Data in R: A Comprehensive Guide, R bloggers.
E3te raz si pozrime našu pôvodnú databázu s chýbajúcimi údajmi:
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
# Count missing values in each column
print("pocet chybajucich udajov za jednotlive premenne")
## [1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 9 9 10 9 11 9 17 11 10 10 11 11
Štatistika vyššie nám hovorí, koľko NA má ktorý stĺpec databázy. Ďalšie riadky nám hovoria o štruktúre záznamov, kde sa nachádzajú chýbajúce hodnoty. Posledný riadok hovorí o počte chýbajúcich údajov za jednotlivé premenné a za celú databázu. Máme 760 záznamov , z kotých 754 je úplných a mámo 8 chýbajúcich hodnôt. Podbnú informáciu nám dáva nasledovný graf.
# pattern of missingness
md.pattern(udaje1)
## YEAR COMP EXCH TOBI RETU PRIM SOCI ENVI MARK FIRM GOVE ESG. DEBT
## 744 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 6 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2
## 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 0 1 0 3
## 1 1 1 1 1 1 0 1 1 1 0 1 1 1 2
## 9 1 0 0 0 0 0 0 0 0 0 0 0 0 12
## 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 9 9 9 9 10 10 10 11 11 11 11 17 128
# visualize missing data
aggr(udaje1, bars=FALSE,col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE) # cervena farba signalizuje chybahuce polozky
##
## Variables sorted by number of missings:
## Variable Count
## DEBT 0.02210663
## MARK 0.01430429
## FIRM 0.01430429
## GOVE 0.01430429
## ESG. 0.01430429
## PRIM 0.01300390
## SOCI 0.01300390
## ENVI 0.01300390
## COMP 0.01170351
## EXCH 0.01170351
## TOBI 0.01170351
## RETU 0.01170351
## YEAR 0.00130039
# multiple imputation - v pripade, ak vam chyba mensi rozsah udajov
imp <- mice(udaje1, seed=123) # konkretne parametre imputacie vieme nastavovat - pozri help
##
## iter imp variable
## 1 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## Warning: Number of logged events: 3
udaje_imputovane <- complete(imp, 1)
udaje1 <- udaje_imputovane
head(udaje1)
rm(imp)
rm(udaje_imputovane)
print("pocet chybajucich udajov za jednotlive premenne")
## [1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 0 9 9 10 0 0 0 0 0 0 0 0 0
Celkove nám teda ostala nevyplnená jedna premenná - textová - ktorá označuje Primary Business referencovanej firmy
Výber a následné triedenie
library(dplyr)
udaje.2019 <- udaje1 %>%
filter(YEAR == 2019) %>%
dplyr::select(RETU, ESG., DEBT, FIRM)
Knižnica .ggplot2. je v súčasnosti najčastejšie používaná grafická knižnica, pričom predpripravené kódy k jednotlivým obrázkom si viete nájsť v R Graph Gallery. Tu si uvedieme jednoduchšie z nich.
# Basic scatter plot
library(ggplot2)
ggplot(udaje.2019, aes(x = DEBT, y = ESG.)) + # specifikacia osi
geom_point() + # typ grafu
theme_minimal() +
labs(title = "ESG index vs Debt", x = "Debt to Asset", y = "ESG Score") # oznacenie osi
# Bar plot with grouping
library(ggplot2)
library(ggplot2)
ggplot(udaje1, aes(x = factor(YEAR), y = DEBT)) + # specifikacia osi
geom_boxplot(fill = "lightgreen", color = "blue") + # typ grafu - boxplot
labs( # oznacenie nazov grafu
title = "Debt to Asset by Year",
x = "Year",
y = "Debt to Asset"
) +
theme_minimal()
library(dplyr)
library(knitr)
# Summarise basic statistics
esg.stats <- udaje1 %>%
filter(YEAR %in% 2015:2019) %>%
group_by(YEAR) %>%
summarise(
n = n(),
mean = mean(DEBT, na.rm = TRUE),
sd = sd(DEBT, na.rm = TRUE),
min = min(DEBT, na.rm = TRUE),
q25 = quantile(DEBT, 0.25, na.rm = TRUE),
median= median(DEBT, na.rm = TRUE),
q75 = quantile(DEBT, 0.75, na.rm = TRUE),
max = max(DEBT, na.rm = TRUE),
.groups = "drop"
)
# Create knitr table
kable(esg.stats, digits = 2, caption = "Basic statistics of DEBT (2015-2019)")
| YEAR | n | mean | sd | min | q25 | median | q75 | max |
|---|---|---|---|---|---|---|---|---|
| 2015 | 76 | 63.94 | 34.13 | 1.61 | 43.34 | 60.64 | 75.97 | 286.14 |
| 2016 | 75 | 66.64 | 45.92 | 1.67 | 43.36 | 63.49 | 77.93 | 380.77 |
| 2017 | 76 | 70.12 | 51.89 | 2.38 | 45.03 | 60.92 | 78.88 | 395.45 |
| 2018 | 77 | 68.40 | 46.28 | -20.78 | 43.48 | 60.71 | 77.52 | 292.73 |
| 2019 | 76 | 69.48 | 45.66 | -11.59 | 44.25 | 59.68 | 79.96 | 285.85 |
alebo krajšie tabuľky s pomocou .kableExtra.:
library(dplyr)
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
# Summarise basic statistics
esg.stats <- udaje1 %>%
filter(YEAR %in% 2015:2019) %>%
group_by(YEAR) %>%
summarise(
n = n(),
mean = mean(DEBT, na.rm = TRUE),
sd = sd(DEBT, na.rm = TRUE),
min = min(DEBT, na.rm = TRUE),
q25 = quantile(DEBT, 0.25, na.rm = TRUE),
median = median(DEBT, na.rm = TRUE),
q75 = quantile(DEBT, 0.75, na.rm = TRUE),
max = max(DEBT, na.rm = TRUE),
.groups = "drop"
)
# Create styled kableExtra table
esg.stats %>%
kable(digits = 1, caption = "Basic statistics of DEBT (2015–2019)") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "condensed", "hover")) %>%
column_spec(1, bold = TRUE) %>% # make YEAR bold
row_spec(0, bold = TRUE, background = "#d9edf7") %>% # style header row
add_header_above(c(" " = 1, "DEBT statistics" = 8))
| YEAR | n | mean | sd | min | q25 | median | q75 | max |
|---|---|---|---|---|---|---|---|---|
| 2015 | 76 | 63.9 | 34.1 | 1.6 | 43.3 | 60.6 | 76.0 | 286.1 |
| 2016 | 75 | 66.6 | 45.9 | 1.7 | 43.4 | 63.5 | 77.9 | 380.8 |
| 2017 | 76 | 70.1 | 51.9 | 2.4 | 45.0 | 60.9 | 78.9 | 395.4 |
| 2018 | 77 | 68.4 | 46.3 | -20.8 | 43.5 | 60.7 | 77.5 | 292.7 |
| 2019 | 76 | 69.5 | 45.7 | -11.6 | 44.3 | 59.7 | 80.0 | 285.9 |
t.test.result <- t.test(
udaje1$ESG.[udaje1$YEAR == 2015],
udaje1$ESG.[udaje1$YEAR == 2019]
)
print(t.test.result)
##
## Welch Two Sample t-test
##
## data: udaje1$ESG.[udaje1$YEAR == 2015] and udaje1$ESG.[udaje1$YEAR == 2019]
## t = -2.0364, df = 147.87, p-value = 0.04349
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.081667223 -0.001227514
## sample estimates:
## mean of x mean of y
## 0.2510526 0.2925000
anova.result <- aov(ESG. ~ YEAR, data = udaje1)
summary(anova.result)
## Df Sum Sq Mean Sq F value Pr(>F)
## YEAR 1 0.457 0.4575 20.33 7.53e-06 ***
## Residuals 767 17.258 0.0225
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(dplyr)
udaje.2015 <- udaje1 %>%
filter(YEAR == 2015)
model <- lm(ESG. ~ RETU + FIRM + DEBT, data = udaje.2015)
summary(model)
##
## Call:
## lm(formula = ESG. ~ RETU + FIRM + DEBT, data = udaje.2015)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.171445 -0.050215 -0.000873 0.050132 0.263040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.614e-01 9.101e-02 -3.971 0.000168 ***
## RETU -3.911e-06 8.977e-06 -0.436 0.664397
## FIRM 8.932e-02 1.251e-02 7.137 6.17e-10 ***
## DEBT -2.953e-04 3.119e-04 -0.947 0.346871
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0916 on 72 degrees of freedom
## Multiple R-squared: 0.4185, Adjusted R-squared: 0.3943
## F-statistic: 17.27 on 3 and 72 DF, p-value: 1.504e-08
udaje <- read.csv2("imputacia/udaje/ChybnaDatabaza.csv", header=TRUE, sep=";", dec=".")
names(udaje) <- make.names(names(udaje))
model <- lm(ESG.INDEX ~ FIRM.SIZE + DEBT.TO.ASSET + MARKET.CAPITALIZATION, data = udaje)
print("Odhadnuté koeficienty sú: ")
## [1] "Odhadnuté koeficienty sú: "
print(model$coefficients)
## (Intercept) FIRM.SIZE DEBT.TO.ASSET
## -0.3404672099 0.0325900066 -0.0001929163
## MARKET.CAPITALIZATION
## 0.0591326315
print("Odhadnuté rezíduá: ")
## [1] "Odhadnuté rezíduá: "
print(model$residuals[1:11])
## 1 2 3 4 6 8
## 0.006230587 -0.020517572 -0.011695161 -0.010424051 -0.017388402 -0.010100773
## 28 29 31 32 33
## 0.228798099 0.399326274 0.078599913 0.062282320 0.085645461
print("Vyrovnané hodnoty vysvetľovanej premennej sú: ")
## [1] "Vyrovnané hodnoty vysvetľovanej premennej sú: "
print(model$fitted.values[1:11])
## 1 2 3 4 6 8 28 29
## 0.2437694 0.2405176 0.2316952 0.2304241 0.2373884 0.2601008 0.3812019 0.4006737
## 31 32 33
## 0.2614001 0.2777177 0.2543545
summary(model)
##
## Call:
## lm(formula = ESG.INDEX ~ FIRM.SIZE + DEBT.TO.ASSET + MARKET.CAPITALIZATION,
## data = udaje)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31085 -0.07647 -0.01489 0.05827 0.53103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.405e-01 3.628e-02 -9.384 < 2e-16 ***
## FIRM.SIZE 3.259e-02 9.107e-03 3.579 0.000368 ***
## DEBT.TO.ASSET -1.929e-04 9.991e-05 -1.931 0.053885 .
## MARKET.CAPITALIZATION 5.913e-02 7.843e-03 7.540 1.37e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1175 on 743 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.3672, Adjusted R-squared: 0.3647
## F-statistic: 143.7 on 3 and 743 DF, p-value: < 2.2e-16
Do modelu som pridal premennú MARKET.CAPITALIZATION (trhová hodnota firmy). Ukázalo sa, že má silný pozitívny vplyv na ESG index. Premenná FIRM.SIZE má tiež pozitívny vplyv, ale slabší. Premenná DEBT.TO.ASSET má negatívny vplyv, ale len slabý a nie úplne jednoznačný. Po pridaní MARKET.CAPITALIZATION sa model výrazne zlepšil.