Predpokladajme, že v názvoch stĺpcov v pôvodnom súbore máme medzery,
teda 
Tu sa v názvoch stĺpcov vyskytujú medzery. Názvy stĺpcov sa v prostredí R stávajú názvami premenných a tie nesmú byť súčasťou názvu premennej. Neprípustné znaky v názvoch premenných vo všeobecnosti môžeme nahradiť s pomocou knižnice janitor.
Skontrolujeme si, či sa doplňujú na miesta chýbajúcich údajov doplňujú NA hodnoty (NA - Not Available).
# Import the CSV file into a data frame
# - header = TRUE: the first row contains variable names
# - sep = ";": variables are separated by semicolons
# - dec = ".": decimal numbers use a dot
# - na.strings = c("", "NA"): empty cells and text "NA" are treated as missing values
# - stringsAsFactors = FALSE: text variables remain text, not factors
udaje1 <- read.csv2(
"imputacia/udaje/ChybnaDatabaza.csv",
header = TRUE,
sep = ";",
dec = ".",
na.strings = c("", "NA"),
stringsAsFactors = FALSE
)
# Show the first rows of the dataset
head(udaje1)
# Load the dplyr package
# dplyr provides convenient tools for working with data frames
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# -----------------------------
# 1. Save the original column names
# -----------------------------
old_names <- names(udaje1)
# -----------------------------
# 2. Shorten (abbreviate) column names
# -----------------------------
# rename_with() applies a function to all column names
# abbreviate() automatically shortens long names
# strict = FALSE allows a more flexible abbreviation
udaje1 <- udaje1 %>%
rename_with(~ abbreviate(.x, strict = FALSE))
# -----------------------------
# 3. Ensure that column names are unique
# -----------------------------
# Sometimes abbreviation may create identical names
# make.unique() automatically adds suffixes (.1, .2, ...) if necessary
names(udaje1) <- make.unique(names(udaje1))
# -----------------------------
# 4. Show comparison: old vs. new names
# -----------------------------
comparison <- data.frame(
Original_Name = old_names,
Shortened_Name = names(udaje1)
)
print(comparison)
## Original_Name Shortened_Name
## 1 YEARS YEAR
## 2 COMPANIES COMP
## 3 EXCHANGE.SECTOR EXCH
## 4 PRIMARY.BUSINESS PRIM
## 5 TOBIN.Q TOBI
## 6 MARKET.CAPITALIZATION MARK
## 7 RETURN.ON.ASSETS RETU
## 8 DEBT.TO.ASSET DEBT
## 9 FIRM.SIZE FIRM
## 10 SOCIAL.DISCLOSURE.INDEX SOCI
## 11 ENVIRONMENTAL.DISCLOSURE.INDEX ENVI
## 12 GOVERNANCE.DISCLOSURE.INDEX GOVE
## 13 ESG.INDEX ESG.
Odporúčam tu použiť knižnice VIM, Amelia, mice a iné. Pokiaľ máme databázu dostatočne nekonzistentnú a nevieme ju upraviť vynechaním niekoľkých riadkov / stĺpcov, potom odporúčame blog M. Fatih Tüzen: Handling Missing Data in R: A Comprehensive Guide, R bloggers.
E3te raz si pozrime našu pôvodnú databázu s chýbajúcimi údajmi:
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
# Count missing values in each column
print("pocet chybajucich udajov za jednotlive premenne")
## [1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 9 9 10 9 11 9 17 11 10 10 11 11
Štatistika vyššie nám hovorí, koľko NA má ktorý stĺpec databázy. Ďalšie riadky nám hovoria o štruktúre záznamov, kde sa nachádzajú chýbajúce hodnoty. Posledný riadok hovorí o počte chýbajúcich údajov za jednotlivé premenné a za celú databázu. Máme 760 záznamov , z kotých 754 je úplných a mámo 8 chýbajúcich hodnôt. Podbnú informáciu nám dáva nasledovný graf.
# pattern of missingness
md.pattern(udaje1)
## YEAR COMP EXCH TOBI RETU PRIM SOCI ENVI MARK FIRM GOVE ESG. DEBT
## 744 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 6 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2
## 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 0 1 0 3
## 1 1 1 1 1 1 0 1 1 1 0 1 1 1 2
## 9 1 0 0 0 0 0 0 0 0 0 0 0 0 12
## 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 9 9 9 9 10 10 10 11 11 11 11 17 128
# visualize missing data
aggr(udaje1, bars=FALSE,col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE) # cervena farba signalizuje chybahuce polozky
##
## Variables sorted by number of missings:
## Variable Count
## DEBT 0.02210663
## MARK 0.01430429
## FIRM 0.01430429
## GOVE 0.01430429
## ESG. 0.01430429
## PRIM 0.01300390
## SOCI 0.01300390
## ENVI 0.01300390
## COMP 0.01170351
## EXCH 0.01170351
## TOBI 0.01170351
## RETU 0.01170351
## YEAR 0.00130039
# multiple imputation - v pripade, ak vam chyba mensi rozsah udajov
imp <- mice(udaje1, seed=123) # konkretne parametre imputacie vieme nastavovat - pozri help
##
## iter imp variable
## 1 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 1 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 2 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 3 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 4 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 1 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 2 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 3 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 4 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 5 5 YEAR TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## Warning: Number of logged events: 3
udaje_imputovane <- complete(imp, 1)
udaje1 <- udaje_imputovane
head(udaje1)
rm(imp)
rm(udaje_imputovane)
print("pocet chybajucich udajov za jednotlive premenne")
## [1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
## YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
## 0 9 9 10 0 0 0 0 0 0 0 0 0
Celkove nám teda ostala nevyplnená jedna premenná - textová - ktorá označuje Primary Business referencovanej firmy
Výber a následné triedenie
library(dplyr)
udaje.2019 <- udaje1 %>%
filter(YEAR == 2019) %>%
dplyr::select(RETU, ESG., DEBT, FIRM)
Knižnica .ggplot2. je v súčasnosti najčastejšie používaná grafická knižnica, pričom predpripravené kódy k jednotlivým obrázkom si viete nájsť v R Graph Gallery. Tu si uvedieme jednoduchšie z nich.
# Basic scatter plot
library(ggplot2)
ggplot(udaje.2019, aes(x = DEBT, y = ESG.)) + # specifikacia osi
geom_point() + # typ grafu
theme_minimal() +
labs(title = "ESG index vs Debt", x = "Debt to Asset", y = "ESG Score") # oznacenie osi
# Bar plot with grouping
library(ggplot2)
library(ggplot2)
ggplot(udaje1, aes(x = factor(YEAR), y = DEBT)) + # specifikacia osi
geom_boxplot(fill = "lightgreen", color = "blue") + # typ grafu - boxplot
labs( # oznacenie nazov grafu
title = "Debt to Asset by Year",
x = "Year",
y = "Debt to Asset"
) +
theme_minimal()
library(dplyr)
library(knitr)
# Summarise basic statistics
esg.stats <- udaje1 %>%
filter(YEAR %in% 2015:2019) %>%
group_by(YEAR) %>%
summarise(
n = n(),
mean = mean(DEBT, na.rm = TRUE),
sd = sd(DEBT, na.rm = TRUE),
min = min(DEBT, na.rm = TRUE),
q25 = quantile(DEBT, 0.25, na.rm = TRUE),
median= median(DEBT, na.rm = TRUE),
q75 = quantile(DEBT, 0.75, na.rm = TRUE),
max = max(DEBT, na.rm = TRUE),
.groups = "drop"
)
# Create knitr table
kable(esg.stats, digits = 2, caption = "Basic statistics of DEBT (2015-2019)")
| YEAR | n | mean | sd | min | q25 | median | q75 | max |
|---|---|---|---|---|---|---|---|---|
| 2015 | 76 | 63.94 | 34.13 | 1.61 | 43.34 | 60.64 | 75.97 | 286.14 |
| 2016 | 75 | 66.64 | 45.92 | 1.67 | 43.36 | 63.49 | 77.93 | 380.77 |
| 2017 | 76 | 70.12 | 51.89 | 2.38 | 45.03 | 60.92 | 78.88 | 395.45 |
| 2018 | 77 | 68.40 | 46.28 | -20.78 | 43.48 | 60.71 | 77.52 | 292.73 |
| 2019 | 76 | 69.48 | 45.66 | -11.59 | 44.25 | 59.68 | 79.96 | 285.85 |
alebo krajšie tabuľky s pomocou .kableExtra.:
library(dplyr)
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
# Summarise basic statistics
esg.stats <- udaje1 %>%
filter(YEAR %in% 2015:2019) %>%
group_by(YEAR) %>%
summarise(
n = n(),
mean = mean(DEBT, na.rm = TRUE),
sd = sd(DEBT, na.rm = TRUE),
min = min(DEBT, na.rm = TRUE),
q25 = quantile(DEBT, 0.25, na.rm = TRUE),
median = median(DEBT, na.rm = TRUE),
q75 = quantile(DEBT, 0.75, na.rm = TRUE),
max = max(DEBT, na.rm = TRUE),
.groups = "drop"
)
# Create styled kableExtra table
esg.stats %>%
kable(digits = 1, caption = "Basic statistics of DEBT (2015–2019)") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "condensed", "hover")) %>%
column_spec(1, bold = TRUE) %>% # make YEAR bold
row_spec(0, bold = TRUE, background = "#d9edf7") %>% # style header row
add_header_above(c(" " = 1, "DEBT statistics" = 8))
| YEAR | n | mean | sd | min | q25 | median | q75 | max |
|---|---|---|---|---|---|---|---|---|
| 2015 | 76 | 63.9 | 34.1 | 1.6 | 43.3 | 60.6 | 76.0 | 286.1 |
| 2016 | 75 | 66.6 | 45.9 | 1.7 | 43.4 | 63.5 | 77.9 | 380.8 |
| 2017 | 76 | 70.1 | 51.9 | 2.4 | 45.0 | 60.9 | 78.9 | 395.4 |
| 2018 | 77 | 68.4 | 46.3 | -20.8 | 43.5 | 60.7 | 77.5 | 292.7 |
| 2019 | 76 | 69.5 | 45.7 | -11.6 | 44.3 | 59.7 | 80.0 | 285.9 |
t.test.result <- t.test(
udaje1$ESG.[udaje1$YEAR == 2015],
udaje1$ESG.[udaje1$YEAR == 2019]
)
print(t.test.result)
##
## Welch Two Sample t-test
##
## data: udaje1$ESG.[udaje1$YEAR == 2015] and udaje1$ESG.[udaje1$YEAR == 2019]
## t = -2.0364, df = 147.87, p-value = 0.04349
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.081667223 -0.001227514
## sample estimates:
## mean of x mean of y
## 0.2510526 0.2925000
anova.result <- aov(ESG. ~ YEAR, data = udaje1)
summary(anova.result)
## Df Sum Sq Mean Sq F value Pr(>F)
## YEAR 1 0.457 0.4575 20.33 7.53e-06 ***
## Residuals 767 17.258 0.0225
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(dplyr)
udaje.2015 <- udaje1 %>%
filter(YEAR == 2015)
model <- lm(ESG. ~ RETU + FIRM + DEBT, data = udaje.2015)
summary(model)
##
## Call:
## lm(formula = ESG. ~ RETU + FIRM + DEBT, data = udaje.2015)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.171445 -0.050215 -0.000873 0.050132 0.263040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.614e-01 9.101e-02 -3.971 0.000168 ***
## RETU -3.911e-06 8.977e-06 -0.436 0.664397
## FIRM 8.932e-02 1.251e-02 7.137 6.17e-10 ***
## DEBT -2.953e-04 3.119e-04 -0.947 0.346871
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0916 on 72 degrees of freedom
## Multiple R-squared: 0.4185, Adjusted R-squared: 0.3943
## F-statistic: 17.27 on 3 and 72 DF, p-value: 1.504e-08