Predpokladajme, že v názvoch stĺpcov v pôvodnom súbore máme medzery,
teda
Tu sa v názvoch stĺpcov vyskytujú medzery. Názvy stĺpcov sa v prostredí R stávajú názvami premenných a tie nesmú byť súčasťou názvu premennej. Neprípustné znaky v názvoch premenných vo všeobecnosti môžeme nahradiť s pomocou knižnice janitor.
Skontrolujeme si, či sa doplňujú na miesta chýbajúcich údajov doplňujú NA hodnoty (NA - Not Available).
# Import the CSV file into a data frame
# - header = TRUE: the first row contains variable names
# - sep = ";": variables are separated by semicolons
# - dec = ".": decimal numbers use a dot
# - na.strings = c("", "NA"): empty cells and text "NA" are treated as missing values
# - stringsAsFactors = FALSE: text variables remain text, not factors
udaje1 <- read.csv2(
"udaje/ChybnaDatabaza.csv",
header = TRUE,
sep = ";",
dec = ".",
na.strings = c("", "NA"),
stringsAsFactors = FALSE
)
# Show the first rows of the dataset
head(udaje1)
NA
NA
# Load the dplyr package
# dplyr provides convenient tools for working with data frames
library(dplyr)
# -----------------------------
# 1. Save the original column names
# -----------------------------
old_names <- names(udaje1)
# -----------------------------
# 2. Shorten (abbreviate) column names
# -----------------------------
# rename_with() applies a function to all column names
# abbreviate() automatically shortens long names
# strict = FALSE allows a more flexible abbreviation
udaje1 <- udaje1 %>%
rename_with(~ abbreviate(.x, strict = FALSE))
# -----------------------------
# 3. Ensure that column names are unique
# -----------------------------
# Sometimes abbreviation may create identical names
# make.unique() automatically adds suffixes (.1, .2, ...) if necessary
names(udaje1) <- make.unique(names(udaje1))
# -----------------------------
# 4. Show comparison: old vs. new names
# -----------------------------
comparison <- data.frame(
Original_Name = old_names,
Shortened_Name = names(udaje1)
)
print(comparison)
NA
NA
Odporúčam tu použiť knižnice VIM, Amelia, mice a iné. Pokiaľ máme databázu dostatočne nekonzistentnú a nevieme ju upraviť vynechaním niekoľkých riadkov / stĺpcov, potom odporúčame blog M. Fatih Tüzen: Handling Missing Data in R: A Comprehensive Guide, R bloggers.
E3te raz si pozrime našu pôvodnú databázu s chýbajúcimi údajmi:
library(mice)
library(VIM)
# Count missing values in each column
print("pocet chybajucich udajov za jednotlive premenne")
[1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
1 0 0 1 0 1 0 1 1 1 1 1 0
Štatistika vyššie nám hovorí, koľko NA má ktorý stĺpec databázy. Ďalšie riadky nám hovoria o štruktúre záznamov, kde sa nachádzajú chýbajúce hodnoty. Posledný riadok hovorí o počte chýbajúcich údajov za jednotlivé premenné a za celú databázu. Máme 760 záznamov , z kotých 754 je úplných a mámo 8 chýbajúcich hodnôt. Podbnú informáciu nám dáva nasledovný graf.
# pattern of missingness
md.pattern(udaje1)
COMP EXCH TOBI RETU ESG. YEAR PRIM MARK DEBT FIRM SOCI ENVI GOVE
754 1 1 1 1 1 1 1 1 1 1 1 1 1 0
1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
1 1 1 1 1 1 1 1 1 0 1 0 1 0 3
1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1 1 1 1 8
# visualize missing data
aggr(udaje1, bars=FALSE,col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE) # cervena farba signalizuje chybahuce polozky
Variables sorted by number of missings:
# multiple imputation - v pripade, ak vam chyba mensi rozsah udajov
imp <- mice(udaje1, seed=123) # konkretne parametre imputacie vieme nastavovat - pozri help
iter imp variable
1 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
1 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
1 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
1 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
1 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
2 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
2 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
2 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
2 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
2 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
3 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
3 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
3 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
3 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
3 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
4 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
4 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
4 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
4 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
4 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
5 1 YEAR MARK DEBT FIRM SOCI ENVI GOVE
5 2 YEAR MARK DEBT FIRM SOCI ENVI GOVE
5 3 YEAR MARK DEBT FIRM SOCI ENVI GOVE
5 4 YEAR MARK DEBT FIRM SOCI ENVI GOVE
5 5 YEAR MARK DEBT FIRM SOCI ENVI GOVE
udaje_imputovane <- complete(imp, 1)
udaje1 <- udaje_imputovane
head(udaje1)
rm(imp)
rm(udaje_imputovane)
print("pocet chybajucich udajov za jednotlive premenne")
[1] "pocet chybajucich udajov za jednotlive premenne"
colSums(is.na(udaje1))
YEAR COMP EXCH PRIM TOBI MARK RETU DEBT FIRM SOCI ENVI GOVE ESG.
0 0 0 1 0 0 0 0 0 0 0 0 0
Celkove nám teda ostala nevyplnená jedna premenná - textová - ktorá označuje Primary Business referencovanej firmy
Výber a následné triedenie
library(dplyr)
udaje.2013 <- udaje1 %>%
filter(YEAR == 2013) %>%
dplyr::select(RETU, ESG., DEBT, FIRM)
Knižnica .ggplot2. je v súčasnosti najčastejšie používaná grafická knižnica, pričom predpripravené kódy k jednotlivým obrázkom si viete nájsť v R Graph Gallery. Tu si uvedieme jednoduchšie z nich.
# Basic scatter plot
library(ggplot2)
ggplot(udaje.2013, aes(x = FIRM, y = ESG.)) + # specifikacia osi
geom_point() + # typ grafu - scatterplot
theme_minimal() +
labs(title = "ESG index", x = "Veľkosť firmy", y = "Score") # oznacenie osi
# Bar plot with grouping
library(ggplot2)
library(ggplot2)
ggplot(udaje1, aes(x = factor(YEAR), y = ESG.)) + # specifikacia osi
geom_boxplot(fill = "lightblue", color = "darkblue") + # typ grafu - boxplot
labs( # oznacenie osi, nazov grafu
title = "ESG Index by YEAR",
x = "Year",
y = "ESG Index"
) +
theme_minimal()
library(dplyr)
library(knitr)
# Summarise basic statistics
esg.stats <- udaje1 %>%
filter(YEAR %in% 2013:2016) %>%
group_by(YEAR) %>%
summarise(
n = n(),
mean = mean(ESG., na.rm = TRUE),
sd = sd(ESG., na.rm = TRUE),
min = min(ESG., na.rm = TRUE),
q25 = quantile(ESG., 0.25, na.rm = TRUE),
median= median(ESG., na.rm = TRUE),
q75 = quantile(ESG., 0.75, na.rm = TRUE),
max = max(ESG., na.rm = TRUE),
.groups = "drop"
)
# Create knitr table
kable(esg.stats, digits = 2, caption = "Basic statistics of ESG Index (2013–2016)")
| YEAR | n | mean | sd | min | q25 | median | q75 | max |
|---|---|---|---|---|---|---|---|---|
| 2013 | 76 | 0.23 | 0.10 | 0 | 0.15 | 0.22 | 0.29 | 0.58 |
| 2014 | 76 | 0.24 | 0.11 | 0 | 0.16 | 0.24 | 0.31 | 0.57 |
| 2015 | 76 | 0.25 | 0.12 | 0 | 0.17 | 0.25 | 0.31 | 0.65 |
| 2016 | 75 | 0.26 | 0.12 | 0 | 0.17 | 0.27 | 0.32 | 0.64 |
alebo krajšie tabuľky s pomocou .kableExtra.:
library(dplyr)
library(knitr)
library(kableExtra)
# Summarise basic statistics
esg.stats <- udaje1 %>%
filter(YEAR %in% 2013:2016) %>%
group_by(YEAR) %>%
summarise(
n = n(),
mean = mean(ESG., na.rm = TRUE),
sd = sd(ESG., na.rm = TRUE),
min = min(ESG., na.rm = TRUE),
q25 = quantile(ESG., 0.25, na.rm = TRUE),
median = median(ESG., na.rm = TRUE),
q75 = quantile(ESG., 0.75, na.rm = TRUE),
max = max(ESG., na.rm = TRUE),
.groups = "drop"
)
# Create styled kableExtra table
esg.stats %>%
kable(digits = 2, caption = "Basic statistics of ESG Index (2013–2016)") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(1, bold = TRUE) %>% # make YEAR bold
row_spec(0, bold = TRUE, background = "#f2f2f2") %>% # style header row
add_header_above(c(" " = 2, "ESG Index Statistics" = 7))
| YEAR | n | mean | sd | min | q25 | median | q75 | max |
|---|---|---|---|---|---|---|---|---|
| 2013 | 76 | 0.23 | 0.10 | 0 | 0.15 | 0.22 | 0.29 | 0.58 |
| 2014 | 76 | 0.24 | 0.11 | 0 | 0.16 | 0.24 | 0.31 | 0.57 |
| 2015 | 76 | 0.25 | 0.12 | 0 | 0.17 | 0.25 | 0.31 | 0.65 |
| 2016 | 75 | 0.26 | 0.12 | 0 | 0.17 | 0.27 | 0.32 | 0.64 |
t.test.result <- t.test(
udaje1$ESG.[udaje1$YEAR == 2013],
udaje1$ESG.[udaje1$YEAR == 2015]
)
print(t.test.result)
Welch Two Sample t-test
data: udaje1$ESG.[udaje1$YEAR == 2013] and udaje1$ESG.[udaje1$YEAR == 2015]
t = -1.2984, df = 147.86, p-value = 0.1962
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.05906743 0.01222533
sample estimates:
mean of x mean of y
0.2276316 0.2510526
anova.result <- aov(ESG. ~ YEAR, data = udaje1)
summary(anova.result)
Df Sum Sq Mean Sq F value Pr(>F)
YEAR 1 0.617 0.6169 29.88 6.25e-08 ***
Residuals 758 15.651 0.0206
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model <- lm(ESG. ~ RETU + FIRM + DEBT, data = udaje.2013)
summary(model)
Call:
lm(formula = ESG. ~ RETU + FIRM + DEBT, data = udaje.2013)
Residuals:
Min 1Q Median 3Q Max
-0.189450 -0.054050 -0.001079 0.049273 0.237250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.080e-01 8.210e-02 -3.751 0.000354 ***
RETU 1.668e-04 9.115e-05 1.830 0.071345 .
FIRM 8.080e-02 1.136e-02 7.115 6.78e-10 ***
DEBT -6.201e-04 4.205e-04 -1.475 0.144688
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08095 on 72 degrees of freedom
Multiple R-squared: 0.4217, Adjusted R-squared: 0.3976
F-statistic: 17.5 on 3 and 72 DF, p-value: 1.24e-08