Spracované a inšpirované Notebookom Jason Locklin: Introduction to R for Education Data Analysis and Visualization
Pre prácu s údajmi (databázou) používame najčastejšie dátový typ .data.frame.. Je to tabuľka, ktorá pozostáva zo stĺpcov rozličných typov. Jeden riadok pritom predstavuje jeden záznam databázy.
Majme údaje o žiakoch, ktoré predstavujú tri premenné - Meno, Vek a Body:
# Working with data frames
Meno = c("Jana", "Jozef", "Mária")
Vek = c(10, 11, 9)
Body = c(85, 92, 78)Tieto tri premenné nie sú zatiaľ nijako prepojené, predstavujú izolované stĺpce tabuľky. Do tabuľky ich spojíme nasledovne
Vysvetlenie: DataFrame má tri stĺpce: Meno, Vek a Body. Niektoré operácie s údajmi organizovanými v .data.frame. sú uvedené nasledovne
[1] 10 11 9
[1] 10
[1] "Jana"
Meno Vek Body
Length:3 Min. : 9.0 Min. :78.0
Class :character 1st Qu.: 9.5 1st Qu.:81.5
Mode :character Median :10.0 Median :85.0
Mean :10.0 Mean :85.0
3rd Qu.:10.5 3rd Qu.:88.5
Max. :11.0 Max. :92.0
Ak chceme pridať k tabuľke dodatočný stĺpec, potom to robíme nasledovne
Ak chceme pridať riadok, potom
# New record (must match column order/types)
novy.riadok <- data.frame(Meno = "Diana", Vek = 22.485, Body = 42,MaAuto = FALSE)
# Append
udaje <- rbind(udaje, novy.riadok)
print(udaje)library(knitr)
library(kableExtra)
kable(
udaje,
# format,
digits = 2,
# row.names = NA,
# col.names = NA,
align=c("l","c","l","r"),
caption = "Toto je tabuľka"
# label = NULL,
# format.args = list(),
# escape = TRUE,
# ...
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center")| Meno | Vek | Body | MaAuto |
|---|---|---|---|
| Jana | 10.00 | 85 | TRUE |
| Jozef | 11.00 | 92 | FALSE |
| Mária | 9.00 | 78 | TRUE |
| Diana | 22.48 | 42 | FALSE |
Tidyverse je súbor knižníc, ktoré majú zjednodušiť prácu s údajmi. Majú jednotný komunikačný štandard, vzájomne sa doplňujú.
.dplyr. poskytuje základné možnosti manipulácie s údajmi, ako napr.:
filter(): vyberá riadky
select(): vyberá stĺpce
mutate(): vytvára nové stĺpce tabuľky
arrange(): triedi riadky
summarise(): sumarizuje
V nasledovnej ukážke využijeme tzv. .pipes. %>% alebo %<% umožňuje posielať výsledky z jednej funkcie priamo do volanie nasledovnej funkcie. To umožňuje ľahšiu čitateľnosť kódov, konvencia sa ujala a má široké použitie.
# výber a následné triedenie
udaje %>%
filter(Body > 50) %>% # vybera zaznamy s poctom bodov viac, ako 50
arrange(desc(Body)) %>% # vysledny subor triedi zostupne podla premennej Body
kable %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)| Meno | Vek | Body | MaAuto |
|---|---|---|---|
| Jozef | 11 | 92 | FALSE |
| Jana | 10 | 85 | TRUE |
| Mária | 9 | 78 | TRUE |
# Zoskupí and sumarizuje
udaje %>%
group_by(MaAuto) %>% # zoskupi zaznamy podla premennej MaAuto a vypocita za kazdu skupinu jej priemer Body
summarise( # a taktiez spocita pocetnosti oboch skupin
Priem.Body = mean(Body),
count = n()
) %>%
kable(
caption = "Priemerné Body podľa premennej MaAuto",
col.names = c("Má Auto", "Priemer Body", "Počet"),
align = "c"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)| Má Auto | Priemer Body | Počet |
|---|---|---|
| FALSE | 67.0 | 2 |
| TRUE | 81.5 | 2 |
# Vytváranie novej premennej
udaje %>%
mutate(
grade = case_when( # vytvara novu premennu grade podla nasledovnej relacnej schemy
Body >= 90 ~ "A",
Body >= 80 ~ "B",
Body >= 70 ~ "C",
TRUE ~ "D"
),
VekPoPlnoletosti = round(Vek-18,0)
) %>%
kable %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
) | Meno | Vek | Body | MaAuto | grade | VekPoPlnoletosti |
|---|---|---|---|---|---|
| Jana | 10.000 | 85 | TRUE | B | -8 |
| Jozef | 11.000 | 92 | FALSE | A | -7 |
| Mária | 9.000 | 78 | TRUE | C | -9 |
| Diana | 22.485 | 42 | FALSE | D | 4 |
library(datasets)
# datasets available in the 'datasets' package - nasledovne kody za mna urobil Chat GPT
ds <- as.data.frame(utils::data(package = "datasets")$results)[, c("Item","Title")]
knitr::kable(head(ds, 20), col.names = c("Dataset", "Title")) # prvych 20 databaz| Dataset | Title |
|---|---|
| AirPassengers | Monthly Airline Passenger Numbers 1949-1960 |
| BJsales | Sales Data with Leading Indicator |
| BJsales.lead (BJsales) | Sales Data with Leading Indicator |
| BOD | Biochemical Oxygen Demand |
| CO2 | Carbon Dioxide Uptake in Grass Plants |
| ChickWeight | Weight versus age of chicks on different diets |
| DNase | Elisa assay of DNase |
| EuStockMarkets | Daily Closing Prices of Major European Stock Indices, 1991-1998 |
| Formaldehyde | Determination of Formaldehyde |
| HairEyeColor | Hair and Eye Color of Statistics Students |
| Harman23.cor | Harman Example 2.3 |
| Harman74.cor | Harman Example 7.4 |
| Indometh | Pharmacokinetics of Indomethacin |
| InsectSprays | Effectiveness of Insect Sprays |
| JohnsonJohnson | Quarterly Earnings per Johnson & Johnson Share |
| LakeHuron | Level of Lake Huron 1875-1972 |
| LifeCycleSavings | Intercountry Life-Cycle Savings Data |
| Loblolly | Growth of Loblolly Pine Trees |
| Nile | Flow of the River Nile |
| Orange | Growth of Orange Trees |
# kniznica datasets obsahuje databazu nazvanu CO2. Mozeme sa na nu odvolavat nasledovne, ako napr.
head(CO2)Môžeme použiť aj databázu určenú pre ekobometriu - package Wooldridge
# install.packages("wooldridge")
library(wooldridge)
ds <- as.data.frame(utils::data(package = "wooldridge")$results)[, c("Item","Title")]
knitr::kable(head(ds, 20), col.names = c("Dataset", "Title")) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)| Dataset | Title |
|---|---|
| admnrev | admnrev |
| affairs | affairs |
| airfare | airfare |
| alcohol | alcohol |
| apple | apple |
| approval | approval |
| athlet1 | athlet1 |
| athlet2 | athlet2 |
| attend | attend |
| audit | audit |
| barium | barium |
| beauty | beauty |
| benefits | benefits |
| beveridge | beveridge |
| big9salary | big9salary |
| bwght | bwght |
| bwght2 | bwght2 |
| campus | campus |
| card | card |
| catholic | catholic |
Ja som si zvolil údaje z [Abosede Tiamiyu: Environmental, Social, and Governance Reporting Evidencing Firm Performance in Emerging Economy]{https://data.mendeley.com/datasets/7k8pjhsrwb/1}. Na stránke sa nachádza súbor .Dataset ESG and Firm Performance.xlsx., ktorý som si stiahol a exportoval do formátu csv. Ako oddeľovač položiek som si zvolil bodkočiarku (semicolon ;), vyžívam desatinnú bodku a nie čiarku a tiež textové premenné uvádzam apostrofmi “. V prvom riadku sa nachádzajú názvy stĺpcov, ktoré neskôr budú vystupovať ako premenné. Tie obsahujú medzery, čo je v zázve premennej neprípustné a nahradil som ich podtrhovátkom”.”.
Potom už stačí importovať údaje do .data.frame., a to nasledovne
library(readr)
udaje <- read_delim("4.tyzden/udaje/HealthRiskData_Ver_3.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)Error:
! ]8;;file:///cloud/project/4.tyzden/4.tyzden/udaje/HealthRiskData_Ver_3.csv4.tyzden/udaje/HealthRiskData_Ver_3.csv]8;; does not exist in current working
directory: ]8;;file:///cloud/project/4.tyzden/cloud/project/4.tyzden]8;;.
Run `]8;;x-r-run:rlang::last_trace()rlang::last_trace()]8;;` to see where the error occurred.
Výber a následné triedenie
Knižnica .ggplot2. je v súčasnosti najčastejšie používaná grafická knižnica, pričom predpripravené kódy k jednotlivým obrázkom si viete nájsť v R Graph Gallery. Tu si uvedieme jednoduchšie z nich.
# Basic scatter plot
library(ggplot2)
ggplot(udaje, aes(x = Weight, y = BMI)) + # specifikacia osi
geom_point() + # typ grafu - scatterplot
theme_minimal() +
labs(title = "BMI index", x = "Váha", y = "Hodnota") # oznacenie osi# Bar plot with grouping
library(ggplot2)
library(ggplot2)
ggplot(udaje, aes(x = factor(Weight), y = BMI)) + # specifikacia osi
geom_boxplot(fill = "lightgreen", color = "purple") + # typ grafu - boxplot
labs( # oznacenie osi, nazov grafu
title = "BMI Index",
x = "Weight",
y = "BMI"
) +
theme_minimal()library(dplyr)
library(knitr)
# Summarise basic statistics
esg.stats <- udaje %>%
filter(BMI %in% 20:25) %>%
group_by(Weight) %>%
summarise(
n = n(),
min = min(BMI, na.rm = TRUE),
max = max(BMI, na.rm= TRUE),
.groups = "drop"
)
# Create knitr table
kable(esg.stats, digits = 2, caption = "Basic statistics of BMI Index (20–25)")| Weight | n | min | max |
|---|---|---|---|
| 47 | 1 | 22 | 22 |
| 50 | 1 | 20 | 20 |
| 51,6 | 1 | 22 | 22 |
| 55 | 1 | 21 | 21 |
| 55,9 | 2 | 21 | 21 |
| 56,5 | 1 | 20 | 20 |
| 56,9 | 1 | 22 | 22 |
| 57 | 2 | 25 | 25 |
| 58 | 1 | 21 | 21 |
| 59,9 | 1 | 22 | 22 |
| 60,4 | 1 | 23 | 23 |
| 61,1 | 2 | 23 | 24 |
| 64,2 | 1 | 21 | 21 |
| 64,6 | 1 | 24 | 24 |
| 64,8 | 1 | 23 | 23 |
| 65 | 2 | 23 | 23 |
| 65,9 | 2 | 22 | 22 |
| 66,8 | 1 | 24 | 24 |
| 66,9 | 1 | 21 | 21 |
| 67,00 | 1 | 24 | 24 |
| 68 | 2 | 25 | 25 |
| 76,8 | 1 | 24 | 24 |
alebo krajšie tabuľky s pomocou .kableExtra.:
t.test.result <- t.test(
udaje$BirthAge[udaje$BirthAge <= 22],
udaje$BirthAge[udaje$BirthAge >= 35],
)
print(t.test.result)
Welch Two Sample t-test
data: udaje$BirthAge[udaje$BirthAge <= 22] and udaje$BirthAge[udaje$BirthAge >= 35]
t = -49.91, df = 368.05, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-27.18760 -25.12647
sample estimates:
mean of x mean of y
20.16667 46.32370
Df Sum Sq Mean Sq F value Pr(>F)
Weight 358 86374 241.3 2.064 1.14e-11 ***
Residuals 343 40097 116.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: unexpected '=' in "model <- lm (BirthAge ~ BMI + Weight + data ="
# install.packages(c("broom", "kableExtra", "dplyr", "stringr"))
library(broom)
library(dplyr)
library(kableExtra)
library(stringr)
coef.tbl <- tidy(model, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"RETURN.ON.ASSETS" = "Return on Assets",
"FIRM.SIZE" = "Firm Size",
"DEBT.TO.ASSET" = "Debt to Asset"
),
stars = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ "·",
TRUE ~ ""
)
) %>%
transmute(
Term = term,
Estimate = estimate,
`Std. Error` = std.error,
`t value` = statistic,
`p value` = p.value,
`95% CI` = str_c("[", round(conf.low, 2), ", ", round(conf.high, 4), "]"),
Sig = stars
)
coef.tbl %>%
kable(
digits = 3,
caption = "OLS Regression Coefficients (ESG.INDEX ~ RETURN.ON.ASSETS + FIRM.SIZE + DEBT.TO.ASSET)"
) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, background = "#f2f2f2") %>%
footnote(
general = "Signif. codes: *** p<0.001, ** p<0.01, * p<0.05, · p<0.1.",
threeparttable = TRUE
)fit.tbl <- glance(model) %>%
transmute(
`R-squared` = r.squared,
`Adj. R-squared` = adj.r.squared,
`F-statistic` = statistic,
`F p-value` = p.value,
`AIC` = AIC,
`BIC` = BIC,
`Num. obs.` = nobs
)
fit.tbl %>%
kable(digits = 2, caption = "Model Fit Statistics") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("condensed"))