Cieľ
V tomto dokumente vykonáme lineárnu regresiu, kde
budeme modelovať závislosť počtu úmrtí (deaths) od počtu
prípadov (cases) v dátach COVID. Ako vysvetľujúce premenné
použijeme základnú formu: \[ \texttt{deaths}
= \beta_0 + \beta_1 \cdot \texttt{cases} + \varepsilon. \]
Pre inšpiráciu je štruktúra dokumentu podobná súboru
Cvicenie5.Rmd.
Balíčky
library(broom)
library(scales)
library(knitr)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(tidyr)
library(magrittr)
Načítanie a príprava
dát
# Názvy stĺpcov: country, date, cases, deaths, population
covid<- read.csv("Covid.csv", stringsAsFactors = FALSE, sep = ";")
head(covid)
Základná deskriptívna
štatistika
summary_tbl <- covid %>%
summarise(
n_riadkov = n(),
n_krajin = n_distinct(country),
datum_min = min(date, na.rm = TRUE),
datum_max = max(date, na.rm = TRUE),
cases_min = min(cases, na.rm = TRUE),
cases_median = median(cases, na.rm = TRUE),
cases_mean = mean(cases, na.rm = TRUE),
cases_max = max(cases, na.rm = TRUE),
deaths_min = min(deaths, na.rm = TRUE),
deaths_median = median(deaths, na.rm = TRUE),
deaths_mean = mean(deaths, na.rm = TRUE),
deaths_max = max(deaths, na.rm = TRUE)
)
kable(summary_tbl, caption = "Základná sumarizácia datasetu") %>%
kable_classic(full_width = FALSE)
Základná sumarizácia datasetu
|
n_riadkov
|
n_krajin
|
datum_min
|
datum_max
|
cases_min
|
cases_median
|
cases_mean
|
cases_max
|
deaths_min
|
deaths_median
|
deaths_mean
|
deaths_max
|
|
4372
|
4
|
2020-01-04
|
2022-12-31
|
0
|
1043490
|
1627459
|
6368388
|
0
|
20371.5
|
30075.81
|
118533
|
Prepočty na 100 tis.
obyvateľov (voliteľné)
Niektoré analýzy je vhodné robiť per-capita. Tu doplníme prepočítané
metriky.
covid <- covid %>%
mutate(
cases_per_100k = if_else(!is.na(population) & population > 0, 1e5 * cases / population, NA_real_),
deaths_per_100k = if_else(!is.na(population) & population > 0, 1e5 * deaths / population, NA_real_)
)
covid %>%
select(country, date, cases, deaths, population, cases_per_100k, deaths_per_100k) %>%
tail(10) %>%
kable(caption = "Ukážka posledných riadkov s per-capita prepočtami") %>%
kable_classic(full_width = FALSE)
Ukážka posledných riadkov s per-capita prepočtami
|
|
country
|
date
|
cases
|
deaths
|
population
|
cases_per_100k
|
deaths_per_100k
|
|
4363
|
Slovakia
|
2022-12-22
|
1858400
|
20794
|
5473195
|
33954.57
|
379.9243
|
|
4364
|
Slovakia
|
2022-12-23
|
1858531
|
20797
|
5473195
|
33956.97
|
379.9792
|
|
4365
|
Slovakia
|
2022-12-24
|
1858653
|
20799
|
5473195
|
33959.20
|
380.0157
|
|
4366
|
Slovakia
|
2022-12-25
|
1858687
|
20803
|
5473195
|
33959.82
|
380.0888
|
|
4367
|
Slovakia
|
2022-12-26
|
1858709
|
20806
|
5473195
|
33960.22
|
380.1436
|
|
4368
|
Slovakia
|
2022-12-27
|
1858747
|
20809
|
5473195
|
33960.91
|
380.1984
|
|
4369
|
Slovakia
|
2022-12-28
|
1858879
|
20813
|
5473195
|
33963.32
|
380.2715
|
|
4370
|
Slovakia
|
2022-12-29
|
1858997
|
20817
|
5473195
|
33965.48
|
380.3446
|
|
4371
|
Slovakia
|
2022-12-30
|
1859083
|
20821
|
5473195
|
33967.05
|
380.4177
|
|
4372
|
Slovakia
|
2022-12-31
|
1859186
|
20823
|
5473195
|
33968.93
|
380.4542
|
Vizualizácia
Vzťah deaths ~ cases:
scatter + regresná priamka
ggplot(covid, aes(x = cases, y = deaths)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = TRUE) +
scale_x_continuous(labels = label_number(big.mark = " ")) +
scale_y_continuous(labels = label_number(big.mark = " ")) +
labs(
title ="Vzťah medzi prípadmi a úmrtiami",
x = "Prípady (cases)",
y = "Úmrtia (deaths)")

Lineárna regresia
Základný model:
deaths ~ cases
model_basic <- lm(deaths ~ cases, data = covid)
summary(model_basic)
##
## Call:
## lm(formula = deaths ~ cases, data = covid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37478 -2783 -1848 10053 27372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.771e+03 2.683e+02 10.33 <2e-16 ***
## cases 1.678e-02 1.104e-04 151.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13170 on 4370 degrees of freedom
## Multiple R-squared: 0.8408, Adjusted R-squared: 0.8408
## F-statistic: 2.308e+04 on 1 and 4370 DF, p-value: < 2.2e-16
tidy(model_basic) %>% kable(caption = "Odhady parametrov modelu") %>% kable_classic(full_width = FALSE)
Odhady parametrov modelu
|
term
|
estimate
|
std.error
|
statistic
|
p.value
|
|
(Intercept)
|
2770.7331477
|
268.2815311
|
10.32771
|
0
|
|
cases
|
0.0167777
|
0.0001104
|
151.93532
|
0
|
glance(model_basic) %>% kable(caption = "Diagnostické štatistiky modelu") %>% kable_classic(full_width = FALSE)
Diagnostické štatistiky modelu
|
r.squared
|
adj.r.squared
|
sigma
|
statistic
|
p.value
|
df
|
logLik
|
AIC
|
BIC
|
deviance
|
df.residual
|
nobs
|
|
0.8408266
|
0.8407902
|
13170.79
|
23084.34
|
0
|
1
|
-47674.33
|
95354.66
|
95373.81
|
758062882014
|
4370
|
4372
|
Interpretácia
- Intercept (\(\beta_0\)): očakávaný počet úmrtí,
ak je počet prípadov 0 (interpretácia závisí od rozsahu dát).
- Smernica (\(\beta_1\)): priemerná zmena úmrtí
pri náraste o jednu jednotku prípadov.
- \(R^2\): podiel
variability v
deaths vysvetlený premennou
cases.
- p-hodnoty: test významnosti koeficientov.
Alternatívny model
(per-capita): deaths_per_100k ~ cases_per_100k
covid_pc <- covid %>% filter(!is.na(cases_per_100k), !is.na(deaths_per_100k))
model_pc <- lm(deaths_per_100k ~ cases_per_100k, data = covid_pc)
summary(model_pc)
##
## Call:
## lm(formula = deaths_per_100k ~ cases_per_100k, data = covid_pc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147.40 -73.08 -22.50 56.09 192.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.451e+01 1.921e+00 38.78 <2e-16 ***
## cases_per_100k 1.089e-02 1.115e-04 97.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90.62 on 4370 degrees of freedom
## Multiple R-squared: 0.6858, Adjusted R-squared: 0.6857
## F-statistic: 9538 on 1 and 4370 DF, p-value: < 2.2e-16
tidy(model_pc) %>% kable(caption = "Odhady parametrov (per 100k)") %>% kable_classic(full_width = FALSE)
Odhady parametrov (per 100k)
|
term
|
estimate
|
std.error
|
statistic
|
p.value
|
|
(Intercept)
|
74.5053816
|
1.9211905
|
38.78084
|
0
|
|
cases_per_100k
|
0.0108906
|
0.0001115
|
97.66021
|
0
|
glance(model_pc) %>% kable(caption = "Diagnostické štatistiky (per 100k)") %>% kable_classic(full_width = FALSE)
Diagnostické štatistiky (per 100k)
|
r.squared
|
adj.r.squared
|
sigma
|
statistic
|
p.value
|
df
|
logLik
|
AIC
|
BIC
|
deviance
|
df.residual
|
nobs
|
|
0.6857814
|
0.6857095
|
90.62193
|
9537.516
|
0
|
1
|
-25905.88
|
51817.75
|
51836.9
|
35887903
|
4370
|
4372
|
Diagnostika modelu
Predpoklady lineárnej
regresie
Skontrolujeme reziduá a diagnostické grafy pre základný model.
par(mfrow = c(2,2))
plot(model_basic)

par(mfrow = c(1,1))
Reziduá vs. predikcie
(ggplot)
augment(model_basic) %>%
ggplot(aes(.fitted, .resid)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_point(alpha = 0.5) +
labs(title = "Reziduá vs. predikcie", x = "Predikcie", y = "Reziduá")

Analýza po krajinách
(voliteľné)
Nižšie ukážeme, ako spraviť separátny fit pre každú
krajinu a porovnať sklon.
by_cty <- covid %>%
group_by(country) %>%
filter(sum(!is.na(cases) & !is.na(deaths)) >= 10) %>%
nest()
fit_by <- by_cty %>%
mutate(
fit = purrr::map(data, ~ lm(deaths ~ cases, data = .x)),
tidy = purrr::map(fit, broom::tidy),
glance= purrr::map(fit, broom::glance)
) %>%
select(-data, -fit) %>%
tidyr::unnest(cols = c(tidy, glance), names_sep = "_") %>%
filter(tidy_term == "cases") %>%
transmute(
country,
estimate = tidy_estimate,
std.error = tidy_std.error,
statistic = tidy_statistic,
p.value = tidy_p.value,
r.squared = glance_r.squared
) %>%
arrange(desc(estimate))
kable(fit_by %>% select(country, estimate, std.error, statistic, p.value, r.squared) ,
digits = 4,
caption = "Odhad podľa krajín") %>%
kable_classic(full_width = FALSE)
Odhad podľa krajín
|
country
|
estimate
|
std.error
|
statistic
|
p.value
|
r.squared
|
|
Hungary
|
0.0231
|
2e-04
|
135.4734
|
0
|
0.9439
|
|
Poland
|
0.0191
|
1e-04
|
184.6926
|
0
|
0.9690
|
|
Slovakia
|
0.0103
|
1e-04
|
74.1946
|
0
|
0.8346
|
|
Czechia
|
0.0091
|
1e-04
|
84.0029
|
0
|
0.8661
|
Vizualizácia po
krajinách (facet)
covid %>%
group_by(country) %>%
filter(sum(!is.na(cases) & !is.na(deaths)) >= 20) %>%
ungroup() %>%
ggplot(aes(cases, deaths)) +
geom_point(alpha = 0.35) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ country, scales = "free") +
labs(title = "Vzťah cases vs. deaths podľa krajín",
x = "Prípady", y = "Úmrtia")

Záver
- Základný model deaths ~ cases odhaduje priemerný
nárast úmrtí pri raste počtu prípadov.
- Per-capita model môže byť z hľadiska
porovnateľnosti vhodnejší.
- Predpoklady lineárnej regresie treba vždy overiť (homoskedasticita,
normalita reziduí, linearita).
- Pre robustnejší model je možné pridať ďalšie kovariáty (napr. časový
posun/lags, krajinu ako faktor, nelinearity).