Cieľ cvičenia:
1) Odhadnúť lineárny model miery nezamestnanosti z Eurostat dát
(une_rt_a__custom_18708117_linear.csv).
2) Diagnostikovať heteroskedasticitu (vizuálne +
testy).
3) Ukázať riešenia: robustné (HC) štandardné chyby,
WLS/Feasible GLS, a
transformáciu.
Všade je priložená interpretácia.
# === BALÍČKY ===
libs <- c("tidyverse","readr","janitor","broom","lmtest","sandwich","car","modelsummary")
to_install <- libs[!libs %in% installed.packages()[,"Package"]]
if(length(to_install)) install.packages(to_install, quiet = TRUE)
invisible(lapply(libs, library, character.only = TRUE))
theme_set(theme_minimal())
options(digits = 4)
Komentár:
- Očakáva sa CSV s ročnou mierou nezamestnanosti.
- Kód je robustný voči názvom stĺpcov (hľadá
time/time_period, geo, value
atď.).
- Filtrovanie: Total populácia a percentuálna jednotka (ak
dostupné).
# Lokálna cesta; ak nie je, fallback (napr. v cloudovom prostredí)
csv_path <- "une_rt_a__custom_18708117_linear.csv"
if(!file.exists(csv_path)){
alt <- "/mnt/data/une_rt_a__custom_18708117_linear.csv"
if(file.exists(alt)) csv_path <- alt
}
if(!file.exists(csv_path)) stop("CSV súbor 'une_rt_a__custom_18708117_linear.csv' nebol nájdený.")
raw <- readr::read_csv(csv_path, show_col_types = FALSE) %>%
janitor::clean_names()
glimpse(raw)
## Rows: 33
## Columns: 11
## $ dataflow <chr> "ESTAT:UNE_RT_A(1.0)", "ESTAT:UNE_RT_A(1.0)", "ESTAT:UNE_R…
## $ last_update <chr> "11/09/25 23:00:00", "11/09/25 23:00:00", "11/09/25 23:00:…
## $ freq <chr> "Annual", "Annual", "Annual", "Annual", "Annual", "Annual"…
## $ age <chr> "From 15 to 74 years", "From 15 to 74 years", "From 15 to …
## $ unit <chr> "Percentage of population in the labour force", "Percentag…
## $ sex <chr> "Total", "Total", "Total", "Total", "Total", "Total", "Tot…
## $ geo <chr> "Czechia", "Czechia", "Czechia", "Czechia", "Czechia", "Cz…
## $ time_period <dbl> 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023…
## $ obs_value <dbl> 6.1, 5.1, 4.0, 2.9, 2.2, 2.0, 2.6, 2.8, 2.2, 2.6, 2.6, 4.7…
## $ obs_flag <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ conf_status <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# Robustná identifikácia bežných stĺpcov (Eurostat štýl)
nm <- names(raw)
keep_first <- function(x) { x[which(x %in% nm)][1] }
time_col <- keep_first(c("time_period","time","year"))
geo_col <- keep_first(c("geo","country","geo_name"))
sex_col <- keep_first(c("sex"))
age_col <- keep_first(c("age"))
unit_col <- keep_first(c("unit"))
value_col <- keep_first(c("value","obs_value","rate","unemployment_rate"))
if(is.na(value_col) || is.null(value_col)){
num_cols <- raw %>% select(where(is.numeric)) %>% names()
if(length(num_cols)==0) stop("V CSV nie je žiaden numerický stĺpec použiteľný ako 'y'.")
value_col <- num_cols[1]
}
df <- raw
# Filtrovanie Total/Percent (ak dostupné)
if(!is.na(sex_col) && any(df[[sex_col]] %in% c("T","Total","TOTAL"))) {
df <- df %>% filter(.data[[sex_col]] %in% c("T","Total","TOTAL"))
}
if(!is.na(age_col) && any(df[[age_col]] %in% c("Y15-74","TOTAL","Y_GE15","Y15-64","Total"))) {
df <- df %>% filter(.data[[age_col]] %in% c("Y15-74","TOTAL","Y_GE15","Y15-64","Total"))
}
if(!is.na(unit_col) && any(df[[unit_col]] %in% c("PC","PC_POP","PERC","Percent","Percentage"))) {
df <- df %>% filter(.data[[unit_col]] %in% c("PC","PC_POP","PERC","Percent","Percentage"))
}
# Tvorba pracovných premenných
df <- df %>%
mutate(
y = .data[[value_col]],
year = if(!is.na(time_col)) suppressWarnings(readr::parse_number(as.character(.data[[time_col]]))) else NA_real_,
geo_factor = if(!is.na(geo_col)) as.factor(.data[[geo_col]]) else factor("ALL")
) %>%
filter(!is.na(y))
if("year" %in% names(df)) df <- df %>% filter(!is.na(year))
# Prehľad počtu pozorovaní podľa krajiny (top 12)
df %>% count(geo_factor, sort = TRUE) %>% head(12)
Interpretácia dát:
- y = miera nezamestnanosti v % (závislá premenná).
- year = rok (ak v CSV existuje), geo_factor =
krajina (faktor).
- Ďalej odhadneme trendový model s krajinnými efektmi.
\[ y_{it} = \alpha + \beta \cdot \text{year}_t + \sum_{i\neq \text{ref}} \gamma_i \cdot \mathbb{1}\{\text{geo}_i\} + \varepsilon_{it}. \]
has_year <- "year" %in% names(df) && dplyr::n_distinct(df$year) > 1
has_geo <- "geo_factor" %in% names(df) && dplyr::n_distinct(df$geo_factor) > 1
if(has_year && has_geo){
mod_ols <- lm(y ~ year + geo_factor, data = df)
} else if (has_year){
mod_ols <- lm(y ~ year, data = df)
} else if (has_geo){
mod_ols <- lm(y ~ geo_factor, data = df)
} else {
mod_ols <- lm(y ~ 1, data = df) # fallback
}
summary(mod_ols)
##
## Call:
## lm(formula = y ~ year + geo_factor, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.045 -0.714 -0.290 0.751 3.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 742.8791 135.0208 5.50 6.3e-06 ***
## year -0.3664 0.0669 -5.48 6.7e-06 ***
## geo_factorGermany 0.3909 0.5180 0.75 0.46
## geo_factorSlovakia 4.5545 0.5180 8.79 1.1e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.21 on 29 degrees of freedom
## Multiple R-squared: 0.812, Adjusted R-squared: 0.792
## F-statistic: 41.7 on 3 and 29 DF, p-value: 1.22e-10
Interpretácia (OLS):
- Koeficient pri year ~ priemerná medziročná zmena (v
p. b.).
- Koeficienty geo_factor ~ odchýlky od referenčnej krajiny
(fixné rozdiely).
- Ďalej skontrolujeme, či majú reziduá konštantný
rozptyl.
aug <- broom::augment(mod_ols)
p1 <- ggplot(aug, aes(.fitted, .resid)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(alpha = 0.7) +
labs(title = "Reziduá vs. predikované hodnoty",
x = "Predikované hodnoty", y = "Reziduá")
p2 <- ggplot(aug, aes(.fitted, abs(.resid))) +
geom_smooth(se = FALSE, method = "loess") +
geom_point(alpha = 0.7) +
labs(title = "Absolútne reziduá vs. predikované hodnoty (LOESS)",
x = "Predikované hodnoty", y = "|reziduá|")
p3 <- ggplot(aug, aes(sample = .std.resid)) +
stat_qq() + stat_qq_line() +
labs(title = "QQ-plot štandardizovaných reziduí")
p1; p2; p3
Interpretácia (vizuálne):
- Ak v p1/p2 vidíme lievik
(rozptyl rastie/klesá s úrovňou), ide o podozrenie na
heteroskedasticitu.
- QQ-plot je skôr o normalite; heteroskedasticitu sledujeme najmä v
p1/p2.
# Breusch–Pagan (H0: homoskedasticita)
bp <- lmtest::bptest(mod_ols)
# Studentizovaný Breusch–Pagan (citlivejší)
bp_stud <- lmtest::bptest(mod_ols, studentize = TRUE)
# White (pomocná regresia reziduá^2 ~ fitted + fitted^2)
fit_vals <- fitted(mod_ols)
white_aux <- lm(resid(mod_ols)^2 ~ fit_vals + I(fit_vals^2))
white_stat <- summary(white_aux)$r.squared * nrow(aug)
white_df <- 2
white_p <- 1 - pchisq(white_stat, df = white_df)
list(
breusch_pagan = bp,
breusch_pagan_studentized = bp_stud,
white_test = c(statistic = white_stat, df = white_df, p_value = white_p)
)
## $breusch_pagan
##
## studentized Breusch-Pagan test
##
## data: mod_ols
## BP = 6.9, df = 3, p-value = 0.08
##
##
## $breusch_pagan_studentized
##
## studentized Breusch-Pagan test
##
## data: mod_ols
## BP = 6.9, df = 3, p-value = 0.08
##
##
## $white_test
## statistic df p_value
## 1.588e+01 2.000e+00 3.566e-04
Interpretácia (testy):
- H0: homoskedasticita.
- p < 0.05 ⇒ zamietame H0 ⇒ dôkazy
heteroskedasticity.
- White je robustnejší k nešpecifikovanému tvaru (kvadratické efekty
fitted).
# HC3 je často odporúčaná voľba
vcov_hc3 <- sandwich::vcovHC(mod_ols, type = "HC3")
coeftest_hc3 <- lmtest::coeftest(mod_ols, vcov = vcov_hc3)
# Tabuľky porovnania
modelsummary::modelsummary(
list("OLS (klasické SE)" = mod_ols),
statistic = c("std.error", "p.value"),
gof_omit = "IC|R2$"
)
| OLS (klasické SE) | |
|---|---|
| (Intercept) | 742.879 |
| (135.021) | |
| (<0.001) | |
| year | -0.366 |
| (0.067) | |
| (<0.001) | |
| geo_factorGermany | 0.391 |
| (0.518) | |
| (0.457) | |
| geo_factorSlovakia | 4.555 |
| (0.518) | |
| (<0.001) | |
| Num.Obs. | 33 |
| R2 Adj. | 0.792 |
| Log.Lik. | -51.115 |
| F | 41.667 |
| RMSE | 1.14 |
modelsummary::modelsummary(
list("OLS + HC3 robustné SE" = mod_ols),
vcov = list(vcov_hc3),
statistic = c("std.error", "p.value"),
gof_omit = "IC|R2$"
)
| OLS + HC3 robustné SE | |
|---|---|
| (Intercept) | 742.879 |
| (182.019) | |
| (<0.001) | |
| year | -0.366 |
| (0.090) | |
| (<0.001) | |
| geo_factorGermany | 0.391 |
| (0.422) | |
| (0.362) | |
| geo_factorSlovakia | 4.555 |
| (0.621) | |
| (<0.001) | |
| Num.Obs. | 33 |
| R2 Adj. | 0.792 |
| Log.Lik. | -51.115 |
| RMSE | 1.14 |
| Std.Errors | Custom |
coeftest_hc3
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 742.8791 182.0189 4.08 0.00032 ***
## year -0.3664 0.0902 -4.06 0.00034 ***
## geo_factorGermany 0.3909 0.4218 0.93 0.36171
## geo_factorSlovakia 4.5545 0.6214 7.33 4.5e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretácia:
- Koeficienty sa nemenia, menia sa štandardné
chyby a p-hodnoty.
- Pri heteroskedasticite je to základ pre korektnú štatistickú
inferenciu.
Idea: ak \(\mathrm{Var}(\varepsilon_i)\) súvisí s úrovňou predpovede, vážime pozorovania inverzne k odhadnutému rozptylu.
# 1) Model pre |reziduá| ~ fitted (proxy sigma)
sigma_mod <- lm(abs(resid(mod_ols)) ~ fitted(mod_ols))
sigma_hat <- pmax(fitted(sigma_mod), .Machine$double.eps)
# 2) Váhy ~ 1 / sigma_hat^2
w <- 1 / (sigma_hat^2)
# Rovnaká štruktúra ako pri OLS
if("year" %in% all.vars(formula(mod_ols)) && "geo_factor" %in% all.vars(formula(mod_ols))){
mod_wls <- lm(y ~ year + geo_factor, data = df, weights = w)
} else if ("year" %in% all.vars(formula(mod_ols))) {
mod_wls <- lm(y ~ year, data = df, weights = w)
} else if ("geo_factor" %in% all.vars(formula(mod_ols))) {
mod_wls <- lm(y ~ geo_factor, data = df, weights = w)
} else {
mod_wls <- lm(y ~ 1, data = df, weights = w)
}
summary(mod_wls)
##
## Call:
## lm(formula = y ~ year + geo_factor, data = df, weights = w)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.625 -0.814 -0.205 0.524 3.082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 478.340 107.026 4.47 0.00011 ***
## year -0.235 0.053 -4.44 0.00012 ***
## geo_factorGermany 0.543 0.355 1.53 0.13701
## geo_factorSlovakia 4.296 0.492 8.73 1.3e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.16 on 29 degrees of freedom
## Multiple R-squared: 0.78, Adjusted R-squared: 0.757
## F-statistic: 34.2 on 3 and 29 DF, p-value: 1.16e-09
# Porovnanie
modelsummary::modelsummary(
list("OLS" = mod_ols, "WLS (Feasible GLS)" = mod_wls),
statistic = c("std.error","p.value"),
gof_omit = "IC|R2$"
)
| OLS | WLS (Feasible GLS) | |
|---|---|---|
| (Intercept) | 742.879 | 478.340 |
| (135.021) | (107.026) | |
| (<0.001) | (<0.001) | |
| year | -0.366 | -0.235 |
| (0.067) | (0.053) | |
| (<0.001) | (<0.001) | |
| geo_factorGermany | 0.391 | 0.543 |
| (0.518) | (0.355) | |
| (0.457) | (0.137) | |
| geo_factorSlovakia | 4.555 | 4.297 |
| (0.518) | (0.492) | |
| (<0.001) | (<0.001) | |
| Num.Obs. | 33 | 33 |
| R2 Adj. | 0.792 | 0.757 |
| Log.Lik. | -51.115 | -45.283 |
| F | 41.667 | 34.236 |
| RMSE | 1.14 | 1.22 |
Interpretácia:
- WLS môže znížiť štandardné chyby (efektívnejší odhad), ak váhy
vystihujú štruktúru rozptylu.
- Porovnajte významnosti a diagnostiku reziduí aj po WLS.
if(all(df$y > 0, na.rm = TRUE)){
mod_log <- lm(log(y) ~ year + geo_factor, data = df)
summary(mod_log)
modelsummary::modelsummary(
list("OLS (y)" = mod_ols, "OLS (log y)" = mod_log),
statistic = c("std.error","p.value"),
gof_omit = "IC|R2$"
)
} else {
cat("Log-transformácia vynechaná (y obsahuje nepozitívne hodnoty).\n")
}
| OLS (y) | OLS (log y) | |
|---|---|---|
| (Intercept) | 742.879 | 130.387 |
| (135.021) | (21.347) | |
| (<0.001) | (<0.001) | |
| year | -0.366 | -0.064 |
| (0.067) | (0.011) | |
| (<0.001) | (<0.001) | |
| geo_factorGermany | 0.391 | 0.169 |
| (0.518) | (0.082) | |
| (0.457) | (0.047) | |
| geo_factorSlovakia | 4.555 | 0.907 |
| (0.518) | (0.082) | |
| (<0.001) | (<0.001) | |
| Num.Obs. | 33 | 33 |
| R2 Adj. | 0.792 | 0.843 |
| Log.Lik. | -51.115 | 9.754 |
| F | 41.667 | 58.477 |
| RMSE | 1.14 | 0.18 |
Interpretácia:
- Transformácie (napr. log, sqrt) môžu stabilizovať rozptyl, no menia
interpretáciu: pri log(y) je koeficient pri
year približne percentuálna zmena (pri
malých zmenách).