V zadaní pracujeme s údajmi o miere nezamestnanosti z
Eurostatu.
Cieľom je odhadnúť lineárny model a skontrolovať prítomnosť
multikolinearity.
# načítanie dát – súbor musí byť v rovnakom priečinku ako tento Rmd
udaje_raw <- read.csv("une_rt_a__custom_18708117_linear.csv")
str(udaje_raw)
## 'data.frame': 33 obs. of 11 variables:
## $ DATAFLOW : chr "ESTAT:UNE_RT_A(1.0)" "ESTAT:UNE_RT_A(1.0)" "ESTAT:UNE_RT_A(1.0)" "ESTAT:UNE_RT_A(1.0)" ...
## $ LAST.UPDATE: chr "11/09/25 23:00:00" "11/09/25 23:00:00" "11/09/25 23:00:00" "11/09/25 23:00:00" ...
## $ freq : chr "Annual" "Annual" "Annual" "Annual" ...
## $ age : chr "From 15 to 74 years" "From 15 to 74 years" "From 15 to 74 years" "From 15 to 74 years" ...
## $ unit : chr "Percentage of population in the labour force" "Percentage of population in the labour force" "Percentage of population in the labour force" "Percentage of population in the labour force" ...
## $ sex : chr "Total" "Total" "Total" "Total" ...
## $ geo : chr "Czechia" "Czechia" "Czechia" "Czechia" ...
## $ TIME_PERIOD: int 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 ...
## $ OBS_VALUE : num 6.1 5.1 4 2.9 2.2 2 2.6 2.8 2.2 2.6 ...
## $ OBS_FLAG : chr "" "" "" "" ...
## $ CONF_STATUS: logi NA NA NA NA NA NA ...
head(udaje_raw)
## DATAFLOW LAST.UPDATE freq age
## 1 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 2 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 3 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 4 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 5 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 6 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## unit sex geo TIME_PERIOD
## 1 Percentage of population in the labour force Total Czechia 2014
## 2 Percentage of population in the labour force Total Czechia 2015
## 3 Percentage of population in the labour force Total Czechia 2016
## 4 Percentage of population in the labour force Total Czechia 2017
## 5 Percentage of population in the labour force Total Czechia 2018
## 6 Percentage of population in the labour force Total Czechia 2019
## OBS_VALUE OBS_FLAG CONF_STATUS
## 1 6.1 NA
## 2 5.1 NA
## 3 4.0 NA
## 4 2.9 NA
## 5 2.2 NA
## 6 2.0 NA
Vyfiltrujeme pozorovania podľa zadania – vek 15–74 rokov, celkový
súčet za pohlavia.
Necháme si len krajinu, rok a mieru nezamestnanosti.
udaje <- subset(
udaje_raw,
age == "From 15 to 74 years" & sex == "Total"
)
udaje <- udaje[, c("geo", "TIME_PERIOD", "OBS_VALUE")]
names(udaje) <- c("country", "year", "unemp")
head(udaje)
## country year unemp
## 1 Czechia 2014 6.1
## 2 Czechia 2015 5.1
## 3 Czechia 2016 4.0
## 4 Czechia 2017 2.9
## 5 Czechia 2018 2.2
## 6 Czechia 2019 2.0
table(udaje$country)
##
## Czechia Germany Slovakia
## 11 11 11
sort(unique(udaje$year))
## [1] 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Vytvoríme vysvetľujúce premenné:
time – rok (trend),covid – dummy pre covidové obdobie (2020+),DE – dummy pre Nemecko,SK – dummy pre Slovensko (referenčná krajina je tá bez
dummy – napr. Česko, ak je v dátach).udaje$time <- udaje$year
udaje$covid <- ifelse(udaje$year >= 2020, 1, 0)
udaje$DE <- ifelse(udaje$country == "Germany", 1, 0)
udaje$SK <- ifelse(udaje$country == "Slovakia", 1, 0)
head(udaje)
## country year unemp time covid DE SK
## 1 Czechia 2014 6.1 2014 0 0 0
## 2 Czechia 2015 5.1 2015 0 0 0
## 3 Czechia 2016 4.0 2016 0 0 0
## 4 Czechia 2017 2.9 2017 0 0 0
## 5 Czechia 2018 2.2 2018 0 0 0
## 6 Czechia 2019 2.0 2019 0 0 0
Najskôr odhadneme jednoduchý model s časovým trendom:
\[ Unemp_{it} = \beta_0 + \beta_1 time_{it} + u_{it}. \]
model_basic <- lm(unemp ~ time, data = udaje)
summary(model_basic)
##
## Call:
## lm(formula = unemp ~ time, data = udaje)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0058 -1.9385 -0.6403 2.2270 6.4288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 744.5276 270.0271 2.757 0.00968 **
## time -0.3664 0.1337 -2.739 0.01012 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.43 on 31 degrees of freedom
## Multiple R-squared: 0.1949, Adjusted R-squared: 0.1689
## F-statistic: 7.504 on 1 and 31 DF, p-value: 0.01012
Stručná interpretácia:
time ukazuje, či má nezamestnanosť v
čase skôr klesajúci alebo rastúci trend.Teraz pridáme dummy premenné a covidovú premennú:
\[ Unemp_{it} = \beta_0 + \beta_1 time_{it} + \beta_2 covid_{it} + \beta_3 DE_i + \beta_4 SK_i + u_{it}. \]
model_full <- lm(unemp ~ time + covid + DE + SK, data = udaje)
summary(model_full)
##
## Call:
## lm(formula = unemp ~ time + covid + DE + SK, data = udaje)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5182 -0.7046 -0.2176 0.3376 3.0566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1306.6549 245.3397 5.326 1.14e-05 ***
## time -0.6461 0.1217 -5.310 1.19e-05 ***
## covid 2.0511 0.7727 2.655 0.0129 *
## DE 0.3909 0.4712 0.830 0.4138
## SK 4.5545 0.4712 9.666 2.03e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.105 on 28 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8281
## F-statistic: 39.53 on 4 and 28 DF, p-value: 3.925e-11
Stručná interpretácia koeficientov:
time – dlhodobý trend nezamestnanosti pri fixnom
covid a krajine,covid – priemerný rozdiel nezamestnanosti v covidových
rokoch oproti predcovidovým,DE, SK – rozdiel v priemernej
nezamestnanosti Nemecka a Slovenska oproti referenčnej krajine.Skontrolujeme párové vzťahy medzi vysvetľujúcimi premennými.
xvars <- udaje[, c("time", "covid", "DE", "SK")]
round(cor(xvars), 3)
## time covid DE SK
## time 1.000 0.866 0.0 0.0
## covid 0.866 1.000 0.0 0.0
## DE 0.000 0.000 1.0 -0.5
## SK 0.000 0.000 -0.5 1.0
Scatterplotová matica:
pairs(xvars,
main = "Scatterplotová matica – time, covid, DE, SK")
Komentár:
time a
covid, keďže covid je „zapnutý“ len v posledných
rokoch.DE a SK sú navzájom
čiastočne späté, ale nie perfektne.Na diagnostiku multikolinearity použijeme Variance Inflation Factor
(VIF).
Definujeme si jednoduchú vlastnú funkciu:
vif_manual <- function(model) {
X <- model.matrix(model)[, -1] # bez interceptu
k <- ncol(X)
vif_values <- numeric(k)
names(vif_values) <- colnames(X)
for (j in 1:k) {
xj <- X[, j]
others <- X[, -j, drop = FALSE]
fit_j <- lm(xj ~ others)
R2_j <- summary(fit_j)$r.squared
vif_values[j] <- 1 / (1 - R2_j)
}
return(vif_values)
}
vif_manual(model_full)
## time covid DE SK
## 4.000000 4.000000 1.333333 1.333333
Interpretácia VIF:
Pri time a covid očakávame vyššie VIF,
pretože sú silno prepojené.
Ďalšou charakteristikou multikolinearity je tzv. Condition
Number.
Najprv premenne štandardizujeme:
udaje$time_sc <- scale(udaje$time, center = TRUE, scale = TRUE)
udaje$covid_sc <- scale(udaje$covid, center = TRUE, scale = TRUE)
udaje$DE_sc <- scale(udaje$DE, center = TRUE, scale = TRUE)
udaje$SK_sc <- scale(udaje$SK, center = TRUE, scale = TRUE)
model_scaled <- lm(unemp ~ time_sc + covid_sc + DE_sc + SK_sc,
data = udaje)
summary(model_scaled)
##
## Call:
## lm(formula = unemp ~ time_sc + covid_sc + DE_sc + SK_sc, data = udaje)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5182 -0.7046 -0.2176 0.3376 3.0566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8394 0.1924 25.157 < 2e-16 ***
## time_sc -2.0747 0.3907 -5.310 1.19e-05 ***
## covid_sc 1.0371 0.3907 2.655 0.0129 *
## DE_sc 0.1871 0.2256 0.830 0.4138
## SK_sc 2.1803 0.2256 9.666 2.03e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.105 on 28 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8281
## F-statistic: 39.53 on 4 and 28 DF, p-value: 3.925e-11
Vypočítame Condition Number z matice \(X'X\):
X <- model.matrix(model_scaled)[, -1] # bez interceptu
XtX <- t(X) %*% X
eig <- eigen(XtX)
condition_number <- sqrt(max(eig$values) / min(eig$values))
condition_number
## [1] 3.732051
Interpretácia (orientačne):
Spolu s VIF hodnotíme, či je multikolinearita skôr mierna alebo vážna.
Jedným z možných riešení multikolinearity je vynechanie jednej z
kolineárnych premenných.
Odhadneme model bez covid:
\[ Unemp_{it} = \beta_0 + \beta_1 time_{it} + \beta_3 DE_i + \beta_4 SK_i + u_{it}. \]
model_no_covid <- lm(unemp ~ time + DE + SK, data = udaje)
summary(model_no_covid)
##
## Call:
## lm(formula = unemp ~ time + DE + SK, data = udaje)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0455 -0.7136 -0.2900 0.7509 3.5227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 742.87909 135.02078 5.502 6.29e-06 ***
## time -0.36636 0.06687 -5.478 6.72e-06 ***
## DE 0.39091 0.51801 0.755 0.457
## SK 4.55455 0.51801 8.792 1.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.215 on 29 degrees of freedom
## Multiple R-squared: 0.8117, Adjusted R-squared: 0.7922
## F-statistic: 41.67 on 3 and 29 DF, p-value: 1.224e-10
vif_manual(model_no_covid)
## time DE SK
## 1.000000 1.333333 1.333333
Komentár:
covid klesnúť,time a covid
existuje silnejšia väzba, čo sa prejavilo zvýšenými VIF a vyšším
Condition Number.covid) – za cenu zmeny ekonomickej
interpretácie modelu.