1. Údaje a príprava

V zadaní pracujeme s údajmi o miere nezamestnanosti z Eurostatu.
Cieľom je odhadnúť lineárny model a skontrolovať prítomnosť multikolinearity.

# načítanie dát – súbor musí byť v rovnakom priečinku ako tento Rmd
udaje_raw <- read.csv("une_rt_a__custom_18708117_linear.csv")

str(udaje_raw)
## 'data.frame':    33 obs. of  11 variables:
##  $ DATAFLOW   : chr  "ESTAT:UNE_RT_A(1.0)" "ESTAT:UNE_RT_A(1.0)" "ESTAT:UNE_RT_A(1.0)" "ESTAT:UNE_RT_A(1.0)" ...
##  $ LAST.UPDATE: chr  "11/09/25 23:00:00" "11/09/25 23:00:00" "11/09/25 23:00:00" "11/09/25 23:00:00" ...
##  $ freq       : chr  "Annual" "Annual" "Annual" "Annual" ...
##  $ age        : chr  "From 15 to 74 years" "From 15 to 74 years" "From 15 to 74 years" "From 15 to 74 years" ...
##  $ unit       : chr  "Percentage of population in the labour force" "Percentage of population in the labour force" "Percentage of population in the labour force" "Percentage of population in the labour force" ...
##  $ sex        : chr  "Total" "Total" "Total" "Total" ...
##  $ geo        : chr  "Czechia" "Czechia" "Czechia" "Czechia" ...
##  $ TIME_PERIOD: int  2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 ...
##  $ OBS_VALUE  : num  6.1 5.1 4 2.9 2.2 2 2.6 2.8 2.2 2.6 ...
##  $ OBS_FLAG   : chr  "" "" "" "" ...
##  $ CONF_STATUS: logi  NA NA NA NA NA NA ...
head(udaje_raw)
##              DATAFLOW       LAST.UPDATE   freq                 age
## 1 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 2 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 3 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 4 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 5 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
## 6 ESTAT:UNE_RT_A(1.0) 11/09/25 23:00:00 Annual From 15 to 74 years
##                                           unit   sex     geo TIME_PERIOD
## 1 Percentage of population in the labour force Total Czechia        2014
## 2 Percentage of population in the labour force Total Czechia        2015
## 3 Percentage of population in the labour force Total Czechia        2016
## 4 Percentage of population in the labour force Total Czechia        2017
## 5 Percentage of population in the labour force Total Czechia        2018
## 6 Percentage of population in the labour force Total Czechia        2019
##   OBS_VALUE OBS_FLAG CONF_STATUS
## 1       6.1                   NA
## 2       5.1                   NA
## 3       4.0                   NA
## 4       2.9                   NA
## 5       2.2                   NA
## 6       2.0                   NA

Vyfiltrujeme pozorovania podľa zadania – vek 15–74 rokov, celkový súčet za pohlavia.
Necháme si len krajinu, rok a mieru nezamestnanosti.

udaje <- subset(
  udaje_raw,
  age == "From 15 to 74 years" & sex == "Total"
)

udaje <- udaje[, c("geo", "TIME_PERIOD", "OBS_VALUE")]
names(udaje) <- c("country", "year", "unemp")

head(udaje)
##   country year unemp
## 1 Czechia 2014   6.1
## 2 Czechia 2015   5.1
## 3 Czechia 2016   4.0
## 4 Czechia 2017   2.9
## 5 Czechia 2018   2.2
## 6 Czechia 2019   2.0
table(udaje$country)
## 
##  Czechia  Germany Slovakia 
##       11       11       11
sort(unique(udaje$year))
##  [1] 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024

Vytvoríme vysvetľujúce premenné:

udaje$time  <- udaje$year
udaje$covid <- ifelse(udaje$year >= 2020, 1, 0)
udaje$DE    <- ifelse(udaje$country == "Germany", 1, 0)
udaje$SK    <- ifelse(udaje$country == "Slovakia", 1, 0)

head(udaje)
##   country year unemp time covid DE SK
## 1 Czechia 2014   6.1 2014     0  0  0
## 2 Czechia 2015   5.1 2015     0  0  0
## 3 Czechia 2016   4.0 2016     0  0  0
## 4 Czechia 2017   2.9 2017     0  0  0
## 5 Czechia 2018   2.2 2018     0  0  0
## 6 Czechia 2019   2.0 2019     0  0  0

2. Základný model

Najskôr odhadneme jednoduchý model s časovým trendom:

\[ Unemp_{it} = \beta_0 + \beta_1 time_{it} + u_{it}. \]

model_basic <- lm(unemp ~ time, data = udaje)
summary(model_basic)
## 
## Call:
## lm(formula = unemp ~ time, data = udaje)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0058 -1.9385 -0.6403  2.2270  6.4288 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 744.5276   270.0271   2.757  0.00968 **
## time         -0.3664     0.1337  -2.739  0.01012 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.43 on 31 degrees of freedom
## Multiple R-squared:  0.1949, Adjusted R-squared:  0.1689 
## F-statistic: 7.504 on 1 and 31 DF,  p-value: 0.01012

Stručná interpretácia:


3. Rozšírený model

Teraz pridáme dummy premenné a covidovú premennú:

\[ Unemp_{it} = \beta_0 + \beta_1 time_{it} + \beta_2 covid_{it} + \beta_3 DE_i + \beta_4 SK_i + u_{it}. \]

model_full <- lm(unemp ~ time + covid + DE + SK, data = udaje)
summary(model_full)
## 
## Call:
## lm(formula = unemp ~ time + covid + DE + SK, data = udaje)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5182 -0.7046 -0.2176  0.3376  3.0566 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1306.6549   245.3397   5.326 1.14e-05 ***
## time          -0.6461     0.1217  -5.310 1.19e-05 ***
## covid          2.0511     0.7727   2.655   0.0129 *  
## DE             0.3909     0.4712   0.830   0.4138    
## SK             4.5545     0.4712   9.666 2.03e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.105 on 28 degrees of freedom
## Multiple R-squared:  0.8496, Adjusted R-squared:  0.8281 
## F-statistic: 39.53 on 4 and 28 DF,  p-value: 3.925e-11

Stručná interpretácia koeficientov:


4. Korelačná matica a scatterploty

Skontrolujeme párové vzťahy medzi vysvetľujúcimi premennými.

xvars <- udaje[, c("time", "covid", "DE", "SK")]
round(cor(xvars), 3)
##        time covid   DE   SK
## time  1.000 0.866  0.0  0.0
## covid 0.866 1.000  0.0  0.0
## DE    0.000 0.000  1.0 -0.5
## SK    0.000 0.000 -0.5  1.0

Scatterplotová matica:

pairs(xvars,
      main = "Scatterplotová matica – time, covid, DE, SK")

Komentár:


5. Diagnostika multikolinearity – VIF

Na diagnostiku multikolinearity použijeme Variance Inflation Factor (VIF).
Definujeme si jednoduchú vlastnú funkciu:

vif_manual <- function(model) {
  X <- model.matrix(model)[, -1]  # bez interceptu
  k <- ncol(X)
  vif_values <- numeric(k)
  names(vif_values) <- colnames(X)
  
  for (j in 1:k) {
    xj <- X[, j]
    others <- X[, -j, drop = FALSE]
    fit_j <- lm(xj ~ others)
    R2_j <- summary(fit_j)$r.squared
    vif_values[j] <- 1 / (1 - R2_j)
  }
  
  return(vif_values)
}

vif_manual(model_full)
##     time    covid       DE       SK 
## 4.000000 4.000000 1.333333 1.333333

Interpretácia VIF:

Pri time a covid očakávame vyššie VIF, pretože sú silno prepojené.


6. Condition Number

Ďalšou charakteristikou multikolinearity je tzv. Condition Number.
Najprv premenne štandardizujeme:

udaje$time_sc  <- scale(udaje$time, center = TRUE, scale = TRUE)
udaje$covid_sc <- scale(udaje$covid, center = TRUE, scale = TRUE)
udaje$DE_sc    <- scale(udaje$DE, center = TRUE, scale = TRUE)
udaje$SK_sc    <- scale(udaje$SK, center = TRUE, scale = TRUE)

model_scaled <- lm(unemp ~ time_sc + covid_sc + DE_sc + SK_sc,
                   data = udaje)
summary(model_scaled)
## 
## Call:
## lm(formula = unemp ~ time_sc + covid_sc + DE_sc + SK_sc, data = udaje)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5182 -0.7046 -0.2176  0.3376  3.0566 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.8394     0.1924  25.157  < 2e-16 ***
## time_sc      -2.0747     0.3907  -5.310 1.19e-05 ***
## covid_sc      1.0371     0.3907   2.655   0.0129 *  
## DE_sc         0.1871     0.2256   0.830   0.4138    
## SK_sc         2.1803     0.2256   9.666 2.03e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.105 on 28 degrees of freedom
## Multiple R-squared:  0.8496, Adjusted R-squared:  0.8281 
## F-statistic: 39.53 on 4 and 28 DF,  p-value: 3.925e-11

Vypočítame Condition Number z matice \(X'X\):

X  <- model.matrix(model_scaled)[, -1]   # bez interceptu
XtX <- t(X) %*% X
eig <- eigen(XtX)

condition_number <- sqrt(max(eig$values) / min(eig$values))
condition_number
## [1] 3.732051

Interpretácia (orientačne):

Spolu s VIF hodnotíme, či je multikolinearita skôr mierna alebo vážna.


7. Model bez covidovej premennej (ilustrácia riešenia)

Jedným z možných riešení multikolinearity je vynechanie jednej z kolineárnych premenných.
Odhadneme model bez covid:

\[ Unemp_{it} = \beta_0 + \beta_1 time_{it} + \beta_3 DE_i + \beta_4 SK_i + u_{it}. \]

model_no_covid <- lm(unemp ~ time + DE + SK, data = udaje)
summary(model_no_covid)
## 
## Call:
## lm(formula = unemp ~ time + DE + SK, data = udaje)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0455 -0.7136 -0.2900  0.7509  3.5227 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 742.87909  135.02078   5.502 6.29e-06 ***
## time         -0.36636    0.06687  -5.478 6.72e-06 ***
## DE            0.39091    0.51801   0.755    0.457    
## SK            4.55455    0.51801   8.792 1.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.215 on 29 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.7922 
## F-statistic: 41.67 on 3 and 29 DF,  p-value: 1.224e-10
vif_manual(model_no_covid)
##     time       DE       SK 
## 1.000000 1.333333 1.333333

Komentár:


8. Stručné zhrnutie