Import súboru
Multicollinearity detection and testing
- Variance Inflation Factor
install.packages(“car”) # run once if not installed
- Číslo podmienenosti
Eliminácia problému multikolinearity
- Odstránenie dvoch z korelovaných premenných

Import súboru

# import the dataset and create a data.frame udaje
udaje_svet <- read.csv("Life-Expectancy-Data-Updated.csv",header=TRUE,sep=",",dec=".",check.names = TRUE)
head(udaje_svet)

 udaje_svet <- udaje_svet[-992,]

# z databázy udaje_svet si vyberieme len tie pozorovania, ktoré sa týkajú Portugalska 
udaje <- subset(udaje_svet, Country == "Portugal")

# vyrovnanie priebehu očakávanej dĺžky dožitia v čase
model <- lm(Life_expectancy ~ Alcohol_consumption+Adult_mortality+Incidents_HIV,data = udaje)
library(broom)
library(knitr)
library(kableExtra)

# koeficienty regresie
tidy(model) %>%
  kable(digits = 3, caption = "Odhadnuté koeficienty regresie") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Odhadnuté koeficienty regresie
term	estimate	std.error	statistic	p.value
(Intercept)	89.127	2.353	37.870	0.000
Alcohol_consumption	-0.474	0.228	-2.084	0.059
Adult_mortality	0.035	0.071	0.499	0.627
Incidents_HIV	-47.654	25.950	-1.836	0.091

# kvalita vyrovnania
glance(model) %>%
  kable(digits = 3, caption = "Ukazovatele kvality vyrovnania") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Ukazovatele kvality vyrovnania
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.967	0.959	0.322	116.512	0	3	-2.282	14.564	18.426	1.246	12	16

# výber premenných
X <- udaje[, c("Alcohol_consumption", "Adult_mortality", "Incidents_HIV")]

# výpočet korelačnej matice
cor_matrix <- cor(X, use = "complete.obs")

# zaokrúhlenie
round(cor_matrix, 4)

##                     Alcohol_consumption Adult_mortality Incidents_HIV
## Alcohol_consumption              1.0000          0.9328        0.9323
## Adult_mortality                  0.9328          1.0000        0.9947
## Incidents_HIV                    0.9323          0.9947        1.0000

library(knitr)
round(cor_matrix, 4) %>%
  kable(caption = "Korelačná matica")

Korelačná matica
	Alcohol_consumption	Adult_mortality	Incidents_HIV
Alcohol_consumption	1.0000	0.9328	0.9323
Adult_mortality	0.9328	1.0000	0.9947
Incidents_HIV	0.9323	0.9947	1.0000

cor.test (udaje$Adult_mortality, udaje$Alcohol_consumption)

## 
##  Pearson's product-moment correlation
## 
## data:  udaje$Adult_mortality and udaje$Alcohol_consumption
## t = 9.6832, df = 14, p-value = 1.39e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8129992 0.9768213
## sample estimates:
##       cor 
## 0.9327848

cor.test (udaje$Alcohol_consumption, udaje$Incidents_HIV)

## 
##  Pearson's product-moment correlation
## 
## data:  udaje$Alcohol_consumption and udaje$Incidents_HIV
## t = 9.6423, df = 14, p-value = 1.464e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8116555 0.9766395
## sample estimates:
##       cor 
## 0.9322696

cor.test (udaje$Adult_mortality, udaje$Incidents_HIV )

## 
##  Pearson's product-moment correlation
## 
## data:  udaje$Adult_mortality and udaje$Incidents_HIV
## t = 36.329, df = 14, p-value = 2.954e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9844731 0.9982227
## sample estimates:
##       cor 
## 0.9947378

Multicollinearity detection and testing

if (!requireNamespace("corrplot", quietly = TRUE)) {
  install.packages("corrplot")}

library(corrplot)

corrplot(cor_matrix, method = "number", type = "upper")

Variance Inflation Factor

Pre premennú \(x_j\) definujeme

\[ VIF_j = \frac{1}{1-R_j^2}, \]

kde \(R_j^2\) je koeficient determinácie z pomocnej regresie, v ktorej je \(x_j\) vysvetľovaná ostatnými regresormi.

Ak je \(R_j^2\) blízko jednej, potom je \(VIF_j\) veľký a premenná \(x_j\) je silno lineárne vysvetliteľná ostatnými premennými.

install.packages(“car”) # run once if not installed

library(car)
# Variance Inflation Factors
vif_values <- vif(model)

vif_values

## Alcohol_consumption     Adult_mortality       Incidents_HIV 
##            7.808001           97.351215           96.636514

Interpretation:

\(VIF_j = 1\): žiadna multikolinearita,
\(1 < VIF_j < 5\): mierna multikolinearita
\(VIF_j \geq 5\): silná multikolinearita (niektorí autori používajú prísnejší prah \(VIF_j \geq 10\)).

Číslo podmienenosti

Číslo podmienenosti je založené na vlastných číslach matice \(X'X\). Ak sú vlastné čísla veľmi rozdielne, matica je zle podmienená.

\[ \kappa = \sqrt{\frac{\lambda_{\max}}{\lambda_{\min}}}. \]

X_scaled <- scale(X)
eigen_values <- eigen(cor(X_scaled))$values

condition_number <- sqrt(max(eigen_values) / min(eigen_values))
eigen_values

## [1] 2.906833311 0.087905633 0.005261057

condition_number

## [1] 23.50572

Interpretácia čísla podmienenosti:P - \(\kappa \approx 1\): žiadna multikolinearita, - \(\kappa > 10\): mierna multikolinearita, - \(\kappa > 30\): silná multikolinearita.

Malá p-hodnota znamená, že zamietame hypotézu \(R=I_k\), teda medzi vysvetľujúcimi premennými existuje štatisticky významná korelačná štruktúra.

Eliminácia problému multikolinearity

Multikolinearita nie je porušením exogenity. Nie je teda automaticky dôvodom na zamietnutie OLS modelu. Problém je hlavne inferenčný: veľké štandardné chyby a nestabilné individuálne koeficienty.

Možné riešenia:

získať viac dát,
odstrániť alebo spojiť silno korelované premenné,
transformovať premenné,
použiť teoretické obmedzenia,
použiť regularizačné metódy, napríklad ridge regresiu.

Odstránenie dvoch z korelovaných premenných

Dve premenné (Adult mortality a Incidents_HIV ) sú veľmi vysoké, tak ich vynecháme.

model_reduced <- lm(Life_expectancy ~ Alcohol_consumption, data = udaje)

summary(model)

## 
## Call:
## lm(formula = Life_expectancy ~ Alcohol_consumption + Adult_mortality + 
##     Incidents_HIV, data = udaje)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45968 -0.22271 -0.02747  0.16006  0.70128 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          89.12688    2.35348  37.870 7.39e-14 ***
## Alcohol_consumption  -0.47444    0.22771  -2.084   0.0593 .  
## Adult_mortality       0.03545    0.07099   0.499   0.6266    
## Incidents_HIV       -47.65355   25.95043  -1.836   0.0912 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3222 on 12 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9585 
## F-statistic: 116.5 on 3 and 12 DF,  p-value: 3.865e-09

summary(model_reduced)

## 
## Call:
## lm(formula = Life_expectancy ~ Alcohol_consumption, data = udaje)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82921 -0.28194 -0.00681  0.21178  0.94241 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          95.5365     1.4875   64.23  < 2e-16 ***
## Alcohol_consumption  -1.4714     0.1298  -11.33 1.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5134 on 14 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8947 
## F-statistic: 128.4 on 1 and 14 DF,  p-value: 1.945e-08

Ak je spotreba alkoholu 0, očakávaná sĺžka života je 95,53 rokov. Každé zvýšenie spotreby alkoholu o 1 jednotku znižuje očakávanú dĺžku života približne o 1,47 roka. Štatistická významnosť- p- hodnota pre alkohol 1,94 - 0,8 -> veľmi silne štatisticky význbamná (prakticky nulová pravdepodobnosť náhody).Model ako celok je štatisticky významný.

Multikolinearita v lineárnej regresii

Lucia Faltýnková