Príprava údajov

Pripravme si údaje - vysvetľovacie premenné, ktoré označíme ako x1 a x2. tieto vektory majú šiestu zložku odľahlú (pozri hodnoty 25,15).

x1<-c(1,2,4,6,3,25,8,1,0)
x2 <-c(2,0,1,3,4,15,6,0,1)
udaje <- data.frame(cbind(x1,x2))
print(udaje)
  x1 x2
1  1  2
2  2  0
3  4  1
4  6  3
5  3  4
6 25 15
7  8  6
8  1  0
9  0  1

Toto odľahlé pozorovanie by sa malo prejaviť vo vysokej hodnote leverage, t.j. 6-tom prvku diagonály Hat matice \[\mathbf{H} = \mathbf{X}(\mathbf{X}^t\mathbf{X})^{-1}\mathbf{X}= \{h_{i,j}\}_{i,j \in \{1..n\}}\]

leverage vyjadrujú prvky na hlavnej diagonále tejto matice \(h_{ii}\), pričom \(h_{ii}\in \[0,1\]\). Čím je toto číslo bližšie k 1, tým je dané pozorovanie odľahlejšie.

plot(udaje)

X <- as.matrix(udaje)
X
      x1 x2
 [1,]  1  2
 [2,]  2  0
 [3,]  4  1
 [4,]  6  3
 [5,]  3  4
 [6,] 25 15
 [7,]  8  6
 [8,]  1  0
 [9,]  0  1
H <- X%*%solve((t(X)%*%X))%*%t(X)
H
             [,1]        [,2]        [,3]        [,4]        [,5]        [,6]        [,7]        [,8]        [,9]
 [1,]  0.14695661 -0.12431735 -0.14407705 -0.05927912  0.23175454  0.01439778  0.13007646 -0.06215867  0.10455764
 [2,] -0.12431735  0.11597657  0.14080032  0.07447125 -0.19064641  0.08241485 -0.08301062  0.05798828 -0.09115282
 [3,] -0.14407705  0.14080032  0.17436203  0.10068514 -0.21775395  0.15142488 -0.08023036  0.07040016 -0.10723861
 [4,] -0.05927912  0.07447125  0.10068514  0.07864164 -0.08132261  0.20703009  0.00834078  0.03723563 -0.04825737
 [5,]  0.23175454 -0.19064641 -0.21775395 -0.08132261  0.36818588  0.07000298  0.21864760 -0.09532321  0.16353887
 [6,]  0.01439778  0.08241485  0.15142488  0.20703009  0.07000298  0.82911330  0.24923046  0.04120743 -0.01340483
 [7,]  0.13007646 -0.08301062 -0.08023036  0.00834078  0.21864760  0.24923046  0.18270281 -0.04150531  0.08579088
 [8,] -0.06215867  0.05798828  0.07040016  0.03723563 -0.09532321  0.04120743 -0.04150531  0.02899414 -0.04557641
 [9,]  0.10455764 -0.09115282 -0.10723861 -0.04825737  0.16353887 -0.01340483  0.08579088 -0.04557641  0.07506702
print(paste("Prvky na hlavnej diagonále matice H sú: ",diag(H)))
[1] "Prvky na hlavnej diagonále matice H sú:  0.146956608082613" 
[2] "Prvky na hlavnej diagonále matice H sú:  0.115976566378711" 
[3] "Prvky na hlavnej diagonále matice H sú:  0.174362029589911" 
[4] "Prvky na hlavnej diagonále matice H sú:  0.0786416443252903"
[5] "Prvky na hlavnej diagonále matice H sú:  0.368185880250222" 
[6] "Prvky na hlavnej diagonále matice H sú:  0.829113295601233" 
[7] "Prvky na hlavnej diagonále matice H sú:  0.182702810048654" 
[8] "Prvky na hlavnej diagonále matice H sú:  0.0289941415946777"
[9] "Prvky na hlavnej diagonále matice H sú:  0.0750670241286861"
print(paste("Leverage šiesteho pozorovania je: ",diag(H)[6]))
[1] "Leverage šiesteho pozorovania je:  0.829113295601233"
print(paste("Suma prvkov na diagonále sa rovná počtu stĺpcov v X: ",sum(diag(H))))    # tejto sume hovoríme tiež stopa matice 
[1] "Suma prvkov na diagonále sa rovná počtu stĺpcov v X:  2"

Pridáme vysvetľovanú premennú y

y <- c(1,2,4,3,5,9,8,2,6)

a vyčíslime regresiu

model <- lm(y~+1+x1+x2,data=udaje)
summary(model)

Call:
lm(formula = y ~ +1 + x1 + x2, data = udaje)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1123 -0.6214 -0.4397  1.1781  2.4515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   2.8030     0.8457   3.314   0.0161 *
x1           -0.1816     0.3375  -0.538   0.6098  
x2            0.7455     0.5529   1.348   0.2262  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.982 on 6 degrees of freedom
Multiple R-squared:  0.6213,    Adjusted R-squared:  0.4951 
F-statistic: 4.923 on 2 and 6 DF,  p-value: 0.05429
plot(model)

Štandardizované rezíduá

\[ r_i = \frac{e_i}{\sqrt{MSE} \sqrt{1 - h_{ii}}} \]

kde \(r_i\) je štandardizované rezíduum majú približne strednú hodnotu nula a štandardnú odchýlku 1. Ak \(|r_i|>2\), je to potenciálny outlier, ak \(|r_i|>3\), je to už problematické pozorovanie.

Cook’s distance

\[ D_i = \frac{e_i^2}{p \cdot \mathrm{MSE}} \cdot \frac{h_{ii}}{(1 - h_{ii})^2} \]

where:


🔹 Interpretation Rules of Thumb

  • \(D_i < 0.5\): typically safe
  • \(0.5 \le D_i < 1.0\): moderate influence
  • \(D_i \ge 1.0\): highly influential observation, investigate or consider removal

(But the thresholds depend on context and sample size.)

Rozdelenie pravdepodobnosti zošikmené do ľava

# 1. Generate a left-skewed distribution
# Trick: take a right-skewed variable (chi-squared) and multiply by -1
set.seed(123)
x <- -rchisq(1000, df = 5)  # left-skewed because of the minus sign

# 2. Plot histogram
hist(x,
     breaks = 30,
     main = "Histogram of Left-Skewed Distribution",
     xlab = "Values",
     col = "skyblue",
     border = "white")


# 3. Q-Q plot
qqnorm(x,
       main = "Q-Q Plot of Left-Skewed Distribution")
qqline(x, col = "red", lwd = 2)

# 1. Generate a right-skewed distribution
# Example: Chi-squared with 5 degrees of freedom
set.seed(123)
x <- rchisq(1000, df = 5)   # right-skewed

# 2. Plot histogram
hist(x,
     breaks = 30,
     main = "Histogram of Right-Skewed Distribution",
     xlab = "Values",
     col = "orange",
     border = "white")


# 3. Q-Q plot
qqnorm(x,
       main = "Q-Q Plot of Right-Skewed Distribution")
qqline(x, col = "red", lwd = 2)

# 1. Generate symmetric heavy-tailed data
# t-distribution with low df => high kurtosis
set.seed(123)
x <- rt(1000, df = 2)  # symmetric but heavy tails (high kurtosis)

# 2. Plot histogram
hist(x,
     breaks = 30,
     main = "Histogram of Symmetric Distribution with High Kurtosis",
     xlab = "Values",
     col = "lightgreen",
     border = "white")


# 3. Q-Q plot
qqnorm(x,
       main = "Q-Q Plot of Symmetric Heavy-Tailed Distribution")
qqline(x, col = "red", lwd = 2)

# 1. Generate uniform distribution
set.seed(123)
x <- runif(1000, min = -3, max = 3)  # symmetric around 0, flat tails

# 2. Plot histogram
hist(x,
     breaks = 30,
     main = "Histogram of Uniform Distribution (Low Kurtosis)",
     xlab = "Values",
     col = "lightblue",
     border = "white")


# 3. Q-Q plot
qqnorm(x,
       main = "Q-Q Plot of Uniform Distribution")
qqline(x, col = "red", lwd = 2)

LS0tCnRpdGxlOiAiSGF0IE1hdHJpeCBhIExldmVyYWdlIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyBQcsOtcHJhdmEgw7pkYWpvdgoKUHJpcHJhdm1lIHNpIMO6ZGFqZSAtIHZ5c3ZldMS+b3ZhY2llIHByZW1lbm7DqSwga3RvcsOpIG96bmHEjcOtbWUgYWtvIHgxIGEgeDIuIHRpZXRvIHZla3RvcnkgbWFqw7ogxaFpZXN0dSB6bG/Fvmt1IG9kxL5haGzDuiAocG96cmkgaG9kbm90eSAyNSwxNSkuIAoKYGBge3J9CngxPC1jKDEsMiw0LDYsMywyNSw4LDEsMCkKeDIgPC1jKDIsMCwxLDMsNCwxNSw2LDAsMSkKdWRhamUgPC0gZGF0YS5mcmFtZShjYmluZCh4MSx4MikpCnByaW50KHVkYWplKQpgYGAKClRvdG8gb2TEvmFobMOpIHBvem9yb3ZhbmllIGJ5IHNhIG1hbG8gcHJlamF2acWlIHZvIHZ5c29rZWogaG9kbm90ZSBsZXZlcmFnZSwgdC5qLiA2LXRvbSBwcnZrdSBkaWFnb27DoWx5IEhhdCBtYXRpY2UgCiQkXG1hdGhiZntIfSA9IFxtYXRoYmZ7WH0oXG1hdGhiZntYfV50XG1hdGhiZntYfSleey0xfVxtYXRoYmZ7WH09IFx7aF97aSxqfVx9X3tpLGogXGluIFx7MS4ublx9fSQkCgpsZXZlcmFnZSB2eWphZHJ1asO6IHBydmt5IG5hIGhsYXZuZWogZGlhZ29uw6FsZSB0ZWp0byBtYXRpY2UgJGhfe2lpfSQsIHByacSNb20gJGhfe2lpfVxpbiBcWzAsMVxdJC4gxIzDrW0gamUgdG90byDEjcOtc2xvIGJsacW+xaFpZSBrIDEsIHTDvW0gamUgZGFuw6kgcG96b3JvdmFuaWUgb2TEvmFobGVqxaFpZS4KCgpgYGB7cn0KcGxvdCh1ZGFqZSkKWCA8LSBhcy5tYXRyaXgodWRhamUpClgKYGBgCmBgYHtyfQpIIDwtIFglKiVzb2x2ZSgodChYKSUqJVgpKSUqJXQoWCkKSApwcmludChwYXN0ZSgiUHJ2a3kgbmEgaGxhdm5laiBkaWFnb27DoWxlIG1hdGljZSBIIHPDujogIixkaWFnKEgpKSkKcHJpbnQocGFzdGUoIkxldmVyYWdlIMWhaWVzdGVobyBwb3pvcm92YW5pYSBqZTogIixkaWFnKEgpWzZdKSkKcHJpbnQocGFzdGUoIlN1bWEgcHJ2a292IG5hIGRpYWdvbsOhbGUgc2Egcm92bsOhIHBvxI10dSBzdMS6cGNvdiB2IFg6ICIsc3VtKGRpYWcoSCkpKSkgICAgIyB0ZWp0byBzdW1lIGhvdm9yw61tZSB0aWXFviBzdG9wYSBtYXRpY2UgCmBgYAoKCiMjIFByaWTDoW1lIHZ5c3ZldMS+b3ZhbsO6IHByZW1lbm7DuiB5CgpgYGB7cn0KeSA8LSBjKDEsMiw0LDMsNSw5LDgsMiw2KQpgYGAKYSB2ecSNw61zbGltZSByZWdyZXNpdQoKYGBge3J9Cm1vZGVsIDwtIGxtKHl+KzEreDEreDIsZGF0YT11ZGFqZSkKc3VtbWFyeShtb2RlbCkKYGBgCgpgYGB7cn0KcGxvdChtb2RlbCkKYGBgCiMjIyDFoHRhbmRhcmRpem92YW7DqSByZXrDrWR1w6EKClxbCnJfaSA9IFxmcmFje2VfaX17XHNxcnR7TVNFfSBcc3FydHsxIC0gaF97aWl9fX0KXF0KCmtkZSAkcl9pJCBqZSDFoXRhbmRhcmRpem92YW7DqSByZXrDrWR1dW0gbWFqw7ogcHJpYmxpxb5uZSBzdHJlZG7DuiBob2Rub3R1IG51bGEgYSDFoXRhbmRhcmRuw7ogb2RjaMO9bGt1IDEuIEFrICR8cl9pfD4yJCwgamUgdG8gcG90ZW5jacOhbG55IG91dGxpZXIsIGFrICR8cl9pfD4zJCwgamUgdG8gdcW+IHByb2JsZW1hdGlja8OpIHBvem9yb3ZhbmllLgoKCgoKIyMgQ29vaydzIGRpc3RhbmNlCgpcWwpEX2kgPSBcZnJhY3tlX2leMn17cCBcY2RvdCBcbWF0aHJte01TRX19IFxjZG90IFxmcmFje2hfe2lpfX17KDEgLSBoX3tpaX0pXjJ9ClxdCgp3aGVyZToKCi0gXCggZV9pIFwpID0gcmV6w61kdXVtIFwoIGkgXCkgIAotIFwoIFxtYXRocm17TVNFfSBcKSA9IHN0cmVkbsOhIMWhdHZvcmNvdsOhIGNoeWJhIG1vZGVsdSAgCi0gXCggaF97aWl9IFwpID0gbGV2ZXJhZ2UgcG96b3JvdmFuaWEgXCggaSBcKSAoZGlhZ29uw6FsbnkgZWxlbWVudCBIYXQgbWF0aWNlIFwoIEggPSBYIChYJ1gpXnstMX0gWCcgXCkpICAKLSBcKCBwIFwpID0gcG/EjWV0IHBhcmFtZXRyb3YgKHZyw6F0YW5lICRcYmV0YV8wJCkKCi0tLQoKIyMjIPCflLkgKipJbnRlcnByZXRhdGlvbiBSdWxlcyBvZiBUaHVtYioqCgotIFwoIERfaSA8IDAuNSBcKTogdHlwaWNhbGx5ICoqc2FmZSoqICAKLSBcKCAwLjUgXGxlIERfaSA8IDEuMCBcKTogKiptb2RlcmF0ZSBpbmZsdWVuY2UqKiAgCi0gXCggRF9pIFxnZSAxLjAgXCk6ICoqaGlnaGx5IGluZmx1ZW50aWFsIG9ic2VydmF0aW9uKiosIGludmVzdGlnYXRlIG9yIGNvbnNpZGVyIHJlbW92YWwgIAoKKihCdXQgdGhlIHRocmVzaG9sZHMgZGVwZW5kIG9uIGNvbnRleHQgYW5kIHNhbXBsZSBzaXplLikqCgoKCgoKCgoKCiMjIFJvemRlbGVuaWUgcHJhdmRlcG9kb2Jub3N0aSB6b8WhaWttZW7DqSBkbyDEvmF2YQoKYGBge3J9CiMgMS4gR2VuZXJhdGUgYSBsZWZ0LXNrZXdlZCBkaXN0cmlidXRpb24KIyBUcmljazogdGFrZSBhIHJpZ2h0LXNrZXdlZCB2YXJpYWJsZSAoY2hpLXNxdWFyZWQpIGFuZCBtdWx0aXBseSBieSAtMQpzZXQuc2VlZCgxMjMpCnggPC0gLXJjaGlzcSgxMDAwLCBkZiA9IDUpICAjIGxlZnQtc2tld2VkIGJlY2F1c2Ugb2YgdGhlIG1pbnVzIHNpZ24KCiMgMi4gUGxvdCBoaXN0b2dyYW0KaGlzdCh4LAogICAgIGJyZWFrcyA9IDMwLAogICAgIG1haW4gPSAiSGlzdG9ncmFtIG9mIExlZnQtU2tld2VkIERpc3RyaWJ1dGlvbiIsCiAgICAgeGxhYiA9ICJWYWx1ZXMiLAogICAgIGNvbCA9ICJza3libHVlIiwKICAgICBib3JkZXIgPSAid2hpdGUiKQoKIyAzLiBRLVEgcGxvdApxcW5vcm0oeCwKICAgICAgIG1haW4gPSAiUS1RIFBsb3Qgb2YgTGVmdC1Ta2V3ZWQgRGlzdHJpYnV0aW9uIikKcXFsaW5lKHgsIGNvbCA9ICJyZWQiLCBsd2QgPSAyKQoKYGBgCgpgYGB7cn0KIyAxLiBHZW5lcmF0ZSBhIHJpZ2h0LXNrZXdlZCBkaXN0cmlidXRpb24KIyBFeGFtcGxlOiBDaGktc3F1YXJlZCB3aXRoIDUgZGVncmVlcyBvZiBmcmVlZG9tCnNldC5zZWVkKDEyMykKeCA8LSByY2hpc3EoMTAwMCwgZGYgPSA1KSAgICMgcmlnaHQtc2tld2VkCgojIDIuIFBsb3QgaGlzdG9ncmFtCmhpc3QoeCwKICAgICBicmVha3MgPSAzMCwKICAgICBtYWluID0gIkhpc3RvZ3JhbSBvZiBSaWdodC1Ta2V3ZWQgRGlzdHJpYnV0aW9uIiwKICAgICB4bGFiID0gIlZhbHVlcyIsCiAgICAgY29sID0gIm9yYW5nZSIsCiAgICAgYm9yZGVyID0gIndoaXRlIikKCiMgMy4gUS1RIHBsb3QKcXFub3JtKHgsCiAgICAgICBtYWluID0gIlEtUSBQbG90IG9mIFJpZ2h0LVNrZXdlZCBEaXN0cmlidXRpb24iKQpxcWxpbmUoeCwgY29sID0gInJlZCIsIGx3ZCA9IDIpCgpgYGAKCmBgYHtyfQojIDEuIEdlbmVyYXRlIHN5bW1ldHJpYyBoZWF2eS10YWlsZWQgZGF0YQojIHQtZGlzdHJpYnV0aW9uIHdpdGggbG93IGRmID0+IGhpZ2gga3VydG9zaXMKc2V0LnNlZWQoMTIzKQp4IDwtIHJ0KDEwMDAsIGRmID0gMikgICMgc3ltbWV0cmljIGJ1dCBoZWF2eSB0YWlscyAoaGlnaCBrdXJ0b3NpcykKCiMgMi4gUGxvdCBoaXN0b2dyYW0KaGlzdCh4LAogICAgIGJyZWFrcyA9IDMwLAogICAgIG1haW4gPSAiSGlzdG9ncmFtIG9mIFN5bW1ldHJpYyBEaXN0cmlidXRpb24gd2l0aCBIaWdoIEt1cnRvc2lzIiwKICAgICB4bGFiID0gIlZhbHVlcyIsCiAgICAgY29sID0gImxpZ2h0Z3JlZW4iLAogICAgIGJvcmRlciA9ICJ3aGl0ZSIpCgojIDMuIFEtUSBwbG90CnFxbm9ybSh4LAogICAgICAgbWFpbiA9ICJRLVEgUGxvdCBvZiBTeW1tZXRyaWMgSGVhdnktVGFpbGVkIERpc3RyaWJ1dGlvbiIpCnFxbGluZSh4LCBjb2wgPSAicmVkIiwgbHdkID0gMikKCmBgYAoKYGBge3J9CiMgMS4gR2VuZXJhdGUgdW5pZm9ybSBkaXN0cmlidXRpb24Kc2V0LnNlZWQoMTIzKQp4IDwtIHJ1bmlmKDEwMDAsIG1pbiA9IC0zLCBtYXggPSAzKSAgIyBzeW1tZXRyaWMgYXJvdW5kIDAsIGZsYXQgdGFpbHMKCiMgMi4gUGxvdCBoaXN0b2dyYW0KaGlzdCh4LAogICAgIGJyZWFrcyA9IDMwLAogICAgIG1haW4gPSAiSGlzdG9ncmFtIG9mIFVuaWZvcm0gRGlzdHJpYnV0aW9uIChMb3cgS3VydG9zaXMpIiwKICAgICB4bGFiID0gIlZhbHVlcyIsCiAgICAgY29sID0gImxpZ2h0Ymx1ZSIsCiAgICAgYm9yZGVyID0gIndoaXRlIikKCiMgMy4gUS1RIHBsb3QKcXFub3JtKHgsCiAgICAgICBtYWluID0gIlEtUSBQbG90IG9mIFVuaWZvcm0gRGlzdHJpYnV0aW9uIikKcXFsaW5lKHgsIGNvbCA9ICJyZWQiLCBsd2QgPSAyKQoKYGBgCgo=