Pripravme si údaje - vysvetľovacie premenné, ktoré označíme ako x1 a x2. tieto vektory majú šiestu zložku odľahlú (pozri hodnoty 25,15).
x1<-c(1,2,4,6,3,25,8,1,0)
x2 <-c(2,0,1,3,4,15,6,0,1)
udaje <- data.frame(cbind(x1,x2))
print(udaje)
x1 x2
1 1 2
2 2 0
3 4 1
4 6 3
5 3 4
6 25 15
7 8 6
8 1 0
9 0 1
Toto odľahlé pozorovanie by sa malo prejaviť vo vysokej hodnote leverage, t.j. 6-tom prvku diagonály Hat matice \[\mathbf{H} = \mathbf{X}(\mathbf{X}^t\mathbf{X})^{-1}\mathbf{X}= \{h_{i,j}\}_{i,j \in \{1..n\}}\]
leverage vyjadrujú prvky na hlavnej diagonále tejto matice \(h_{ii}\), pričom \(h_{ii}\in \[0,1\]\). Čím je toto číslo bližšie k 1, tým je dané pozorovanie odľahlejšie.
plot(udaje)
X <- as.matrix(udaje)
X
x1 x2
[1,] 1 2
[2,] 2 0
[3,] 4 1
[4,] 6 3
[5,] 3 4
[6,] 25 15
[7,] 8 6
[8,] 1 0
[9,] 0 1
H <- X%*%solve((t(X)%*%X))%*%t(X)
H
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0.14695661 -0.12431735 -0.14407705 -0.05927912 0.23175454 0.01439778 0.13007646 -0.06215867 0.10455764
[2,] -0.12431735 0.11597657 0.14080032 0.07447125 -0.19064641 0.08241485 -0.08301062 0.05798828 -0.09115282
[3,] -0.14407705 0.14080032 0.17436203 0.10068514 -0.21775395 0.15142488 -0.08023036 0.07040016 -0.10723861
[4,] -0.05927912 0.07447125 0.10068514 0.07864164 -0.08132261 0.20703009 0.00834078 0.03723563 -0.04825737
[5,] 0.23175454 -0.19064641 -0.21775395 -0.08132261 0.36818588 0.07000298 0.21864760 -0.09532321 0.16353887
[6,] 0.01439778 0.08241485 0.15142488 0.20703009 0.07000298 0.82911330 0.24923046 0.04120743 -0.01340483
[7,] 0.13007646 -0.08301062 -0.08023036 0.00834078 0.21864760 0.24923046 0.18270281 -0.04150531 0.08579088
[8,] -0.06215867 0.05798828 0.07040016 0.03723563 -0.09532321 0.04120743 -0.04150531 0.02899414 -0.04557641
[9,] 0.10455764 -0.09115282 -0.10723861 -0.04825737 0.16353887 -0.01340483 0.08579088 -0.04557641 0.07506702
print(paste("Prvky na hlavnej diagonále matice H sú: ",diag(H)))
[1] "Prvky na hlavnej diagonále matice H sú: 0.146956608082613"
[2] "Prvky na hlavnej diagonále matice H sú: 0.115976566378711"
[3] "Prvky na hlavnej diagonále matice H sú: 0.174362029589911"
[4] "Prvky na hlavnej diagonále matice H sú: 0.0786416443252903"
[5] "Prvky na hlavnej diagonále matice H sú: 0.368185880250222"
[6] "Prvky na hlavnej diagonále matice H sú: 0.829113295601233"
[7] "Prvky na hlavnej diagonále matice H sú: 0.182702810048654"
[8] "Prvky na hlavnej diagonále matice H sú: 0.0289941415946777"
[9] "Prvky na hlavnej diagonále matice H sú: 0.0750670241286861"
print(paste("Leverage šiesteho pozorovania je: ",diag(H)[6]))
[1] "Leverage šiesteho pozorovania je: 0.829113295601233"
print(paste("Suma prvkov na diagonále sa rovná počtu stĺpcov v X: ",sum(diag(H)))) # tejto sume hovoríme tiež stopa matice
[1] "Suma prvkov na diagonále sa rovná počtu stĺpcov v X: 2"
y <- c(1,2,4,3,5,9,8,2,6)
a vyčíslime regresiu
model <- lm(y~+1+x1+x2,data=udaje)
summary(model)
Call:
lm(formula = y ~ +1 + x1 + x2, data = udaje)
Residuals:
Min 1Q Median 3Q Max
-3.1123 -0.6214 -0.4397 1.1781 2.4515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.8030 0.8457 3.314 0.0161 *
x1 -0.1816 0.3375 -0.538 0.6098
x2 0.7455 0.5529 1.348 0.2262
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.982 on 6 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.4951
F-statistic: 4.923 on 2 and 6 DF, p-value: 0.05429
plot(model)
\[ r_i = \frac{e_i}{\sqrt{MSE} \sqrt{1 - h_{ii}}} \]
kde \(r_i\) je štandardizované rezíduum majú približne strednú hodnotu nula a štandardnú odchýlku 1. Ak \(|r_i|>2\), je to potenciálny outlier, ak \(|r_i|>3\), je to už problematické pozorovanie.
\[ D_i = \frac{e_i^2}{p \cdot \mathrm{MSE}} \cdot \frac{h_{ii}}{(1 - h_{ii})^2} \]
where:
(But the thresholds depend on context and sample size.)
# 1. Generate a left-skewed distribution
# Trick: take a right-skewed variable (chi-squared) and multiply by -1
set.seed(123)
x <- -rchisq(1000, df = 5) # left-skewed because of the minus sign
# 2. Plot histogram
hist(x,
breaks = 30,
main = "Histogram of Left-Skewed Distribution",
xlab = "Values",
col = "skyblue",
border = "white")
# 3. Q-Q plot
qqnorm(x,
main = "Q-Q Plot of Left-Skewed Distribution")
qqline(x, col = "red", lwd = 2)
# 1. Generate a right-skewed distribution
# Example: Chi-squared with 5 degrees of freedom
set.seed(123)
x <- rchisq(1000, df = 5) # right-skewed
# 2. Plot histogram
hist(x,
breaks = 30,
main = "Histogram of Right-Skewed Distribution",
xlab = "Values",
col = "orange",
border = "white")
# 3. Q-Q plot
qqnorm(x,
main = "Q-Q Plot of Right-Skewed Distribution")
qqline(x, col = "red", lwd = 2)
# 1. Generate symmetric heavy-tailed data
# t-distribution with low df => high kurtosis
set.seed(123)
x <- rt(1000, df = 2) # symmetric but heavy tails (high kurtosis)
# 2. Plot histogram
hist(x,
breaks = 30,
main = "Histogram of Symmetric Distribution with High Kurtosis",
xlab = "Values",
col = "lightgreen",
border = "white")
# 3. Q-Q plot
qqnorm(x,
main = "Q-Q Plot of Symmetric Heavy-Tailed Distribution")
qqline(x, col = "red", lwd = 2)
# 1. Generate uniform distribution
set.seed(123)
x <- runif(1000, min = -3, max = 3) # symmetric around 0, flat tails
# 2. Plot histogram
hist(x,
breaks = 30,
main = "Histogram of Uniform Distribution (Low Kurtosis)",
xlab = "Values",
col = "lightblue",
border = "white")
# 3. Q-Q plot
qqnorm(x,
main = "Q-Q Plot of Uniform Distribution")
qqline(x, col = "red", lwd = 2)