library(sjPlot)
## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.2.15
## Current Matrix version is 1.2.14
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
library(sjmisc)
library(sjlabelled)
To address the different types of measures:
Regression Model:\(Y= X\beta + \epsilon\)
Predicted Response: \(\hat{y} = Xb\)
Estimated coefficients: \(b = (X^{T}X)^{-1}X^{T}y\)
Predicted Response above can be re-written as: \(\hat{y} = X(X^{T}X)^{-1}X^{T}y\)
We call H the hat matrix to help us identify leverage points \(H = X(X^{T}X)^{-1}X^{T}\)
Predicted Response above can also be re-writtem as \(\hat{y} =Hy\)
residual is \(e_i = y_i - \hat{y}_i\)
standardized residual for each observation \(r_i = \frac{e_i}{s(e_i)} = \frac{e_i}{\sqrt{MSE(1-h_{ii})}}\)
DFFITS: \(DFFITS_{i} = \frac{\hat{y_i} - \hat{y}_{(i)}}{\sqrt{MSE_{(i)}h_{ii}}}\)
Cooks’ Distance: \(D_i = (\frac{ (y_i - \hat{y}_i)^2}{(k+1)MSE})\)
properties of Leverages:
The leverage hii is a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points.
The leverage hii is a number between 0 and 1, inclusive.
The sum of the hii equals k+1, the number of parameters (regression coefficients including the intercept).
How to check if x observation has high-leverage if \(h_{ii} > 3 ( \frac{k+1}{n})\)
outlier point is a y observation that is extreme with respect to other y observations (not in x values) influential point if an predictor(S) x valus influences part of the regression repsonse (confficients, hypothesis, etc.)
Outlier
In the chart below is an example of outlier. the \(\beta_1\) standard error is bigger for when the row#21 is included resulting in bigger.
Input = ("
Row x y
1 0.1 -0.0716
2 0.45401 4.1673
3 1.09765 6.5703
4 1.27936 13.815
5 2.20611 11.4501
6 2.50064 12.9554
7 3.0403 20.1575
8 3.23583 17.5633
9 4.45308 26.0317
10 4.1699 22.7573
11 5.28474 26.303
12 5.59238 30.6885
13 5.92091 33.9402
14 6.66066 30.9228
15 6.79953 34.11
16 7.97943 44.4536
17 8.41536 46.5022
18 8.71607 50.0568
19 8.70156 46.5475
20 9.16463 45.7762
21 4 40
")
influential = read.table(textConnection(Input), header = T)
plot(x=influential$x,y=influential$y)
influential_lm_model = lm(y~x, data = influential)
influential_remove_row_21__lm_model = lm(y~x, data = influential[-c(21),])
tab_model(influential_lm_model,influential_remove_row_21__lm_model,show.se = T,show.re.var = T)
y | y | |||||||
---|---|---|---|---|---|---|---|---|
Predictors | Estimates | std. Error | CI | p | Estimates | std. Error | CI | p |
(Intercept) | 2.96 | 2.01 | -0.98 – 6.90 | 0.157 | 1.73 | 1.12 | -0.46 – 3.93 | 0.140 |
x | 5.04 | 0.36 | 4.33 – 5.75 | <0.001 | 5.12 | 0.20 | 4.72 – 5.51 | <0.001 |
Observations | 21 | 20 | ||||||
R2 / adjusted R2 | 0.910 / 0.905 | 0.973 / 0.972 |
ordered_influential = influential[order(influential$x),]
l = length(ordered_influential$Row)
X = matrix(rep(1,l))
X = cbind(X, ordered_influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)
plot(X[,2],rep(-0.5,l),ylim = c(-0.5,1))
lines(X[,2], diag(H))
sprintf("Since the Hii are relatively low, we can conclude that row#21 has no high-leverage")
## [1] "Since the Hii are relatively low, we can conclude that row#21 has no high-leverage"
#please use the equation at the beginning of this chapter
residuals_ = (ordered_influential$y - H %*% ordered_influential$y)
residuals_var = sum( residuals_^2 )/19
standardized_residuals = residuals_/sqrt(residuals_var*(1-diag(H)))
sprintf("Since the std_res for row#21(original data set) = %f > 3, we can conclude that this is an outlier", standardized_residuals[9,1] )
## [1] "Since the std_res for row#21(original data set) = 3.681098 > 3, we can conclude that this is an outlier"
Why should I watch for outlier the mean square error MSE is substantially inflated from 6.72 to 22.19 by the presence of the outlier. Recalling that MSE appears in all of our confidence and prediction interval formulas, the inflated size of MSE would thereby cause a detrimental increase in the width of all of our confidence and prediction intervals
High-Leverage and Outlier
In the chart below is an example of an influential point that is an outlier and high leverage
Input = ("
Row x y
1 0.1 -0.0716
2 0.45401 4.1673
3 1.09765 6.5703
4 1.27936 13.815
5 2.20611 11.4501
6 2.50064 12.9554
7 3.0403 20.1575
8 3.23583 17.5633
9 4.45308 26.0317
10 4.1699 22.7573
11 5.28474 26.303
12 5.59238 30.6885
13 5.92091 33.9402
14 6.66066 30.9228
15 6.79953 34.11
16 7.97943 44.4536
17 8.41536 46.5022
18 8.71607 50.0568
19 8.70156 46.5475
20 9.16463 45.7762
21 13 15
")
influential = read.table(textConnection(Input), header = T)
plot(x=influential$x,y=influential$y)
influential_lm_model = lm(y~x, data = influential)
influential_remove_row_21__lm_model = lm(y~x, data = influential[-c(21),])
tab_model(influential_lm_model,influential_remove_row_21__lm_model,show.se = T,show.re.var = T)
y | y | |||||||
---|---|---|---|---|---|---|---|---|
Predictors | Estimates | std. Error | CI | p | Estimates | std. Error | CI | p |
(Intercept) | 8.50 | 4.22 | 0.23 – 16.78 | 0.058 | 1.73 | 1.12 | -0.46 – 3.93 | 0.140 |
x | 3.32 | 0.69 | 1.97 – 4.66 | <0.001 | 5.12 | 0.20 | 4.72 – 5.51 | <0.001 |
Observations | 21 | 20 | ||||||
R2 / adjusted R2 | 0.552 / 0.528 | 0.973 / 0.972 |
ordered_influential = influential[order(influential$x),]
l = length(ordered_influential$Row)
X = matrix(rep(1,l))
X = cbind(X, ordered_influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)
plot(X[,2],rep(-0.5,l),ylim = c(-0.5,1))
lines(X[,2], diag(H))
Hii_cutoff = 3*(sum(diag(H)))/l
sprintf("Since the Hii[21,21] = %f > Hii_cutoff = %f, we can conclude that row#21 has a high-leverage", H[21,21], Hii_cutoff)
## [1] "Since the Hii[21,21] = 0.311532 > Hii_cutoff = 0.285714, we can conclude that row#21 has a high-leverage"
High-Leverage
In the chart below is an example of an influential point that is an outlier and high leverage
Input = ("
Row x y
1 0.1 -0.0716
2 0.45401 4.1673
3 1.09765 6.5703
4 1.27936 13.815
5 2.20611 11.4501
6 2.50064 12.9554
7 3.0403 20.1575
8 3.23583 17.5633
9 4.45308 26.0317
10 4.1699 22.7573
11 5.28474 26.303
12 5.59238 30.6885
13 5.92091 33.9402
14 6.66066 30.9228
15 6.79953 34.11
16 7.97943 44.4536
17 8.41536 46.5022
18 8.71607 50.0568
19 8.70156 46.5475
20 9.16463 45.7762
21 14 68
")
influential = read.table(textConnection(Input), header = T)
plot(x=influential$x,y=influential$y)
influential_lm_model = lm(y~x, data = influential)
influential_remove_row_21__lm_model = lm(y~x, data = influential[-c(21),])
tab_model(influential_lm_model,influential_remove_row_21__lm_model,show.se = T)
y | y | |||||||
---|---|---|---|---|---|---|---|---|
Predictors | Estimates | std. Error | CI | p | Estimates | std. Error | CI | p |
(Intercept) | 2.47 | 1.08 | 0.36 – 4.58 | 0.033 | 1.73 | 1.12 | -0.46 – 3.93 | 0.140 |
x | 4.93 | 0.17 | 4.59 – 5.26 | <0.001 | 5.12 | 0.20 | 4.72 – 5.51 | <0.001 |
Observations | 21 | 20 | ||||||
R2 / adjusted R2 | 0.977 / 0.976 | 0.973 / 0.972 |
ordered_influential = influential[order(influential$x),]
l = length(ordered_influential$Row)
X = matrix(rep(1,l))
X = cbind(X, ordered_influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)
plot(X[,2],rep(-0.5,l),ylim = c(-0.5,1))
lines(X[,2], diag(H))
Hii_cutoff = 3*(sum(diag(H)))/l
sprintf("Since the Hii[21,21] = %f > Hii_cutoff = %f, we can conclude that row#21 has a high-leverage", H[21,21], Hii_cutoff)
## [1] "Since the Hii[21,21] = 0.357535 > Hii_cutoff = 0.285714, we can conclude that row#21 has a high-leverage"
Input = ("
Row x y
1 0.1 -0.0716
2 0.45401 4.1673
3 1.09765 6.5703
4 1.27936 13.815
5 2.20611 11.4501
6 2.50064 12.9554
7 3.0403 20.1575
8 3.23583 17.5633
9 4.45308 26.0317
10 4.1699 22.7573
11 5.28474 26.303
12 5.59238 30.6885
13 5.92091 33.9402
14 6.66066 30.9228
15 6.79953 34.11
16 7.97943 44.4536
17 8.41536 46.5022
18 8.71607 50.0568
19 8.70156 46.5475
20 9.16463 45.7762
21 4 40
")
influential = read.table(textConnection(Input), header = T)
l = length(influential$Row)
X = matrix(rep(1,l))
X = cbind(X, influential$x)
H_orig = X %*% solve(t(X) %*%X) %*% t(X)
residuals_ = (influential$y - H_orig %*% influential$y)
residuals_var = sum( residuals_^2 )/19
standardized_residuals_ = residuals_/sqrt(residuals_var*(1-diag(H_orig)))
studentized_residuals_ = standardized_residuals_* sqrt((21-1-2)/(21-1-1- standardized_residuals_^2))
pt(studentized_residuals_[21,1],18,lower.tail = F)
## [1] 1.414868e-06
Another way to call externally studentized residuals is show below
rstudent(lm(y~x,data=influential))
## 1 2 3 4 5 6
## -0.81916651 -0.24290537 -0.42596177 0.99808661 -0.57149911 -0.56405976
## 7 8 9 10 11 12
## 0.40458192 -0.36264263 0.13610958 -0.25597698 -0.70363292 -0.09336169
## 13 14 15 16 17 18
## 0.24640779 -1.24719479 -0.67326065 0.28548280 0.25561479 0.72218968
## 19 20 21
## -0.05413613 -0.76838197 6.69012861
\[DFFITS_{i} = \frac{\hat{y_i} - \hat{y}_{(i)}}{\sqrt{MSE_{(i)}h_{ii}}}\]
Cut-off point \[2 \sqrt{ \frac{k+2}{n-k-2}}\]
Input = ("
Row x y
1 0.1 -0.0716
2 0.45401 4.1673
3 1.09765 6.5703
4 1.27936 13.815
5 2.20611 11.4501
6 2.50064 12.9554
7 3.0403 20.1575
8 3.23583 17.5633
9 4.45308 26.0317
10 4.1699 22.7573
11 5.28474 26.303
12 5.59238 30.6885
13 5.92091 33.9402
14 6.66066 30.9228
15 6.79953 34.11
16 7.97943 44.4536
17 8.41536 46.5022
18 8.71607 50.0568
19 8.70156 46.5475
20 9.16463 45.7762
21 4 40
")
k = 1 #number of predictors
n = dim(influential)[1] #numbers of observations
p = 2 #number of other variables y and intercept
cutoff_value =2*sqrt( (k+2)/(n-k-2))
my_diffts = rep(0,n)
for (i in 1:n) {
row_removal = i
influential = read.table(textConnection(Input), header = T)
X = matrix(rep(1,n))
X = cbind(X, influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)
residuals_ = (influential$y - H %*% influential$y)
influential_del = influential[-c(row_removal),]
X_del = matrix(rep(1,n-1)) # n -1 because we are removing on observation at a time
X_del = cbind(X_del, influential_del$x)
H_del = X_del %*% solve(t(X_del) %*%X_del) %*% t(X_del)
residuals_del_ = (influential_del$y - H_del %*% influential_del$y)
y_i = H%*%influential$y #fitted y with full model, same as model$fitted.values
y_i_del = H_del%*%influential_del$y #fitted y with reduced model by removing row#21, same as reduced_model$fitted.values
MSE_del = sum ( (influential_del$y - y_i_del)^2)/(n-1-p) #compute MSE to be used by dfits equation below (n-1 because we are remove one row)
#predicted_y_i_del = X_del %*%beta # - another way to compute y hats
beta = solve(t(X_del)%*%X_del)%*%t(X_del)%*%influential_del$y #get the beta coffecients values for the reduced_model
New_X = matrix(c(1,influential$x[row_removal]))
New_Y = t(New_X)%*%beta #predict a new row#21 using old X but using the new beta values above
#y_i_del = c(y_i_del, New_Y) #add the new prediced values to the y range
y_i_del = append(y_i_del, New_Y, after = row_removal-1)
dfitts_for_a_row_removal = ( y_i - y_i_del)/sqrt(MSE_del*diag(H))
my_diffts[row_removal] = dfitts_for_a_row_removal[row_removal]
#sprintf("%f",dfitts_[row_removal])
#dfitts_[row_removal]
}
cutoff_value
## [1] 0.8164966
my_diffts[which(my_diffts > cutoff_value)]
## [1] 1.5505
Input = ("
Row x y
1 0.1 -0.0716
2 0.45401 4.1673
3 1.09765 6.5703
4 1.27936 13.815
5 2.20611 11.4501
6 2.50064 12.9554
7 3.0403 20.1575
8 3.23583 17.5633
9 4.45308 26.0317
10 4.1699 22.7573
11 5.28474 26.303
12 5.59238 30.6885
13 5.92091 33.9402
14 6.66066 30.9228
15 6.79953 34.11
16 7.97943 44.4536
17 8.41536 46.5022
18 8.71607 50.0568
19 8.70156 46.5475
20 9.16463 45.7762
21 4 40
")
k = 1 #number of predictors
n = dim(influential)[1] #numbers of observations
p = 2 #number of other variables y and intercept
cutoff_value =2*sqrt( (k+2)/(n-k-2))
my_diffts = rep(0,n)
my_dcooks = rep(0,n)
for (i in 1:n) {
row_removal = i
influential = read.table(textConnection(Input), header = T)
X = matrix(rep(1,n))
X = cbind(X, influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)
residuals_ = (influential$y - H %*% influential$y)
MSE = (t(residuals_)%*%residuals_)/(n-k-1)
influential_del = influential[-c(row_removal),]
X_del = matrix(rep(1,n-1)) # n -1 because we are removing on observation at a time
X_del = cbind(X_del, influential_del$x)
H_del = X_del %*% solve(t(X_del) %*%X_del) %*% t(X_del)
residuals_del_ = (influential_del$y - H_del %*% influential_del$y)
y_i = H%*%influential$y #fitted y with full model, same as model$fitted.values
y_i_del = H_del%*%influential_del$y #fitted y with reduced model by removing row#21, same as reduced_model$fitted.values
MSE_del = sum ( (influential_del$y - y_i_del)^2)/(n-1-p) #compute MSE to be used by dfits equation below (n-1 because we are remove one row)
#predicted_y_i_del = X_del %*%beta # - another way to compute y hats
beta = solve(t(X_del)%*%X_del)%*%t(X_del)%*%influential_del$y #get the beta coffecients values for the reduced_model
New_X = matrix(c(1,influential$x[row_removal]))
New_Y = t(New_X)%*%beta #predict a new row#21 using old X but using the new beta values above
#y_i_del = c(y_i_del, New_Y) #add the new prediced values to the y range
y_i_del = append(y_i_del, New_Y, after = row_removal-1)
dfitts_for_a_row_removal = ( y_i - y_i_del)/sqrt(MSE_del*diag(H))
dcooks = ((y_i - y_i_del)^2 / ((1+6.718))/(diag(H)/(1-diag(H))))
my_diffts[row_removal] = dfitts_for_a_row_removal[row_removal]
my_dcooks[row_removal] = (sum( (y_i - y_i_del)^2)/ (MSE*(k+1)))
#sum( (m1$fitted.values - y_i_del)^2)/ (22.19*2)
#sprintf("%f",dfitts_[row_removal])
#dfitts_[row_removal]
}
cutoff_value
## [1] 0.8164966
my_diffts[which(my_diffts > cutoff_value)]
## [1] 1.5505
Cooks distance is between 0 and 1 so anything greater than 1 deserves further investigation
references: https://newonlinecourses.science.psu.edu/stat462/node/173/ https://en.wikipedia.org/wiki/Cook's_distance#cite_note-6 https://web.stanford.edu/class/stats191/notebooks/