library(sjPlot)
## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.2.15
## Current Matrix version is 1.2.14
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
library(sjmisc)
library(sjlabelled)

To address the different types of measures:




Regression Model:\(Y= X\beta + \epsilon\)

Predicted Response: \(\hat{y} = Xb\)

Estimated coefficients: \(b = (X^{T}X)^{-1}X^{T}y\)

Predicted Response above can be re-written as: \(\hat{y} = X(X^{T}X)^{-1}X^{T}y\)

We call H the hat matrix to help us identify leverage points \(H = X(X^{T}X)^{-1}X^{T}\)

Predicted Response above can also be re-writtem as \(\hat{y} =Hy\)

residual is \(e_i = y_i - \hat{y}_i\)

standardized residual for each observation \(r_i = \frac{e_i}{s(e_i)} = \frac{e_i}{\sqrt{MSE(1-h_{ii})}}\)

DFFITS: \(DFFITS_{i} = \frac{\hat{y_i} - \hat{y}_{(i)}}{\sqrt{MSE_{(i)}h_{ii}}}\)

Cooks’ Distance: \(D_i = (\frac{ (y_i - \hat{y}_i)^2}{(k+1)MSE})\)


properties of Leverages:


How to check if x observation has high-leverage if \(h_{ii} > 3 ( \frac{k+1}{n})\)

1 Example of leverages

outlier point is a y observation that is extreme with respect to other y observations (not in x values) influential point if an predictor(S) x valus influences part of the regression repsonse (confficients, hypothesis, etc.)



Outlier

In the chart below is an example of outlier. the \(\beta_1\) standard error is bigger for when the row#21 is included resulting in bigger.

Input = ("
Row x   y
1   0.1 -0.0716
2   0.45401 4.1673
3   1.09765 6.5703
4   1.27936 13.815
5   2.20611 11.4501
6   2.50064 12.9554
7   3.0403  20.1575
8   3.23583 17.5633
9   4.45308 26.0317
10  4.1699  22.7573
11  5.28474 26.303
12  5.59238 30.6885
13  5.92091 33.9402
14  6.66066 30.9228
15  6.79953 34.11
16  7.97943 44.4536
17  8.41536 46.5022
18  8.71607 50.0568
19  8.70156 46.5475
20  9.16463 45.7762
21  4   40
")
influential = read.table(textConnection(Input), header = T)
plot(x=influential$x,y=influential$y)

influential_lm_model = lm(y~x, data = influential)
influential_remove_row_21__lm_model = lm(y~x, data = influential[-c(21),])
tab_model(influential_lm_model,influential_remove_row_21__lm_model,show.se = T,show.re.var = T)
  y y
Predictors Estimates std. Error CI p Estimates std. Error CI p
(Intercept) 2.96 2.01 -0.98 – 6.90 0.157 1.73 1.12 -0.46 – 3.93 0.140
x 5.04 0.36 4.33 – 5.75 <0.001 5.12 0.20 4.72 – 5.51 <0.001
Observations 21 20
R2 / adjusted R2 0.910 / 0.905 0.973 / 0.972
ordered_influential = influential[order(influential$x),]


l = length(ordered_influential$Row)
X = matrix(rep(1,l))
X  = cbind(X, ordered_influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)

plot(X[,2],rep(-0.5,l),ylim = c(-0.5,1))
lines(X[,2], diag(H))

sprintf("Since the Hii are relatively low, we can conclude that row#21 has no high-leverage")
## [1] "Since the Hii are relatively low, we can conclude that row#21 has no high-leverage"
#please use the equation at the beginning of this chapter
residuals_ = (ordered_influential$y - H %*% ordered_influential$y)
residuals_var = sum( residuals_^2 )/19
standardized_residuals =  residuals_/sqrt(residuals_var*(1-diag(H)))

sprintf("Since the std_res for row#21(original data set) = %f > 3, we can conclude that this is an outlier", standardized_residuals[9,1] )
## [1] "Since the std_res for row#21(original data set) = 3.681098 > 3, we can conclude that this is an outlier"


Why should I watch for outlier the mean square error MSE is substantially inflated from 6.72 to 22.19 by the presence of the outlier. Recalling that MSE appears in all of our confidence and prediction interval formulas, the inflated size of MSE would thereby cause a detrimental increase in the width of all of our confidence and prediction intervals



High-Leverage and Outlier

In the chart below is an example of an influential point that is an outlier and high leverage

Input = ("
Row x   y
1   0.1 -0.0716
2   0.45401 4.1673
3   1.09765 6.5703
4   1.27936 13.815
5   2.20611 11.4501
6   2.50064 12.9554
7   3.0403  20.1575
8   3.23583 17.5633
9   4.45308 26.0317
10  4.1699  22.7573
11  5.28474 26.303
12  5.59238 30.6885
13  5.92091 33.9402
14  6.66066 30.9228
15  6.79953 34.11
16  7.97943 44.4536
17  8.41536 46.5022
18  8.71607 50.0568
19  8.70156 46.5475
20  9.16463 45.7762
21  13  15
         ")


influential = read.table(textConnection(Input), header = T)
plot(x=influential$x,y=influential$y)

influential_lm_model = lm(y~x, data = influential)
influential_remove_row_21__lm_model = lm(y~x, data = influential[-c(21),])
tab_model(influential_lm_model,influential_remove_row_21__lm_model,show.se = T,show.re.var = T)
  y y
Predictors Estimates std. Error CI p Estimates std. Error CI p
(Intercept) 8.50 4.22 0.23 – 16.78 0.058 1.73 1.12 -0.46 – 3.93 0.140
x 3.32 0.69 1.97 – 4.66 <0.001 5.12 0.20 4.72 – 5.51 <0.001
Observations 21 20
R2 / adjusted R2 0.552 / 0.528 0.973 / 0.972
ordered_influential = influential[order(influential$x),]


l = length(ordered_influential$Row)
X = matrix(rep(1,l))
X  = cbind(X, ordered_influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)

plot(X[,2],rep(-0.5,l),ylim = c(-0.5,1))
lines(X[,2], diag(H))

Hii_cutoff = 3*(sum(diag(H)))/l

sprintf("Since the Hii[21,21] = %f > Hii_cutoff = %f, we can conclude that row#21 has a high-leverage", H[21,21], Hii_cutoff)
## [1] "Since the Hii[21,21] = 0.311532 > Hii_cutoff = 0.285714, we can conclude that row#21 has a high-leverage"



High-Leverage

In the chart below is an example of an influential point that is an outlier and high leverage

Input = ("
Row x   y
1   0.1 -0.0716
2   0.45401 4.1673
3   1.09765 6.5703
4   1.27936 13.815
5   2.20611 11.4501
6   2.50064 12.9554
7   3.0403  20.1575
8   3.23583 17.5633
9   4.45308 26.0317
10  4.1699  22.7573
11  5.28474 26.303
12  5.59238 30.6885
13  5.92091 33.9402
14  6.66066 30.9228
15  6.79953 34.11
16  7.97943 44.4536
17  8.41536 46.5022
18  8.71607 50.0568
19  8.70156 46.5475
20  9.16463 45.7762
21  14  68
         ")

influential = read.table(textConnection(Input), header = T)
plot(x=influential$x,y=influential$y)

influential_lm_model = lm(y~x, data = influential)
influential_remove_row_21__lm_model = lm(y~x, data = influential[-c(21),])
tab_model(influential_lm_model,influential_remove_row_21__lm_model,show.se = T)
  y y
Predictors Estimates std. Error CI p Estimates std. Error CI p
(Intercept) 2.47 1.08 0.36 – 4.58 0.033 1.73 1.12 -0.46 – 3.93 0.140
x 4.93 0.17 4.59 – 5.26 <0.001 5.12 0.20 4.72 – 5.51 <0.001
Observations 21 20
R2 / adjusted R2 0.977 / 0.976 0.973 / 0.972
ordered_influential = influential[order(influential$x),]


l = length(ordered_influential$Row)
X = matrix(rep(1,l))
X  = cbind(X, ordered_influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)

plot(X[,2],rep(-0.5,l),ylim = c(-0.5,1))
lines(X[,2], diag(H))

Hii_cutoff = 3*(sum(diag(H)))/l

sprintf("Since the Hii[21,21] = %f > Hii_cutoff = %f, we can conclude that row#21 has a high-leverage", H[21,21], Hii_cutoff)
## [1] "Since the Hii[21,21] = 0.357535 > Hii_cutoff = 0.285714, we can conclude that row#21 has a high-leverage"




2 residuals

3 standardized residuals

4 deleted residuals (or PRESS prediction errors)

5 studentized residuals

Input = ("
Row x   y
1   0.1 -0.0716
2   0.45401 4.1673
3   1.09765 6.5703
4   1.27936 13.815
5   2.20611 11.4501
6   2.50064 12.9554
7   3.0403  20.1575
8   3.23583 17.5633
9   4.45308 26.0317
10  4.1699  22.7573
11  5.28474 26.303
12  5.59238 30.6885
13  5.92091 33.9402
14  6.66066 30.9228
15  6.79953 34.11
16  7.97943 44.4536
17  8.41536 46.5022
18  8.71607 50.0568
19  8.70156 46.5475
20  9.16463 45.7762
21  4   40
")
influential = read.table(textConnection(Input), header = T)

l = length(influential$Row)
X = matrix(rep(1,l))
X  = cbind(X, influential$x)
H_orig = X %*% solve(t(X) %*%X) %*% t(X)

residuals_ = (influential$y - H_orig %*% influential$y)
residuals_var = sum( residuals_^2 )/19
standardized_residuals_ =  residuals_/sqrt(residuals_var*(1-diag(H_orig)))
studentized_residuals_ = standardized_residuals_* sqrt((21-1-2)/(21-1-1- standardized_residuals_^2))

pt(studentized_residuals_[21,1],18,lower.tail = F)
## [1] 1.414868e-06

Another way to call externally studentized residuals is show below

rstudent(lm(y~x,data=influential))
##           1           2           3           4           5           6 
## -0.81916651 -0.24290537 -0.42596177  0.99808661 -0.57149911 -0.56405976 
##           7           8           9          10          11          12 
##  0.40458192 -0.36264263  0.13610958 -0.25597698 -0.70363292 -0.09336169 
##          13          14          15          16          17          18 
##  0.24640779 -1.24719479 -0.67326065  0.28548280  0.25561479  0.72218968 
##          19          20          21 
## -0.05413613 -0.76838197  6.69012861

6 difference in fits (DFFITS)

\[DFFITS_{i} = \frac{\hat{y_i} - \hat{y}_{(i)}}{\sqrt{MSE_{(i)}h_{ii}}}\]

Cut-off point \[2 \sqrt{ \frac{k+2}{n-k-2}}\]

Input = ("
Row x   y
1   0.1 -0.0716
2   0.45401 4.1673
3   1.09765 6.5703
4   1.27936 13.815
5   2.20611 11.4501
6   2.50064 12.9554
7   3.0403  20.1575
8   3.23583 17.5633
9   4.45308 26.0317
10  4.1699  22.7573
11  5.28474 26.303
12  5.59238 30.6885
13  5.92091 33.9402
14  6.66066 30.9228
15  6.79953 34.11
16  7.97943 44.4536
17  8.41536 46.5022
18  8.71607 50.0568
19  8.70156 46.5475
20  9.16463 45.7762
21  4   40
")

k = 1 #number of predictors
n = dim(influential)[1] #numbers of observations
p = 2 #number of other variables y and intercept

cutoff_value =2*sqrt( (k+2)/(n-k-2))


my_diffts = rep(0,n)

for (i in 1:n) {

  
row_removal = i

influential = read.table(textConnection(Input), header = T)
X = matrix(rep(1,n))
X  = cbind(X, influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)
residuals_ = (influential$y - H %*% influential$y)


influential_del = influential[-c(row_removal),]
X_del = matrix(rep(1,n-1)) # n -1 because we are removing on observation at a time
X_del  = cbind(X_del, influential_del$x)
H_del = X_del %*% solve(t(X_del) %*%X_del) %*% t(X_del)
residuals_del_ = (influential_del$y - H_del %*% influential_del$y)




y_i = H%*%influential$y  #fitted y with full model, same as model$fitted.values

y_i_del = H_del%*%influential_del$y #fitted y with reduced model by removing row#21, same as reduced_model$fitted.values
MSE_del = sum ( (influential_del$y - y_i_del)^2)/(n-1-p)  #compute MSE to be used by dfits equation below (n-1 because we are remove one row)
#predicted_y_i_del = X_del %*%beta # - another way to compute y hats


beta = solve(t(X_del)%*%X_del)%*%t(X_del)%*%influential_del$y  #get the beta coffecients values for the reduced_model
New_X = matrix(c(1,influential$x[row_removal]))   
New_Y = t(New_X)%*%beta #predict a new row#21 using old X but using the new beta values above
#y_i_del = c(y_i_del, New_Y) #add the new prediced values to the y range
y_i_del = append(y_i_del, New_Y, after = row_removal-1)

dfitts_for_a_row_removal = ( y_i - y_i_del)/sqrt(MSE_del*diag(H))

my_diffts[row_removal] = dfitts_for_a_row_removal[row_removal]

#sprintf("%f",dfitts_[row_removal])

#dfitts_[row_removal]

}

cutoff_value
## [1] 0.8164966
my_diffts[which(my_diffts > cutoff_value)]
## [1] 1.5505

7 Cook’s distances

Input = ("
Row x   y
1   0.1 -0.0716
2   0.45401 4.1673
3   1.09765 6.5703
4   1.27936 13.815
5   2.20611 11.4501
6   2.50064 12.9554
7   3.0403  20.1575
8   3.23583 17.5633
9   4.45308 26.0317
10  4.1699  22.7573
11  5.28474 26.303
12  5.59238 30.6885
13  5.92091 33.9402
14  6.66066 30.9228
15  6.79953 34.11
16  7.97943 44.4536
17  8.41536 46.5022
18  8.71607 50.0568
19  8.70156 46.5475
20  9.16463 45.7762
21  4   40
")

k = 1 #number of predictors
n = dim(influential)[1] #numbers of observations
p = 2 #number of other variables y and intercept

cutoff_value =2*sqrt( (k+2)/(n-k-2))


my_diffts = rep(0,n)
my_dcooks = rep(0,n)

for (i in 1:n) {

  
row_removal = i

influential = read.table(textConnection(Input), header = T)
X = matrix(rep(1,n))
X  = cbind(X, influential$x)
H = X %*% solve(t(X) %*%X) %*% t(X)
residuals_ = (influential$y - H %*% influential$y)
MSE = (t(residuals_)%*%residuals_)/(n-k-1)


influential_del = influential[-c(row_removal),]
X_del = matrix(rep(1,n-1)) # n -1 because we are removing on observation at a time
X_del  = cbind(X_del, influential_del$x)
H_del = X_del %*% solve(t(X_del) %*%X_del) %*% t(X_del)
residuals_del_ = (influential_del$y - H_del %*% influential_del$y)




y_i = H%*%influential$y  #fitted y with full model, same as model$fitted.values

y_i_del = H_del%*%influential_del$y #fitted y with reduced model by removing row#21, same as reduced_model$fitted.values
MSE_del = sum ( (influential_del$y - y_i_del)^2)/(n-1-p)  #compute MSE to be used by dfits equation below (n-1 because we are remove one row)
#predicted_y_i_del = X_del %*%beta # - another way to compute y hats


beta = solve(t(X_del)%*%X_del)%*%t(X_del)%*%influential_del$y  #get the beta coffecients values for the reduced_model
New_X = matrix(c(1,influential$x[row_removal]))   
New_Y = t(New_X)%*%beta #predict a new row#21 using old X but using the new beta values above
#y_i_del = c(y_i_del, New_Y) #add the new prediced values to the y range
y_i_del = append(y_i_del, New_Y, after = row_removal-1)

dfitts_for_a_row_removal = ( y_i - y_i_del)/sqrt(MSE_del*diag(H))

dcooks = ((y_i - y_i_del)^2 / ((1+6.718))/(diag(H)/(1-diag(H))))

my_diffts[row_removal] = dfitts_for_a_row_removal[row_removal]
my_dcooks[row_removal] = (sum( (y_i - y_i_del)^2)/ (MSE*(k+1)))
#sum( (m1$fitted.values - y_i_del)^2)/ (22.19*2)
#sprintf("%f",dfitts_[row_removal])

#dfitts_[row_removal]

}

cutoff_value
## [1] 0.8164966
my_diffts[which(my_diffts > cutoff_value)]
## [1] 1.5505

Cooks distance is between 0 and 1 so anything greater than 1 deserves further investigation

references: https://newonlinecourses.science.psu.edu/stat462/node/173/ https://en.wikipedia.org/wiki/Cook's_distance#cite_note-6 https://web.stanford.edu/class/stats191/notebooks/