Trabajo Final - Regresion Lineal Simple

Datos propios del archivo compartido

Author

Fernando G. Loayza Pizarro

Published

December 21, 2024

1 Ajuste e inferencia del Modelo de Regresión

1.1 Ejercicio 1

Se tiene el gráfico de dispersión

Warning: package 'dplyr' was built under R version 4.4.2

Adjuntando el paquete: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Interpretación: Las variables x1, x2 y x3 tienen una dispersión que pueda seguir un comportamiento lineal con pendiente negativa respecto a y. Las variables x4, x5 y x6 no tienen un comportamiento definido.

Calculamos las correlaciones

# df <- dat %>% select(x1,x6, y)

# graficando las variables
# plot(df$x1, df$y, pch=19, col=2,las=1)

library("PerformanceAnalytics")
Warning: package 'PerformanceAnalytics' was built under R version 4.4.2
Cargando paquete requerido: xts
Warning: package 'xts' was built under R version 4.4.2
Cargando paquete requerido: zoo

Adjuntando el paquete: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################

Adjuntando el paquete: 'xts'
The following objects are masked from 'package:dplyr':

    first, last

Adjuntando el paquete: 'PerformanceAnalytics'
The following object is masked from 'package:graphics':

    legend
px= ncol(dat)-5;
chart.Correlation(dat[,0:px], histogram=TRUE, pch=19)
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter

Interpretación: Siguiendo lo mensionado anteriormento, ahora con valores numericos, las 3 primeras variables tienen una alta correlación inversa, mientras que las otras variables tienen una correlación baja a media.

# la matrix X conocida como matriz del modelo 
X=model.matrix(~x1+x6, data=dat); X
   (Intercept)    x1 x6
1            1 350.0  4
2            1 350.0  4
3            1 250.0  1
4            1 351.0  2
5            1 225.0  1
6            1 440.0  4
7            1 231.0  2
8            1 262.0  2
9            1  89.7  2
10           1  96.9  2
11           1 350.0  4
12           1  85.3  2
13           1 171.0  2
14           1 258.0  1
15           1 140.0  2
16           1 302.0  2
17           1 500.0  4
18           1 440.0  4
19           1 350.0  4
20           1 318.0  2
21           1 231.0  2
22           1 360.0  2
23           1 400.0  4
24           1  96.9  2
25           1 140.0  2
26           1 460.0  4
27           1 133.6  2
28           1 318.0  2
29           1 351.0  2
30           1 351.0  2
31           1 360.0  4
32           1 360.0  4
attr(,"assign")
[1] 0 1 2
# vector de respuestas
y=dat$y;y
 [1] 18.90 17.00 20.00 18.25 20.07 11.20 22.12 21.47 34.70 30.40 16.50 36.50
[13] 21.50 19.70 20.30 17.80 14.39 14.89 17.80 16.41 23.54 21.47 16.59 31.90
[25] 29.40 13.27 23.90 19.73 13.90 13.27 13.77 16.50

El estimador de minimos cuadrados es dado por \[ \hat{\beta} = (X^{\top}X)^{-1}X^\top y \]

betas = (solve(t(X)%*%X))%*%t(X)%*%y; betas
                   [,1]
(Intercept) 32.88455083
x1          -0.05314767
x6           0.95922305

El modelo de regresión ajustado es dado por \[ y = 32.8846 - 0.0531 x_1 + 0.9522 x_6 \]


Call:
lm(formula = y ~ x1 + x6, data = dat)

Coefficients:
(Intercept)           x1           x6  
   32.88455     -0.05315      0.95922  
Analysis of Variance Table

Model 1: y ~ 1
Model 2: y ~ x1 + x6
  Res.Df     RSS Df Sum of Sq      F   Pr(>F)    
1     31 1237.54                                 
2     29  263.23  2    974.31 53.669 1.79e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretación: Observando el p-value tenemos fuertes evidencias al nivel de confianza 0.001 de recharzar la hipotesi nula. Al menos una de las variables predictoras contribuye de manera significativa a la predicción de y.

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
stargazer(lm1, type="text", title="Modelo Lineal")

Modelo Lineal
===============================================
                        Dependent variable:    
                    ---------------------------
                                 y             
-----------------------------------------------
x1                           -0.053***         
                              (0.006)          
                                               
x6                             0.959           
                              (0.670)          
                                               
Constant                     32.885***         
                              (1.535)          
                                               
-----------------------------------------------
Observations                    32             
R2                             0.787           
Adjusted R2                    0.773           
Residual Std. Error       3.013 (df = 29)      
F Statistic           53.669*** (df = 2; 29)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Interpretación: Al observar ambos valores podemos concluir que r2adj no penaliza demasiado (ambos valores cercanos) al tomar las variables mas significativas en la predicción.

confint(lm1)
                  2.5 %      97.5 %
(Intercept) 29.74428901 36.02481266
x1          -0.06569892 -0.04059641
x6          -0.41164739  2.33009349

# modelos reducidos
lmredu1 = lm(y ~ x1, data=dat)
lmredu2= lm(y ~ x6, data=dat)
anova(lmredu1, lm1)
Analysis of Variance Table

Model 1: y ~ x1
Model 2: y ~ x1 + x6
  Res.Df    RSS Df Sum of Sq     F Pr(>F)
1     30 281.82                          
2     29 263.24  1     18.59 2.048 0.1631

Interpretación: Al nivel 0.001 el regresor x6 no tiene efecto en la variable de salida y (hipotesis nula no es rechazada).

anova(lmredu2, lm1)
Analysis of Variance Table

Model 1: y ~ x6
Model 2: y ~ x1 + x6
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
1     30 944.04                                 
2     29 263.23  1    680.81 75.003 1.55e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretación: Al nivel 0.001 el regresor x1 si tiene efecto en la variable de salida y (hipotesis nula es rechazada). Por lo tanto, este regresor x1 no debe ser excluido del modelo.

xnew = data.frame(x1=275, x6=2)
predict(lm1, xnew, interval="confidence", level=0.95)
       fit      lwr      upr
1 20.18739 18.87221 21.50257

xnew = data.frame(x1=275, x6=2)
predict(lm1, xnew, interval="prediction", level=0.95)
       fit     lwr      upr
1 20.18739 13.8867 26.48808

1.2 Ejercicio 2

Interpretación: Respecto a la variable de salida, las variables x1, x3 y x4 tienen una dispersión que podría seguir un comportamiento lineal con pendiente positiva. Las otras variables no tienen un comportamiento definido.

# df <- dat %>% select(x1,x6, y)

# graficando las variables
# plot(df$x1, df$y, pch=19, col=2,las=1)

library("PerformanceAnalytics")
px= ncol(dat);
chart.Correlation(dat[,0:px], histogram=TRUE, pch=19)
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter

Interpretación: Siguiendo lo mensionado anteriormente, las 4 primeras variables tienen una alta correlación directa, aqui hay que tener cuidado con la variable x2 y la correlación con y, mientras que las otras variables tienen una correlación baja a media.

# la matrix X conocida como matriz del modelo 
X=model.matrix(~x1+x2+x3+x4+x5+x6+x7+x8+x9, data=dat); X
   (Intercept)     x1  x2     x3    x4  x5 x6 x7 x8 x9
1            1 5.0208 1.0 3.5310 1.500 2.0  7  4 62  0
2            1 4.5429 1.0 2.2750 1.175 1.0  6  3 40  0
3            1 4.5573 1.0 4.0500 1.232 1.0  6  3 54  0
4            1 5.0597 1.0 4.4550 1.121 1.0  6  3 42  0
5            1 3.8910 1.0 4.4550 0.988 1.0  6  3 56  0
6            1 5.8980 1.0 5.8500 1.240 1.0  7  3 51  1
7            1 5.6039 1.0 9.5200 1.501 0.0  6  3 32  0
8            1 5.8282 1.0 6.4350 1.225 2.0  6  3 32  0
9            1 5.3003 1.0 4.9883 1.552 1.0  6  3 30  0
10           1 6.2712 1.0 5.5200 0.975 1.0  5  2 30  0
11           1 5.9592 1.0 6.6660 1.121 2.0  6  3 32  0
12           1 5.0500 1.0 5.0000 1.020 0.0  5  2 46  1
13           1 8.2464 1.5 5.1500 1.664 2.0  8  4 50  0
14           1 6.6969 1.5 6.9020 1.488 1.5  7  3 22  1
15           1 7.7841 1.5 7.1020 1.376 1.0  6  3 17  0
16           1 9.0384 1.0 7.8000 1.500 1.5  7  3 23  0
17           1 5.9894 1.0 5.5200 1.256 2.0  6  3 40  1
18           1 7.5422 1.5 5.0000 1.690 1.0  6  3 22  0
19           1 8.7951 1.5 9.8900 1.820 2.0  8  4 50  1
20           1 6.0831 1.5 6.7265 1.652 1.0  6  3 44  0
21           1 8.3607 1.5 9.1500 1.777 2.0  8  4 48  1
22           1 8.1400 1.0 8.0000 1.504 2.0  7  3  3  0
23           1 9.1416 1.5 7.3262 1.831 1.5  8  4 31  0
24           1 4.9176 1.0 3.4720 0.998 1.0  7  4 42  0
attr(,"assign")
 [1] 0 1 2 3 4 5 6 7 8 9
# vector de respuestas
y=dat$y;y
 [1] 29.5 27.9 25.9 29.9 29.9 30.9 28.9 35.9 31.5 31.0 30.9 30.0 36.9 41.9 40.5
[16] 43.9 37.5 37.9 44.5 37.9 38.9 36.9 45.8 25.9
betas = (solve(t(X)%*%X))%*%t(X)%*%y; betas
                   [,1]
(Intercept) 14.92764759
x1           1.92472156
x2           7.00053420
x3           0.14917793
x4           2.72280790
x5           2.00668402
x6          -0.41012376
x7          -1.40323530
x8          -0.03714908
x9           1.55944663

Se utilizó el estimador de minimos cuadrados. El modelo de regresión ajustado es dado por

\[ y = 14.9276 + 1.9247 x_1 + 7.0001 x_2 + 0.1492 x_3 + 2.7228 x_4 + 2.0067 x_5 - 0.4101 x_6 - 1.4032 x_7 - 0.0371 x_8 + 1.5594 x_9 \]

lm2 = lm(y~x1+x2+x3+x4+x5+x6+x7+x8+x9, data=dat)
stargazer(lm2, type="text", title="Modelo Lineal")

Modelo Lineal
===============================================
                        Dependent variable:    
                    ---------------------------
                                 y             
-----------------------------------------------
x1                            1.925*           
                              (1.030)          
                                               
x2                             7.001           
                              (4.300)          
                                               
x3                             0.149           
                              (0.490)          
                                               
x4                             2.723           
                              (4.360)          
                                               
x5                             2.007           
                              (1.374)          
                                               
x6                            -0.410           
                              (2.379)          
                                               
x7                            -1.403           
                              (3.396)          
                                               
x8                            -0.037           
                              (0.067)          
                                               
x9                             1.559           
                              (1.937)          
                                               
Constant                     14.928**          
                              (5.913)          
                                               
-----------------------------------------------
Observations                    24             
R2                             0.853           
Adjusted R2                    0.759           
Residual Std. Error       2.949 (df = 14)      
F Statistic            9.037*** (df = 9; 14)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Interpretación: Al observar ambos valores podemos concluir que r2adj penaliza un poco al tomar solo variables las significativas en la predicción. Esto quiere decir que hay variables que se pueden descartar para que no impacte el resultado del modelo.

C1 = summary(lm2)$cov.unscaled# (Xt*X)^-1
Inv = solve(C1)
det(Inv)
[1] 7184427528

library(jtools)
Warning: package 'jtools' was built under R version 4.4.2
summ(lm2, digits = 5)
MODEL INFO:
Observations: 24
Dependent Variable: y
Type: OLS linear regression 

MODEL FIT:
F(9,14) = 9.03703, p = 0.00019
R² = 0.85315
Adj. R² = 0.75874 

Standard errors:OLS
-----------------------------------------------------------
                        Est.      S.E.     t val.         p
----------------- ---------- --------- ---------- ---------
(Intercept)         14.92765   5.91285    2.52461   0.02428
x1                   1.92472   1.02990    1.86884   0.08271
x2                   7.00053   4.30037    1.62789   0.12584
x3                   0.14918   0.49039    0.30420   0.76545
x4                   2.72281   4.35955    0.62456   0.54230
x5                   2.00668   1.37351    1.46099   0.16610
x6                  -0.41012   2.37854   -0.17243   0.86557
x7                  -1.40324   3.39554   -0.41326   0.68568
x8                  -0.03715   0.06672   -0.55679   0.58646
x9                   1.55945   1.93750    0.80488   0.43435
-----------------------------------------------------------

Interpretación: De estas pruebas de hipotesis individuales a un nivel 0.001 todas las variables son significativos, esto quiere decir que todos tienen efecto en la variable de salida y.

De la consulta anterior podemos sacar los pesos de cada regresor.

confint(lm2)
                 2.5 %     97.5 %
(Intercept)  2.2458422 27.6094530
x1          -0.2841970  4.1336401
x2          -2.2228458 16.2239142
x3          -0.9025984  1.2009542
x4          -6.6275044 12.0731202
x5          -0.9391990  4.9525670
x6          -5.5115940  4.6913465
x7          -8.6859484  5.8794778
x8          -0.1802490  0.1059509
x9          -2.5960687  5.7149620

2 Datos outliers y medidas de influencia

2.1 Ejercicio 3

      y     x1    x2    x3    x4    x5
1 271.8 783.35 33.53 40.55 16.66 13.20
2 264.0 748.45 36.50 36.19 16.46 14.11
3 238.8 684.45 34.66 37.31 17.66 15.68
4 230.7 827.80 33.13 32.52 17.50 10.53
5 251.6 860.45 35.75 33.71 16.40 11.00
6 257.9 875.15 34.46 34.14 16.28 11.31
library(MASS)
Warning: package 'MASS' was built under R version 4.4.2

Adjuntando el paquete: 'MASS'
The following object is masked from 'package:dplyr':

    select
library(faraway)
Warning: package 'faraway' was built under R version 4.4.2
library(MPV)
Warning: package 'MPV' was built under R version 4.4.2
Cargando paquete requerido: lattice

Adjuntando el paquete: 'lattice'
The following object is masked from 'package:faraway':

    melanoma
Cargando paquete requerido: KernSmooth
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
Cargando paquete requerido: randomForest
Warning: package 'randomForest' was built under R version 4.4.2
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Adjuntando el paquete: 'randomForest'
The following object is masked from 'package:dplyr':

    combine

Adjuntando el paquete: 'MPV'
The following object is masked from 'package:MASS':

    cement
lm3 = lm(y~x1+x2+x3+x4+x5, data=dat)
# Matriz X 
X=model.matrix(~x1+x2+x3+x4+x5, data=dat)
#names(lm3);names(summary(lm3))
Hat = X%*%(solve(t(X)%*%X))%*%t(X);round(Hat)
   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
1  1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
2  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
3  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
4  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
5  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
6  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
7  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
8  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
9  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
10 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
11 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
12 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
13 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
14 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
15 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
16 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
17 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
18 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
19 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
20 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
21 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
22 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
23 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
24 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
25 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
26 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
27 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
28 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
29 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   29
1   0
2   0
3   0
4   0
5   0
6   0
7   0
8   0
9   0
10  0
11  0
12  0
13  0
14  0
15  0
16  0
17  0
18  0
19  0
20  0
21  0
22  0
23  0
24  0
25  0
26  0
27  0
28  0
29  0

# residuos
eres1 = lm3$residuals; eres1
          1           2           3           4           5           6 
  4.2291482   0.9854539   4.2904379  17.3962202  -1.4975487   1.9644395 
          7           8           9          10          11          12 
 -4.0273552 -11.5890237 -11.1038139  -7.1792166   0.3363472  -8.6690794 
         13          14          15          16          17          18 
 -2.0918681   0.6273152  -2.6708219   5.2910015   0.6297434 -13.6847756 
         19          20          21          22          23          24 
  5.5813751  -2.7687671   3.8852532  14.3562521   3.5635546 -11.9948459 
         25          26          27          28          29 
 -0.4650692   3.8069245   3.9166343  -1.1352766   8.0173612 
# calculo de la raiz cuadrada de MCE o desviacion estandar estimada
sigmahat = summary(lm3)$sigma
# residuo semi estudentizado
di = eres1/sigmahat;di
          1           2           3           4           5           6 
 0.52607747  0.12258381  0.53370150  2.16397227 -0.18628494  0.24436300 
          7           8           9          10          11          12 
-0.50097579 -1.44159626 -1.38123944 -0.89304605  0.04183932 -1.07837491 
         13          14          15          16          17          18 
-0.26021426  0.07803377 -0.33223221  0.65816485  0.07833583 -1.70229364 
         19          20          21          22          23          24 
 0.69428536 -0.34441593  0.48329925  1.78582078  0.44328212 -1.49207781 
         25          26          27          28          29 
-0.05785147  0.47355569  0.48720285 -0.14122074  0.99730558 
# hii
hii = lm.influence(lm3)$hat#; hii
# MCE o estimacion de la varianza
sigma2hat = sigmahat^2 #; sigma2hat
# residuo internamente estudentizado 
ri = eres1/sqrt( sigma2hat*(1-hii) );ri
          1           2           3           4           5           6 
 0.96934442  0.12920241  0.56185097  2.89421917 -0.20531914  0.26615419 
          7           8           9          10          11          12 
-0.56750170 -1.63685277 -1.51529596 -0.95327459  0.04349406 -1.11458349 
         13          14          15          16          17          18 
-0.27268309  0.08913632 -0.35087335  0.85766253  0.08552698 -1.90584353 
         19          20          21          22          23          24 
 0.79995049 -0.37952157  0.51827868  2.09118155  0.48033454 -1.79975685 
         25          26          27          28          29 
-0.07169299  0.52172323  0.52917617 -0.15107487  1.08649415 
# residuo PRESS
eipress = eres1/(1-hii)
# residuo R de Student
p1 = 5
n1 = nrow(dat)
# estimacion de la varianza deletada
S2i = ( (n1-p1)*sigma2hat - (eres1^2/(1-hii)) )/(n1-p1-1)
# residuo R Student
ti = eres1/(sqrt(S2i*(1-hii)))
cbind(eres1, di, ri,eipress,ti)
         eres1          di          ri     eipress          ti
1    4.2291482  0.52607747  0.96934442  14.3585220  0.96807473
2    0.9854539  0.12258381  0.12920241   1.0947408  0.12652607
3    4.2904379  0.53370150  0.56185097   4.7549620  0.55367455
4   17.3962202  2.16397227  2.89421917  31.1181910  3.51160963
5   -1.4975487 -0.18628494 -0.20531914  -1.8192163 -0.20117291
6    1.9644395  0.24436300  0.26615419   2.3304208  0.26093568
7   -4.0273552 -0.50097579 -0.56750170  -5.1679796 -0.55931840
8  -11.5890237 -1.44159626 -1.63685277 -14.9409699 -1.70009298
9  -11.1038139 -1.38123944 -1.51529596 -13.3637753 -1.55988547
10  -7.1792166 -0.89304605 -0.95327459  -8.1802274 -0.95138882
11   0.3363472  0.04183932  0.04349406   0.3634783  0.04257997
12  -8.6690794 -1.07837491 -1.11458349  -9.2610160 -1.12050114
13  -2.0918681 -0.26021426 -0.27268309  -2.2971456 -0.26735623
14   0.6273152  0.07803377  0.08913632   0.8185214  0.08727401
15  -2.6708219 -0.33223221 -0.35087335  -2.9789433 -0.34437011
16   5.2910015  0.65816485  0.85766253   8.9846547  0.85277466
17   0.6297434  0.07833583  0.08552698   0.7506699  0.08373897
18 -13.6847756 -1.70229364 -1.90584353 -17.1531234 -2.02525410
19   5.5813751  0.69428536  0.79995049   7.4095429  0.79376121
20  -2.7687671 -0.34441593 -0.37952157  -3.3619629 -0.37265067
21   3.8852532  0.48329925  0.51827868   4.4680062  0.51022966
22  14.3562521  1.78582078  2.09118155  19.6856068  2.26375353
23   3.5635546  0.44328212  0.48033454   4.1841826  0.47249774
24 -11.9948459 -1.49207781 -1.79975685 -17.4517669 -1.89432689
25  -0.4650692 -0.05785147 -0.07169299  -0.7142367 -0.07019101
26   3.8069245  0.47355569  0.52172323   4.6207503  0.51365949
27   3.9166343  0.48720285  0.52917617   4.6205529  0.52108324
28  -1.1352766 -0.14122074 -0.15107487  -1.2992394 -0.14796436
29   8.0173612  0.99730558  1.08649415   9.5154590  1.09078199

vf

hii = lm.influence(lm3)$hat
eipress = eres1/(1-hii)
press1 = sum(eipress^2)

sst = sum(dat$y^2) - (sum(dat$y)^2/nrow(dat))

R2pred = 1-(press1/sst); R2pred
[1] 0.7881786

Interpretacion: Este valor explica el 78.81% de la variabilidad cuando se hace predicción.

# valores ajustados o predecidos
yfit3 = lm3$fitted.values
# residuos R-Student, ti
ti2 = rstudent(lm3)
#
op=par(mfrow=c(1,2))
# grafico normal
qqnorm(rstudent(lm3),las=1, ylab="residuos", pch=19, col=2)
qqline(rstudent(lm3))
# valores ajustados vs ti
plot(yfit3, ti2, las=1,cex=1,pch=19, col=2)
text(x=yfit3, y=ti2, labels=rownames(dat), col=4, pos = 4)

Se observa que los puntos 4 y 22 se pueden considerar puntos atipicos ya que estan un poco alejados de la normalidad.

dfit1=dffits(lm3)
pfit1=2*sqrt(p1/n1)
plot(dffits(lm3), pch="*", las=1, cex=2, main="MEDIDA DE INFLUENCIA ", xlab="obs", ylab = "DFFITS_i")
abline(h =pfit1, col="red", lwd=2) 
abline(h =-pfit1, col="red", lwd=2) 
text(x=1:n1, y=dfit1, labels=ifelse(dfit1>pfit1, names(dfit1),""), col="red", pos = 4)
text(x=1:n1, y=dfit1, labels=ifelse(dfit1<(-pfit1), names(dfit1),""), col="red", pos = 4)

library(olsrr)
Warning: package 'olsrr' was built under R version 4.4.2

Adjuntando el paquete: 'olsrr'
The following object is masked from 'package:MPV':

    cement
The following object is masked from 'package:faraway':

    hsb
The following object is masked from 'package:MASS':

    cement
The following object is masked from 'package:datasets':

    rivers
ols_plot_dffits(lm3)

Observamos que un punto muy influyente es 4.