Comprender formalmente:
No. Existen tres coeficientes principales:
| Método | Tipo de relación | Requiere normalidad | Sensible a atípicos |
|---|---|---|---|
| Pearson | Lineal | Sí (para inferencia) | Sí |
| Spearman | Monótona | No | Menos sensible |
| Kendall | Monótona | No | Más robusto |
\[ r = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum (x_i-\bar{x})^2}\sqrt{\sum (y_i-\bar{y})^2}} \]
Se basa en rangos.
\[ \rho_s = 1 - \frac{6\sum d_i^2}{n(n^2-1)} \]
Donde \(d_i\) es la diferencia entre rangos.
Basada en pares concordantes y discordantes:
\[ \tau = \frac{C-D}{\frac{n(n-1)}{2}} \]
Donde: - C = pares concordantes - D = pares discordantes
set.seed(123)
x <- 1:10
y <- c(2,4,5,4,6,7,8,9,10,12)
cor.test(x,y, method="pearson")
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 13.54, df = 8, p-value = 8.501e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9102563 0.9951583
## sample estimates:
## cor
## 0.9788709
cor.test(x,y, method="spearman")
## Warning in cor.test.default(x, y, method = "spearman"): Cannot compute exact
## p-value with ties
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 3.5099, p-value = 8.731e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9787279
cor.test(x,y, method="kendall")
## Warning in cor.test.default(x, y, method = "kendall"): Cannot compute exact
## p-value with ties
##
## Kendall's rank correlation tau
##
## data: x and y
## z = 3.7717, p-value = 0.0001621
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.9438798
Modelo poblacional:
\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]
Datos:
| X | Y |
|---|---|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
| 4 | 4 |
| 5 | 6 |
\[ \bar{x} = 3 \]
\[ \bar{y} = 4 \]
\[ \hat{\beta}_1 = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})} {\sum (x_i-\bar{x})^2} \]
Tabla auxiliar:
| X | Y | X-3 | Y-4 | Producto | (X-3)^2 |
|---|---|---|---|---|---|
| 1 | 2 | -2 | -2 | 4 | 4 |
| 2 | 3 | -1 | -1 | 1 | 1 |
| 3 | 5 | 0 | 1 | 0 | 0 |
| 4 | 4 | 1 | 0 | 0 | 1 |
| 5 | 6 | 2 | 2 | 4 | 4 |
Sumas:
\[ \sum (x_i-\bar{x})(y_i-\bar{y}) = 9 \]
\[ \sum (x_i-\bar{x})^2 = 10 \]
Entonces:
\[ \hat{\beta}_1 = 9/10 = 0.9 \]
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]
\[ \hat{\beta}_0 = 4 - 0.9(3) \]
\[ \hat{\beta}_0 = 1.3 \]
\[ \hat{Y} = 1.3 + 0.9X \]
Interpretación:
Por cada unidad que aumenta X, Y aumenta en promedio 0.9 unidades.
datos <- data.frame(
X=c(1,2,3,4,5),
Y=c(2,3,5,4,6)
)
modelo <- lm(Y~X, data=datos)
summary(modelo)
##
## Call:
## lm(formula = Y ~ X, data = datos)
##
## Residuals:
## 1 2 3 4 5
## -0.2 -0.1 1.0 -0.9 0.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3000 0.8347 1.558 0.2172
## X 0.9000 0.2517 3.576 0.0374 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7958 on 3 degrees of freedom
## Multiple R-squared: 0.81, Adjusted R-squared: 0.7467
## F-statistic: 12.79 on 1 and 3 DF, p-value: 0.03739
OLS significa Ordinary Least Squares (Mínimos Cuadrados Ordinarios).
Método que minimiza:
\[ S(\beta_0,\beta_1)=\sum (y_i-\beta_0-\beta_1x_i)^2 \]
Se llama:
\[ E(Y|X)=\beta_0+\beta_1X \]
Significa que la media condicional es lineal.
Se llama supuesto de media cero del error.
Significa:
\[ E(Y|X)=\beta_0+\beta_1X \]
El error no sesga el modelo.
\[ Var(\varepsilon_i)=\sigma^2 \]
Varianza constante de errores.
\[ Cov(\varepsilon_i,\varepsilon_j)=0 \]
No hay autocorrelación.
\[ \varepsilon_i \sim N(0,\sigma^2) \]
Necesaria para inferencia exacta.
Prueba:
\[ H_0: \beta_1=0 \]
Estadístico:
\[ t=\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \]
En R:
summary(modelo)
##
## Call:
## lm(formula = Y ~ X, data = datos)
##
## Residuals:
## 1 2 3 4 5
## -0.2 -0.1 1.0 -0.9 0.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3000 0.8347 1.558 0.2172
## X 0.9000 0.2517 3.576 0.0374 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7958 on 3 degrees of freedom
## Multiple R-squared: 0.81, Adjusted R-squared: 0.7467
## F-statistic: 12.79 on 1 and 3 DF, p-value: 0.03739
Interpretación:
confint(modelo)
## 2.5 % 97.5 %
## (Intercept) -1.35627846 3.956278
## X 0.09910191 1.700898
\[ R^2 = 1-\frac{SSE}{SST} \]
En R:
summary(modelo)$r.squared
## [1] 0.81
Interpretación:
Proporción de variabilidad explicada por X.
plot(modelo)
library(lmtest)
## Cargando paquete requerido: zoo
##
## Adjuntando el paquete: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(modelo)
##
## Durbin-Watson test
##
## data: modelo
## DW = 3.1789, p-value = 0.8406
## alternative hypothesis: true autocorrelation is greater than 0
bptest(modelo)
##
## studentized Breusch-Pagan test
##
## data: modelo
## BP = 0.34137, df = 1, p-value = 0.559
shapiro.test(modelo$residuals)
##
## Shapiro-Wilk normality test
##
## data: modelo$residuals
## W = 0.97256, p-value = 0.8914