1 Introducción

La regresión es un modelo estadístico que establece la posible relación entre una o más variables explicativas de un conjunto de datos observados con una varaible explicada. Lo que se pretende es estimar la relación entre una variable de interés (explicada) respecto del valor de una o más variables explicativas. Todas las variables involucradas (explicativa o explicativas y la o las explicadas utilizadas) deben ser numéricas para este tipo de análisis.

\[ \begin{align} \vec{y} &= f\left({X}_{1},{X}_{1},\ldots,{X}_{p}\right)\\ &= \boldsymbol{X}\vec{\beta}+\vec{\varepsilon} \end{align} \]

En donde:

  • \(\vec{y}\) es un vertor de dimensión \(n{\times}1\) de respuestas

  • \(\boldsymbol{X}\) es una matriz \(n{\times}p\) de variables explicativas conocida como matriz de diseñno.

  • \(\vec{\beta}\) es un vector \(p{\times}1\) de parámetross.

  • \(\vec{\varepsilon}\) es un vector \(n{\times}1\) de errores aleatorios.

Se asume que \(\vec{\varepsilon}\) es un vector de variables aleatorias tal que:

\[E(\vec{\varepsilon}) = \vec{0}\]

\[V(\vec{\varepsilon}) = {\sigma}^{2}\boldsymbol{I}\]

De donde se tiene lo siguiente:

\[ \begin{align} E(\vec{y}) &= E(\boldsymbol{X}\vec{\beta}+\vec{\varepsilon})\\ &= \boldsymbol{X}\vec{\beta}+E(\vec{\varepsilon})\\ &= \boldsymbol{X}\vec{\beta}\\ \end{align} \]

\[ \begin{align} V(\vec{y}) &= V(\boldsymbol{X}\vec{\beta}+\vec{\varepsilon})\\ &= V(\vec{\varepsilon})\\ &= {\sigma}^{2}\boldsymbol{I} \end{align} \]

Interesa:

  • Estimar los parámetros asociados al modelo, si no es posible, estimar algunas funciones lineales de los parámetros.

  • Hacer predicciones de la variable respuesta.

Nota:

“Es necesario tener más observaciones que parámetros, si no entonces hay problemas de singularidad con las matrices \(n{>}p\)”.

2 Regresión lineal simple

La regresión lineal presenta la relación entre variables numéricas, más expresamente entre una variable explicada (\(y\)) y otra variable explicativa (\(x\)). Intenta por tanto, predecir el valor de una variable cuantitativa en relación a otra.

\[ \begin{align} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} &= \begin{bmatrix} {\beta}_{0}\\ {\beta}_{0}\\ \vdots\\ {\beta}_{0} \end{bmatrix} + \begin{bmatrix} {\beta}_{1}{x}_{1}\\ {\beta}_{1}{x}_{2}\\ \vdots\\ {\beta}_{1}{x}_{n} \end{bmatrix} + \begin{bmatrix} {\varepsilon}_{1}\\ {\varepsilon}_{2}\\ \vdots\\ {\varepsilon}_{n} \end{bmatrix} \end{align} \]

con \({\varepsilon}_{i}{\sim}N(0,{\sigma}^{2})\) o

\[ \begin{align} \begin{bmatrix} {\varepsilon}_{1}\\ {\varepsilon}_{2}\\ \vdots\\ {\varepsilon}_{n} \end{bmatrix} &\sim N\left( \begin{bmatrix} 0\\ 0\\ \vdots\\ 0\\ \end{bmatrix},{\sigma}^{2}\begin{bmatrix} 1 & 0 & \cdots & 0\\ 0 & 1 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & 1 \end{bmatrix} \right) \end{align} \]

2.1 Introducción y formulación del modelo

El análisis de regresión es una herramienta estadística para investigar relaciones entre variables. Por lo general, el investigador trata de determinar el efecto causal de una variable sobre otra: el efecto de un aumento del precio sobre la demanda, por ejemplo, o el efecto de los cambios en la oferta monetaria sobre la tasa de inflación. Para explorar estas cuestiones, el investigador reúne datos sobre las variables subyacentes de interés y emplea la regresión para estimar el efecto cuantitativo de las variables causales sobre la variable que afectan. El investigador también suele evaluar la “significación estadística” de las relaciones estimadas, es decir, el grado de confianza de que la relación verdadera esté próxima a la relación estimada.

library(statsr)
## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
## ************
## Welcome to BayesFactor 0.9.12-4.4. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************
datos <- data.frame(x = rnorm(10), y = rnorm(10))

2.2 Estimación de los parámetros del modelo

En estadística, la regresión lineal o ajuste lineal es un modelo matemático usado para aproximar la relación entre una variable explicada \(\vec{y}\) y, \(p\) variables independientes \(\vec{x}_{i}\) con \(p{\in}{Z}^{p}\) y un término aleatorio \(\vec{\varepsilon}\). Este método es aplicable en muchas situaciones en las que se estudia la relación entre dos o más variables o predecir un comportamiento, algunas incluso sin relación con la tecnología. En caso de que no se pueda aplicar un modelo de regresión a un estudio, se dice que no hay correlación entre las variables estudiadas.

library(statsr)
plot_ss(x = x, y = y, data = datos)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     0.08674      0.74037  
## 
## Sum of Squares:  5.317

Dado el modelo de regresión planteado de forma general

\[{y}_{i} = {\beta}_{0} + {\beta}_{1}{x}_{i} + {\varepsilon}_{i}\]

Se tiene, al despejar, que los errores son iguales a la diferencia

\[{\varepsilon}_{i} = {y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}\]

Y entonces la idea, para encontrar la recta de mejor ajuste es minimizar la suma de los errores o distancias verticales de las observaciones a la recta ajustada, por lo que la expresión a minimizar es:

\[ \begin{align} S\left({\beta}_{0}, {\beta}_{1}\right) &= \vec{\varepsilon}^{t}\vec{\varepsilon}\\ &= \begin{bmatrix} {y}_{1} - {\beta}_{0} + {\beta}_{1}{x}_{1}, & {y}_{2} - {\beta}_{0} + {\beta}_{1}{x}_{2}, & \cdots, & {y}_{n} - {\beta}_{0} + {\beta}_{1}{x}_{n} \end{bmatrix} \begin{bmatrix} {y}_{1} - {\beta}_{0} - {\beta}_{1}{x}_{1}\\ {y}_{2} - {\beta}_{0} - {\beta}_{1}{x}_{2}\\ \vdots\\ {y}_{n} - {\beta}_{0} - {\beta}_{1}{x}_{n} \end{bmatrix}\\ &= {\sum}_{i=1}^{n}\left({{y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}}\right)^{2} \end{align} \]

2.2.1 Método de mínimos cuadrados ordinarios

\[ \begin{align} \frac{{\partial}}{{\partial}{\beta}_{0}}S\left({\beta}_{0}, {\beta}_{1}\right) &= 2{\sum}_{i=1}^{n}\left({{y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}}\right)\left(-1\right)\\ &= -2{\sum}_{i=1}^{n}\left({{y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}}\right)\\ &= -2{{\sum}_{i=1}^{n}{y}_{i} + 2{\sum}_{i=1}^{n}{\beta}_{0} + 2{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}} \end{align} \]

\[ \begin{align} \frac{{\partial}}{{\partial}{\beta}_{1}}S\left({\beta}_{0}, {\beta}_{1}\right) &= 2{\sum}_{i=1}^{n}\left({{y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}}\right)\left(-{x}_{i}\right)\\ &= -2{\sum}_{i=1}^{n}{x}_{i}\left({{y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}}\right)\\ &= -2\left({\sum}_{i=1}^{n}{{x}_{i}{y}_{i} - {\beta}_{0}{x}_{i} - {\beta}_{1}{x}_{i}^{2}}\right)\\ &= -2{{\sum}_{i=1}^{n}{x}_{i}{y}_{i} + 2{\beta}_{0}{\sum}_{i=1}^{n}{x}_{i} + 2{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}^{2}}\\ \end{align} \]

Igualando a cero se obtiene lo siguiente:

\[ \begin{align} \frac{{\partial}}{{\partial}{\beta}_{0}}S({\beta}_{0}, {\beta}_{1}) = 0 &\rightarrow -2{{\sum}_{i=1}^{n}{y}_{i} + 2\widehat{\beta}_{0}{\sum}_{i=1}^{n}1 + 2\widehat{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}} = 0\\ &\rightarrow {-{\sum}_{i=1}^{n}{y}_{i} + \widehat{\beta}_{0}{\sum}_{i=1}^{n}1 + \widehat{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}} = 0\\ &\rightarrow {\widehat{\beta}_{0}{\sum}_{i=1}^{n}1 = {\sum}_{i=1}^{n}{y}_{i} - \widehat{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}}\\ &\rightarrow {\widehat{\beta}_{0} = \frac{{\sum}_{i=1}^{n}{y}_{i}}{{\sum}_{i=1}^{n}1} - \widehat{\beta}_{1}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}1}}\\ &\rightarrow {\widehat{\beta}_{0} = \frac{{\sum}_{i=1}^{n}{y}_{i}}{n} - \widehat{\beta}_{1}\frac{{\sum}_{i=1}^{n}{x}_{i}}{n}}\\ &\rightarrow {\widehat{\beta}_{0} = \overline{y} - \widehat{\beta}_{1}\overline{x}} \end{align} \]

\[ \begin{align} \frac{{\partial}}{{\partial}{\beta}_{1}}S({\beta}_{0}, {\beta}_{1}) = 0 &\rightarrow -2{{\sum}_{i=1}^{n}{x}_{i}{y}_{i} + 2\widehat{\beta}_{0}{\sum}_{i=1}^{n}{x}_{i} + 2\widehat{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}^{2}} = 0\\ &\rightarrow {-{\sum}_{i=1}^{n}{x}_{i}{y}_{i} + \widehat{\beta}_{0}{\sum}_{i=1}^{n}{x}_{i} + \widehat{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}^{2}} = 0\\ &\rightarrow {\widehat{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}^{2} = {\sum}_{i=1}^{n}{x}_{i}{y}_{i} - \widehat{\beta}_{0}{\sum}_{i=1}^{n}{x}_{i}}\\ &\rightarrow {\widehat{\beta}_{1} = \frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \widehat{\beta}_{0}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}\\ &\rightarrow {\widehat{\beta}_{1} = \frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \left(\overline{y} - \widehat{\beta}_{1}\overline{x}\right)\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}\\ &\rightarrow {\widehat{\beta}_{1} = \frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} + \widehat{\beta}_{1}\overline{x}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}\\ &\rightarrow {\widehat{\beta}_{1} - \widehat{\beta}_{1}\overline{x}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} = \frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}\\ &\rightarrow {\widehat{\beta}_{1}\left(1 - \overline{x}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}\right) = \frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}\\ &\rightarrow \widehat{\beta}_{1} = \frac{\frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}{1 - \overline{x}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}\\ &\rightarrow \widehat{\beta}_{1} = \frac{\frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}{\frac{{\sum}_{i=1}^{n}{x}_{i}^{2}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{x}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}\\ &\rightarrow \widehat{\beta}_{1} = \frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i} - \overline{y}{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2} - \overline{x}{\sum}_{i=1}^{n}{x}_{i}}\\ &\rightarrow \widehat{\beta}_{1} = \frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i} - n\frac{{\sum}_{i=1}^{n}{y}_{i}}{n}\overline{x}}{{\sum}_{i=1}^{n}{x}_{i}^{2} - \frac{{\sum}_{i=1}^{n}{x}_{i}}{n}{{\sum}_{i=1}^{n}{x}_{i}}}\\ &\rightarrow \widehat{\beta}_{1} = \frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i} - n\overline{y}\overline{x}}{{\sum}_{i=1}^{n}{x}_{i}^{2} - \frac{1}{n}\left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}}\\ &\rightarrow \widehat{\beta}_{1} = \frac{\frac{1}{n}{\sum}_{i=1}^{n}{x}_{i}{y}_{i} - \overline{y}\overline{x}}{\frac{1}{n}{\sum}_{i=1}^{n}{x}_{i}^{2} - \left(\frac{{{\sum}_{i=1}^{n}{x}_{i}}}{n}\right)^{2}}\\ &\rightarrow \widehat{\beta}_{1} = \frac{{S}_{x,y}}{{S}_{x,x}} \end{align} \]

2.2.2 Método de máxima-verosimilitud

con \({y}_{i}{\sim}N({\beta}_{0} + {\beta}_{1}{x}_{i},{\sigma}^{2})\) o

\[ \begin{align} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} &\sim N\left( \begin{bmatrix} {\beta}_{0} + {\beta}_{1}{x}_{1}\\ {\beta}_{0} + {\beta}_{1}{x}_{2}\\ \vdots\\ {\beta}_{0} + {\beta}_{1}{x}_{n}\\ \end{bmatrix},{\sigma}^{2}\begin{bmatrix} 1 & 0 & \cdots & 0\\ 0 & 1 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & 1 \end{bmatrix} \right) \end{align} \]

\[ \begin{align} \mathbb{L}\left({\beta}_{0}, {\beta}_{1}, {\sigma}^{2}|{x}_{1}, {x}_{2}, \cdots, {x}_{n}, {y}_{1}, {y}_{2}, \cdots, {y}_{n}\right) &= {\prod}_{i=1}^{n}{\frac{1}{\sqrt{2{\pi}{\sigma}^{2}}}}\exp{\left\{-\frac{1}{2{\sigma}^{2}}\left({y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}\right)^{2}\right\}}\\ &= {\left(\frac{1}{\sqrt{2{\pi}{\sigma}^{2}}}\right)^{n}}\exp{\left\{-\frac{1}{2{\sigma}^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}\right)^{2}\right\}}\\ &= {\frac{1}{\left({2{\pi}{\sigma}^{2}}\right)^\frac{n}{2}}}\exp{\left\{-\frac{1}{2{\sigma}^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}\right)^{2}\right\}} \end{align} \]

\[ \begin{align} \ln\left[{\mathbb{L}\left({\beta}_{0}, {\beta}_{1}, {\sigma}^{2}|{x}_{1}, {x}_{2}, \cdots, {x}_{n}, {y}_{1}, {y}_{2}, \cdots, {y}_{n}\right)}\right] &= \ln\left[{\frac{1}{\left({2{\pi}{\sigma}^{2}}\right)^\frac{n}{2}}}\exp{\left\{-\frac{1}{2{\sigma}^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}\right)^{2}\right\}}\right]\\ &= \ln\left[{{\left({2{\pi}{\sigma}^{2}}\right)^{-\frac{n}{2}}}}\right]{-\frac{1}{2{\sigma}^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}\right)^{2}}\\ &= {-\frac{n}{2}}\ln{{\left({2{\pi}{\sigma}^{2}}\right)}}{-\frac{1}{2{\sigma}^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - {\beta}_{0} - {\beta}_{1}{x}_{i}\right)^{2}}\\ \end{align} \]

\[ \begin{align} \frac{{\partial}}{{\partial}{\beta}_{0}}\ln\left[{\mathbb{L}\left({\beta}_{0}, {\beta}_{1}, {\sigma}^{2}|{x}_{1}, {x}_{2}, \cdots, {x}_{n}, {y}_{1}, {y}_{2}, \cdots, {y}_{n}\right)}\right] = 0 &{\rightarrow} {-2\frac{1}{2\widehat{\sigma}^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)}\left(-1\right) = 0\\ &{\rightarrow} {\frac{1}{2\widehat{\sigma}^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)} = 0\\ &{\rightarrow} {{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)} = 0\\ &{\rightarrow} {{\sum}_{i=1}^{n}{y}_{i} - {\sum}_{i=1}^{n}\widehat{\beta}_{0} - {\sum}_{i=1}^{n}\widehat{\beta}_{1}{x}_{i}} = 0\\ &{\rightarrow} {{\sum}_{i=1}^{n}{y}_{i} - {\sum}_{i=1}^{n}\widehat{\beta}_{1}{x}_{i}} = {\sum}_{i=1}^{n}\widehat{\beta}_{0}\\ &{\rightarrow} {{\sum}_{i=1}^{n}{y}_{i} - \widehat{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}} = {n}\widehat{\beta}_{0}\\ &{\rightarrow} {\frac{{\sum}_{i=1}^{n}{y}_{i}}{n} - \widehat{\beta}_{1}\frac{{\sum}_{i=1}^{n}{x}_{i}}{n}} = \widehat{\beta}_{0}\\ &{\rightarrow} {\overline{y} - \widehat{\beta}_{1}\overline{x}} = \widehat{\beta}_{0} \end{align} \]

\[ \begin{align} \frac{{\partial}}{{\partial}{\beta}_{1}}\ln\left[{\mathbb{L}\left({\beta}_{0}, {\beta}_{1}, {\sigma}^{2}|{x}_{1}, {x}_{2}, \cdots, {x}_{n}, {y}_{1}, {y}_{2}, \cdots, {y}_{n}\right)}\right] = 0 &{\rightarrow} {-2\frac{1}{2\widehat{\sigma}^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)}\left(-{x}_{i}\right) = 0\\ &{\rightarrow} {\frac{1}{2\widehat{\sigma}^{2}}{\sum}_{i=1}^{n}\left({y}_{i}{x}_{i} - \widehat{\beta}_{0}{x}_{i} - \widehat{\beta}_{1}{x}_{i}^{2}\right)} = 0\\ &{\rightarrow} {{\sum}_{i=1}^{n}\left({y}_{i}{x}_{i} - \widehat{\beta}_{0}{x}_{i} - \widehat{\beta}_{1}{x}_{i}^{2}\right)} = 0\\ &{\rightarrow} {{\sum}_{i=1}^{n}{y}_{i}{x}_{i} - {\sum}_{i=1}^{n}\widehat{\beta}_{0}{x}_{i} - {\sum}_{i=1}^{n}\widehat{\beta}_{1}{x}_{i}^{2}} = 0\\ &{\rightarrow} {{\sum}_{i=1}^{n}{y}_{i}{x}_{i} - {\sum}_{i=1}^{n}\widehat{\beta}_{0}{x}_{i} = {\sum}_{i=1}^{n}\widehat{\beta}_{1}{x}_{i}^{2}}\\ &{\rightarrow} {{\sum}_{i=1}^{n}{y}_{i}{x}_{i} - \widehat{\beta}_{0}{\sum}_{i=1}^{n}{x}_{i} = \widehat{\beta}_{1}{\sum}_{i=1}^{n}{x}_{i}^{2}}\\ &{\rightarrow} {\frac{{\sum}_{i=1}^{n}{y}_{i}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \widehat{\beta}_{0}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} = \widehat{\beta}_{1}}\\ &{\rightarrow} {\frac{{\sum}_{i=1}^{n}{y}_{i}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \left(\overline{y} - \widehat{\beta}_{1}\overline{x}\right)\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} = \widehat{\beta}_{1}}\\ &{\rightarrow} {\frac{{\sum}_{i=1}^{n}{y}_{i}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} + \widehat{\beta}_{1}\overline{x}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} = \widehat{\beta}_{1}}\\ &{\rightarrow} {\frac{{\sum}_{i=1}^{n}{y}_{i}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} = \widehat{\beta}_{1} - \widehat{\beta}_{1}\overline{x}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}\\ &{\rightarrow} {\frac{{\sum}_{i=1}^{n}{y}_{i}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} = \widehat{\beta}_{1}\left(1 - \overline{x}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}\right)}\\ &{\rightarrow} {\frac{\frac{{\sum}_{i=1}^{n}{y}_{i}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}} - \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}{\left(1 - \overline{x}\frac{{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}\right)} = \widehat{\beta}_{1}}\\ &{\rightarrow} {\frac{{\sum}_{i=1}^{n}{y}_{i}{x}_{i} - \overline{y}{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2} - \overline{x}{\sum}_{i=1}^{n}{x}_{i}} = \widehat{\beta}_{1}}\\ &{\rightarrow} {\frac{{\sum}_{i=1}^{n}{y}_{i}{x}_{i} - \overline{y}\frac{n}{n}{\sum}_{i=1}^{n}{x}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2} - \overline{x}\frac{n}{n}{\sum}_{i=1}^{n}{x}_{i}} = \widehat{\beta}_{1}}\\ &{\rightarrow} {\frac{{\sum}_{i=1}^{n}{y}_{i}{x}_{i} - {n}\overline{y}\overline{x}}{{\sum}_{i=1}^{n}{x}_{i}^{2} - {n}\overline{x}\overline{x}} = \widehat{\beta}_{1}}\\ &{\rightarrow} {\frac{\frac{1}{n}{\sum}_{i=1}^{n}{y}_{i}{x}_{i} - \overline{y}\overline{x}}{\frac{1}{n}{\sum}_{i=1}^{n}{x}_{i}^{2} - \overline{x}^{2}} = \widehat{\beta}_{1}}\\ &{\rightarrow} {\frac{{S}_{x,y}}{{S}_{x,x}} = \widehat{\beta}_{1}} \end{align} \]

\[ \begin{align} \frac{{\partial}}{{\partial}{\sigma}^{2}}\ln\left[{\mathbb{L}\left({\beta}_{0}, {\beta}_{1}, {\sigma}^{2}|{x}_{1}, {x}_{2}, \cdots, {x}_{n}, {y}_{1}, {y}_{2}, \cdots, {y}_{n}\right)}\right] = 0 &{\rightarrow} - {\frac{n}{2}}\frac{1}{{{2{\pi}\widehat{\sigma}^{2}}}}{2{\pi}} - {\frac{1}{2\left(\widehat{\sigma}^{2}\right)^{2}}(-1){\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}} = 0\\ &{\rightarrow} - \frac{n}{{2{\widehat{\sigma}^{2}}}} + {\frac{1}{2\left(\widehat{\sigma}^{2}\right)^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}} = 0\\ &{\rightarrow} {\frac{1}{2\left(\widehat{\sigma}^{2}\right)^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}} = \frac{n}{{2{\widehat{\sigma}^{2}}}}\\ &{\rightarrow} {\frac{1}{\left(\widehat{\sigma}^{2}\right)^{2}}{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}} = \frac{n}{{{\widehat{\sigma}^{2}}}}\\ &{\rightarrow} {\frac{1}{n}{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}} = \widehat{\sigma}^{2}\\ \end{align} \]

2.3 Análisis de varianza y pruebas de hipótesis

2.3.1 Sumas de cuadrados

\[ \begin{align} {SC}_{T} &= {\sum}_{i=1}^{n}\left({y}_{i} - \overline{y}\right)^{2}\\ &= {\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i} + \widehat{y}_{i} - \overline{y}\right)^{2}\\ &= {\sum}_{i=1}^{n}\left[\left({y}_{i} - \widehat{y}_{i}\right) + \left(\widehat{y}_{i} - \overline{y}\right)\right]^{2}\\ &= {\sum}_{i=1}^{n}\left[\left({y}_{i} - \widehat{y}_{i}\right)^{2} + 2\left({y}_{i} - \widehat{y}_{i}\right)\left(\widehat{y}_{i} - \overline{y}\right) + \left(\widehat{y}_{i} - \overline{y}\right)^{2}\right]\\ &= {\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i}\right)^{2} + 2{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i}\right)\left(\widehat{y}_{i} - \overline{y}\right) + {\sum}_{i=1}^{n}\left(\widehat{y}_{i} - \overline{y}\right)^{2}\\ &= {\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i}\right)^{2} + 2{\sum}_{i=1}^{n}{\varepsilon}_{i}\left(\widehat{y}_{i} - \overline{y}\right) + {\sum}_{i=1}^{n}\left(\widehat{y}_{i} - \overline{y}\right)^{2}\\ &= {\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i}\right)^{2} + 2{\sum}_{i=1}^{n}{\varepsilon}_{i}\left(\widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i} - \overline{y}\right) + {\sum}_{i=1}^{n}\left(\widehat{y}_{i} - \overline{y}\right)^{2}\\ &= {\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i}\right)^{2} + 2\left(\widehat{\beta}_{0} - \overline{y}\right){\sum}_{i=1}^{n}{\varepsilon}_{i} + 2\widehat{\beta}_{1}{\sum}_{i=1}^{n}{\varepsilon}_{i}{x}_{i} + {\sum}_{i=1}^{n}\left(\widehat{y}_{i} - \overline{y}\right)^{2}\\ &= {\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i}\right)^{2} + 2\left(\widehat{\beta}_{0} - \overline{y}\right)0 + 2\widehat{\beta}_{1}{\sum}_{i=1}^{n}\left({{y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}}\right){x}_{i} + {\sum}_{i=1}^{n}\left(\widehat{y}_{i} - \overline{y}\right)^{2}\\ &= {\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i}\right)^{2} + 2\widehat{\beta}_{1}\frac{{\partial}}{{\partial}{\beta}_{1}}S\left(\widehat{\beta}_{0}, \widehat{\beta}_{1}\right) + {\sum}_{i=1}^{n}\left(\widehat{y}_{i} - \overline{y}\right)^{2}\\ &= {\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i}\right)^{2} + 2\widehat{\beta}_{1}0 + {\sum}_{i=1}^{n}\left(\widehat{y}_{i} - \overline{y}\right)^{2}\\ &= {\sum}_{i=1}^{n}\left({y}_{i} - \widehat{y}_{i}\right)^{2} + {\sum}_{i=1}^{n}\left(\widehat{y}_{i} - \overline{y}\right)^{2} \end{align} \]

\[ \begin{align} {SC}_{T} &= {SC}_{E} + {SC}_{R} \end{align} \]

\[ \begin{align} {SC}_{T} &= {\sum}_{i=1}^{n}\left({y}_{i} - \overline{y}\right)^{2} \end{align} \]

\[ \begin{align} {SC}_{E} &= {\sum}_{i=1}^{n}\left[{y}_{i} - \left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i}\right)\right]^{2} \end{align} \]

\[ \begin{align} {SC}_{R} &= {\sum}_{i=1}^{n}\left[\left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i}\right) - \overline{y}\right]^{2} \end{align} \]

\[ \begin{align} \frac{{SC}_{T}}{{gl}_{T}} = \frac{{\sum}_{i=1}^{n}\left({y}_{i} - \overline{y}\right)^{2}}{n-1} &{\sim} \chi_{(n-1)}^{2} \end{align} \]

\[ \begin{align} \frac{{SC}_{E}}{{gl}_{E}} = \frac{{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}}{n-2} &{\sim} \chi_{(n-2)}^{2} \end{align} \]

\[ \begin{align} \frac{{SC}_{R}}{{gl}_{R}} = \frac{{SC}_{T}}{n-1} - \frac{{SC}_{E}}{n-2} &{\sim} \chi_{[(n - 1) - (n - 2)]}^{2} \end{align} \]

\[ \begin{align} \frac{{SC}_{R}}{{gl}_{R}} = \frac{{\sum}_{i=1}^{n}\left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i} - \overline{y}\right)^{2}}{1} &{\sim} \chi_{(1)}^{2} \end{align} \]

\[ \begin{align} \frac{\frac{{SC}_{R}}{{gl}_{R}}}{\frac{{SC}_{E}}{{gl}_{E}}} = \frac{\frac{{\sum}_{i=1}^{n}\left(\widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i} - \overline{y}\right)^{2}}{1}}{\frac{{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}}{n-2}} &{\sim} F_{(1,n-2)}^{2} \end{align} \]

2.3.2 Sistema de hipótesis

\[ \begin{align} {H}_{0}: {y}_{i} = {\beta}_{0} + {\varepsilon}_{i} &{\text{ versus }} {H}_{1}: {y}_{i} = {\beta}_{0} + {\beta}_{1}{x}_{i} + {\varepsilon}_{i}\\ {H}_{0}: {\beta}_{1} = {0} &{\text{ versus }} {H}_{1}: {\beta}_{1} \not= {0} \end{align} \]

2.3.3 Análisis de varianza

Fuente gl Suma de cuadrados Cuadrado medio \(F_{(1,n-2)}\)
Regresión 1 \(SC_R={\sum}_{i=1}^{n}\left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i} - \overline{y}\right)^{2}\) \(CM_R=\frac{{\sum}_{i=1}^{n}\left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i} - \overline{y}\right)^{2}}{1}\) \(\frac{CM_R}{CM_E}=\frac{\frac{{\sum}_{i=1}^{n}\left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i} - \overline{y}\right)^{2}}{1}}{\frac{{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}}{n-2}}\)
Error n-2 \(SC_E={\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}\) \(CM_E=\frac{{\sum}_{i=1}^{n}\left({y}_{i} - \widehat{\beta}_{0} - \widehat{\beta}_{1}{x}_{i}\right)^{2}}{n-2}\)
Total n-1 \(SC_T={\sum}_{i=1}^{n}\left({y}_{i} - \overline{y}\right)^{2}\) \(CM_T=\frac{{\sum}_{i=1}^{n}\left({y}_{i} - \overline{y}\right)^{2}}{n-1}\)

2.4 Bondad de ajuste del modelo

2.4.1 Coeficiente de determinación

\[ \begin{align} R^{2} &= \frac{SC_R}{SC_T}\\ &= \frac{{\sum}_{i=1}^{n}\left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i} - \overline{y}\right)^{2}}{{\sum}_{i=1}^{n}\left({y}_{i} - \overline{y}\right)^{2}} \end{align} \]

2.4.2 Coeficiente de determinación ajsutado

\[ \begin{align} R_{a}^{2} &= \frac{\frac{SC_R}{n-(k+1)}}{\frac{SC_T}{n-1}}\\ &= \frac{\frac{{\sum}_{i=1}^{n}\left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i} - \overline{y}\right)^{2}}{n-(k+1)}}{\frac{{\sum}_{i=1}^{n}\left({y}_{i} - \overline{y}\right)^{2}}{n-1}} \end{align} \]

2.5 Regresión en términos matriciales

\[ \begin{align} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} &= \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} {\beta}_{0}\\ {\beta}_{1} \end{bmatrix} + \begin{bmatrix} {\varepsilon}_{1}\\ {\varepsilon}_{2}\\ \vdots\\ {\varepsilon}_{n} \end{bmatrix} \end{align} \]

2.6 Mínimos cuadrados

\[ \begin{align} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} - \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} {\beta}_{0}\\ {\beta}_{1} \end{bmatrix} &= \begin{bmatrix} {\varepsilon}_{1}\\ {\varepsilon}_{2}\\ \vdots\\ {\varepsilon}_{n} \end{bmatrix} \end{align} \]

\[ \begin{align} \begin{bmatrix} {\varepsilon}_{1}, & {\varepsilon}_{2}, & \cdots, & {\varepsilon}_{n} \end{bmatrix}\begin{bmatrix} {\varepsilon}_{1}\\ {\varepsilon}_{2}\\ \vdots\\ {\varepsilon}_{n} \end{bmatrix} &= S(\vec{\beta}) \end{align} \]

\[ \begin{align} \begin{bmatrix} \begin{bmatrix} {y}_{1}, & {y}_{2}, \cdots, & {y}_{n} \end{bmatrix} - \begin{bmatrix} {\beta}_{0}, & {\beta}_{1} \end{bmatrix}\begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \end{bmatrix} \begin{bmatrix} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} - \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} {\beta}_{0}\\ {\beta}_{1} \end{bmatrix} \end{bmatrix} &= S(\vec{\beta})\\ \begin{bmatrix} {y}_{1}, & {y}_{2}, \cdots, & {y}_{n} \end{bmatrix} \begin{bmatrix} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} - \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} {\beta}_{0}\\ {\beta}_{1} \end{bmatrix} \end{bmatrix} - \begin{bmatrix} {\beta}_{0}, & {\beta}_{1} \end{bmatrix}\begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} - \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} {\beta}_{0}\\ {\beta}_{1} \end{bmatrix} \end{bmatrix} &= \\ \begin{bmatrix} {y}_{1}, & {y}_{2}, \cdots, & {y}_{n} \end{bmatrix} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} - \begin{bmatrix} {y}_{1}, & {y}_{2}, \cdots, & {y}_{n} \end{bmatrix} \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} {\beta}_{0}\\ {\beta}_{1} \end{bmatrix} - \begin{bmatrix} {\beta}_{0}, & {\beta}_{1} \end{bmatrix}\begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} + \begin{bmatrix} {\beta}_{0}, & {\beta}_{1} \end{bmatrix}\begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} {\beta}_{0}\\ {\beta}_{1} \end{bmatrix} &= \\ \begin{bmatrix} {y}_{1}, & {y}_{2}, \cdots, & {y}_{n} \end{bmatrix} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} - 2\begin{bmatrix} {\beta}_{0}, & {\beta}_{1} \end{bmatrix}\begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} + \begin{bmatrix} {\beta}_{0}, & {\beta}_{1} \end{bmatrix}\begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} {\beta}_{0}\\ {\beta}_{1} \end{bmatrix} &= \end{align} \]

\[ \begin{align} - 2\begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} + 2\begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} {\beta}_{0}\\ {\beta}_{1} \end{bmatrix} &= \frac{\partial}{\partial\vec{\beta}}S(\vec{\beta}) \end{align} \]

\[ \begin{align} - \begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} + \begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} \widehat{\beta}_{0}\\ \widehat{\beta}_{1} \end{bmatrix} &= \begin{bmatrix} 0\\ 0 \end{bmatrix} \\ \begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix} \begin{bmatrix} 1 & {x}_{1}\\ 1 & {x}_{2}\\ \vdots & \vdots\\ 1 & {x}_{n} \end{bmatrix}\begin{bmatrix} \widehat{\beta}_{0}\\ \widehat{\beta}_{1} \end{bmatrix} &= \begin{bmatrix} 1, & 1, & \cdots, & 1\\ {x}_{1}, & {x}_{2}, & \cdots, & {x}_{n} \end{bmatrix}\begin{bmatrix} {y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n} \end{bmatrix} \\ \begin{bmatrix} n & {\sum}_{i=1}^{n}{x}_{i}\\ {\sum}_{i=1}^{n}{x}_{i} & {\sum}_{i=1}^{n}{x}_{i}^{2} \end{bmatrix}\begin{bmatrix} \widehat{\beta}_{0}\\ \widehat{\beta}_{1} \end{bmatrix} &= \begin{bmatrix} {\sum}_{i=1}^{n}{y}_{i}\\ {\sum}_{i=1}^{n}{x}_{i}{y}_{i} \end{bmatrix} \\ \frac{1}{{n}{\sum}_{i=1}^{n}{x}_{i}^{2}-\left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}}\begin{bmatrix} {\sum}_{i=1}^{n}{x}_{i}^{2} & -{\sum}_{i=1}^{n}{x}_{i}\\ -{\sum}_{i=1}^{n}{x}_{i} & n \end{bmatrix}\begin{bmatrix} n & {\sum}_{i=1}^{n}{x}_{i}\\ {\sum}_{i=1}^{n}{x}_{i} & {\sum}_{i=1}^{n}{x}_{i}^{2} \end{bmatrix}\begin{bmatrix} \widehat{\beta}_{0}\\ \widehat{\beta}_{1} \end{bmatrix} &= \frac{1}{{n}{\sum}_{i=1}^{n}{x}_{i}^{2}-\left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}}\begin{bmatrix} {\sum}_{i=1}^{n}{x}_{i}^{2} & -{\sum}_{i=1}^{n}{x}_{i}\\ -{\sum}_{i=1}^{n}{x}_{i} & n \end{bmatrix}\begin{bmatrix} {\sum}_{i=1}^{n}{y}_{i}\\ {\sum}_{i=1}^{n}{x}_{i}{y}_{i} \end{bmatrix} \\ \begin{bmatrix} \widehat{\beta}_{0}\\ \widehat{\beta}_{1} \end{bmatrix} &= \frac{1}{{n}{\sum}_{i=1}^{n}{x}_{i}^{2}-\left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}}\begin{bmatrix} {\sum}_{i=1}^{n}{x}_{i}^{2}{\sum}_{i=1}^{n}{y}_{i} - {\sum}_{i=1}^{n}{x}_{i}{\sum}_{i=1}^{n}{x}_{i}{y}_{i}\\ n{\sum}_{i=1}^{n}{x}_{i}{y}_{i} - {\sum}_{i=1}^{n}{x}_{i}{\sum}_{i=1}^{n}{y}_{i} \end{bmatrix} \\ &= \begin{bmatrix} \frac{{\sum}_{i=1}^{n}{x}_{i}^{2}{\sum}_{i=1}^{n}{y}_{i} - {\sum}_{i=1}^{n}{x}_{i}{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{n}{\sum}_{i=1}^{n}{x}_{i}^{2} - \left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}}\\ \frac{n{\sum}_{i=1}^{n}{x}_{i}{y}_{i} - {\sum}_{i=1}^{n}{x}_{i}{\sum}_{i=1}^{n}{y}_{i}}{{n}{\sum}_{i=1}^{n}{x}_{i}^{2} - \left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{{\sum}_{i=1}^{n}{x}_{i}^{2}\frac{n}{n}{\sum}_{i=1}^{n}{y}_{i} - \frac{n}{n}{\sum}_{i=1}^{n}{x}_{i}{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{n}{\sum}_{i=1}^{n}{x}_{i}^{2} - \left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}}\\ \frac{n{\sum}_{i=1}^{n}{x}_{i}{y}_{i} - \frac{n}{n}{\sum}_{i=1}^{n}{x}_{i}\frac{n}{n}{\sum}_{i=1}^{n}{y}_{i}}{{n}{\sum}_{i=1}^{n}{x}_{i}^{2} - \left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{{n}\overline{y}{\sum}_{i=1}^{n}{x}_{i}^{2} - {n}\overline{x}{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{n}{\sum}_{i=1}^{n}{x}_{i}^{2}-\left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}}\\ \frac{n{\sum}_{i=1}^{n}{x}_{i}{y}_{i} - {n}\overline{x}{n}\overline{y}}{{n}{\sum}_{i=1}^{n}{x}_{i}^{2} - \left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\overline{y}{\sum}_{i=1}^{n}{x}_{i}^{2} - \overline{x}{\sum}_{i=1}^{n}{x}_{i}{y}_{i}}{{\sum}_{i=1}^{n}{x}_{i}^{2} - \frac{1}{n}\left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}}\\ \frac{{\sum}_{i=1}^{n}{x}_{i}{y}_{i} - {n}\overline{x}\overline{y}}{{\sum}_{i=1}^{n}{x}_{i}^{2}-\frac{1}{n}\left({{\sum}_{i=1}^{n}{x}_{i}}\right)^{2}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\overline{y}{\sum}_{i=1}^{n}{x}_{i}^{2} - \overline{x}({S}_{x,y} + {n}\overline{x}\overline{y})}{{S}_{x,x}}\\ \frac{{S}_{x,y}}{{S}_{x,x}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\overline{y}{\sum}_{i=1}^{n}{x}_{i}^{2}}{{S}_{x,x}} - \frac{\overline{x}{S}_{x,y} + {n}\overline{y}\overline{x}^{2}}{{S}_{x,x}}\\ \frac{{S}_{x,y}}{{S}_{x,x}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\overline{y}{\sum}_{i=1}^{n}{x}_{i}^{2}}{{S}_{x,x}} - \frac{\overline{x}{S}_{x,y}}{{S}_{x,x}} - \frac{{n}\overline{y}\overline{x}^{2}}{{S}_{x,x}}\\ \frac{{S}_{x,y}}{{S}_{x,x}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\overline{y}{\sum}_{i=1}^{n}{x}_{i}^{2}}{{S}_{x,x}} - \frac{{n}\overline{y}\overline{x}^{2}}{{S}_{x,x}} - \frac{\overline{x}{S}_{x,y}}{{S}_{x,x}}\\ \frac{{S}_{x,y}}{{S}_{x,x}} \end{bmatrix} \\ &= \begin{bmatrix} \overline{y}\frac{{\sum}_{i=1}^{n}{x}_{i}^{2}}{{S}_{x,x}} - \overline{y}\frac{{n}\overline{x}^{2}}{{S}_{x,x}} - \overline{x}\frac{{S}_{x,y}}{{S}_{x,x}}\\ \frac{{S}_{x,y}}{{S}_{x,x}} \end{bmatrix} \\ &= \begin{bmatrix} \overline{y}\left(\frac{{\sum}_{i=1}^{n}{x}_{i}^{2} - {n}\overline{x}^{2}}{{S}_{x,x}}\right) - \frac{{S}_{x,y}}{{S}_{x,x}}\overline{x}\\ \frac{{S}_{x,y}}{{S}_{x,x}} \end{bmatrix} \\ &= \begin{bmatrix} \overline{y} - \frac{{S}_{x,y}}{{S}_{x,x}}\overline{x}\\ \frac{{S}_{x,y}}{{S}_{x,x}} \end{bmatrix} \end{align} \]

2.6.1 Ponderados

Para este caso partícular, se busca minimizar la siguiente función objetivo

\[ \begin{align} {\sum}_{i=1}^{n}{\omega}_{i}\left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i} - {y}_{i}\right)^{2} \end{align} \]

Lo más usual es establecer los pesos \({\omega}_{i}\) como el inverso de la variabilidad asociada a la observación dada, sin embargo, en la práctica es imposible establecer dicha varianza por lo que en su lugar se asume que es proporcional al valor \({x}_{i}\)

\[ \begin{align} {\omega}_{i} &= \frac{k}{c{\cdot}{x}_{i}} \end{align} \]

\[ \begin{align} {\omega}_{i} &{\approxeq} \frac{k}{c{\cdot}{x}_{i}^{\alpha}} \end{align} \]

\[ \begin{align} {\arg{\min}}_{{\beta}_{0}, {\beta}_{1}}{{\sum}_{i=1}^{n}{\omega}_{i}\left(\widehat{\beta}_{0} + \widehat{\beta}_{1}{x}_{i} - {y}_{i}\right)^{2} } \end{align} \]

\[ \begin{align} \widehat{\beta}_{0} &= \frac{\left({\sum}_{i=1}^{n}{\omega}_{i}{y}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}^{2}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}{y}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)}{\left({\sum}_{i=1}^{n}{\omega}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}^{2}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)^{2}} \end{align} \]

\[ \begin{align} \widehat{\beta}_{1} &= \frac{\left({\sum}_{i=1}^{n}{\omega}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}{y}_{i}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{y}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)}{\left({\sum}_{i=1}^{n}{\omega}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}^{2}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)^{2}} \end{align} \]

Si se asume que \({\sum}_{i=1}^{n}{\omega}_{i}=1\) entonces:

\[ \begin{align} \widehat{\beta}_{0} &= \frac{\left({\sum}_{i=1}^{n}{\omega}_{i}{y}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}^{2}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}{y}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)}{\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}^{2}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)^{2}} \end{align} \]

\[ \begin{align} \widehat{\beta}_{1} &= \frac{\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}{y}_{i}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{y}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)}{\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}^{2}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)^{2}} \end{align} \]

Por lo anterior, para

\[ \begin{align} \hat{x} &= {\sum}_{i=1}^{n}{\omega}_{i}{x}_{i} \end{align} \]

Se llega a que

\[ \begin{align} \widehat{y}_{i} &= \widehat{\beta}_{0} + \widehat{\beta}_{1}\hat{x}\\ &= \frac{\left({\sum}_{i=1}^{n}{\omega}_{i}{y}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}^{2}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}{y}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)}{\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}^{2}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)^{2}} + \frac{\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}{y}_{i}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{y}_{i}\right)\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)}{\left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}^{2}\right) - \left({\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\right)^{2}}{\sum}_{i=1}^{n}{\omega}_{i}{x}_{i}\\ &= {\sum}_{i=1}^{n}{\omega}_{i}{y}_{i} \end{align} \]

2.6.2 Generalizados

Componentes de un modelo lineal generalizado:

  • Componente aleatoria: Identifica la variable respuesta y su distribución de probabilidad

\[g\left({y}_{i} | {\theta}_{i}\right) = a\left({\theta}_{i}\right)b\left({y}_{i}\right)\exp{\left\{{y}_{i}Q\left({\theta}_{i}\right)\right\}}\]

  • Componente sistemática: Identifica la variable respuesta y su distribución de probabilidad

\[{\beta}_{0}+{\beta}_{1}{x}_{1}\]

  • Función enlace: Función del valor esperado de \(\vec{y}\) expresada como una combinación lineal de las variables predictoras

\[g[E({y}) = {\mu}] = {\beta}_{0}+{\beta}_{1}{x}_{1}\]

2.7 Implementación en R

2.7.1 Ecuación de regresión lineal

\[gastos=\beta_0 + \beta_1{\times}ingresos+error\]

\[error{\sim}N(0,\sigma_{error}^2)\]

\[y=m{\times}x+b+error\]

2.7.1.1 Simulación de unos mil ingresos

ingresos <- as.data.frame(877802 * rgamma(n=100000, 87))
colnames(ingresos) <- "ingresos"
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ tidyr::pack()   masks Matrix::pack()
## ✖ tidyr::unpack() masks Matrix::unpack()
ingresos %>% 
  ggplot(aes(ingresos, fill=cut(ingresos, 100))) + geom_histogram(bins=sqrt(nrow(ingresos)), show.legend=FALSE)

gastos <- rnorm(n=100000, mean=400000, sd=80000) + 0.3 * (ingresos + rnorm(n=100000, mean=219450.5, sd=438901))
colnames(gastos) <- "gastos"
library(tidyverse)
gastos %>% 
  ggplot(aes(gastos, fill=cut(gastos, 100))) + geom_histogram(bins=sqrt(nrow(gastos)), show.legend=FALSE)

2.7.1.2 Creación de un data frame con los datos

data <- as.data.frame(cbind(ingresos, gastos))

2.7.1.3 Se genera una muestra aleatoria de tamaño 40000

sample <- data %>% 
  sample_n(size=40000)

2.7.1.4 Gráfico de dispersión de ingresos versus gastos

library(tidyverse)
sample %>% 
  ggplot(aes(x=ingresos, y=gastos, color=ingresos)) +
  geom_point(shape=16, show.legend=FALSE) +
  scale_color_gradient(low="#32aeff", high="#f2aeff")

2.7.1.5 Ajuste de un modelo de regresión

modelo.1 <- lm(gastos~ingresos, data=sample)
summary(modelo.1)
## 
## Call:
## lm(formula = gastos ~ ingresos, data = sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -680682 -103413     120  103707  633250 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.706e+05  7.233e+03   65.06   <2e-16 ***
## ingresos    2.999e-01  9.417e-05 3185.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 153500 on 39998 degrees of freedom
## Multiple R-squared:  0.9961, Adjusted R-squared:  0.9961 
## F-statistic: 1.015e+07 on 1 and 39998 DF,  p-value: < 2.2e-16

\[ \begin{align} {gastos}_{i}&=\widehat{\beta}_0+\widehat{\beta}_1{\times}{ingresos}_{i}+{error}_{i}\\ &=4.7057583\times 10^{5}+0.2999475{\times}{ingresos}_{i}+{error}_{i} \end{align} \]

\[ {error}_{i}{\sim}N(0,\sigma_{error}^{2}) \]

\[ \begin{align} \widehat{gastos}_{i}&=\widehat{\beta}_0+\widehat{\beta}_1{\times}{ingresos}_{i}\\ &=4.7057583\times 10^{5}+0.2999475{\times}{ingresos}_{i} \end{align} \]

\[ {error}_{i}={gastos}_{i}-\widehat{gastos}_{i} \]

modelo.1$coefficients
##  (Intercept)     ingresos 
## 4.705758e+05 2.999475e-01

2.7.1.6 Distribución de los coeficientes de regresión

\[ t=\frac{\widehat{\beta}_i-{\beta}_i}{\widehat{\sigma}_{\widehat{\beta}_i}}{\sim}t_{\left[n-1\right]} \]

2.7.1.7 Intervalo de confianza

\[ \begin{align} 1-\alpha&=P\left(-t_{\left[n-1,1-\frac{\alpha}{2}\right]}{\leq}t=\frac{\widehat{\beta}_i-{\beta}_i}{\widehat{\sigma}_{\widehat{\beta}_i}}{\leq}+t_{\left[n-1,1-\frac{\alpha}{2}\right]}\right)\\ &=P\left(-t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i}{\leq}\widehat{\beta}_i-{\beta}_i{\leq}+t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i}\right)\\ &=P\left(-t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i}-\widehat{\beta}_i{\leq}-{\beta}_i{\leq}+t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i}-\widehat{\beta}_i\right)\\ &=P\left(t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i}+\widehat{\beta}_i{\geq}{\beta}_i{\geq}-t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i}+\widehat{\beta}_i\right)\\ &=P\left(\widehat{\beta}_i-t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i}{\leq}{\beta}_i{\leq}\widehat{\beta}_i+t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i}\right)\\ \end{align} \]

\[IC_{1-\alpha}=\left(\widehat{\beta}_i-t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i};\widehat{\beta}_i+t_{\left[n-1,1-\frac{\alpha}{2}\right]}\widehat{\sigma}_{\widehat{\beta}_i}\right)\]

2.7.1.8 Intervalos de confianza para los parametros del modelo

confint(modelo.1)
##                    2.5 %       97.5 %
## (Intercept) 4.563995e+05 4.847522e+05
## ingresos    2.997629e-01 3.001320e-01
2.7.1.8.1 Intervalo de confianza y prueba de hipótesis para el intercepto
confint(modelo.1)[1,]
##    2.5 %   97.5 % 
## 456399.5 484752.2

\[ \begin{align} IC_{1-\alpha}&=\left(4.7057583\times 10^{5}-1.9600233\widehat{\sigma}_{\widehat{\beta}_i};4.7057583\times 10^{5}+1.9600233\widehat{\sigma}_{\widehat{\beta}_i}\right)\\ &=\left(4.7057583\times 10^{5}-1.9600233{\times}7232.7511007;4.7057583\times 10^{5}+1.9600233{\times}7232.7511007\right) \end{align} \]

c(resumen$coefficients[1,1])+
  c(-qt(0.975,nrow(sample)-1)*resumen$coefficients[1,2],+qt(0.975,nrow(sample)-1)*resumen$coefficients[1,2])
## [1] 456399.5 484752.2
2.7.1.8.2 Intervalo de confianza y prueba de hipótesis para la pendiente
confint(modelo.1)[2,]
##     2.5 %    97.5 % 
## 0.2997629 0.3001320

\[ \begin{align} IC_{1-\alpha}&=\left(0.2999475-1.9600233\widehat{\sigma}_{\widehat{\beta}_i};0.2999475+1.9600233\widehat{\sigma}_{\widehat{\beta}_i}\right)\\ &=\left(0.2999475-1.9600233{\times}9.416562\times 10^{-5};0.2999475+1.9600233{\times}9.416562\times 10^{-5}\right) \end{align} \]

c(resumen$coefficients[2,1])+
  c(-qt(0.975,nrow(sample)-1)*resumen$coefficients[2,2],+qt(0.975,nrow(sample)-1)*resumen$coefficients[2,2])
## [1] 0.2997629 0.3001320

2.7.1.9 Sistema de hipótesis

2.7.1.9.1 Hipótesis nula

\[H_0:{\beta}_i=0\]

2.7.1.9.2 Bajo la hipótesis nula

\[ t=\frac{\widehat{\beta}_i-0}{\widehat{\sigma}_{\widehat{\beta}_i}}{\sim}t_{\left[n-1\right]} \]

2.7.1.9.3 Hipótesis alternativa

\[H_1:{\beta}_i{\neq}0\]

2.7.1.9.4 Conclusión
2.7.1.9.4.1 Del intercepto \(\beta_0\)
## [1] "RECHAZO LA HIPÓTESIS NULA DE QUE EL INTERCEPTO ES IGUAL A CERO EN FAVOR DE QUE ES DISTINTO DE CERO"
2.7.1.9.4.2 De la pendiente \(\beta_1\)
## [1] "RECHAZO LA HIPÓTESIS NULA DE QUE LA PENDIENTE ES IGUAL A CERO EN FAVOR DE QUE ES DISTINTO DE CERO"

2.7.1.10 Supuestos del modelo

2.7.1.10.1 Residuales
ggplot(sample, aes(x=ingresos, y=residuals(modelo.1), color=abs(residuals(modelo.1)))) +
    geom_point() + geom_smooth(formula=y~x, color="blue", method="loess") +
    theme(legend.position="bottom")
## Warning: Computation failed in `stat_smooth()`:
## workspace required (2400430050) is too large probably because of setting 'se = TRUE'.

2.7.1.10.2 Datos atípicos
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_plot_cooksd_bar(modelo.1)

2.7.1.10.3 Verificación de supuestos
library(ggfortify)
autoplot(modelo.1)

2.7.1.11 Multicolinealidad

#library(regclass)
#VIF(modelo.1)

2.7.1.12 Línea de regresión ajustada

library(tidyverse)
sample %>% 
  ggplot(aes(x=ingresos, y=gastos, color=ingresos)) +
  geom_point(shape=16, show.legend=FALSE) +
  scale_color_gradient(low="#32aeff", high="#f2aeff") +
  geom_smooth(formula=y~x,method=lm,  linetype="dashed", 
             color="darkgray", fill="blue")

2.7.1.13 Criterios de ajuste

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
## 
##     cement
## The following object is masked from 'package:dplyr':
## 
##     select
summary(stepAIC(modelo.1, trace=0))
## 
## Call:
## lm(formula = gastos ~ ingresos, data = sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -680682 -103413     120  103707  633250 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.706e+05  7.233e+03   65.06   <2e-16 ***
## ingresos    2.999e-01  9.417e-05 3185.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 153500 on 39998 degrees of freedom
## Multiple R-squared:  0.9961, Adjusted R-squared:  0.9961 
## F-statistic: 1.015e+07 on 1 and 39998 DF,  p-value: < 2.2e-16
library(MASS)
AIC(modelo.1)
## [1] 1068852

2.7.2 Caimanes; perso y longitud del respiradero

Esta información es para un estudio en el centro de la Florida donde se capturaron 15 caimanes y se hicieron dos mediciones en cada uno de los caimanes. El peso (en libras) se registró junto con la longitud del respiradero del hocico (en pulgadas: esta es la distancia entre la parte posterior de la cabeza y el extremo de la nariz).

alligator = data.frame(
  lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,
    3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
  lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
    3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)
)
library(lattice)

Como con la mayoría de los análisis, el primer paso es estudiar los dato;s exploratorios para obtener una impresión visual de si existe una relación entre el peso y la longitud del respiradero del hocico y qué forma es probable que tome. Creamos un diagrama de dispersión de los datos de la siguiente manera:

xyplot(lnWeight ~ lnLength, data = alligator,
  xlab = "Snout vent length (inches) on log scale",
  ylab = "Weight (pounds) on log scale",
  main = "Alligators in Central Florida"
)

El gráfico sugiere que el peso (en la escala logarátmica) aumenta linealmente con la longitud del respiradero (de nuevo en la escala logarítmica), por lo que ajustaremos un modelo de regresión lineal simple a los datos y guardaremos el modelo ajustado a un objeto para su posterior análisis:

alli.mod1 = lm(lnWeight ~ lnLength, data = alligator)

La función lm ajusta un modelo lineal a los datos. Especificamos el modelo usando una f?rmula donde la variable de respuesta está en el lado izquierdo, separada por una de las variables explicativas. La fórmula proporciona una forma flexible de específicar varias formas funcionales diferentes para la relación. El argumento de datos se usa para indicar a R dónde buscar las variables utilizadas en la fórmula.

Ahora que el modelo se guarda como un objeto, podemos usar algunas de las funciones de propósito general para extraer información de este objeto sobre el modelo lineal, p. los parámetros o residuales La gran ventaja con R es que hay funciones definidas para diferentes tipos de modelos, que usan el mismo nombre, como el resumen, y el sistema determina qué funciún pretendemos usar en función del tipo de objeto guardado. Para crear un resumen del modelo ajustado:

summary(alli.mod1)
## 
## Call:
## lm(formula = lnWeight ~ lnLength, data = alligator)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24348 -0.03186  0.03740  0.07727  0.12669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.4761     0.5007  -16.93 3.08e-10 ***
## lnLength      3.4311     0.1330   25.80 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1229 on 13 degrees of freedom
## Multiple R-squared:  0.9808, Adjusted R-squared:  0.9794 
## F-statistic: 665.8 on 1 and 13 DF,  p-value: 1.495e-12

Aquí obtenemos mucha información útil.

Las estimaciones para la intercepción del modelo son -8.4761 y el coeficiente que mide la pendiente de la relación con la longitud del respiradero del hocico es 3.4311 y la información sobre los errores est?ndar de estas estimaciones también se proporciona en la tabla de Coeficientes. Vemos que la prueba de significancia de los coeficientes del modelo tambi?n se resume en esa tabla, por lo que podemos ver que hay pruebas sálidas de que el coeficiente es significativamente diferente de cero, a medida que aumenta la longitud del respiradero del hocico, también lo hace el peso.

En lugar de detenernos aquí, realizamos algunas investigaciones utilizando diagnósticos residuales para determinar si los diversos supuestos que sustentan la regresión lineal son razonables para nuestros datos o si hay evidencia que sugiere que se requieren variables adicionales en el modelo o algunas otras alteraciones para identificar una mejor descripción de las variables que determinan cómo cambia el peso.

Se usa un gráfico de los residuos frente a los valores ajustados para determinar si existen patrones sistem?ticos, como la sobreestimación para la mayoría de los valores grandes o el aumento de la dispersión a medida que aumentan los valores ajustados del modelo. Para crear este gráfico, podríamos usar el siguiente código:

xyplot(resid(alli.mod1) ~ fitted(alli.mod1),
  xlab = "Fitted Values",
  ylab = "Residuals",
  main = "Residual Diagnostic Plot",
  panel = function(x, y, ...)
  {
    panel.grid(h = -1, v = -1)
    panel.abline(h = 0)
    panel.xyplot(x, y, ...)
  }
)

Creamos nuestra propia funci?n de panel personalizado utilizando los bloques de construcci?n provistos por el paquete de celosía. Comenzamos creando un conjunto de líneas de cuadrícula como la capa base y h = -1 y v = -1 indican celosía para alinearlas con las etiquetas de los ejes. Luego creamos una línea horizontal sálida para ayudar a distinguir entre residuos positivos y negativos. Finalmente obtenemos los puntos trazados en la capa superior.

2.7.2.1 Gráfica de diagnóstico de residuales

Probablemente la trama está bien, pero hay más casos de residuos positivos y cuando consideramos una gráfica de probabilidad normal, vemos que hay algunas deficiencias con el modelo:

qqmath( ~ resid(alli.mod1),
  xlab = "Theoretical Quantiles",
  ylab = "Residuals"
)

La función resid extrae los residuos del modelo del objeto modelo ajustado.

2.7.3 Vivienda en Boston; factores que afectan el precio de las viviendas

Los datos de vivienda de Boston son un conjunto de datos en el paquete MASS. El conjunto de datos tiene 506 filas y 14 columnas. Se análisiza y evaluan los factores que afectan el valor medio de las viviendas ocupadas por sus propietarios en los suburbios de Boston; los factores incluyen variables sobre la calidad estructural, el vecindario, la accesibilidad y la contaminación del aire, como la tasa de criminalidad per cápita por ciudad, la proporción de acres comerciales no minoristas por ciudad , índice de accesibilidad a las carreteras radiales, etc.

library(MASS)
library(ggplot2)
attach(Boston)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
##Sample the dataset. The return for this is row nos.
set.seed(1)
row.number <- sample(1:nrow(Boston), 0.8*nrow(Boston))
train = Boston[row.number,]
test = Boston[-row.number,]
dim(train)
## [1] 404  14
dim(test)
## [1] 102  14
##Explore the data.
ggplot(Boston, aes(medv)) + geom_density(fill="blue")

ggplot(train, aes(log(medv))) + geom_density(fill="blue")

ggplot(train, aes(sqrt(medv))) + geom_density(fill="blue")

2.7.3.1 Construcción del modelo 1

#Let's make default model.
model1 = lm(log(medv)~., data=train)
summary(model1)
## 
## Call:
## lm(formula = log(medv) ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73932 -0.09713 -0.01923  0.08883  0.86529 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.133e+00  2.370e-01  17.438  < 2e-16 ***
## crim        -1.166e-02  1.636e-03  -7.123 5.14e-12 ***
## zn           1.116e-03  6.129e-04   1.821  0.06941 .  
## indus        2.134e-03  2.718e-03   0.785  0.43286    
## chas         1.084e-01  3.797e-02   2.854  0.00454 ** 
## nox         -7.142e-01  1.727e-01  -4.135 4.35e-05 ***
## rm           8.303e-02  1.907e-02   4.353 1.72e-05 ***
## age         -9.102e-05  5.898e-04  -0.154  0.87743    
## dis         -5.104e-02  9.132e-03  -5.589 4.29e-08 ***
## rad          1.645e-02  2.885e-03   5.700 2.36e-08 ***
## tax         -7.018e-04  1.624e-04  -4.322 1.96e-05 ***
## ptratio     -3.593e-02  6.048e-03  -5.941 6.29e-09 ***
## black        4.138e-04  1.201e-04   3.447  0.00063 ***
## lstat       -2.957e-02  2.238e-03 -13.213  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1921 on 390 degrees of freedom
## Multiple R-squared:  0.7914, Adjusted R-squared:  0.7844 
## F-statistic: 113.8 on 13 and 390 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model1)

2.7.3.2 Contrucción del modelo 2

# remove the less significant feature
model2 = update(model1, ~.-zn-indus-age) 
summary(model2) 
## 
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73727 -0.10583 -0.02177  0.09436  0.86896 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.1511772  0.2361239  17.581  < 2e-16 ***
## crim        -0.0114749  0.0016304  -7.038 8.74e-12 ***
## chas         0.1098383  0.0377278   2.911 0.003804 ** 
## nox         -0.7160222  0.1599660  -4.476 9.97e-06 ***
## rm           0.0854763  0.0184393   4.636 4.85e-06 ***
## dis         -0.0450161  0.0073599  -6.116 2.31e-09 ***
## rad          0.0156919  0.0027803   5.644 3.19e-08 ***
## tax         -0.0006071  0.0001455  -4.171 3.74e-05 ***
## ptratio     -0.0390424  0.0056372  -6.926 1.78e-11 ***
## black        0.0004127  0.0001198   3.445 0.000632 ***
## lstat       -0.0294784  0.0021172 -13.923  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1923 on 393 degrees of freedom
## Multiple R-squared:  0.7894, Adjusted R-squared:  0.784 
## F-statistic: 147.3 on 10 and 393 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model2)

2.7.3.3 Gráfico de las predicciones vs. residuales

##Plot the residual plot with all predictors.
attach(train)
## The following objects are masked from Boston:
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
require(gridExtra)
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
plot1 = ggplot(train, aes(crim, residuals(model2))) + geom_point() + geom_smooth()
plot2=ggplot(train, aes(chas, residuals(model2))) + geom_point() + geom_smooth()
plot3=ggplot(train, aes(nox, residuals(model2))) + geom_point() + geom_smooth()
plot4=ggplot(train, aes(rm, residuals(model2))) + geom_point() + geom_smooth()
plot5=ggplot(train, aes(dis, residuals(model2))) + geom_point() + geom_smooth()
plot6=ggplot(train, aes(rad, residuals(model2))) + geom_point() + geom_smooth()
plot7=ggplot(train, aes(tax, residuals(model2))) + geom_point() + geom_smooth()
plot8=ggplot(train, aes(ptratio, residuals(model2))) + geom_point() + geom_smooth()
plot9=ggplot(train, aes(black, residuals(model2))) + geom_point() + geom_smooth()
plot10=ggplot(train, aes(lstat, residuals(model2))) + geom_point() + geom_smooth()
grid.arrange(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10,ncol=2,nrow=5)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at -0.005
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 2.5e-05
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.005
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.005
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1.01
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf en llamada a una función externa (arg 5)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

2.7.3.4 Construcción del modelo - Modelo 3 y 4

#Lets  make default model and add square term in the model.
model3 = lm(log(medv)~crim+chas+nox+rm+dis+rad+tax+ptratio+
black+lstat+ I(crim^2)+ I(chas^2)+I(nox^2)+ I(rm^2)+ I(dis^2)+ 
I(rad^2)+ I(tax^2)+ I(ptratio^2)+ I(black^2)+ I(lstat^2), data=train)
summary(model3)
## 
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat + I(crim^2) + I(chas^2) + I(nox^2) + 
##     I(rm^2) + I(dis^2) + I(rad^2) + I(tax^2) + I(ptratio^2) + 
##     I(black^2) + I(lstat^2), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58373 -0.08611 -0.01228  0.08528  0.77344 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.273e+00  8.749e-01   9.456  < 2e-16 ***
## crim         -3.291e-02  4.505e-03  -7.306 1.61e-12 ***
## chas          1.124e-01  3.223e-02   3.487 0.000546 ***
## nox          -6.286e-01  1.074e+00  -0.585 0.558693    
## rm           -8.026e-01  1.324e-01  -6.063 3.20e-09 ***
## dis          -1.202e-01  2.452e-02  -4.900 1.41e-06 ***
## rad           1.628e-02  9.436e-03   1.726 0.085217 .  
## tax          -3.393e-04  5.300e-04  -0.640 0.522477    
## ptratio      -1.592e-01  7.163e-02  -2.222 0.026843 *  
## black         1.314e-03  5.115e-04   2.568 0.010594 *  
## lstat        -5.419e-02  5.487e-03  -9.876  < 2e-16 ***
## I(crim^2)     2.961e-04  6.690e-05   4.426 1.25e-05 ***
## I(chas^2)            NA         NA      NA       NA    
## I(nox^2)     -2.450e-01  8.002e-01  -0.306 0.759664    
## I(rm^2)       6.752e-02  1.036e-02   6.520 2.22e-10 ***
## I(dis^2)      6.899e-03  1.936e-03   3.564 0.000411 ***
## I(rad^2)      2.739e-04  3.730e-04   0.734 0.463258    
## I(tax^2)     -4.613e-07  6.474e-07  -0.712 0.476601    
## I(ptratio^2)  3.751e-03  2.040e-03   1.839 0.066742 .  
## I(black^2)   -2.355e-06  1.129e-06  -2.085 0.037695 *  
## I(lstat^2)    7.380e-04  1.520e-04   4.854 1.77e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1627 on 384 degrees of freedom
## Multiple R-squared:  0.8526, Adjusted R-squared:  0.8453 
## F-statistic: 116.9 on 19 and 384 DF,  p-value: < 2.2e-16
##Removing the insignificant variables.
model4=update(model3, ~.-nox-rad-tax-I(crim^2)-I(chas^2)-I(rad^2)-
I(tax^2)-I(ptratio^2)-I(black^2))
summary(model4)
## 
## Call:
## lm(formula = log(medv) ~ crim + chas + rm + dis + ptratio + black + 
##     lstat + I(nox^2) + I(rm^2) + I(dis^2) + I(lstat^2), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75555 -0.08920 -0.00584  0.08572  0.83906 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.8984602  0.4469706  15.434  < 2e-16 ***
## crim        -0.0113984  0.0013782  -8.271 2.10e-15 ***
## chas         0.1335828  0.0340957   3.918 0.000105 ***
## rm          -0.9135506  0.1388913  -6.577 1.53e-10 ***
## dis         -0.0771922  0.0230393  -3.350 0.000885 ***
## ptratio     -0.0210271  0.0049197  -4.274 2.41e-05 ***
## black        0.0002769  0.0001078   2.568 0.010585 *  
## lstat       -0.0506485  0.0056777  -8.921  < 2e-16 ***
## I(nox^2)    -0.5290802  0.1127763  -4.691 3.75e-06 ***
## I(rm^2)      0.0778438  0.0108068   7.203 3.03e-12 ***
## I(dis^2)     0.0038669  0.0019568   1.976 0.048840 *  
## I(lstat^2)   0.0005754  0.0001559   3.691 0.000255 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1747 on 392 degrees of freedom
## Multiple R-squared:  0.8266, Adjusted R-squared:  0.8217 
## F-statistic: 169.9 on 11 and 392 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model4)

2.7.3.5 Prediction

pred1 <- predict(model4, newdata = test)
rmse <- sqrt(sum((exp(pred1) - test$medv)^2)/length(test$medv))
c(RMSE = rmse, R2=summary(model4)$r.squared)
##      RMSE        R2 
## 4.8235100 0.8265999
par(mfrow=c(1,1))
plot(test$medv, exp(pred1))

2.7.3.6 Conclusión

El ejemplo muestra cómo abordar el modelado de regresión lineal. El modelo que se crea aún tiene margen de mejora ya que podemos aplicar técnicas como detección de valores atípicos, detección de correlación para mejorar aún más la precisión de predicciones más precisas. También se puede utilizar una técnica avanzada, como la técnica Random Forest y Boosting, para comprobar si la precisión puede mejorarse aún más para el modelo. Una advertencia es que debemos abstenernos de sobreajustar el modelo de datos de entrenamiento ya que la precisión de la prueba del modelo se reducir para los datos de prueba en caso de sobreajuste.

3 Regresión lineal múltiple

3.1 Introducción y formulación del modelo

library(statsr)
datos <- data.frame(x1 = rnorm(10), x2 = rnorm(10), y = rnorm(10))

3.2 Estimación de los parámetros del modelo

En estadística, la regresión lineal o ajuste lineal es un modelo matemático usado para aproximar la relación entre una variable explicada \(\vec{y}\) y, \(p\) variables independientes \(\vec{x}_{i}\) con \(p{\in}{Z}^{p}\) y un término aleatorio \(\vec{\varepsilon}\). Este método es aplicable en muchas situaciones en las que se estudia la relación entre dos o más variables o predecir un comportamiento, algunas incluso sin relación con la tecnología. En caso de que no se pueda aplicar un modelo de regresión a un estudio, se dice que no hay correlación entre las variables estudiadas.

library(scatterplot3d)
plot3d <- scatterplot3d(datos$x1,datos$x2,datos$y,
angle = 55, scale.y=0.7, pch=16, color ="red", main ="Regression Plane")
my.lm <- lm(y ~ x1 + x2, data=datos)
plot3d$plane3d(my.lm, lty.box = "solid")

Dado el modelo de regresión planteado de forma general

\[\vec{y} = \boldsymbol{X}\vec{\beta} + \vec{\varepsilon}\]

Se tiene, al despejar, que los errores son iguales a la diferencia

\[\vec{\varepsilon} = \vec{y} - \boldsymbol{X}\vec{\beta}\]

Y entonces la idea, para encontrar la recta de mejor ajuste es minimizar la suma de los errores o distancias verticales de las observaciones a la recta ajustada, por lo que la expresión a minimizar es:

\[ \begin{align} S(\widehat{\vec{\beta}}) &= {\vec{\varepsilon}}^{t}\vec{\varepsilon}\\ &= \left({\vec{y} - \boldsymbol{X}\widehat{\vec{\beta}}}\right)^{t}\left({\vec{y} - \boldsymbol{X}\widehat{\vec{\beta}}}\right)\\ &= {\vec{y}^{t}\left({\vec{y} - \boldsymbol{X}\widehat{\vec{\beta}}}\right) - \left(\boldsymbol{X}\widehat{\vec{\beta}}\right)}^{t}\left({\vec{y} + \boldsymbol{X}\widehat{\vec{\beta}}}\right)\\ &= {\vec{y}^{t}{\vec{y} - \vec{y}^{t}\boldsymbol{X}\widehat{\vec{\beta}}} - \left(\boldsymbol{X}\widehat{\vec{\beta}}\right)}^{t}{\vec{y} + \left(\boldsymbol{X}\widehat{\vec{\beta}}\right)^{t}\boldsymbol{X}\widehat{\vec{\beta}}}\\ &= {\vec{y}^{t}{\vec{y} - \vec{y}^{t}\boldsymbol{X}\widehat{\vec{\beta}}} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}}{\vec{y} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\boldsymbol{X}\widehat{\vec{\beta}}}\\ &= {\vec{y}^{t}{\vec{y} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y}} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}}{\vec{y} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\boldsymbol{X}\widehat{\vec{\beta}}}\\ &= {\vec{y}^{t}{\vec{y}} - 2\widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}}{\vec{y} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\boldsymbol{X}\widehat{\vec{\beta}}} \end{align} \]

3.2.1 Método de mínimos cuadrados ordinarios

\[ \begin{align} \frac{{\partial}}{{\partial}\vec{\beta}}S(\vec{\beta}) &= \frac{{\partial}}{{\partial}\vec{\beta}}\left({\vec{y}^{t}{\vec{y}} - 2\vec{\beta}^{t}\boldsymbol{X}^{t}{\vec{y} - \vec{\beta}^{t}\boldsymbol{X}^{t}\boldsymbol{X}\vec{\beta}}}\right)\\ &= {- 2\boldsymbol{X}^{t}}{\vec{y} + 2\boldsymbol{X}^{t}\boldsymbol{X}\vec{\beta}} \end{align} \]

Igualando a cero se obtiene lo siguiente:

\[ \begin{align} \frac{{\partial}}{{\partial}\vec{\beta}}S(\vec{\beta}) = 0 &\rightarrow {2\boldsymbol{X}^{t}}{\vec{y} = 2\boldsymbol{X}^{t}\boldsymbol{X}\vec{\beta}}\\ &\rightarrow {\boldsymbol{X}^{t}}{\vec{y} = \boldsymbol{X}^{t}\boldsymbol{X}\vec{\beta}} \end{align} \]

Nota:

Esto admite solución si:

\[rank(\boldsymbol{X}^{t}\boldsymbol{X}|\boldsymbol{X}^{t}\vec{y})=rank(\boldsymbol{X}^{t}\boldsymbol{X})\]

Esto es así dado que:

  • \(rank(\boldsymbol{X}^{t}\boldsymbol{X}|\boldsymbol{X}^{t}\vec{y}){\geq}rank(\boldsymbol{X}^{t}\boldsymbol{X})\)

  • \(rank(\boldsymbol{X}^{t}\boldsymbol{X}|\boldsymbol{X}^{t}\vec{y})=rank\left[\boldsymbol{X}^{t}(\boldsymbol{X}|\vec{y})\right]{\leq}rank(\boldsymbol{X}){=}rank(\boldsymbol{X}^{t}\boldsymbol{X})\)

3.2.2 Método de máxima-verosimilitud

\[ \begin{align} \vec{y} = \begin{bmatrix}{y}_{1}\\ {y}_{2}\\ \vdots\\ {y}_{n}\\ \end{bmatrix} &{\sim} N\left( \vec{\mu} = \begin{bmatrix} {\mu}_{1}\\ {\mu}_{2}\\ \vdots\\ {\mu}_{n}\\ \end{bmatrix},\boldsymbol{{\Sigma}} = \begin{bmatrix} {\sigma}_{1}^{2} & 0 & \cdots & 0\\ 0 & {\sigma}_{2}^{2} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & {\sigma}_{n}^{2} \end{bmatrix} \right) \end{align} \]

\[ \begin{align} f\left(\vec{y},\vec{\mu},\boldsymbol{\Sigma}\right) &= \frac{1}{\left({2{\pi}}\right)^{2}\left|{\boldsymbol{\Sigma}}\right|^{2}}{\exp{\left\{-\frac{1}{2}\left({\vec{y}-\vec{\mu}}\right)^{2}{\boldsymbol{\Sigma}}^{-1}\left({\vec{y}-\vec{\mu}}\right)\right\}}} \end{align} \]

\[ \begin{align} l\left(\vec{y},\vec{\mu},\boldsymbol{\Sigma}\right) &= \frac{1}{\left({2{\pi}}\right)^{2}\left|{\boldsymbol{\Sigma}}\right|^{2}}{\exp{\left\{-\frac{1}{2}\left({\vec{y}-\vec{\mu}}\right)^{2}{\boldsymbol{\Sigma}}^{-1}\left({\vec{y}-\vec{\mu}}\right)\right\}}} \end{align} \]

3.2.2.1 Teorema de Gauss - Markov

Para el modelo de regresión lineal múltiple, el mejor estimador lineal insesgado (BLUE) de una función lineal parametrica estimable \({\vec{\lambda}}^{t}\vec{\beta}\), es \({\vec{\lambda}}^{t}\widehat{\vec{\beta}}\), donde \(\widehat{\vec{\beta}}\) es una solución de las ecuaciones normales.

\[ \boldsymbol{X}^{t}\boldsymbol{X}\widehat{\vec{\beta}}=\boldsymbol{X}^{t}\vec{y} \]

  • Insesgamiento

\[ \begin{align} E\left[{\vec{\lambda}}^{t}\widehat{\vec{\beta}}\right] &= E\left[{\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{X}^{t}\widehat{\vec{y}}\right]\\ &= {\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{X}^{t}E\left[\widehat{\vec{y}}\right]\\ &= {\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{X}^{t}\boldsymbol{X}\widehat{\vec{\beta}}\\ &= {\vec{\lambda}}^{t}\boldsymbol{H}\widehat{\vec{\beta}}\\ &= {\vec{\lambda}}^{t}\widehat{\vec{\beta}} \end{align} \]

Condición de estimabilidad \({\vec{\lambda}}^{t}\boldsymbol{H} = {\vec{\lambda}}^{t}\)

  • Mímima varianza

\[ \begin{align} E\left[{\vec{d}}^{t}\vec{y}\right] &= {\vec{\lambda}}^{t}\vec{\beta}\\ {\vec{d}}^{t}E\left[\vec{y}\right] &= {\vec{\lambda}}^{t}\vec{\beta}\\ {\vec{d}}^{t}\boldsymbol{X}\vec{\beta} &= {\vec{\lambda}}^{t}\vec{\beta} \end{align} \]

Lo anterior implica \({\vec{d}}^{t}\boldsymbol{X} = {\vec{\lambda}}^{t}\)

\[ \begin{align} V\left[{\vec{\lambda}}^{t}\widehat{\vec{\beta}}\right] &= {\vec{\lambda}}^{t}V\left[\widehat{\vec{\beta}}\right]{\vec{\lambda}}\\ &= {\vec{\lambda}}^{t}V\left[\boldsymbol{G}\boldsymbol{X}^{t}{\vec{y}}\right]{\vec{\lambda}}\\ &= {\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{X}^{t}V\left[{\vec{y}}\right]\boldsymbol{X}\boldsymbol{G}^{t}{\vec{\lambda}}\\ &= {\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{X}^{t}\boldsymbol{X}\boldsymbol{G}^{t}{\vec{\lambda}}\sigma^{2}\\ &= {\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{H}^{t}{\vec{\lambda}}\sigma^{2}\\ &= {\vec{\lambda}}^{t}\boldsymbol{G}{\vec{\lambda}}\sigma^{2} \end{align} \]

\[ \begin{align} C\left[{\vec{\lambda}}^{t}\widehat{\vec{\beta}},{\vec{d}}^{t}\widehat{\vec{\beta}}\right] &= C\left[{\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{X}^{t}\widehat{\vec{y}},{\vec{d}}^{t}\widehat{\vec{\beta}}\right]\\ &= {\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{X}^{t}{\vec{d}}\sigma^{2}\\ &= {\vec{\lambda}}^{t}\boldsymbol{G}{\vec{\lambda}}\sigma^{2} \end{align} \]

\[ \begin{align} V\left[{\vec{\lambda}}^{t}\widehat{\vec{\beta}}-{\vec{d}}^{t}{y}\right] &= V\left[{\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{X}^{t}\widehat{\vec{y}}\right]-2C\left[{\vec{\lambda}}^{t}\boldsymbol{G}\boldsymbol{X}^{t}\widehat{\vec{y}},{\vec{d}}^{t}\widehat{\vec{\beta}}\right]+V\left[{\vec{d}}^{t}{y}\right]\\ &= {\vec{\lambda}}^{t}\boldsymbol{G}{\vec{\lambda}}\sigma^{2}-2{\vec{\lambda}}^{t}\boldsymbol{G}{\vec{\lambda}}\sigma^{2}+V\left[{\vec{d}}^{t}{y}\right]\\ &= -{\vec{\lambda}}^{t}\boldsymbol{G}{\vec{\lambda}}\sigma^{2}+V\left[{\vec{d}}^{t}{y}\right]\\ \end{align} \]

\[ \begin{align} -{\vec{\lambda}}^{t}\boldsymbol{G}{\vec{\lambda}}\sigma^{2}+V\left[{\vec{d}}^{t}{y}\right] &\geq 0\\ V\left[{\vec{d}}^{t}{y}\right] &\geq {\vec{\lambda}}^{t}\boldsymbol{G}{\vec{\lambda}}\sigma^{2}\\ V\left[{\vec{d}}^{t}{y}\right] &\geq V\left[{\vec{\lambda}}^{t}\widehat{\vec{\beta}}\right] \end{align} \]

Luego, como conclusión

\[{\vec{d}}^{t}{y} = {\vec{\lambda}}^{t}\widehat{\vec{\beta}}\]

3.3 Análisis de varianza y pruebas de hipótesis

\[ \begin{align} SC_E &= \left(\vec{y} - \boldsymbol{X}\widehat{\vec{\beta}}\right)^{t}\left(\vec{y} - \boldsymbol{X}\widehat{\vec{\beta}}\right)\\ &= \left(\vec{y} - \boldsymbol{X}\widehat{\vec{\beta}}\right)^{t}\vec{y} - \left(\vec{y} - \boldsymbol{X}\widehat{\vec{\beta}}\right)^{t}\boldsymbol{X}\widehat{\vec{\beta}}\\ &= \left(\vec{y}^{t} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\right)\vec{y} - \left(\vec{y}^{t} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\right)\boldsymbol{X}\widehat{\vec{\beta}}\\ &= \vec{y}^{t}\vec{y} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y} - \vec{y}^{t}\boldsymbol{X}\widehat{\vec{\beta}} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\boldsymbol{X}\widehat{\vec{\beta}}\\ &= \vec{y}^{t}\vec{y} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y} - \left(\vec{y}^{t}\boldsymbol{X}\widehat{\vec{\beta}}\right)^{t} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y}\\ &= \vec{y}^{t}\vec{y} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y}\\ &= \vec{y}^{t}\vec{y} - 2\widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y}\\ &= \vec{y}^{t}\vec{y} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y} \end{align} \]

A \(\widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y}\) se le llama Suma de Cuadrados de la Regresión (o Modelo) - \(SC_R\) y también se representa por \({SC}\left[{\beta}_{1},{\beta}_{2},\ldots,{\beta}_{p}\right]\)

A \(\vec{y}^{t}\vec{y} = {\sum}_{i=1}^{n}{y}_{i}^{2}\) se le llama Suma de Cuadrados totales - \(SC_T\), luego en resumen se ha llegado a:

\[ SC_T = SC_R +SC_E \]

\[ \begin{align} E\left[SC_R\right] &= E\left[SC_T - SC_E\right]\\ &= E\left[SC_T\right] - E\left[SC_E\right]\\ &= E\left[{\sum}_{i=1}^{n}{y}_{i}^{2}\right] - E\left[\widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\vec{y}\right]\\ &= {\sum}_{i=1}^{n}\left\{V\left[{y}_{i}\right] + E\left[{y}_{i}\right]^{2}\right\} - \left(n-r\right){\sigma}^{2}\\ &= {\sum}_{i=1}^{n}V\left[{y}_{i}\right] + {\sum}_{i=1}^{n}E\left[{y}_{i}\right]^{2} - \left(n-r\right){\sigma}^{2}\\ &= n{\sigma}^{2} + E\left[\vec{y}\right]^{t}E\left[\vec{y}\right] - \left(n-r\right){\sigma}^{2}\\ &= \left[n - \left(n - r\right)\right]{\sigma}^{2} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t} \boldsymbol{X}\widehat{\vec{\beta}}\\ &= \left[n - n + r\right]{\sigma}^{2} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t} \boldsymbol{X}\widehat{\vec{\beta}}\\ &= r{\sigma}^{2} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t} \boldsymbol{X}\widehat{\vec{\beta}}\\ \end{align} \]

Fuente gl Suma de cuadrados - SC Cuadrado medio - CM \(E[CM]\)
Modelo r \(SC_R=\widehat{\vec{\beta}}^{t}\boldsymbol{X}\vec{y}\) \(CM_R=\frac{SCR}{r}\) \(E[CM_R]={\sigma}^{2} + \widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t}\boldsymbol{X}\widehat{\vec{\beta}}\)
Error n-r \(SC_E=\vec{y}^{t}\vec{y} - \widehat{\vec{\beta}}^{t}\boldsymbol{X}\vec{y}\) \(CM_E=\frac{SC_E}{n-r}\) \(E[CM_E]={\sigma}^{2}\)
Total n \(SC_T=\vec{y}^{t}\vec{y}\) \(CM_T=\frac{SC_T}{n}\)

Luego \(E[CM_R] {\geq} E[CM_E]\) con igualdad solo sí \(\widehat{\vec{\beta}}^{t}\boldsymbol{X}^{t} \boldsymbol{X}\widehat{\vec{\beta}} = 0\) o equivalentemente \(\boldsymbol{X}\widehat{\vec{\beta}} = 0\)

3.4 Bondad de ajuste del modelo

3.5 Evaluación de los supuestos del modelo

3.6 Variables

3.6.1 Predictoras

3.6.2 Categóricas

3.6.3 Confusión

3.6.4 Interacción

3.7 Selección de variables y construcción de modelos

3.7.1 Métodos stepwise

3.7.2 Métodos backward

3.7.3 Métodos forward

3.8 Diagnóstico del modelo de regresión

3.9 Predicción y modelamiento en regresión lineal

3.10 Implementación en R

3.10.1 Ecuación de regresión lineal múltiple

\[y_i=\beta_0 + \beta_1{\times}x_{i,1} + \beta_2{\times}x_{i,2} + \cdots + \beta_p{\times}x_{i,p} + error_{i}\]

\[error_{i}{\sim}N(0,\sigma_{error}^2)\]

\[\boldsymbol{y}_{(n{\times}1)}=\boldsymbol{X}_{(n{\times}p)}\boldsymbol{\beta}_{(p{\times}1)}+\boldsymbol{error}_{(n{\times}1)}\]

\[\boldsymbol{error}_{(n{\times}1)}{\sim}N(\boldsymbol{0}_{(n{\times}1)},\sigma_{error}^2\boldsymbol{I}_{(n{\times}1)})\]

3.10.2 Librerías

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(broom)
library(tidyverse)
library(ggfortify)
library(mosaic)
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## The following objects are masked from 'package:car':
## 
##     deltaMethod, logit
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     stat
## The following object is masked from 'package:BayesFactor':
## 
##     compare
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
library(huxtable)
## 
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
## 
##     add_rownames
## The following object is masked from 'package:ggplot2':
## 
##     theme_grey
library(jtools)
library(latex2exp)
library(pubh)
## Loading required package: emmeans
## Loading required package: gtsummary
## 
## Attaching package: 'gtsummary'
## The following object is masked from 'package:huxtable':
## 
##     as_flextable
## The following object is masked from 'package:MASS':
## 
##     select
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(sjlabelled)
## 
## Attaching package: 'sjlabelled'
## The following object is masked from 'package:huxtable':
## 
##     set_label
## The following object is masked from 'package:forcats':
## 
##     as_factor
## The following object is masked from 'package:dplyr':
## 
##     as_label
## The following object is masked from 'package:ggplot2':
## 
##     as_label
library(sjPlot)
## 
## Attaching package: 'sjPlot'
## The following object is masked from 'package:huxtable':
## 
##     font_size
library(sjmisc)
## 
## Attaching package: 'sjmisc'
## The following objects are masked from 'package:jtools':
## 
##     %nin%, center
## The following objects are masked from 'package:huxtable':
## 
##     add_columns, add_rows, print_html, print_md
## The following object is masked from 'package:purrr':
## 
##     is_empty
## The following object is masked from 'package:tidyr':
## 
##     replace_na
## The following object is masked from 'package:tibble':
## 
##     add_case
library(Ecdat)
## Loading required package: Ecfun
## 
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
## 
##     sign
## 
## Attaching package: 'Ecdat'
## The following object is masked from 'package:carData':
## 
##     Mroz
## The following object is masked from 'package:MASS':
## 
##     SP500
## The following object is masked from 'package:datasets':
## 
##     Orange

3.10.3 Base de datos

data(birthwt, package = "MASS")
library(tidyverse)
birthwt <- birthwt %>%
  mutate(
    age = as.numeric(age),
    lwt = as.numeric(lwt),
    smoke = factor(smoke, labels = c("Non-smoker", "Smoker")),
    race = factor(race, labels = c("White", "African American", "Other")),
    bwt = as.numeric(bwt)
    ) %>%
  var_labels(
    bwt = 'Birth weight (g)',
    smoke = 'Smoking status',
    race = 'Race'
    )
  • low: Indicador de peso al naces menos a 2.5 kilogramos

glimpse(birthwt)
## Rows: 189
## Columns: 10
## $ low   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ age   <dbl> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 18, …
## $ lwt   <dbl> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95, 1…
## $ race  <fct> African American, Other, White, White, White, Other, White, Othe…
## $ smoke <fct> Non-smoker, Non-smoker, Smoker, Smoker, Smoker, Non-smoker, Non-…
## $ ptl   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ht    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ui    <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1…
## $ ftv   <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2, 3…
## $ bwt   <dbl> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 2722…

3.10.4 Análisis exploratorio

birthwt %>%
  group_by(race, smoke) %>%
  summarise(
    n = n(),
    Mean = mean(bwt, na.rm = TRUE),
    Median = median(bwt, na.rm = TRUE),
    SD = sd(bwt, na.rm = TRUE),
    CV = rel_dis(bwt)
  ) 
## `summarise()` has grouped output by 'race'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 7
## # Groups:   race [3]
##   race             smoke          n  Mean Median    SD    CV
##   <fct>            <fct>      <int> <dbl>  <dbl> <dbl> <dbl>
## 1 White            Non-smoker    44 3429.  3593   710. 0.207
## 2 White            Smoker        52 2827.  2776.  626. 0.222
## 3 African American Non-smoker    16 2854.  2920   621. 0.218
## 4 African American Smoker        10 2504   2381   637. 0.254
## 5 Other            Non-smoker    55 2816.  2807   709. 0.252
## 6 Other            Smoker        12 2757.  3146.  810. 0.294
birthwt %>%
  gen_bst_df(bwt ~ race|smoke)
Birth weight (g)LowerCIUpperCIRaceSmoking status
3.43e+033.22e+033.64e+03WhiteNon-smoker
2.83e+032.66e+032.99e+03WhiteSmoker
2.85e+032.56e+033.13e+03African AmericanNon-smoker
2.5e+03 2.07e+032.87e+03African AmericanSmoker
2.82e+032.63e+032.99e+03OtherNon-smoker
2.76e+032.3e+03 3.15e+03OtherSmoker
birthwt %>%
  bar_error(bwt ~ race, fill = ~ smoke) %>%
  axis_labs() %>%
  gf_labs(fill = "Smoking status:")

3.10.4.1 Análisis de correlación

library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(birthwt[,c(2,3,10)], histogram = TRUE, pch = 19)

3.10.4.2 Análisis de datos faltantes

sapply(birthwt, function(x) sum(is.na(x)))
##   low   age   lwt  race smoke   ptl    ht    ui   ftv   bwt 
##     0     0     0     0     0     0     0     0     0     0
cor.mtest <- function(mat, ...) {
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat<- matrix(NA, n, n)
  diag(p.mat) <- 0
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      tmp <- cor.test(mat[, i], mat[, j], ...)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
    }
  }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

p.mat <- cor.mtest(birthwt[,c(2,3,10)])

library(corrplot)
## corrplot 0.92 loaded
## 
## Attaching package: 'corrplot'
## The following object is masked _by_ '.GlobalEnv':
## 
##     cor.mtest
birthwt.cor <- cor(birthwt[,c(2,3,10)])
corrplot(birthwt.cor, method = "number", type = "upper",
         tl.cex = 0.9, number.cex = 0.6,  order="hclust",  diag = FALSE,
         addCoef.col = "black", tl.col = "black", 
         p.mat = p.mat, sig.level = 0.05, insig = "blank")

3.10.5 Partición de la base de datos

set.seed(0123456789)

library(dplyr)
birthwt.train <- sample_frac(tbl = birthwt, replace = FALSE, size = 0.80)
birthwt.test <- anti_join(birthwt, birthwt.train)
## Joining, by = c("low", "age", "lwt", "race", "smoke", "ptl", "ht", "ui", "ftv",
## "bwt")

3.10.6 Ajuste de un modelo lineal a los datos

model_norm <- lm(bwt ~ smoke + race, data = birthwt.train)

3.10.6.1 Diagnósticos del modelo lineal

library(ggfortify)
autoplot(model_norm)

3.10.6.2 Resúmen del modelo lineal

model_norm %>% augment() %>% as_tibble()
.rownamesbwtsmokerace.fitted.resid.hat.sigma.cooksd.std.resid
2123.94e+03Non-smokerOther2.91e+031.03e+030.02086110.0152  1.69   
1553.27e+03Non-smokerOther2.91e+03364       0.02086170.00189 0.597  
1993.77e+03Non-smokerOther2.91e+03860       0.02086130.0106  1.41   
1012.77e+03SmokerWhite2.91e+03-139       0.019 6170.000253-0.229  
1313.06e+03Non-smokerWhite3.34e+03-276       0.02216170.00116 -0.454  
1323.06e+03SmokerWhite2.91e+03154       0.019 6170.0003070.252  
1803.57e+03SmokerOther2.48e+031.09e+030.03946110.0336  1.81   
522.3e+03 Non-smokerOther2.91e+03-609       0.02086150.00532 -1      
972.73e+03Non-smokerOther2.91e+03-177       0.02086170.000451-0.291  
1182.95e+03SmokerWhite2.91e+0339.6     0.019 6182.04e-050.065  
942.66e+03SmokerWhite2.91e+03-245       0.019 6170.000784-0.403  
792.47e+03SmokerWhite2.91e+03-442       0.019 6160.00255 -0.726  
952.66e+03SmokerWhite2.91e+03-243       0.019 6170.000771-0.399  
352.08e+03SmokerWhite2.91e+03-824       0.019 6140.00885 -1.35   
1062.84e+03Non-smokerOther2.91e+03-75.5     0.02086178.15e-05-0.124  
322.06e+03Non-smokerOther2.91e+03-855       0.02086130.0105  -1.4    
1252.92e+03SmokerWhite2.91e+0313.6     0.019 6182.41e-060.0223 
1453.2e+03 Non-smokerOther2.91e+03293       0.02086170.00122 0.48   
241.89e+03Non-smokerOther2.91e+03-1.02e+030.02086120.0148  -1.67   
682.41e+03SmokerWhite2.91e+03-494       0.019 6160.00318 -0.811  
1613.32e+03Non-smokerAfrican American3.02e+03298       0.05066170.0033  0.498  
982.75e+03Non-smokerOther2.91e+03-159       0.02086170.000364-0.262  
1333.06e+03SmokerWhite2.91e+03154       0.019 6170.0003070.252  
171.59e+03Non-smokerOther2.91e+03-1.32e+030.02086080.025   -2.17   
2194.05e+03Non-smokerWhite3.34e+03716       0.02216150.00783 1.18   
2053.86e+03SmokerWhite2.91e+03948       0.019 6120.0117  1.55   
712.44e+03Non-smokerAfrican American3.02e+03-581       0.05066160.0125  -0.968  
832.5e+03 Non-smokerAfrican American3.02e+03-524       0.05066160.0102  -0.873  
2113.94e+03SmokerWhite2.91e+031.03e+030.019 6110.0139  1.69   
1052.82e+03SmokerWhite2.91e+03-87.4     0.019 6179.94e-05-0.143  
1873.63e+03SmokerWhite2.91e+03721       0.019 6150.00676 1.18   
151.47e+03Non-smokerOther2.91e+03-1.44e+030.02086060.0295  -2.36   
782.47e+03SmokerOther2.48e+03-14.7     0.03946186.12e-06-0.0244 
2133.94e+03Non-smokerWhite3.34e+03603       0.02216150.00555 0.991  
1973.76e+03SmokerWhite2.91e+03848       0.019 6130.00936 1.39   
1393.1e+03 Non-smokerOther2.91e+03194       0.02086170.0005360.318  
1363.09e+03Non-smokerWhite3.34e+03-248       0.02216170.00094 -0.408  
1633.32e+03SmokerOther2.48e+03840       0.03946130.0199  1.39   
2254.59e+03Non-smokerWhite3.34e+031.25e+030.02216090.0241  2.06   
1643.33e+03SmokerOther2.48e+03850       0.03946130.0204  1.41   
372.12e+03SmokerOther2.48e+03-356       0.03946170.00357 -0.59   
402.13e+03SmokerAfrican American2.59e+03-463       0.05616160.00891 -0.774  
592.37e+03SmokerAfrican American2.59e+03-222       0.05616170.00205 -0.371  
1763.54e+03Non-smokerOther2.91e+03634       0.02086150.00574 1.04   
542.32e+03Non-smokerOther2.91e+03-585       0.02086160.0049  -0.961  
342.08e+03SmokerWhite2.91e+03-824       0.019 6140.00885 -1.35   
762.45e+03Non-smokerOther2.91e+03-460       0.02086160.00303 -0.756  
1413.15e+03SmokerWhite2.91e+03239       0.019 6170.0007410.391  
1273.03e+03SmokerWhite2.91e+03125       0.019 6170.0002020.204  
822.5e+03 SmokerOther2.48e+0314.3     0.03946185.73e-060.0236 
2033.8e+03 Non-smokerWhite3.34e+03461       0.02216160.00324 0.757  
2103.91e+03Non-smokerWhite3.34e+03574       0.02216160.00503 0.943  
191.73e+03Non-smokerOther2.91e+03-1.18e+030.02086100.02    -1.94   
492.28e+03Non-smokerOther2.91e+03-628       0.02086150.00565 -1.03   
572.35e+03Non-smokerWhite3.34e+03-985       0.02216120.0148  -1.62   
2013.77e+03Non-smokerOther2.91e+03860       0.02086130.0106  1.41   
1082.84e+03Non-smokerWhite3.34e+03-502       0.02216160.00385 -0.825  
882.59e+03SmokerWhite2.91e+03-314       0.019 6170.00129 -0.516  
512.3e+03 SmokerWhite2.91e+03-612       0.019 6150.00488 -1      
221.82e+03SmokerWhite2.91e+03-1.09e+030.019 6110.0155  -1.79   
692.42e+03SmokerWhite2.91e+03-484       0.019 6160.00306 -0.795  
1683.4e+03 Non-smokerAfrican American3.02e+03383       0.05066170.00544 0.639  
1443.2e+03 SmokerOther2.48e+03722       0.03946150.0147  1.2    
2063.86e+03Non-smokerAfrican American3.02e+03841       0.05066130.0262  1.4    
962.72e+03Non-smokerOther2.91e+03-188       0.02086170.000508-0.309  
1693.42e+03Non-smokerWhite3.34e+0377.9     0.02216179.26e-050.128  
2174e+03       Non-smokerWhite3.34e+03659       0.02216150.00663 1.08   
772.47e+03SmokerWhite2.91e+03-442       0.019 6160.00255 -0.726  
231.88e+03SmokerWhite2.91e+03-1.02e+030.019 6120.0136  -1.68   
1933.65e+03SmokerWhite2.91e+03743       0.019 6140.00718 1.22   
1953.7e+03 Non-smokerWhite3.34e+03361       0.02216170.00199 0.593  
1593.3e+03 SmokerOther2.48e+03822       0.03946140.0191  1.36   
872.56e+03SmokerWhite2.91e+03-351       0.019 6170.00161 -0.576  
312.06e+03Non-smokerOther2.91e+03-855       0.02086130.0105  -1.4    
1663.37e+03Non-smokerAfrican American3.02e+03355       0.05066170.00468 0.593  
462.24e+03Non-smokerOther2.91e+03-670       0.02086150.00643 -1.1    
1483.22e+03Non-smokerOther2.91e+03315       0.02086170.00142 0.516  
1793.54e+03Non-smokerOther2.91e+03634       0.02086150.00574 1.04   
1853.61e+03Non-smokerWhite3.34e+03276       0.02216170.00116 0.453  
862.55e+03Non-smokerOther2.91e+03-359       0.02086170.00185 -0.59   
1913.65e+03Non-smokerWhite3.34e+03313       0.02216170.0015  0.514  
1733.46e+03Non-smokerWhite3.34e+03121       0.02216170.0002230.199  
1132.91e+03SmokerWhite2.91e+03-2.39    0.019 6187.42e-08-0.00392
1353.09e+03Non-smokerOther2.91e+03180       0.02086170.0004610.295  
1503.23e+03Non-smokerOther2.91e+03322       0.02086170.00148 0.528  
2204.11e+03Non-smokerWhite3.34e+03773       0.02216140.00912 1.27   
1122.88e+03Non-smokerWhite3.34e+03-461       0.02216160.00325 -0.758  
291.94e+03SmokerWhite2.91e+03-972       0.019 6120.0123  -1.6    
622.38e+03Non-smokerOther2.91e+03-529       0.02086160.00401 -0.869  
181.7e+03 Non-smokerAfrican American3.02e+03-1.32e+030.05066070.0643  -2.2    
251.9e+03 Non-smokerOther2.91e+03-1.01e+030.02086120.0146  -1.66   
1022.78e+03Non-smokerAfrican American3.02e+03-241       0.05066170.00214 -0.401  
562.35e+03SmokerWhite2.91e+03-555       0.019 6160.00402 -0.911  
912.62e+03Non-smokerOther2.91e+03-288       0.02086170.00119 -0.474  
2224.17e+03Non-smokerWhite3.34e+03829       0.02216140.0105  1.36   
992.75e+03Non-smokerOther2.91e+03-160       0.02086170.000368-0.264  
1903.65e+03Non-smokerWhite3.34e+03313       0.02216170.0015  0.514  
1463.2e+03 Non-smokerOther2.91e+03293       0.02086170.00122 0.48   
2093.88e+03SmokerWhite2.91e+03976       0.019 6120.0124  1.6    
1343.08e+03Non-smokerWhite3.34e+03-258       0.02216170.00102 -0.424  
1843.61e+03Non-smokerWhite3.34e+03276       0.02216170.00116 0.453  
1813.57e+03Non-smokerOther2.91e+03662       0.02086150.00626 1.09   
452.22e+03SmokerWhite2.91e+03-683       0.019 6150.00608 -1.12   
332.08e+03Non-smokerWhite3.34e+03-1.26e+030.02216090.0241  -2.06   
1293.06e+03Non-smokerWhite3.34e+03-276       0.02216170.00116 -0.454  
1112.88e+03Non-smokerOther2.91e+03-33.5     0.02086181.6e-05 -0.055  
1142.92e+03Non-smokerWhite3.34e+03-418       0.02216170.00267 -0.687  
201.79e+03SmokerWhite2.91e+03-1.12e+030.019 6100.0163  -1.83   
271.93e+03SmokerWhite2.91e+03-980       0.019 6120.0125  -1.61   
1513.23e+03Non-smokerWhite3.34e+03-104       0.02216170.000166-0.171  
442.21e+03SmokerOther2.48e+03-270       0.03946170.00205 -0.447  
1473.22e+03Non-smokerOther2.91e+03315       0.02086170.00142 0.516  
1543.26e+03SmokerOther2.48e+03779       0.03946140.0171  1.29   
362.1e+03 Non-smokerWhite3.34e+03-1.24e+030.02216090.0234  -2.03   
2073.86e+03Non-smokerWhite3.34e+03522       0.02216160.00416 0.858  
1192.95e+03SmokerAfrican American2.59e+03359       0.05616170.00537 0.601  
2214.15e+03Non-smokerWhite3.34e+03815       0.02216140.0101  1.34   
1212.98e+03Non-smokerAfrican American3.02e+03-41.6     0.05066186.42e-05-0.0694 
892.6e+03 SmokerWhite2.91e+03-308       0.019 6170.00124 -0.506  
612.38e+03SmokerAfrican American2.59e+03-208       0.05616170.0018  -0.348  
1263e+03       SmokerWhite2.91e+0396.6     0.019 6170.0001220.158  
1383.1e+03 Non-smokerWhite3.34e+03-238       0.02216170.000866-0.391  
2023.79e+03Non-smokerAfrican American3.02e+03771       0.05066140.022   1.29   
1152.92e+03SmokerAfrican American2.59e+03331       0.05616170.00456 0.554  
1623.32e+03SmokerWhite2.91e+03409       0.019 6170.00217 0.67   
1232.98e+03SmokerWhite2.91e+0368.6     0.019 6186.13e-050.113  
2234.17e+03Non-smokerWhite3.34e+03836       0.02216140.0107  1.37   
1703.43e+03SmokerWhite2.91e+03522       0.019 6160.00354 0.856  
1963.73e+03Non-smokerWhite3.34e+03390       0.02216170.00232 0.641  
502.3e+03 SmokerAfrican American2.59e+03-293       0.05616170.00357 -0.49   
1753.47e+03Non-smokerWhite3.34e+03135       0.02216170.0002780.222  
602.38e+03SmokerAfrican American2.59e+03-208       0.05616170.0018  -0.348  
1303.06e+03Non-smokerAfrican American3.02e+0343.4     0.05066186.97e-050.0723 
852.52e+03Non-smokerAfrican American3.02e+03-496       0.05066160.0091  -0.827  
932.64e+03Non-smokerOther2.91e+03-273       0.02086170.00107 -0.449  
2003.77e+03Non-smokerWhite3.34e+03432       0.02216160.00285 0.71   
1092.86e+03Non-smokerOther2.91e+03-47.5     0.02086183.23e-05-0.078  
1673.37e+03SmokerWhite2.91e+03466       0.019 6160.00282 0.764  
1162.92e+03Non-smokerAfrican American3.02e+03-98.6     0.05066170.00036 -0.164  
1723.44e+03SmokerAfrican American2.59e+03855       0.05616130.0304  1.43   
672.41e+03SmokerWhite2.91e+03-498       0.019 6160.00323 -0.818  
1032.78e+03SmokerWhite2.91e+03-126       0.019 6170.000208-0.207  
1433.18e+03Non-smokerOther2.91e+03265       0.02086170.001   0.434  
652.41e+03SmokerWhite2.91e+03-498       0.019 6160.00323 -0.818  
1563.27e+03Non-smokerOther2.91e+03364       0.02086170.00189 0.597  
1883.64e+03SmokerWhite2.91e+03729       0.019 6150.00691 1.2    
2043.83e+03Non-smokerWhite3.34e+03489       0.02216160.00365 0.803  
2083.88e+03Non-smokerOther2.91e+03974       0.02086120.0136  1.6    
1283.04e+03SmokerAfrican American2.59e+03453       0.05616160.00854 0.758  
1042.81e+03Non-smokerOther2.91e+03-103       0.02086170.000153-0.17   
1743.46e+03Non-smokerWhite3.34e+03122       0.02216170.0002270.2    

3.10.6.3 Coeficientes del modelo

model_norm %>% tidy()
termestimatestd.errorstatisticp.value
(Intercept)3.34e+0391.536.5 1.58e-75
smokeSmoker-430       108  -3.990.000104
raceAfrican American-320       149  -2.140.0341  
raceOther-428       117  -3.650.000367
3.10.6.3.1 Intervalos de confianza
model_norm %>% confint() %>% as_tibble()
2.5 %97.5 %
3.16e+033.52e+03
-643       -217       
-615       -24.3     
-659       -196       
model_norm %>% 
  glm_coef(labels = model_labels(model_norm))
ParameterCoefficientPr(>|t|)
Constant3338.12 (3157.2, 3519.04)< 0.001
Smoking status: Smoker-429.74 (-642.58, -216.9)< 0.001
Race: African American-319.5 (-614.66, -24.35)0.034
Race: Other-427.65 (-659.34, -195.95)< 0.001
model_norm %>%
  glm_coef(se_rob = TRUE, labels = model_labels(model_norm))
ParameterCoefficientPr(>|t|)
Constant3338.12 (3157.12, 3519.13)< 0.001
Smoking status: Smoker-429.74 (-644.83, -214.65)< 0.001
Race: African American-319.5 (-587.4, -51.61)0.02
Race: Other-427.65 (-671.48, -183.81)< 0.001
model_norm %>%
  plot_model("pred", terms = ~race|smoke, dot.size = 1.5, title = "")

emmip(model_norm, smoke ~ race) %>%
  gf_labs(y = get_label(birthwt$bwt), x = "", col = "Smoking status")

3.10.6.4 Multicollinealidad del modelo

library(regclass)
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## The following objects are masked from 'package:mosaic':
## 
##     chisq, logit
## The following object is masked from 'package:car':
## 
##     logit
## The following object is masked from 'package:coda':
## 
##     nvar
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
## 
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
## 
##     qq
model_norm %>% VIF() %>% as_tibble()
GVIFDfGVIF^(1/(2*Df))
1.1211.06
1.1221.03

3.10.6.5 Residuales del modelo ajustado

p1 <- ggplot(birthwt.train, aes(birthwt.train[,2], residuals(model_norm))) +
    geom_point() + geom_smooth(color = "blue")
p2 <- ggplot(birthwt.train, aes(birthwt.train[,3], residuals(model_norm))) +
  geom_point() + geom_smooth(color = "blue")
p3 <- ggplot(birthwt.train, aes(birthwt.train[,10], residuals(model_norm))) +
  geom_point() + geom_smooth(color = "blue")

library(pdp)
## 
## Attaching package: 'pdp'
## The following object is masked from 'package:purrr':
## 
##     partial
library(gridExtra)
grid.arrange(p1, p2, p3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3.10.6.6 Datos arípicos

library(olsrr)
model_norm %>% ols_plot_cooksd_bar()

3.10.6.7 Reajustando el modelo de regresión lineal

3.10.6.7.1 Algoritmo paso a paso
model_norm %>% 
  Anova() %>% 
  tidy()
termsumsqdfstatisticp.value
smoke6.03e+06115.9 0.000104
race5.46e+0627.210.00103 
Residuals5.57e+07147          
model_norm %>% 
  tidy()
termestimatestd.errorstatisticp.value
(Intercept)3.34e+0391.536.5 1.58e-75
smokeSmoker-430       108  -3.990.000104
raceAfrican American-320       149  -2.140.0341  
raceOther-428       117  -3.650.000367
model_norm %>% 
  glm_coef(labels = model_labels(model_norm))
ParameterCoefficientPr(>|t|)
Constant3338.12 (3157.2, 3519.04)< 0.001
Smoking status: Smoker-429.74 (-642.58, -216.9)< 0.001
Race: African American-319.5 (-614.66, -24.35)0.034
Race: Other-427.65 (-659.34, -195.95)< 0.001
model_norm %>%
  glm_coef(se_rob = TRUE, labels = model_labels(model_norm))
ParameterCoefficientPr(>|t|)
Constant3338.12 (3157.12, 3519.13)< 0.001
Smoking status: Smoker-429.74 (-644.83, -214.65)< 0.001
Race: African American-319.5 (-587.4, -51.61)0.02
Race: Other-427.65 (-671.48, -183.81)< 0.001

3.10.6.8 Criterios de ajuste del modelo líneal

model_norm %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.1360.1196157.737.88e-053-1.18e+032.37e+032.39e+035.57e+07147151
library(MASS)
model_norm_AIC <- stepAIC(model_norm, trace = 0)
model_norm_AIC %>% 
  Anova() %>% 
  tidy()
termsumsqdfstatisticp.value
smoke6.03e+06115.9 0.000104
race5.46e+0627.210.00103 
Residuals5.57e+07147          
model_norm_AIC %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.1360.1196157.737.88e-053-1.18e+032.37e+032.39e+035.57e+07147151

3.10.6.9 Análisis de varianza

model_norm %>% Anova() %>% tidy()
termsumsqdfstatisticp.value
smoke6.03e+06115.9 0.000104
race5.46e+0627.210.00103 
Residuals5.57e+07147          

3.10.6.10 Comparación de criterios

AIC(model_norm, model_norm_AIC)
dfAIC
52.37e+03
52.37e+03

3.10.6.11 Importancia relativa de cada variable

#library(relaimpo)
#calc.relimp(model_norm_AIC, type = c("lmg", "last", "first", "pratt", "betasq"), rela = T)
#boot <- boot.relimp(model_norm, b = 1000, type = c("lmg", "last", "first", "pratt"), 
#                    rank = TRUE, diff = TRUE, rela = TRUE)
#booteval.relimp(boot) 
#plot(booteval.relimp(boot,sort=TRUE)) 

4 Otros modelos de regresión

4.1 Regresión polinómica

4.2 Regresión Poisson

4.3 Regresión logística

4.4 Regresión robusta

4.5 Regresión no lineal