Regresión lineal
Maestría en Asuntos Políticos y Políticas Públicas
Diego Solís Delgadillo
Mejor Línea de ajuste
Resultado \((Y)\): es la variable que queremos explicar o predecir
Predictor \((X)\): es la variable que utilizamos para explicar la variabilidad en \(Y\)
\[ Y=Modelo+Error \]
\[ Presupuesto_i = \alpha + \beta_1 Matricula_i \]
\[ Y_i = \alpha + \beta_1 x_i \]
\[ residuo_i = presupuesto_i - predicción_i \]
Calculamos el residuo para cada punto
Algunos valores serán positivos y otros negativos
Para que todos sean positivos los elevamos al cuadrado
Se le conoce así a la suma de errores individuales
Diferente líneas de ajuste tienen distintas SSE
Tip
X | Y |
---|---|
1 | 2 |
2 | 4 |
3 | 5 |
4 | 4 |
5 | 5 |
Calculamos las medias
La línea cruza por las medias
Calculamas las diferencias
Primero calculamos las distancias entre \(x\) y \(\bar{x}\)
X | Y | \(x-\bar{x}\) |
---|---|---|
1 | 2 | -2 |
2 | 4 | -1 |
3 | 5 | 0 |
4 | 4 | 1 |
5 | 5 | 2 |
Hacemos lo mismos con las distancias entre \(y\) y \(\bar{y}\)
X | Y | \(x-\bar{x}\) | \(y-\bar{y}\) |
---|---|---|---|
1 | 2 | -2 | -2 |
2 | 4 | -1 | 0 |
3 | 5 | 0 | 1 |
4 | 4 | 1 | 0 |
5 | 5 | 2 | 1 |
X | Y | \(x-\bar{x}\) | \(y-\bar{y}\) | \((x-\bar{x})^2\) | \((x-\bar{x})(y-\bar{y})\) |
---|---|---|---|---|---|
1 | 2 | -2 | -2 | 4 | 4 |
2 | 4 | -1 | 0 | 1 | 0 |
3 | 5 | 0 | 1 | 0 | 0 |
4 | 4 | 1 | 0 | 1 | 0 |
5 | 5 | 2 | 1 | 4 | 2 |
Total | 10 | 6 |
\[ \beta=\frac{\Sigma(x-\bar{x})(y-\bar{y}) }{\Sigma(x-\bar{x})^2} \] \[ \beta=\frac{6}{10}=0.6 \]
\[ y= \alpha+\beta x \] \[ 4= \alpha+0.6(3) \] \[ \alpha= 4-1.8=2.2 \]
\[ R^2= SSE/TSS \]
Tip
\(\beta_1 = 0\) vs \(\beta_1 \neq 0\)
Note
🎯 El Error Estándar de \(\beta\) (\(SE(\hat{\beta})\)) mide la precisión de nuestro estimado de la pendiente.
Si el error estándar es pequeño → mayor precisión.
Si el error estándar es grande → mayor incertidumbre.
Se calcula con:
\[ SE(\hat{\beta}) = \sqrt{ \frac{SSE}{(n-2) \sum (x-\bar{x})^2} } \]
Donde
Sabemos que:
Por lo tanto:
\[ SE(\hat{\beta})= \sqrt{ \frac{2.4}{3 \times 10} } = \sqrt{0.08} = 0.2828 \]
Note
\[ t= \frac{\hat{\beta} - \beta_0}{SE} \]
Tip
Sabemos que:
Por lo tanto:
\[ t= \frac{0.6-0}{0.2828}=2.122 \]
\[ gl = n - 2 \]
Tip
Por lo tanto:
\[ gl = 5-2 = 3 \]
Tip
En nuestro ejemplo:
El valor t calculado fue:
\[ t=2.122 \]
Con \(gl=3\) se obtiene:
\[ p=0.121 \]
(1) | |
---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
(Intercept) | 53.956*** |
(0.315) | |
gdpPercap | 0.001*** |
(0.000) | |
Num.Obs. | 1704 |
R2 | 0.341 |
R2 Adj. | 0.340 |
AIC | 12850.4 |
BIC | 12866.7 |
Log.Lik. | -6422.205 |
F | 879.577 |
RMSE | 10.49 |
\[ y= \alpha+ \beta x_1 + \beta x_2 \]
Tip
Coeficiente de determinación
El coeficiente de determinación continúa explicando la variación de \(Y\) atribuible al cambio de una unidad en \(X\)
¿Qué variables incluir?
\[InspectionScore=\beta_0+ \beta_NumberofLocations+\epsilon\]
¿Qué mide la prueba?
Warning
Tip
La constante es nuestra predicción para \(Y\) cuando todos los predictores son iguales a cero
⚠️Si los valores de la independiente no pueden llegar a cero, entonces su valor no es muy informativo
Ejemplo
\[ Y= \beta_g+ \beta_t + \beta_1X_{it}+ \beta_2W_{i}+\epsilon_{it}\] - Esto ocurre cuando tenemos datos panel:
install.packages("causaldata")
library(tidyverse)
res<-causaldata::restaurant_inspections
res %>% group_by %>%
mutate(NumberofLocations=n())
m1<-lm(inspection_score ~ NumberofLocations, data=res)
Call:
lm(formula = inspection_score ~ NumberofLocations, data = res)
Residuals:
Min 1Q Median 3Q Max
-27.1673 -3.5449 0.9835 5.4362 17.3253
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.8656964 0.0462975 2049.05 <2e-16 ***
NumberofLocations -0.0188715 0.0004356 -43.32 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.052 on 27176 degrees of freedom
Multiple R-squared: 0.0646, Adjusted R-squared: 0.06456
F-statistic: 1877 on 1 and 27176 DF, p-value: < 2.2e-16
Call:
lm(formula = inspection_score ~ NumberofLocations + Year, data = res)
Residuals:
Min 1Q Median 3Q Max
-27.4407 -3.5210 0.9142 5.3014 18.0722
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.253e+02 1.241e+01 18.16 <2e-16 ***
NumberofLocations -1.919e-02 4.358e-04 -44.03 <2e-16 ***
Year -6.489e-02 6.173e-03 -10.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.039 on 27175 degrees of freedom
Multiple R-squared: 0.06838, Adjusted R-squared: 0.06832
F-statistic: 997.4 on 2 and 27175 DF, p-value: < 2.2e-16
(1) | (2) | |
---|---|---|
(Intercept) | 94.866 *** | 225.333 *** |
(0.046) | (12.411) | |
NumberofLocations | -0.019 *** | -0.019 *** |
(0.000) | (0.000) | |
Year | -0.065 *** | |
(0.006) | ||
N | 27178 | 27178 |
R2 | 0.065 | 0.068 |
logLik | -87491.813 | -87436.667 |
AIC | 174989.627 | 174881.334 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
library(palmerpenguins)
data(penguins)
modelo<-lm(flipper_length_mm~sex, data=penguins)
summary(modelo)
Call:
lm(formula = flipper_length_mm ~ sex, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-26.506 -10.364 -4.364 12.636 26.494
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 197.364 1.057 186.792 < 2e-16 ***
sexmale 7.142 1.488 4.801 2.39e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.57 on 331 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.06511, Adjusted R-squared: 0.06229
F-statistic: 23.05 on 1 and 331 DF, p-value: 2.391e-06
Important
library(palmerpenguins)
data(penguins)
modelo<-lm(flipper_length_mm~island, data=penguins)
summary(modelo)
Call:
lm(formula = flipper_length_mm ~ island, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-37.707 -5.196 1.804 6.927 21.293
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 209.7066 0.8621 243.25 <2e-16 ***
islandDream -16.6340 1.3207 -12.60 <2e-16 ***
islandTorgersen -18.5105 1.7824 -10.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.14 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.376, Adjusted R-squared: 0.3723
F-statistic: 102.1 on 2 and 339 DF, p-value: < 2.2e-16
Interpretación intercepto
Warning
library(tidyverse)
data(starwars)
modelo2<-lm(log(height)~log(mass), data=starwars)
summary(modelo2)
Call:
lm(formula = log(height) ~ log(mass), data = starwars)
Residuals:
Min 1Q Median 3Q Max
-0.80720 -0.01361 0.04195 0.08930 0.24794
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.90613 0.18197 21.47 < 2e-16 ***
log(mass) 0.28638 0.04205 6.81 6.6e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1937 on 57 degrees of freedom
(28 observations deleted due to missingness)
Multiple R-squared: 0.4486, Adjusted R-squared: 0.4389
F-statistic: 46.37 on 1 and 57 DF, p-value: 6.601e-09
Important
\[ Y= \beta_0+ \beta_1X + \beta_2Z+ \beta_3XZ+\epsilon\]
library(tidyverse)
df <- causaldata::restaurant_inspections
m3 <- lm(inspection_score ~ NumberofLocations*Weekend+ Year, data = df)
Call:
lm(formula = inspection_score ~ NumberofLocations * Weekend +
Year, data = df)
Residuals:
Min 1Q Median 3Q Max
-27.4313 -3.5208 0.9174 5.3044 18.0344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.251e+02 1.241e+01 18.134 < 2e-16 ***
NumberofLocations -1.911e-02 4.366e-04 -43.759 < 2e-16 ***
WeekendTRUE 1.759e+00 4.878e-01 3.606 0.000311 ***
Year -6.479e-02 6.174e-03 -10.494 < 2e-16 ***
NumberofLocations:WeekendTRUE -9.840e-03 7.529e-03 -1.307 0.191245
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.038 on 27173 degrees of freedom
Multiple R-squared: 0.06884, Adjusted R-squared: 0.06871
F-statistic: 502.3 on 4 and 27173 DF, p-value: < 2.2e-16