Primero, cargamos los paquetes necesarios para este laboratorio (y limpiamos el environment, obvio!)
library(data.table)
library(ggplot2)
library(moderndive)
library(knitr)
rm(list=ls())
Cargamos la base de datos de Airbnb Seattle llamada seattle_01.csv.
red_wine <- fread("wine/Red.csv")
sparkling <- fread("wine/Sparkling.csv")
1. Exploración de datos Primero, vamos a explorar un poco los datos con unos gráficos simples. En este caso, nos gustaría predecir el precio de los vinos con la variable rating
ggplot(red_wine, aes(x=Rating, y=Price)) + geom_point() + labs(title="Red Wine")
ggplot(sparkling, aes(x=Rating, y=Price)) + geom_point() + labs(title="Sparkling Wine")
Para que las estimaciones de MCO sean más precisas, no podemos tener valores demasiado atípicos que ensucien nuestra muestra. Sabemos que pueden haber botellas de vino específicas con precios muy muy altos, y que esos precios se deban a características que no podemos observar en nuestros datos. Dados los gráficos anteriores, eliminemos esos outliers de la muestra. Luego, volvemos a hacer los gráficos para visualizar la distribución.
red_wine <- red_wine[!Price>3000,]
sparkling <- sparkling[!Price>400]
ggplot(red_wine, aes(x=Rating, y=Price)) + geom_point() + labs(title="Red Wine")
ggplot(sparkling, aes(x=Rating, y=Price)) + geom_point() + labs(title="Sparkling Wine")
2. Estimaciones
Ahora, vamos a nuestras estimaciones y predicciones. Con la función lm vamos a generar nuestra primera regresión lineal simple. Vamos a estimar el precio del vino para
reg_red <- lm(Price~Rating , data = red_wine)
reg_spark <- lm(Price~Rating , data = sparkling)
get_regression_table(reg_red) %>% kable()
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -431.782 | 9.13 | -47.294 | 0 | -449.679 | -413.885 |
| Rating | 120.953 | 2.34 | 51.700 | 0 | 116.367 | 125.539 |
get_regression_table(reg_spark) %>% kable()
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -409.397 | 13.969 | -29.307 | 0 | -436.809 | -381.984 |
| Rating | 114.281 | 3.592 | 31.811 | 0 | 107.231 | 121.330 |
2. Predicciones
red_wine[,prediccion:=predict(reg_red)]
sparkling[,prediccion:=predict(reg_spark)]
ggplot(red_wine, aes(x=Rating, y=Price)) + geom_point() + stat_smooth(method = lm) + labs(title = "Red Wine")
## `geom_smooth()` using formula 'y ~ x'
ggplot(sparkling, aes(x=Rating, y=Price)) + geom_point() + stat_smooth(method = lm) + labs(title = "Sparkling Wine")
## `geom_smooth()` using formula 'y ~ x'
3. Residuos Podemos calcular nuestro error de predicción con la función
resid que nos permite ver cuánto nos equivocamos
red_wine[,residuo:=resid(reg_red)]
sparkling[,residuo:=resid(reg_spark)]
4. Agregando más variables Sabemos que en realidad no es solamente el Rating lo que define el precio de un vino. Sabemos que en el error de predicción están presentes, en este caso, características tanto observables como inobservables que influyen en el precio. También sabemos que en el caso de los vinos, la antigüedad es algo muy importante y que puede influir muhco en el precio. Agreguemos el factor año en nuestras regresiones para ver como mejoran nuestras predicciones y en cuanto se reducen nuestros residuos.
reg_red_2 <- lm(Price~Rating + Year , data = red_wine)
reg_spark_2 <- lm(Price~Rating + Year, data = sparkling)
get_regression_table(reg_red_2) %>% kable()
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -227.703 | 61.167 | -3.723 | 0.000 | -347.604 | -107.802 |
| Rating | 101.698 | 2.275 | 44.704 | 0.000 | 97.239 | 106.158 |
| Year1989 | 811.900 | 73.951 | 10.979 | 0.000 | 666.938 | 956.863 |
| Year1990 | 403.965 | 73.950 | 5.463 | 0.000 | 259.006 | 548.924 |
| Year1991 | 646.820 | 85.390 | 7.575 | 0.000 | 479.436 | 814.204 |
| Year1992 | 601.343 | 69.720 | 8.625 | 0.000 | 464.675 | 738.011 |
| Year1993 | 821.860 | 85.390 | 9.625 | 0.000 | 654.475 | 989.246 |
| Year1995 | -10.403 | 67.507 | -0.154 | 0.878 | -142.732 | 121.927 |
| Year1996 | -45.126 | 66.144 | -0.682 | 0.495 | -174.784 | 84.531 |
| Year1997 | 20.483 | 64.548 | 0.317 | 0.751 | -106.047 | 147.013 |
| Year1998 | -96.868 | 65.217 | -1.485 | 0.137 | -224.710 | 30.973 |
| Year1999 | -42.151 | 62.359 | -0.676 | 0.499 | -164.391 | 80.088 |
| Year2000 | -51.533 | 61.948 | -0.832 | 0.405 | -172.966 | 69.900 |
| Year2001 | -115.041 | 62.846 | -1.831 | 0.067 | -238.235 | 8.152 |
| Year2002 | -98.086 | 64.548 | -1.520 | 0.129 | -224.616 | 28.444 |
| Year2003 | -92.177 | 63.064 | -1.462 | 0.144 | -215.799 | 31.444 |
| Year2004 | -78.222 | 61.488 | -1.272 | 0.203 | -198.753 | 42.310 |
| Year2005 | -87.336 | 60.576 | -1.442 | 0.149 | -206.080 | 31.407 |
| Year2006 | -53.850 | 61.062 | -0.882 | 0.378 | -173.546 | 65.846 |
| Year2007 | -117.792 | 61.080 | -1.928 | 0.054 | -237.524 | 1.939 |
| Year2008 | -95.473 | 60.757 | -1.571 | 0.116 | -214.572 | 23.626 |
| Year2009 | -78.123 | 60.740 | -1.286 | 0.198 | -197.187 | 40.941 |
| Year2010 | -72.257 | 60.554 | -1.193 | 0.233 | -190.958 | 46.444 |
| Year2011 | -119.218 | 60.491 | -1.971 | 0.049 | -237.796 | -0.641 |
| Year2012 | -131.496 | 60.465 | -2.175 | 0.030 | -250.021 | -12.970 |
| Year2013 | -120.149 | 60.437 | -1.988 | 0.047 | -238.620 | -1.678 |
| Year2014 | -129.003 | 60.424 | -2.135 | 0.033 | -247.448 | -10.558 |
| Year2015 | -136.054 | 60.407 | -2.252 | 0.024 | -254.466 | -17.643 |
| Year2016 | -136.462 | 60.404 | -2.259 | 0.024 | -254.868 | -18.057 |
| Year2017 | -139.145 | 60.409 | -2.303 | 0.021 | -257.561 | -20.729 |
| Year2018 | -137.566 | 60.423 | -2.277 | 0.023 | -256.009 | -19.123 |
| Year2019 | -136.330 | 60.657 | -2.248 | 0.025 | -255.231 | -17.428 |
| YearN.V. | -155.228 | 64.044 | -2.424 | 0.015 | -280.768 | -29.687 |
get_regression_table(reg_spark_2) %>% kable()
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -290.651 | 18.956 | -15.333 | 0.000 | -327.850 | -253.452 |
| Rating | 80.061 | 3.410 | 23.476 | 0.000 | 73.369 | 86.754 |
| Year1996 | 170.074 | 27.321 | 6.225 | 0.000 | 116.461 | 223.688 |
| Year1999 | 190.644 | 21.639 | 8.810 | 0.000 | 148.180 | 233.109 |
| Year2002 | 172.538 | 18.187 | 9.487 | 0.000 | 136.848 | 208.228 |
| Year2003 | 76.104 | 27.321 | 2.786 | 0.005 | 22.491 | 129.718 |
| Year2004 | 149.995 | 17.328 | 8.656 | 0.000 | 115.991 | 184.000 |
| Year2005 | 118.956 | 21.608 | 5.505 | 0.000 | 76.552 | 161.359 |
| Year2006 | 64.856 | 15.002 | 4.323 | 0.000 | 35.416 | 94.296 |
| Year2007 | 77.397 | 15.216 | 5.086 | 0.000 | 47.537 | 107.258 |
| Year2008 | 40.904 | 14.762 | 2.771 | 0.006 | 11.935 | 69.872 |
| Year2009 | 17.104 | 15.165 | 1.128 | 0.260 | -12.655 | 46.864 |
| Year2010 | 19.647 | 15.134 | 1.298 | 0.195 | -10.052 | 49.346 |
| Year2011 | 28.328 | 15.556 | 1.821 | 0.069 | -2.200 | 58.856 |
| Year2012 | 20.008 | 14.472 | 1.382 | 0.167 | -8.392 | 48.408 |
| Year2013 | 10.241 | 14.850 | 0.690 | 0.491 | -18.900 | 39.383 |
| Year2014 | 17.971 | 14.656 | 1.226 | 0.220 | -10.788 | 46.731 |
| Year2015 | 5.789 | 14.305 | 0.405 | 0.686 | -22.283 | 33.860 |
| Year2016 | -1.079 | 14.516 | -0.074 | 0.941 | -29.565 | 27.408 |
| Year2017 | 1.718 | 14.384 | 0.119 | 0.905 | -26.510 | 29.945 |
| Year2018 | -11.496 | 14.330 | -0.802 | 0.423 | -39.617 | 16.626 |
| Year2019 | -18.617 | 15.115 | -1.232 | 0.218 | -48.278 | 11.045 |
| YearN.V. | 10.547 | 13.647 | 0.773 | 0.440 | -16.233 | 37.328 |
Ahora, volvamos a calcular las predicciones y residuos para comparar:
red_wine[,prediccion_2:=predict(reg_red_2)]
sparkling[,prediccion_2:=predict(reg_spark_2)]
red_wine[,residuo_2:=resid(reg_red_2)]
sparkling[,residuo_2:=resid(reg_spark_2)]