Clase 8: Regresiones

Primero, cargamos los paquetes necesarios para este laboratorio (y limpiamos el environment, obvio!)

library(data.table)
library(ggplot2)
library(moderndive)
library(knitr)
rm(list=ls())

Cargamos la base de datos de Airbnb Seattle llamada seattle_01.csv.

red_wine <- fread("wine/Red.csv")
sparkling <- fread("wine/Sparkling.csv")

1. Exploración de datos Primero, vamos a explorar un poco los datos con unos gráficos simples. En este caso, nos gustaría predecir el precio de los vinos con la variable rating

ggplot(red_wine,  aes(x=Rating, y=Price)) + geom_point() + labs(title="Red Wine")

ggplot(sparkling,  aes(x=Rating, y=Price)) + geom_point() + labs(title="Sparkling Wine")

Para que las estimaciones de MCO sean más precisas, no podemos tener valores demasiado atípicos que ensucien nuestra muestra. Sabemos que pueden haber botellas de vino específicas con precios muy muy altos, y que esos precios se deban a características que no podemos observar en nuestros datos. Dados los gráficos anteriores, eliminemos esos outliers de la muestra. Luego, volvemos a hacer los gráficos para visualizar la distribución.

red_wine <- red_wine[!Price>3000,]
sparkling <- sparkling[!Price>400]

ggplot(red_wine,  aes(x=Rating, y=Price)) + geom_point() + labs(title="Red Wine")

ggplot(sparkling,  aes(x=Rating, y=Price)) + geom_point() + labs(title="Sparkling Wine")

2. Estimaciones

Ahora, vamos a nuestras estimaciones y predicciones. Con la función lm vamos a generar nuestra primera regresión lineal simple. Vamos a estimar el precio del vino para

reg_red <- lm(Price~Rating , data = red_wine)
reg_spark <- lm(Price~Rating , data = sparkling)

get_regression_table(reg_red) %>% kable()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-431.782	9.13	-47.294	0	-449.679	-413.885
Rating	120.953	2.34	51.700	0	116.367	125.539

get_regression_table(reg_spark) %>% kable()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-409.397	13.969	-29.307	0	-436.809	-381.984
Rating	114.281	3.592	31.811	0	107.231	121.330

2. Predicciones

red_wine[,prediccion:=predict(reg_red)]
sparkling[,prediccion:=predict(reg_spark)]

ggplot(red_wine, aes(x=Rating, y=Price)) +   geom_point() + stat_smooth(method = lm) + labs(title = "Red Wine")

## `geom_smooth()` using formula 'y ~ x'

ggplot(sparkling, aes(x=Rating, y=Price)) +   geom_point() + stat_smooth(method = lm) + labs(title = "Sparkling Wine")

## `geom_smooth()` using formula 'y ~ x'

3. Residuos Podemos calcular nuestro error de predicción con la función resid que nos permite ver cuánto nos equivocamos

red_wine[,residuo:=resid(reg_red)]
sparkling[,residuo:=resid(reg_spark)]

4. Agregando más variables Sabemos que en realidad no es solamente el Rating lo que define el precio de un vino. Sabemos que en el error de predicción están presentes, en este caso, características tanto observables como inobservables que influyen en el precio. También sabemos que en el caso de los vinos, la antigüedad es algo muy importante y que puede influir muhco en el precio. Agreguemos el factor año en nuestras regresiones para ver como mejoran nuestras predicciones y en cuanto se reducen nuestros residuos.

reg_red_2 <- lm(Price~Rating + Year , data = red_wine)
reg_spark_2 <- lm(Price~Rating + Year, data = sparkling)

get_regression_table(reg_red_2) %>% kable()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-227.703	61.167	-3.723	0.000	-347.604	-107.802
Rating	101.698	2.275	44.704	0.000	97.239	106.158
Year1989	811.900	73.951	10.979	0.000	666.938	956.863
Year1990	403.965	73.950	5.463	0.000	259.006	548.924
Year1991	646.820	85.390	7.575	0.000	479.436	814.204
Year1992	601.343	69.720	8.625	0.000	464.675	738.011
Year1993	821.860	85.390	9.625	0.000	654.475	989.246
Year1995	-10.403	67.507	-0.154	0.878	-142.732	121.927
Year1996	-45.126	66.144	-0.682	0.495	-174.784	84.531
Year1997	20.483	64.548	0.317	0.751	-106.047	147.013
Year1998	-96.868	65.217	-1.485	0.137	-224.710	30.973
Year1999	-42.151	62.359	-0.676	0.499	-164.391	80.088
Year2000	-51.533	61.948	-0.832	0.405	-172.966	69.900
Year2001	-115.041	62.846	-1.831	0.067	-238.235	8.152
Year2002	-98.086	64.548	-1.520	0.129	-224.616	28.444
Year2003	-92.177	63.064	-1.462	0.144	-215.799	31.444
Year2004	-78.222	61.488	-1.272	0.203	-198.753	42.310
Year2005	-87.336	60.576	-1.442	0.149	-206.080	31.407
Year2006	-53.850	61.062	-0.882	0.378	-173.546	65.846
Year2007	-117.792	61.080	-1.928	0.054	-237.524	1.939
Year2008	-95.473	60.757	-1.571	0.116	-214.572	23.626
Year2009	-78.123	60.740	-1.286	0.198	-197.187	40.941
Year2010	-72.257	60.554	-1.193	0.233	-190.958	46.444
Year2011	-119.218	60.491	-1.971	0.049	-237.796	-0.641
Year2012	-131.496	60.465	-2.175	0.030	-250.021	-12.970
Year2013	-120.149	60.437	-1.988	0.047	-238.620	-1.678
Year2014	-129.003	60.424	-2.135	0.033	-247.448	-10.558
Year2015	-136.054	60.407	-2.252	0.024	-254.466	-17.643
Year2016	-136.462	60.404	-2.259	0.024	-254.868	-18.057
Year2017	-139.145	60.409	-2.303	0.021	-257.561	-20.729
Year2018	-137.566	60.423	-2.277	0.023	-256.009	-19.123
Year2019	-136.330	60.657	-2.248	0.025	-255.231	-17.428
YearN.V.	-155.228	64.044	-2.424	0.015	-280.768	-29.687

get_regression_table(reg_spark_2) %>% kable()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-290.651	18.956	-15.333	0.000	-327.850	-253.452
Rating	80.061	3.410	23.476	0.000	73.369	86.754
Year1996	170.074	27.321	6.225	0.000	116.461	223.688
Year1999	190.644	21.639	8.810	0.000	148.180	233.109
Year2002	172.538	18.187	9.487	0.000	136.848	208.228
Year2003	76.104	27.321	2.786	0.005	22.491	129.718
Year2004	149.995	17.328	8.656	0.000	115.991	184.000
Year2005	118.956	21.608	5.505	0.000	76.552	161.359
Year2006	64.856	15.002	4.323	0.000	35.416	94.296
Year2007	77.397	15.216	5.086	0.000	47.537	107.258
Year2008	40.904	14.762	2.771	0.006	11.935	69.872
Year2009	17.104	15.165	1.128	0.260	-12.655	46.864
Year2010	19.647	15.134	1.298	0.195	-10.052	49.346
Year2011	28.328	15.556	1.821	0.069	-2.200	58.856
Year2012	20.008	14.472	1.382	0.167	-8.392	48.408
Year2013	10.241	14.850	0.690	0.491	-18.900	39.383
Year2014	17.971	14.656	1.226	0.220	-10.788	46.731
Year2015	5.789	14.305	0.405	0.686	-22.283	33.860
Year2016	-1.079	14.516	-0.074	0.941	-29.565	27.408
Year2017	1.718	14.384	0.119	0.905	-26.510	29.945
Year2018	-11.496	14.330	-0.802	0.423	-39.617	16.626
Year2019	-18.617	15.115	-1.232	0.218	-48.278	11.045
YearN.V.	10.547	13.647	0.773	0.440	-16.233	37.328

Ahora, volvamos a calcular las predicciones y residuos para comparar:

red_wine[,prediccion_2:=predict(reg_red_2)]
sparkling[,prediccion_2:=predict(reg_spark_2)]

red_wine[,residuo_2:=resid(reg_red_2)]
sparkling[,residuo_2:=resid(reg_spark_2)]

Clase 8: Regresiones

Profesora: Magdalena Cortina

4/28/2021